Latest Tweets:
I'm a technology analyst from Wellington, New Zealand.
paperless@timmcnamara.co.nz @timClicks
Regular expressions are a uniquely powerful tool, but they seem to be. It’s not just that they’re slow, it’s that they’re brittle. The smallest changes can crash the system. From personal experience, I’ve found crafting regular expressions as difficult and time consuming. So… what could we do to change them?
What about some using a simplified expression language that could be compiled into regular, regular expressions. We would lose lots of the power, but often we don’t need it.
Here are some examples:
nnnn
[0-9]{4}
YYYY:MM
[0-9]{4}:[0,1][0-9]
email
([\w-.]+)@((?:[\w]+.)+)([a-zA-Z]{2,4})
“email”
email
#ffffff
#?[0-9A-Fa-f]{6}
Let’s walk though them:
n is a fairly ubiquitous symbol for number. Having 4ns is far more readable to untrained eyes than the current regular expression. While there may be some lack of readability for large numbers, I think that a set of ns may have an added benefit: the regular expr - email
([\w-.]+)@((?:[\w]+.)+)([a-zA-Z]{2,4})ession will look more like what the target text looks like. This should make them simpler and easier to debug. Which of these two looks more like a phone number? nnn nnn nnnn or [0-9]{3} [0-9]{3} [0-9]{4}.
This example should be clear. Here, I’ve tried to go with readability. Let’s create a system that’s easy for humans and is what they relate to.
email
([\w-.]+)@((?:[\w]+.)+)([a-zA-Z]{2,4})
#ffffff
#?[0-9A-Fa-f]{6}
Creating regular expressions for emails, IP addresses, hex values and so forth can be a nightmare. Why don’t we replace these with keywords? This is actually what I’m sure most developers do anyway.
The downside of my approach is that characters no longer match themselves. So we need to add some form of literal value syntax. Quote marks are probably the best bet.
Some final thoughts: