Jeremy W. Sherman

stay a while, and listen

The Gist of Regex

Regular expressions scare some people. They’re really quite warm and cuddly, or at least, conceptually very neat and tidy. If you don’t feel that way, this post is for you! Here’s how I think about regexen, in a nutshell.

I use this conception on a regular basis; when it comes to writing regex, I think about what I want to do in this model, then translate it into whatever regex notation the system I’m using at the time gives me. (I do the same thing with distributed version control and relation databases, but let’s stick to regexen for now.)

Regex Is Tiny Machines!

Regular expressions are a compact description of a symbol-matching machine. Like, “If you see an a, then maybe a b, and then one or more c, it’s a match!” for ab?c+.

But the machines can nest, so you can instead say stuff like, “If you see one thing matched by this machine, then maybe one thing matched by that one, followed by one or more things matched by that other, it’s a match!” So the a, b, and c from the last bit could actually be bigger regular expressions themselves.

But you have no variables in regex! So, instead, you plop the whole machine descriptions in there in parentheses, like (…)(…)?(…)+ And repeat the description if you need the same machine twice.

Pitch in self-referentiality - “if you see exactly the same thing as you ended up matched back there” - by using backrefs to parenthesized machines, you’re in our modern world of extended “regular” expressions. At that point, what we’re talking about is no longer actually expressions describing what’s technically known as a regular language, but they’re exceedingly useful extensions of the notation, so no-one cares. ;)

Compact Notation, Effective Expressive Power

What makes regular expressions so useful is:

  • Reach: A lot of stuff we want to match against can actually be described by them, especially when you pitch in a lot of the extended power-ups
  • Compactness: They’re a marvelously compact notation for what would otherwise be a lot of very boring code! Instead of writing that code, we dash of a regex, and we leave the translation into code to the regular expression engine.

For the More Curious

If that’s whet your whistle, Friedl’s Mastering Regular Expressions is excellent. And, as a bonus, you can probably just read the first few chapters and emerge enlightened. :)

P.S. You can also look at regular expressions as definitions of regular languages - as generators rather than consumers of text. Running them backwards like this can be a good way to think about whether a regex you’re writing captures exactly what you’re aiming at, or whether it might include a bit more than you intended!

P.P.S. And if you think about them in terms of machines, it’s really easy to start thinking about how to write fast regular expressions.

P.P.P.S. Hat-tip to @bazbt3 over at App.Net. What is dead can never die!