1313: Regex Golf
Title text: /bu|[rn]t|[coy]e|[mtg]a|j|iso|n[hl]|[ae]d|lev|sh|[lnd]i|[po]o|ls/ matches the last names of elected US presidents but not their opponents.
The comic talks about regular expressions, which are a way to specify textual patterns. Given a regular expression, one can search for the pattern it specifies inside a text string. If the pattern is found, it's said that the pattern "matches" the string; if it's not found, it's said they do not match.
The title of the comic and the first panel is based on "regex golf", which is a discipline of "code golf", a game in which programmers attempt to solve a given programming problem using as few characters as possible, analogous to the number of golf shots it takes to reach the goal. In regex golfing, the programmer is given two sets of text fragments, and he/she tries to write the shortest possible regular expression which would match all elements of one set, while at the same time not matching any element from the other set.
The regex golf challenge Megan faces consists of matching all subtitles of Star Wars episodes, while not matching any subtitle of Star Trek episodes. (Subtitles are the secondary titles of the movies, after the "Star Trek: " or "Star Wars Episode N: ". For example, in Star Wars Episode I: The Phantom Menace, the subtitle is The Phantom Menace.) In the first panel, she created a 12-character regex solving the challenge.
Then she moved on to building a tool which would automatically build such a regex for arbitrary lists of text, which could be described as meta- regex golfing. But as she has lost this tool, she needs to search through her files and chooses a tool called "grep" to find it. This implies that she needs a regular expression that would find any code that appears to be a regex golf generator, which leads to another "meta-" layer of abstraction. At the end, Megan notes this sequence of meta-meta-... might go to infinity and Cueball quips that she now has "infinite problems" as a result of her efforts; Megan retorts that she already had "infinite problems" because she's geeky enough to run meta-versions of programs on themselves, and stubborn enough to continue on until she fails, to the exclusion of all else. This also seems to be a reference to a famous quote (see also 1171: Perl Problems):
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
The first regex Megan uses is
/m | [tn]|b/, said to match Star Wars subtitles but not Star Trek.
The forward slashes
/ just mark the start and end of the regex. The
| character means "or", so the regex matches any string that contains the patterns "
m ", "
[tn]" or "
b" (including the spaces). The square brackets match one of the enclosed characters, meaning that "
[tn]" matches either "
t" or "
n". The regex is apparently case-insensitive, because it wouldn't work otherwise.
The Star Wars subtitles match the parts of the regex in the following way:
- "The Phantom Menace" is matched by "
- "Attack of the Clones" is matched by "
- "Revenge of the Sith" is matched by "
- "A New Hope" is matched by "
- "The Empire Strikes Back" is matched by "
- "Return of the Jedi" is matched by "
Note that if one included the animated film "Star Wars: The Clone Wars" it would be matched by "
On the other hand, none of the Star Trek subtitles contains an M followed by a space, a T or an N preceded by a space, or any B, so the regex does not match any of them. Note that in the original series all subtitles start with a "T" but it's the first character so it's not preceded by a space.
Here is the list that Megan probably used:
- Original series:
- The Motion Picture
- The Wrath of Khan
- The Search For Spock
- The Voyage Home
- The Final Frontier
- The Undiscovered Country
- The Next Generation:
- First Contact
- Reboot series:
- the one without a subtitle
- Into Darkness
In the last panel "and beyond" Megan uses the regular expression
/(meta-)*regex golf/ to describe her problem.
* means "zero or more" of the preceding character/group (parentheses
() group characters). So this regex matches "regex golf", "meta-regex golf", "meta-meta-regex golf", etc. In a way this is regex golf in itself, matching all levels of meta-regex golf while not matching anything else.
In the title text, there is a long regex that is the solution of another regex golf challenge: matching the last names of all elected US presidents but not their opponents. Note that the list of opponents include some people who were previously or later became presidents, so taken literally this is impossible. To make this work the list of opponents must exclude anyone who was also president. The regular expression itself works in a very similar way to the Star Wars/Trek one, including several different patterns separated by
|. Each elected president matches one pattern while each opponent matches none.
Here is a list of elected president and the patterns they match:
Note that some presidents are missing because they weren't elected but became presidents after the resignation/death of their formers.
Note also that Randall's regular expression must be modified slightly, because it also matches John C. Fremont, the runner-up to James Buchanan in 1856, as discussed by Peter Norvig at xkcd 1313: Regex Golf. Note that Norvig provides a small amount of Python code which actually plays regex golf with arbitrary lists, and found a shorter solution than Randall's for the Star Wars vs Star Trek game.
- Regex golf:
- [Megan is sitting at a laptop. Cueball is standing behind her.]
- Megan: You try to match one group but not the other.
- Megan: /m | [tn]|b/ matches Star Wars subtitles but not Star Trek.
- Cueball: Cool.
- Meta-regex golf:
- Megan: So I wrote a program that plays regex golf with arbitrary lists...
- Cueball: Uh oh...
- Meta-meta-regex golf:
- Megan: ...But I lost my code, so I'm grepping for files that look like regex golf solvers.
- ...And beyond:
- Megan: Really, this is all /(meta-)*regex golf/.
- Cueball: Now you have infinite problems.
- Megan: No, I had those already.
- There are now at least four comics that references regular expressions. The other three are: 208: Regular Expressions, 224: Lisp and 1171: Perl Problems.
- Additionally, regular expressions are mentioned in title text of 1277: Ayn Random.
- A regular expression is used in the 75th episode of what if? to calculate the answer that week's question.
- Also, Randall mentions a website with a regexp golf game he got distracted by while researching for the 78th episode of what if? (which was published one day after this comic).