208: Regular Expressions

Explain xkcd: It's 'cause you're dumb.
Revision as of 18:24, 24 August 2013 by Dgbrt (Talk | contribs)

Jump to: navigation, search
Regular Expressions
Wait, forgot to escape a space. Wheeeeee[taptaptap]eeeeee.
Title text: Wait, forgot to escape a space. Wheeeeee[taptaptap]eeeeee.

Explanation

In computing, a regular expression provides a concise and flexible means to "match" (specify and recognize) strings of text, such as particular characters, words, or patterns of characters.

Looking for a specific pattern on 200MB of text is an equivalent to "looking for a needle in a haystack". But this task can be made easy by using "regexes", since this script can find a "match", a specific string pattern, much faster than humans can achieve.

Perl is a popular scripting language, and is especially well known for the flexible and simple regular expression features that it offers.

The title text explains how sensitive the response is to small missing characters.

Transcript

Whenever I learn a new skill I concoct elaborate fantasy scenarios where it lets me save the day.
Megan: Oh no! The killer must have followed her on vacation!
[Megan points to computer.]
Megan: But to find them we'd need to search through 200MB of emails looking for something formatted like an address!
Cueball: It's hopeless!
Everybody stand back.
I know regular expressions.
[A man swings in on a rope, toward the computer.]
tap tap
PERL!
[The man swings away, and the other characters cheer.]

Trivia

  • This comic is featured on one of the T-shirts sold at the xkcd store.
  • Randall doesn't like Perl, he prefers Python.
  • Writing a good street address finder in regular expressions is a difficult task, especially internationally. Examples.
Comment.png add a comment!

Discussion

Hi, sorry about the previous "poor explanation". This one is pretty straight-forward (I think), given enough information about "regex"es and Perl. On a side-note, how did you find out the date for the comic? -- NariOx

The explanation was mostly accurate, my gripe was with the fact that a number of fields were empty - the transcript, date and categories weren't done. You can find the date in the "all comics" page, accessible from the sidebar. Davidy22[talk] 23:26, 29 November 2012 (UTC)

*Grumble grumble*, email should be intrinsically treated as 7-bit ASCII. 200MB of email would be almost 210 million bytes, raw. Even assuming half of that is header overheads on top of a whole lot of short message bodies, that's a lot more characters still to processed than 5 million. Of course, if people send more complex data through common MIME extensions (or the old favourites of uuencoding, etc) so as to send extended characters and binary-attachments (including various compressed formats of data, but again needing significant overheads when not containing large documents within), then that reduces the amount of characters needing searching (but probably needs a few more "use <foo>::<bar>;" bits on your Perl code, in order to give you the tools to isolate such blocks and decode what's in them for further searching within). But things went downhill ever since people could start sending emails with fancy graphical signatures and backgrounds... Yeah, I'm an Old Fogey from the dark ages. 178.98.31.27 00:00, 22 June 2013 (UTC)

I don't think a regex can isolate an address, it has so many forms that it can not be considered regular language. This discussion comes to similar conclusions: http://stackoverflow.com/questions/9397485/regex-street-address-match 108.162.246.117 05:41, 1 November 2013 (UTC)
Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox

It seems you are using noscript, which is stopping our project wonderful ads from working. Explain xkcd uses ads to pay for bandwidth, and we manually approve all our advertisers, and our ads are restricted to unobtrusive images and slow animated GIFs. If you found this site helpful, please consider whitelisting us.

Want to advertise with us, or donate to us with Paypal or Bitcoin?