One or More Things I Learned From Regular Expressions
By Dale Dougherty
The real business of business-to-business commerce on the Web is exchanging information. In the past, businesses might have exchanged information by sending it on tape or disk, where the receiving company would convert the data and import it into its own database. With the Internet, data can be exchanged in an instant. Still, the question is, how do I get data out of my database and into your database? Or data from your database into my database?
This is a data-conversion problem. To exchange information, each of us must write programs to import or export it. Writing conversion programs that read or write a simple format like a comma-delimited file is simple. Writing conversion programs for unstructured text like HTML documents is more challenging, but fun. I wrote a book, Sed and Awk, about two UNIX utilities used to build conversion tools. My book could have been titled: All I Know About Programming I Learned from Writing Regular Expressions.
A regular expression is a way to identify patterns in text. Programs like sed, grep, awk, Perl, and even the vi editor let you use regular expressions to match text and then perform various manipulations. For instance, a simple sed script could extract all the headings from a set of HTML files. <H[1-3]> will match lines with <H1> or <H2> or <H3>. When running it on a sample file, I might notice that although the program does indeed match what was specified, it missed <h2>, for example.