Parse for the Course
By Al Williams
In last month's "Java@Work" column, I showed you how to use JavaCC to build parsers for text processing. Although JavaCC is very powerful, sometimes it's more than you need.
Suppose you wanted to strip fancy formatting from arbitrary Web pages to make them more usable from a PDA, or another network appliance with limited display capabilities. You really wouldn't care much about the exact document structure in that caseyou'd only need to pick out a few key tags, ignore comments, and extract the text.
Although you could write a full-blown grammar for JavaCC or YACC, that approach seems like overkill in this instance. A better solution would be to write a simple ad hoc parser, one that can read the HTML and process it the same way you might by hand.
The downside of this technique is that it can be difficult to get precise results and make complex changes. On the plus side, it's simpler to understand, and it's generally applicable to any language. JavaCC, by comparison, is very Java-specific.
Implementing A Parser
For flexibility, I decided to tackle this problem in Java with a parser that accepts an InputStream. That way, I could parse a file, a Web site, or anything that can be converted into an InputStream. (Just because I wanted an easier way to do things doesn't mean that I didn't want to reuse my finished code.) I wrote a general purpose class that contains all of the parsing logic, which you can see in