magazine resources subscribe about advertising

New Architect Daily
Commentary and updates on current events and technologies

CMP Media E-Book

Download your copy today.

Research
Search for reports and white papers from industry vendors and analysts.

This Week at NewArchitect.com Subscribe now to our free email newsletter and get notified when the site is updated with new articles







Day of Defeat Online Gaming

 New Architect > Archives > 2001 > 10 > Multilingual Methods  

Parse for the Course

By Al Williams

In last month's "Java@Work" column, I showed you how to use JavaCC to build parsers for text processing. Although JavaCC is very powerful, sometimes it's more than you need.

Suppose you wanted to strip fancy formatting from arbitrary Web pages to make them more usable from a PDA, or another network appliance with limited display capabilities. You really wouldn't care much about the exact document structure in that case—you'd only need to pick out a few key tags, ignore comments, and extract the text.

Although you could write a full-blown grammar for JavaCC or YACC, that approach seems like overkill in this instance. A better solution would be to write a simple ad hoc parser, one that can read the HTML and process it the same way you might by hand.

The downside of this technique is that it can be difficult to get precise results and make complex changes. On the plus side, it's simpler to understand, and it's generally applicable to any language. JavaCC, by comparison, is very Java-specific.

Implementing A Parser

For flexibility, I decided to tackle this problem in Java with a parser that accepts an InputStream. That way, I could parse a file, a Web site, or anything that can be converted into an InputStream. (Just because I wanted an easier way to do things doesn't mean that I didn't want to reuse my finished code.) I wrote a general purpose class that contains all of the parsing logic, which you can see in



  Day of Defeat Online Gaming

home | daily | current issue | archives | features | critical decisions | case studies | expert opinion | reviews | access | industry events | newsletter | research | careers | info centers | advertising | subscribe | subscriber service | editorial calendar | press | contacts


Copyright © 2006 CMP Media, LLC Read our privacy policy, your California privacy rights, terms of service.
SDMG Web sites: BYTE.com, C/C++ Users Journal, Developer Pipeline, Dr. Dobb's Journal, DotNetJunkies, MSDN Magazine, Sys Admin,
SD Expo, SD Magazine, SqlJunkies, The Perl Journal, Unixreview, Windows Developer Network, New Architect

web2