Parallel User Agents for Link Checking
By Randal L. Schwartz
Roughly two years ago in this column, I took a look at a basic "link verifier" script, using off-the-shelf LWP technology to parse HTML, locate the outward links, and recursively descend the Web tree looking for bad links (see "Programming with Perl," Web Techniques, October 1996). Little did I know the interest I would stir -- it's become one of the most frequently referenced and downloaded scripts of all my columns. In my June 1997 column, I updated the script, adding a forward-backward line-number cross-reference to the listing. But this year, I've got something even cooler!
Just recently, the new "parallel user agent" has come into a fairly stable implementation. This user agent works like the normal LWP user agent, but it also lets me register a number of requests to be performed in parallel within a single process. This means I can scan a Web site in a small fraction of the time it would take otherwise. I decided that this year's annual update to the link checker was to make it parallel.
As always, the program you see here is not intended as a "ready to run" script, but it's good enough as is that I'm using it to verify my Web site. The program is given in
Listing One, and it's a long one, compared to previous columns. I'm gonna be rather brief (contrary to my normal style).<>