Gaining the Upper Hand with Rude Robots
By Lincoln D. Stein
For some time now I've been hearing complaints of "rude robots,"
Web crawling programs that don't abide by the informal robots exclusion
standard (RES) that Webmasters use to control the activities of robots
that visit their sites (see "Webmaster's Domain," Web Techniques,
October 1996 and October 1997). Some Webmasters report being crawled by
robots that don't bother to read the "robots.txt" file, or that check for
the file but fail to abide by the restrictions contained within. Others
tell of being assaulted by robots that fail to respect a decent delay between
fetches, resulting in a load that can swamp a server that serves pages
using CGI scripts or database queries.
The most famous of the rude robots was Microsoft Internet Explorer
4.0, which comes with a built-in Web crawler (user agent "MSIECrawler")
to support its subscribe feature. When Explorer 4.0 was first released,
its crawler neither checked for robots.txt nor waited between page fetches.
After public criticism, good citizen Microsoft improved its crawler's manners.
It now checks for robots.txt, observes a decent delay of about a minute
between fetches, and limits the number of pages the crawler can fetch in
a single session.
Unfortunately, there are many robots that fail to show such restraint.
These include spam-mail crawlers that capture email addresses from personal
Web pages, and some commercially available robots designed for text indexing,
mirroring, and page validation.