magazine resources subscribe about advertising

New Architect Daily
Commentary and updates on current events and technologies

CMP Media E-Book

Download your copy today.

Research
Search for reports and white papers from industry vendors and analysts.

This Week at NewArchitect.com Subscribe now to our free email newsletter and get notified when the site is updated with new articles







Day of Defeat Online Gaming

 New Architect > Archives > 1998 > 05 > Webmaster's Domain  

Gaining the Upper Hand with Rude Robots

For some time now I've been hearing complaints of "rude robots," Web crawling programs that don't abide by the informal robots exclusion standard (RES) that Webmasters use to control the activities of robots that visit their sites (see "Webmaster's Domain," Web Techniques, October 1996 and October 1997). Some Webmasters report being crawled by robots that don't bother to read the "robots.txt" file, or that check for the file but fail to abide by the restrictions contained within. Others tell of being assaulted by robots that fail to respect a decent delay between fetches, resulting in a load that can swamp a server that serves pages using CGI scripts or database queries.

The most famous of the rude robots was Microsoft Internet Explorer 4.0, which comes with a built-in Web crawler (user agent "MSIECrawler") to support its subscribe feature. When Explorer 4.0 was first released, its crawler neither checked for robots.txt nor waited between page fetches. After public criticism, good citizen Microsoft improved its crawler's manners. It now checks for robots.txt, observes a decent delay of about a minute between fetches, and limits the number of pages the crawler can fetch in a single session.

Unfortunately, there are many robots that fail to show such restraint. These include spam-mail crawlers that capture email addresses from personal Web pages, and some commercially available robots designed for text indexing, mirroring, and page validation.




  Day of Defeat Online Gaming

home | daily | current issue | archives | features | critical decisions | case studies | expert opinion | reviews | access | industry events | newsletter | research | careers | info centers | advertising | subscribe | subscriber service | editorial calendar | press | contacts


Copyright © 2006 CMP Media, LLC Read our privacy policy, your California privacy rights, terms of service.
SDMG Web sites: BYTE.com, C/C++ Users Journal, Developer Pipeline, Dr. Dobb's Journal, DotNetJunkies, MSDN Magazine, Sys Admin,
SD Expo, SD Magazine, SqlJunkies, The Perl Journal, Unixreview, Windows Developer Network, New Architect

web2