magazine resources subscribe about advertising

New Architect Daily
Commentary and updates on current events and technologies

CMP Media E-Book

Download your copy today.

Research
Search for reports and white papers from industry vendors and analysts.

This Week at NewArchitect.com Subscribe now to our free email newsletter and get notified when the site is updated with new articles







Day of Defeat Online Gaming

 New Architect > Archives > 1997 > 05 > Features  

Crawling towards Eternity

Building An Archive of The World Wide Web

By Mike Burner

When you ask what he's up to, Brewster Kahle likes to surprise you. So when designing supercomputers became cliche — and everybody was developing new technologies for full-text indexing — Brewster decided to archive the Internet. He thought the Internet would fit nicely into a box. The box he had in mind was a tape robot that, if it held enough tapes, could hold terabytes of data. Into this box, Brewster wanted to cram all of the publicly accessible data from the World Wide Web, anonymous FTP sites, USENET news, and public gopher sites.

His vision was that this would become an archive of digital history, available forever as a research tool and time capsule. By visiting this "library," the world would be able to trace the development of technologies, styles, and cultural trends. Brewster knew it would be necessary to transcribe the archive on a continuing basis, lest the media deteriorate or the ability to read it disappear.

Thus was born the Internet Archive. Housed in the Little House on the Presidio in San Francisco, the Archive is actively collecting all the Web data that can be pulled down two T1 lines. This article describes Brewster and his archivists' experiences, and what they have learned from them.

Designing the Archive Crawler

Collecting Web data isn't a black art. You just need a program that can "speak" HTTP and parse the HTML to find links (in the form of URLs) to additional network objects.




  Day of Defeat Online Gaming

home | daily | current issue | archives | features | critical decisions | case studies | expert opinion | reviews | access | industry events | newsletter | research | careers | info centers | advertising | subscribe | subscriber service | editorial calendar | press | contacts


Copyright © 2006 CMP Media, LLC Read our privacy policy, your California privacy rights, terms of service.
SDMG Web sites: BYTE.com, C/C++ Users Journal, Developer Pipeline, Dr. Dobb's Journal, DotNetJunkies, MSDN Magazine, Sys Admin,
SD Expo, SD Magazine, SqlJunkies, The Perl Journal, Unixreview, Windows Developer Network, New Architect

web2