Crawling towards Eternity
Building An Archive of The World Wide Web
By Mike Burner
When you ask what he's up to, Brewster Kahle likes to surprise you. So when designing supercomputers became cliche and everybody was developing new technologies for full-text indexing Brewster decided to archive the Internet. He thought the Internet would fit nicely into a box. The box he had in mind was a tape robot that, if it held enough tapes, could hold terabytes of data. Into this box, Brewster wanted to cram all of the publicly accessible data from the World Wide Web, anonymous FTP sites, USENET news, and public gopher sites.
His vision was that this would become an archive of digital history, available forever as a research tool and time capsule. By visiting this "library," the world would be able to trace the development of technologies, styles, and cultural trends. Brewster knew it would be necessary to transcribe the archive on a continuing basis, lest the media deteriorate or the ability to read it disappear.
Thus was born the Internet Archive. Housed in the Little House on the Presidio in San Francisco, the Archive is actively collecting all the Web data that can be pulled down two T1 lines. This article describes Brewster and his archivists' experiences, and what they have learned from them.
Designing the Archive Crawler
Collecting Web data isn't a black art. You just need a program that can "speak" HTTP and parse the HTML to find links (in the form of URLs) to additional network objects.