An Automatic Site Survey Tool
By Ray Valdés
Two months ago, I wrote about best practices among client-side technologies in a column titled "The Client-Side Conundrum." I presented an analysis of the home pages of the Web's 500 most-trafficked sites (as listed by Media Metrix). This analysis of client-side technology usage showed that about 80 percent of top-tier sites use constructs such as HTML tables and meta tags, and about 60 percent use JavaScript. About 20 percent of the sites rely on style sheets, and fewer than two percent take advantage of either Java applets or ActiveX objects.
A number of readers wrote in to get copies of the program I'd written to perform the analysis. I was initially reluctant to inflict this program on the world because I had written it quickly and did not intend it for public consumption. I had long ago strung the various pieces of Perl code together into something that met my immediate needs, then promptly forgot about it. For this month's column, the code has been revamped and generalized. The script now goes beyond client-side analysis to gather information about server-side technology usage. One such feature determines which HTTP server software packages are in use at a given set of sites. The program also tallies the home pages that use redirection and cookies. Before getting into the details of the code, it is worth understanding the big picture behind automatic analysis of Web sites. The best known example of this kind of analysis is from Netcraft, an IT consultancy based in the U.K.
The Protocol Does Not Lie
Netcraft publishes an influential survey of Web server usage, which tallies over 17 million public Internet sites and tells us that 60 percent of them use the Apache Web server, compared to 20 percent for Microsoft IIS and 7 percent for iPlanet (formerly Netscape).