Automated Web-Page Monitoring
Tracking Changes to Web Sites
By Andrew Davison
Many Web sites are in a state of constant change, both because Web publishers want to keep material fresh, and because of the nature of the information presented on them: weather reports, Web-based magazines, sports results, and corporate financial pages to name a few. Keeping up with these changes can be time consuming, but neglecting to do so can cause you to miss useful, sometimes essential, updates. A change-monitoring application would eliminate the tedium of checking for changes and ensure that you don't miss an updatethe software will report any changes soon after they occur.
This article presents two C programs. The first observes a specified URL by periodically checking its last modification date using HTTP HEAD messages. Only when the date alters is the actual page downloaded (using an HTTP GET message) and compared with an earlier version. The comparison is carried out using the UNIX diff utility, and the changes are reported to stdout. The second program is a variant of the first that reports changes by sending email messages to a specified address.
Design Issues
First, you must decide how often your monitoring program should check for changes. For greatest flexibility, this should be determined by the user, who will probably have prior knowledge of the change frequency. You'll also need to determine what the program should actually monitor. One solution is to periodically retrieve the page and check the text against a copy obtained at the start of the observation, but if the page rarely changes, this is inefficient.