May 17th, 2008

How To Clean Your Web Server Logs

If you're new here, you may want to subscribe to my RSS feed. Thanks for visiting!

One of the main differences between using web server logs and using page tags as your data collection method for web analytics is that robots and spiders are tracked within web logs but not in page tags.

Now 99% of people looking at their web analytics will want to concentrate on analysing visitor traffic rather than what the robots and spiders are doing and looking at. Given this it is important when analysing web server logs that you accurately identify robot and spider activity. With this in mind I have written a short checklist of things to look out for that may indicate a robot or spider.

  • User agent, this is the browser that is displayed in your Browser report. Well behaved robots and spiders will identify themselves within their user agent string and as such they can be quickly identified and removed.
  • Visitor duration, this can be seen in a visitor report along with the total time online metric. Given this informatoin you should also be able to see for one particular day whether a visitor has viewed over 100 pages in 1 or 2 visits or maybe spent all day on your website. These kinds of behaviour are common of robots and spiders and as such should be removed.

Other ways in which you can clean data represented in your web server logs are listed below:

  • Excluding internal IP address, the aim of this is to exclude any internal traffic that may be accessing the website as this would scew the visitor behaviour. If you are running on dynamic IP addresses then you may want to think about tainting your browser’s user agent string to include your company name then exclude internal traffic by user agent. A quick search on google for modifying your user agent should point you in the right direction for this.
  • Excluding irrelevant pages, if you have pages live favicon.ico and robots.txt or your /admin/ directory then you may wish to to exclude this data as there will be lots of requests for these pages but they are more like web site resources rather than requested page views. Unless you are monitoring your admin area usage, this data will not be of use to you.

The final thing I am going to mention is really that an analytics package can only ever report on the data that it is given. So if you page report looks a bit cryptic then you may want to consider rewriting your URLs in a user friendly way rather than using dynamic URLs. Also making data available via the querystring in the URL will help when it comes to analysing your web server logs.

If you feel that I have missed any other important items that should be removed from your web logs then please post a comment and let me know.

Thanks

Leave a Response