May 12th, 2008

Using packet Sniffing for Web Analytics3

If you're new here, you may want to subscribe to my RSS feed. Thanks for visiting!

Packet Sniffing
Firstly a packet sniffer is a really simple application that passively listens to any network traffic that runs through or past a network card. When it ’sniffs’ the network it picks up all the packets for every protocol such as tcp/ip and ARP, it also picks up encrypted SSL packets.

This all sounds very technical and worlds away from anything related to marketing or web analytics so how does it fit in?

Well, using a packet sniffer you can pick up all the packets contained within a HTTP or HTTPS request. If it is HTTPS traffic then you can provide the SSL certificate to the packet sniffer and access the requests in their unencrypted form.

Once the packet sniffer has recreated the HTTP and HTTPS traffic it can then create a log file, similar to one created by a web server. From this you can use your favourite web log analyzer to process the log files and provide you with website visitor data.

So where does packet sniffing fit into the data collection methodologies?

You might already know that the main difference between page tags and log files is that page tag data is collection on the client side whereas log files are generated on the web server. Packet sniffing also resides on the web server or at least the Local Area Network (LAN). This means it has the same problems as log files with proxy caching and so is likely to be less accurate than page tags.

But there are advantages, packet sniffers pick up every piece of tcp traffic including form data that has been sent using the POST method and all packet sniffer applications will output that data. For technically minded web analysts there are loads of performance statistics about the network that are also output to the log files.

Another extremely useful aspect of packet sniffers is te ability to amalgamate data from multiple web servers into one log file. For example, lets say that a large content provider has 20 servers that are load balanced and in front of them there are 10 proxy servers. If we use standard log files then we need to either use the proxy logs assuming the proxy servers are all on the same platforms and can be configured correctly to output the required information, or cluster the 20 server log files during analysis. Using a packet sniffer in front of the proxies we can pick up all of the data from one point and because it uses passive sniffing it will not slow down the network traffic.

In any other situation I would suggest page tags or log files depending upon your preference. If you are currently using a packet sniffer(like Clipen) in your analytics environment I would be interested to hear of your experiences which you can detail in a comment below.

How To Clean Your Web Server Logs0

One of the main differences between using web server logs and using page tags as your data collection method for web analytics is that robots and spiders are tracked within web logs but not in page tags.

Now 99% of people looking at their web analytics will want to concentrate on analysing visitor traffic rather than what the robots and spiders are doing and looking at. Given this it is important when analysing web server logs that you accurately identify robot and spider activity. With this in mind I have written a short checklist of things to look out for that may indicate a robot or spider.

  • User agent, this is the browser that is displayed in your Browser report. Well behaved robots and spiders will identify themselves within their user agent string and as such they can be quickly identified and removed.
  • Visitor duration, this can be seen in a visitor report along with the total time online metric. Given this informatoin you should also be able to see for one particular day whether a visitor has viewed over 100 pages in 1 or 2 visits or maybe spent all day on your website. These kinds of behaviour are common of robots and spiders and as such should be removed.

Other ways in which you can clean data represented in your web server logs are listed below:

  • Excluding internal IP address, the aim of this is to exclude any internal traffic that may be accessing the website as this would scew the visitor behaviour. If you are running on dynamic IP addresses then you may want to think about tainting your browser’s user agent string to include your company name then exclude internal traffic by user agent. A quick search on google for modifying your user agent should point you in the right direction for this.
  • Excluding irrelevant pages, if you have pages live favicon.ico and robots.txt or your /admin/ directory then you may wish to to exclude this data as there will be lots of requests for these pages but they are more like web site resources rather than requested page views. Unless you are monitoring your admin area usage, this data will not be of use to you.

The final thing I am going to mention is really that an analytics package can only ever report on the data that it is given. So if you page report looks a bit cryptic then you may want to consider rewriting your URLs in a user friendly way rather than using dynamic URLs. Also making data available via the querystring in the URL will help when it comes to analysing your web server logs.

If you feel that I have missed any other important items that should be removed from your web logs then please post a comment and let me know.

Thanks

Web Log Generator Plugin Version 27

The first version of the web log generator was a quick fix to my problem of not having any analytics for this site. But i’ve had some time to sit down and improve it slightly by adding a 1st party cookie.

The plugin is a bit heavier in size now but that doesn’t appear to have an impact on page loading times. If you want to test this plugin side by side with its previous version then this is possible and should not cause any conflicts.

You can download the logger2.zip file containing the PHP script at http://www.webanalyticmatt.com/plugins/logger2.zip

As usual just unzip the file into your plugins directory and activate it in your wordpress administration area under plugins. Once activated the screen will go blank but don’t worry about that. If anyone knows how to get round that little bug then please let me know.

Thanks

Web Log Generator Plugin20

I have updated that script for generating daily web logs and turned it into a fully working wordpress plugin.

To install the plugin just download it from here - Web Log Generator.

Then unzip it into your plugins directory /wp-content/plugins. You will need to edit logger.php and define the location of where your logs should be kept. This location will need to have 777 permissions but does not need to be in your public_html or http directory.

Then activate the plugin in your administration area of your website. When you activate it the screen will go blank, thats quite normal.

Have a wander around your website then check the location of your log files to see if one has been created, assuming a log file has been created then just sit and wait for your logs to start filling up.

If you have any problems then please just add a comment to this post and I will reply with either a solution or a fix.

Edit:
The latest version (ver 2.0) of this script can be downloaded here http://www.webanalyticmatt.com/2007/03/30/web-log-generator-plugin-version-2/

I Can’t Access My Log Files!2

As a web analyst it is important for me to track who comes to my site. For this reason I have installed the Google Analytics page tag, however Google’s privacy policy prevents it from displaying detailed data about site visitors. This doesn’t help when trying to segment visitors or do any in-depth analysis.

This website like many others in the web is hosted by a third party and they don’t supply me with any log files. So…I decided to create a PHP script that will record web log data into my own log files. There are some problems with this in that no requests for images, flash files, pdfs etc are recorded but it should record page views quite accurately.

If you are interested in recording log file data but don’t have access to your logs then you can use the script below.

To use effectively you will need to insert an include to this file in your web site header.php and also specify the $directory variable for where you want the logs to be stored.

Once you have log files you can then import them into a web log analyser for analysis.

Enjoy!

I have created a WordPress plugin based on this script which can be found here - Web Log Generator Wordpress Plugin


//define confiuration variables
$directory = "/home/matt/logs/";
//todays date
$today = date("dmy");
//construct log entry
$datetime = date("d:m:y:H:i:s:O");
if (getenv("HTTP_CLIENT_IP")) $ip = getenv("HTTP_CLIENT_IP");
else if(getenv("HTTP_X_FORWARDED_FOR")) $ip = getenv("HTTP_X_FORWARDED_FOR");
else if(getenv("REMOTE_ADDR")) $ip = getenv("REMOTE_ADDR");
else $ip = "UNKNOWN";
$request = $_SERVER['REQUEST_URI'];
$referrer = $_SERVER['HTTP_REFERER'];
$useragent = $_SERVER['HTTP_USER_AGENT'];
$method = $_SERVER['REQUEST_METHOD'];
$logline = $datetime . " " . $ip . " " . $request . " '" . $referrer . "' '" . $useragent . "' " . $method . "\n";
//open log file
$logfile = $directory . $today . ".log";
$fh = fopen($logfile, 'a') or die("Can't open file");
//write log entry
fwrite($fh, $logline);
fclose($fh);
?>