Logs and Analysis
To
develop a web site effectively, you will need to regularly analyze the web site's log files, which contain data on everyone who accesses the site.
From it you can determine, the number of requests made, the identities (IP
addresses) of the clients and the pattern of hyperlinks that are followed
across the web site. While small scale information can be gained by manually
viewing the log files, this technique is not appropriate for finding
large-scale trends. Each request for a page creates 60 bytes or so of data that
is added to the log file ý more if images are requested along with the pages,
which is usually the case. Multiplying this number by, say, 200 daily page
requests means that roughly 50-60 kilobytes of data added to the log each day.
Therefore, manual viewing is in reality restricted to small samples of the
logs.
To
automatically analyze the complete logs, we will be using Analog, a small yet
powerful program which is configurable, scalable and free. It is currently the
most popular log file analysis program on the web (a 25% market share according
to a GVU report at http://www.gvu.gatech.edu). It will be configured to
produce separate reports for each virtual host, and update them each morning,
and the reports will only be read by authorized people.
Manual
Logfile Analysis
While manual analysis will not be suitable for viewing overall trends,
it allows you to interpret the logs with human intelligence. For example, if
you notice lots of visitors are requesting one page then leaving, you may want
to investigate ways of encouraging them to stay on your site. Do you provide
links to other relevant pages? Are they arriving directly into a frame and
being trapped with no links out? Are your pages so large, or your connection so
slow, they are giving up waiting and leaving the site?
You
will have chosen where to place your logs when editing httpd.conf. Simply open one in an editor and concentrate on a
small section. Below is an extract from access_log on my machine (with the IP
addresses replaced by dummy ones):
231.231.231.231
- - [02/Oct/1999:19:47:35 +0000] "GET / HTTP/1.1" 200 9621
"-" "Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)"
231.231.231.231
- - [02/Oct/1999:19:47:41 +0000] "GET /trampnetmini.gif HTTP/1.1" 304
- "http://www.trampolining.net/" "Mozilla/4.0 (compatible; MSIE
4.01; Windows 95)"
231.231.231.231
- - [02/Oct/1999:19:47:58 +0000] "GET /trampnetmini.gif HTTP/1.1" 304
- "http://www.trampolining.net/" "Mozilla/4.0 (compatible; MSIE
4.01; Windows 95)"
231.231.231.231
- - [02/Oct/1999:19:47:58 +0000] "GET /coach.gif HTTP/1.1" 304 -
"http://www.trampolining.net/" "Mozilla/4.0 (compatible; MSIE
4.01; Windows 95)"
231.231.231.231
- - [02/Oct/1999:19:47:59 +0000] "GET /news.gif HTTP/1.1" 304 -
"http://www.trampolining.net/" "Mozilla/4.0 (compatible; MSIE
4.01; Windows 95)"
231.231.231.231
- - [02/Oct/1999:19:47:59 +0000] "GET /improve.gif HTTP/1.1" 304 -
"http://www.trampolining.net/" "Mozilla/4.0 (compatible; MSIE
4.01; Windows 95)"
231.231.231.231
- - [02/Oct/1999:19:47:59 +0000] "GET /merger.gif HTTP/1.1" 304 -
"http://www.trampolining.net/" "Mozilla/4.0 (compatible; MSIE
4.01; Windows 95)"
231.231.231.231
- - [02/Oct/1999:19:47:59 +0000] "GET /chat.gif HTTP/1.1" 304 -
"http://www.trampolining.net/" "Mozilla/4.0 (compatible; MSIE
4.01; Windows 95)"
132.132.132.132
- - [03/Oct/1999:16:30:45 +0000] "POST /cgi-bin/poll.pl?voted
HTTP/1.1" 302 291 "http://www.trampolining.net/"
"Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)"
132.132.132.132
- - [03/Oct/1999:16:30:46 +0000] "GET / HTTP/1.1" 200 10137
"http://www.trampolining.net/" "Mozilla/4.0 (compatible; MSIE
4.01; Windows 95)"
132.132.132.132
- - [03/Oct/1999:16:30:47 +0000] "GET /trampnetmini.gif HTTP/1.1" 200
6971 "http://www.trampolining.net/" "Mozilla/4.0 (compatible;
MSIE 4.01; Windows 95)"
132.132.132.132
- - [03/Oct/1999:16:30:47 +0000] "GET /improve.gif HTTP/1.1" 200 4727
"http://www.trampolining.net/" "Mozilla/4.0 (compatible; MSIE
4.01; Windows 95)"
132.132.132.132
- - [03/Oct/1999:16:30:49 +0000] "GET /merger.gif HTTP/1.1" 200 4526
"http://www.trampolining.net/" "Mozilla/4.0 (compatible; MSIE
4.01; Windows 95)"
The first number in each line is the IP address of the
client. By following an IP address through the log, you can find the path an
individual visitor took through your site. (Office networks and ISPs such as
AOL employing proxies represent around 25% of web traffic, and can cause a
single user to appear to come from multiple IP addresses, or allow users to
receive some pages without them appearing in your logs. This technique remains
accurate the remainder of the time, and is normally accurate even during access
via a proxy server, assuming there are not multiple caches. However there is as
yet no way round this growing problem).
There
follows the date and time, followed by the requested filename and the version
of HTTP in double quotes. A single slash (/) here represents a directory
request, which usually returns index.html. The number immediately
following the request is the HTTP success code which is either 200 or 304 as
shown above. Any unsuccessful requests, i.e. producing) 403 (Access forbidden)
or 404 (File not found) codes go into the error_log file.
The
next field is the referrer, which in all of the above log entries is http://www.trampolining.net/. The identity of the referrer
depends on what file is being logged at the time. In the case of images, the
referrer is simply the page that contains the image, but in the case of pages,
it is the page the browser was previously viewing ý this gives a good idea
where your visitors are coming from. The final pieces of information are the
browser and the version of operating system.
As
you can see, each page can generate many lines of log so to make this kind of
following easier, we can cut out some of the unwanted information. To follow
the path of just one client, type:
$ grep
231.231.231.231 /usr/local/apache/logs/trampolining_access_log | more
This
will display only log entries created by the client with IP address 231.231.231.231.
There
are many log file entries corresponding to images, which are often of little
interest. To view only page entries, type:
$ grep
'html HTTP' /usr/local/apache/logs/trampolining_access_log | more
You
can even view page requests from a single client:
$ grep
'html HTTP' /usr/local/apache/logs/trampolining_access_log | grep
231.231.231.231
A
final technique allows you to watch current requests in real time. This command
is:
$ tail
-f /usr/local/apache/logs/trampolining_access_log
You
can make this easier to read by removing the image requests and displaying only
page requests:
$ tail
-f /usr/local/apache/logs/trampolining_access_log | grep 'html HTTP'
Automatic
Analysis
This is the vehicle by which we will obtain an overview of our system's
usage. Installation of analog is quite simple:
q
Download Analog from http://www.statslab.cam.ac.uk/~sret1/analog/ to /usr/local/analog/.
q Change to the /usr/local/analog
directory.
q Open the analhead.h file for editing and change ANALOGDIR to /usr/local/analog/.
q Type make.
We
also need to prepare a directory for the reports and populate it with the
necessary images:
q Type mkdir /home/www/trampolining.net/analog.
q Copy /usr/local/analog/images/* to /home/www/trampolining.net/analog.
That's
it ‑ Analog is ready for use.
Analog
is set up using configuration files; the default is analog.cfg which we will edit now, and later on we will create
an additional configuration file for each virtual host.
LOGFORMAT specifies the format of log
used. Analog natively supports the Apache formats COMBINED and COMMON. LOGFILE tells Analog where to look for the access log.
LOGFORMAT
COMBINED
LOGFILE
/usr/local/apache/logs/access_log
HOSTNAME specifies the name to put
at the top of the report.
HOSTNAME
"www.trampolining.net"
Remember
we told Apache not to resolve IP addresses? This little section tells Analog to
resolve them, but is much more efficient because addresses are only resolved
once, and then written to the cache file specified in DNSFILE. DNSGOODHOURS is the number of hours to
trust an entry in the cache file, DNSBADHOURS is the number of hours to
wait before attempting to resolve a bad IP address again. DNS WRITE tells Analog to try to
resolve unknown IP addresses, then write them to the dnsfile.txt file. The alternative command DNS READ would tell Analog to skip
IP addresses which didn't exist in the dnsfile.txt file, thus saving time. On
the first run, Analog will complain about dnsfile.txt not existing ý ignore it,
Analog will create it.
DNSFILE
/usr/local/analog/dnsfile.txt
DNSGOODHOURS
1250
DNSBADHOURS
350
DNS
WRITE
This directive tells Analog where to create the report.
OUTFILE
/home/www/trampolining.net/analog/trampolining_net_report.html
HOSTEXCLUDE directives tell Analog to
ignore accesses from a certain IP address or hostname. This allows you to
report what your visitors do, without being influenced by your own visits! In
this example I exclude all page accesses from Cambridge University using
Cambridge's IP allocation, and exclude all accesses from York University using
the resolved hostnames.
HOSTEXCLUDE
131.111.*.*
HOSTEXCLUDE
*.york.ac.uk
If
your web site contains pages with other extensions than .htm or .html, for example JSPs or .shtml, you will need to add them here to include them in
the page counts, otherwise Analog will assume them to be images.
PAGEINCLUDE
*.htm,*.html,*.shtml
Save
your completed file as analog.cfg then type ./analog (or ./analog +g/other-config-file.cfg if you have an additional config file).
If all goes well, you
should get a report like this:

You
will need to create a configuration file for each virtual host, and save it
with a different filename, e.g. trampolining_net.cfg. Finally we are ready to
schedule Analog to run each morning. We do this using a cronjob, a Linux feature that allows tasks to be run at regular
times. We will need to create a separate task for each report, and run them at
different times to prevent multiple simultaneous Analog processes clashing.
The
set up of cronjobs requires you to use the vi editor, which is explained
in Appendix B. This is what you type at the vi command prompt;
55 0 * * * /usr/local/analog/analog
55 1 * * * /usr/local/analog/analog
+g/trampolining_net.cfg
The
cronjob is now set up to run Analog at 0:55 a.m. each morning, which will write
to the default configuration file, analog.cfg), and again at 1.55 a.m. to
run Analog with the configuration file trampolining_net.cfg.
Our
final task is to protect the reports from unwelcome visitors. To do this, we
will create a directory container for the /home/www/trampolining.net/analog/ directory in the primary
server section of httpd.conf:
<Directory "/home/www/trampolining.net">
Options Indexes FollowSymLinks
Options +Includes
AllowOverride None
Order allow,deny
Allow from all
</Directory>
<Directory
"/home/www/trampolining.net/analog">
Order allow,deny
Allow from 123.123.123.123
</Directory>
This
will deny the contents of www.trampolining.net/analog to anyone except the owner
of IP address 123.123.123.123.