home
NEWS       BLOGS       FORUMS       NEWSLETTERS       RESEARCH       EVENTS       DIGITAL LIBRARY       CAREERS  
Network Computing Network Computing Powered by InformationWeek Business Technology Network

IMMERSE YOURSELF:

SOA

  |

Data Center

  |

802.11n

  |

Data Privacy

  |
APO  |

Virtualization

  |

NAC

  |

Security

  |

Network Mgmt

  |

Enterprise Apps

  |

Storage & Servers





Chapter 5: Deploying Web and FTP Servers (Part Two)

June 5, 2000

Brought to you by:




Table of contents:

Got a tough Linux deployment question?
Ask the experts!

For a limited time, you can put the authors of "Deploying Web and FTP Servers" to the test. Post your question, and if they answer it, you'll receive a free Network Computing collectable. Click here for more info.

Logs and Analysis

To develop a web site effectively, you will need to regularly analyze the web site's log files, which contain data on everyone who accesses the site. From it you can determine, the number of requests made, the identities (IP addresses) of the clients and the pattern of hyperlinks that are followed across the web site. While small scale information can be gained by manually viewing the log files, this technique is not appropriate for finding large-scale trends. Each request for a page creates 60 bytes or so of data that is added to the log file ý more if images are requested along with the pages, which is usually the case. Multiplying this number by, say, 200 daily page requests means that roughly 50-60 kilobytes of data added to the log each day. Therefore, manual viewing is in reality restricted to small samples of the logs.

 

To automatically analyze the complete logs, we will be using Analog, a small yet powerful program which is configurable, scalable and free. It is currently the most popular log file analysis program on the web (a 25% market share according to a GVU report at http://www.gvu.gatech.edu). It will be configured to produce separate reports for each virtual host, and update them each morning, and the reports will only be read by authorized people.

Manual Logfile Analysis

While manual analysis will not be suitable for viewing overall trends, it allows you to interpret the logs with human intelligence. For example, if you notice lots of visitors are requesting one page then leaving, you may want to investigate ways of encouraging them to stay on your site. Do you provide links to other relevant pages? Are they arriving directly into a frame and being trapped with no links out? Are your pages so large, or your connection so slow, they are giving up waiting and leaving the site?

 

You will have chosen where to place your logs when editing httpd.conf. Simply open one in an editor and concentrate on a small section. Below is an extract from access_log on my machine (with the IP addresses replaced by dummy ones):

 

231.231.231.231 - - [02/Oct/1999:19:47:35 +0000] "GET / HTTP/1.1" 200 9621 "-" "Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)"

231.231.231.231 - - [02/Oct/1999:19:47:41 +0000] "GET /trampnetmini.gif HTTP/1.1" 304 - "http://www.trampolining.net/" "Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)"

231.231.231.231 - - [02/Oct/1999:19:47:58 +0000] "GET /trampnetmini.gif HTTP/1.1" 304 - "http://www.trampolining.net/" "Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)"

231.231.231.231 - - [02/Oct/1999:19:47:58 +0000] "GET /coach.gif HTTP/1.1" 304 - "http://www.trampolining.net/" "Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)"

231.231.231.231 - - [02/Oct/1999:19:47:59 +0000] "GET /news.gif HTTP/1.1" 304 - "http://www.trampolining.net/" "Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)"

231.231.231.231 - - [02/Oct/1999:19:47:59 +0000] "GET /improve.gif HTTP/1.1" 304 - "http://www.trampolining.net/" "Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)"

231.231.231.231 - - [02/Oct/1999:19:47:59 +0000] "GET /merger.gif HTTP/1.1" 304 - "http://www.trampolining.net/" "Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)"

231.231.231.231 - - [02/Oct/1999:19:47:59 +0000] "GET /chat.gif HTTP/1.1" 304 - "http://www.trampolining.net/" "Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)"

132.132.132.132 - - [03/Oct/1999:16:30:45 +0000] "POST /cgi-bin/poll.pl?voted HTTP/1.1" 302 291 "http://www.trampolining.net/" "Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)"

132.132.132.132 - - [03/Oct/1999:16:30:46 +0000] "GET / HTTP/1.1" 200 10137 "http://www.trampolining.net/" "Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)"

132.132.132.132 - - [03/Oct/1999:16:30:47 +0000] "GET /trampnetmini.gif HTTP/1.1" 200 6971 "http://www.trampolining.net/" "Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)"

132.132.132.132 - - [03/Oct/1999:16:30:47 +0000] "GET /improve.gif HTTP/1.1" 200 4727 "http://www.trampolining.net/" "Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)"

132.132.132.132 - - [03/Oct/1999:16:30:49 +0000] "GET /merger.gif HTTP/1.1" 200 4526 "http://www.trampolining.net/" "Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)"

 

The first number in each line is the IP address of the client. By following an IP address through the log, you can find the path an individual visitor took through your site. (Office networks and ISPs such as AOL employing proxies represent around 25% of web traffic, and can cause a single user to appear to come from multiple IP addresses, or allow users to receive some pages without them appearing in your logs. This technique remains accurate the remainder of the time, and is normally accurate even during access via a proxy server, assuming there are not multiple caches. However there is as yet no way round this growing problem).

 

There follows the date and time, followed by the requested filename and the version of HTTP in double quotes. A single slash (/) here represents a directory request, which usually returns index.html. The number immediately following the request is the HTTP success code which is either 200 or 304 as shown above. Any unsuccessful requests, i.e. producing) 403 (Access forbidden) or 404 (File not found) codes go into the error_log file.

 

The next field is the referrer, which in all of the above log entries is http://www.trampolining.net/. The identity of the referrer depends on what file is being logged at the time. In the case of images, the referrer is simply the page that contains the image, but in the case of pages, it is the page the browser was previously viewing ý this gives a good idea where your visitors are coming from. The final pieces of information are the browser and the version of operating system.

 

As you can see, each page can generate many lines of log so to make this kind of following easier, we can cut out some of the unwanted information. To follow the path of just one client, type:

 

$ grep 231.231.231.231 /usr/local/apache/logs/trampolining_access_log | more

 

This will display only log entries created by the client with IP address 231.231.231.231.

 

There are many log file entries corresponding to images, which are often of little interest. To view only page entries, type:

 

$ grep 'html HTTP' /usr/local/apache/logs/trampolining_access_log | more

 

You can even view page requests from a single client:

 

$ grep 'html HTTP' /usr/local/apache/logs/trampolining_access_log | grep 231.231.231.231

 

A final technique allows you to watch current requests in real time. This command is:

 

$ tail -f /usr/local/apache/logs/trampolining_access_log

 

You can make this easier to read by removing the image requests and displaying only page requests:

 

$ tail -f /usr/local/apache/logs/trampolining_access_log | grep 'html HTTP'

Automatic Analysis

This is the vehicle by which we will obtain an overview of our system's usage. Installation of analog is quite simple:

 

q     Download Analog from http://www.statslab.cam.ac.uk/~sret1/analog/ to /usr/local/analog/.

q     Change to the /usr/local/analog directory.

q     Open the analhead.h file for editing and change ANALOGDIR to /usr/local/analog/.

q     Type make.

 

We also need to prepare a directory for the reports and populate it with the necessary images:

 

q     Type mkdir /home/www/trampolining.net/analog.

q     Copy /usr/local/analog/images/* to /home/www/trampolining.net/analog.

 

That's it ‑ Analog is ready for use.

 

Analog is set up using configuration files; the default is analog.cfg which we will edit now, and later on we will create an additional configuration file for each virtual host.

 

LOGFORMAT specifies the format of log used. Analog natively supports the Apache formats COMBINED and COMMON. LOGFILE tells Analog where to look for the access log.

 

LOGFORMAT COMBINED

LOGFILE /usr/local/apache/logs/access_log

 

HOSTNAME specifies the name to put at the top of the report.

 

HOSTNAME "www.trampolining.net"

 

Remember we told Apache not to resolve IP addresses? This little section tells Analog to resolve them, but is much more efficient because addresses are only resolved once, and then written to the cache file specified in DNSFILE. DNSGOODHOURS is the number of hours to trust an entry in the cache file, DNSBADHOURS is the number of hours to wait before attempting to resolve a bad IP address again. DNS WRITE tells Analog to try to resolve unknown IP addresses, then write them to the dnsfile.txt file. The alternative command DNS READ would tell Analog to skip IP addresses which didn't exist in the dnsfile.txt file, thus saving time. On the first run, Analog will complain about dnsfile.txt not existing ý ignore it, Analog will create it.

 

DNSFILE /usr/local/analog/dnsfile.txt

DNSGOODHOURS 1250

DNSBADHOURS 350

DNS WRITE

 

This directive tells Analog where to create the report.

 

OUTFILE /home/www/trampolining.net/analog/trampolining_net_report.html

 

HOSTEXCLUDE directives tell Analog to ignore accesses from a certain IP address or hostname. This allows you to report what your visitors do, without being influenced by your own visits! In this example I exclude all page accesses from Cambridge University using Cambridge's IP allocation, and exclude all accesses from York University using the resolved hostnames.

 

HOSTEXCLUDE 131.111.*.*

HOSTEXCLUDE *.york.ac.uk

 

If your web site contains pages with other extensions than .htm or .html, for example JSPs or .shtml, you will need to add them here to include them in the page counts, otherwise Analog will assume them to be images.

 

PAGEINCLUDE *.htm,*.html,*.shtml

 

Save your completed file as analog.cfg then type ./analog (or ./analog +g/other-config-file.cfg if you have an additional config file). If all goes well, you should get a report like this:

 

 

You will need to create a configuration file for each virtual host, and save it with a different filename, e.g. trampolining_net.cfg. Finally we are ready to schedule Analog to run each morning. We do this using a cronjob, a Linux feature that allows tasks to be run at regular times. We will need to create a separate task for each report, and run them at different times to prevent multiple simultaneous Analog processes clashing.

 

The set up of cronjobs requires you to use the vi editor, which is explained in Appendix B. This is what you type at the vi command prompt;

 

55 0 * * * /usr/local/analog/analog

55 1 * * * /usr/local/analog/analog +g/trampolining_net.cfg

 

The cronjob is now set up to run Analog at 0:55 a.m. each morning, which will write to the default configuration file, analog.cfg), and again at 1.55 a.m. to run Analog with the configuration file trampolining_net.cfg.

 

Our final task is to protect the reports from unwelcome visitors. To do this, we will create a directory container for the /home/www/trampolining.net/analog/ directory in the primary server section of httpd.conf:

 

<Directory "/home/www/trampolining.net">

  Options Indexes FollowSymLinks

  Options +Includes

  AllowOverride None

  Order allow,deny

  Allow from all

</Directory>

 

<Directory "/home/www/trampolining.net/analog">

  Order allow,deny

  Allow from 123.123.123.123

</Directory>

 

This will deny the contents of www.trampolining.net/analog to anyone except the owner of IP address 123.123.123.123.

 

PAGE: 1 I 2 I 3 I NEXT PAGE
 





Ready to take that job and shove it?

Function:

Keyword(s):

State:
SPONSOR
RECENT JOB POSTINGS
CAREER NEWS
Go beyond Google and get vertical. These specialized search sites will help you find the business information you need -- fast.

Ari Balogh was named to the post of chief technology officer as the companys for a "realignment" of employees.










InformationWeek U.S. IT Salary Survey 2008
Salaries for business technology professionals are falling. Here's what you need to know in order to make good hiring decisions and personal career choices. Download Today
 
ROLLING RIGHT ALONG
Follow key Network Computing Reviews from conception to completion. This Week: Holistic APM.



Network Computing Reports Emerging Enterprise Podcast Series: Secrets to Success








TechSearch


Microsite of the Week


Powerful Information at Your Fingertips



InformationWeek Business Technology Network
InformationWeekInformationWeek 500InformationWeek 500 ConferenceInformationWeek AnalyticsInformationWeek CIO
InformationWeek EventsInformationWeek ReportsInformationWeek MagazinebMightyByte and SwitchDark Reading
Digital LibraryIntelligent EnterpriseInternet EvolutionNetwork ComputingNo Jitter
space
Techweb Events Network
InteropVoiceConWeb 2.0 ExpoWeb 2.0 SummitEnterprise 2.0 ConferenceMobile Business ExpoSoftware ConferenceCSI - Computer Security Institute
Black HatGTECEnergy CampMashup CampStartup Camp
space
Light Reading Communications Network
Light ReadingLight Reading EuropeUnstrungLight Reading's Cable Digital NewsConstantinopleInternet Evolution
Heavy ReadingLight Reading Live!Light Reading InsiderEthernet ExpoOptical ExpoTeleco TVTower Technology Summit
space
Financial Technology Network
Advanced TradingBank Systems & TechnologyInsurance & TechnologyWall Street & TechnologyAccelerating Wall StreetBank Systems & Technology Executive SummitBuyside Trading SummitInsurance & Technology Executive Summit
space
Microsoft Technology Network
MSDN MagazineTechNetThe Architecture Journal
space
App Infrastructure   |   Messaging & Collaboration   |   Network & Systems Mgmt   |   Network Infrastructure   |   Security  |   Storage & Servers   |   Wireless   |   Enterprise Apps
About Us  |  Contact Us  |  Site Map  |  Technology Marketing Solutions  |  Advertising Contacts  |   Briefing Centers
Copyright © 2008  United Business Media LLC  |  Privacy Statement  |  Terms of Service  |  Your California Privacy Rights