home news blogs forums events research newsletter whitepapers careers


Network Computing Network Computing Network Computing
HOT PICKS

IMMERSE YOURSELF:

SOA

  |

Data Center

  |

802.11n

  |

Data Privacy

  |
APO  |

Virtualization

  |

NAC

  |

Security

  |

Network Mgmt

  |

Enterprise Apps

  |

Storage & Servers





Chapter 5: Deploying Web and FTP Servers

May 22, 2000

Brought to you by:




Table of contents:

Got a tough Linux deployment question?
Ask the experts!

For a limited time, you can put the authors of "Deploying Web and FTP Servers" to the test. Post your question, and if they answer it, you'll receive a free Network Computing collectable. Click here for more info.

Technologies for Effective Sites

Undoubtedly the biggest cost in deploying any web site is the design and maintenance. If you have used ASP in the past, you will be aware of the focus on reducing maintenance costs. If every time you want to make a content or design change, you need to edit every page on the server by hand, maintenance becomes prohibitively expensive and error-prone. Server Side Includes (SSI) allow text which is common to every page, to be specified in one file. The other pages can then include it before the document is sent to the client, using an SSI command. Furthermore, if the site is to be able to present more than purely static content, it needs some way of adapting the content it serves to the actions of the client. It needs the intelligence provided by programming languages, such as the ability to perform calculations, handle information and provide feedback to the user.

 

CGI (Common Gateway Interface) allows scripts to run on the server. The main language used is Perl which is a scripting language providing excellent text processing power as well as standard programming tools. These scripts can handle user input and process it, store it and return customized pages. Apache implements standard script support, and can also provide super-fast script support using mod_perl.

 

Alternatives to CGI for providing fast server-side processing come with Java servlets and JavaServer Pages. Java servlets are complete programs which run on your server, providing a complete portable development environment for your applications. Apache implements full servlet support with the help of ApacheJServ, which we will install later on in this section.

 

Java Server Pages are from the same family as servlets, but instead of being separate programs, they are HTML files with Java code inserted in-line, and executed before the document leaves the server, combining incredible programming power with the simplicity of in-line code. Support for JSP is provided by project Jakarta, which we will briefly discuss later on.

Server-Side Includes

The real trampolining.net web site contains over 100 separate HTML pages, and that may be small compared to the web sites you plan to deploy. Nearly every page follows exactly the same format in terms of design and layout, including a copyright statement at the end of every page. At the end of the year, all 100 pages will need the copyright statement updated to read, for example, ý 1999, 2000.

 

To attempt to update all these manually would be tedious and error-prone. Instead, Server-Side Includes (SSI) are used to include one HTML file within the others. Any commonly repeated text could be inserted using SSI:

 

<!--#include virtual="stylesheet.shtml" -->   Includes a standard stylesheet

<!--#include virtual="navbar.shtml" -->       Starts the page table and include

                                              a standard navigation bar

 

Main content here   

 

<!--#include virtual="copyright.shtml" -->    Closes table, adds copyright statement

 

The include commands insert the contents of the named files at that point. The named file is relative to the directory of the main file; subdirectories can be accessed (e.g.
<!--#include virtual="subdir1/included.html"  -->) as can files in parent directories (e.g. <!--#include virtual="../fromparent.html" -->) and the included files can themselves contain SSI commands if they end in .shtml. These inclusions are all performed before the document leaves the server ý the client will only ever see a normal HTML page. At the end of the year, I will only need to change copyright.shtml, for all the pages on my site to be updated ý a huge saving in maintenance time.

 

SSI includes other useful commands: CGI scripts can be called using the
<!--#exec cgi="/cgi-bin/script.pl" --> command, with the output written directly into the page sent to the client. This prevents the client knowing the script even exists, so is a useful security aid.

 

You can insert text which automatically updates using the
<!--#echo var="LAST_MODIFIED" --> command, which allows an extended set of standard variables to be inserted automatically each time the page is called, to show the date for example. Listings of the available commands are available online in the Apache mod_include documentation.

 

Apache provides excellent support for SSI with just a few commands. Because SSI increases server load, it is traditional to suffix any file containing SSI with .shtml. Setting up Apache to parse *.shtml means it won't waste time attempting to parse normal HTML (*.html) files. This part of the configuration takes place in the primary server section of httpd.conf, outside of any <Directory> containers. In fact, the directives are already there ý about three quarters of the way through the file and just need uncommenting. These two lines tell Apache what content type .shtml should be allocated, and tells the internal SSI handler to parse .shtml files before serving them to the client.

 

AddType text/html .shtml

AddHandler server-parsed .shtml

 

Adding index.shtml to this directive allows .shtml files to be served as directory indexes by preference; or in other words if index.shtml exists in the root directory of my server, it will be served to someone requesting http://www.trampolining.net).

 

DirectoryIndex index.shtml index.html index.htm

 

It is then necessary to turn on SSI support in every directory container in which you wish to use it. If SSI is not working on a virtual host, check this command is present in that virtual host's directory container:

 

<Directory /home/www/trampolining.net>

Options Indexes FollowSymLinks Includes ExecCGI MultiViews

Options +Includes

AllowOverride None

Order allow,deny

Allow from all

</Directory>

 

For more information on Apache SSI look up http://www.apache.org/docs/mod/mod_include.html

Common Gateway Interface

Better known as CGI, this technology is the simplest way to deploy interactive content on your web site. Scripts are freely available to perform everything from form handling to maintaining complete discussion forums. Scripts are usually written in Perl and interpreted as they are used. However as with any program running on your server, they represent a potential security risk. It is possible to configure Apache to interpret scripts from anywhere on the system, but this means anyone with access to directories containing web pages can create potentially harmful scripts.

 

To minimize this, CGI scripts are run from a special directory, usually called cgi-bin, and have file permissions set that allow remote users to execute them, but only allowing write access to root. The first line of the Perl script must also be changed to read the location of the Perl interpreter on your system ý type which perl to find it.

 

The httpd.conf file already contains the necessary directives in the primary server section, so we just need to uncomment them and change any locations if necessary ý note the trailing slashes:

 

ScriptAlias /cgi-bin/ "/home/www/cgi/"

 

The above directive tells Apache to treat any request to /cgi-bin/ as a request for a script, and to look for that script in the server directory /home/www/cgi/. This is inherited by any virtual hosts, unless we define a different ScriptAlias in the corresponding VirtualHost container, so in this example, http://www.trampolining.net/cgi-bin/script.pl and http://www.sport-science.net/cgi-bin/script.pl will each point to /home/www/cgi/script.pl.

 

Now look at this Directory container:

 

<Directory /home/www/cgi/>

    AllowOverride None

    Options None

    Order allow,deny

    Allow from all

</Directory>

 

This sets the permissions for your CGI directory to the absolute minimum necessary to run scripts. No-one will actually be able to read the scripts as any request will instead run them. These minimum permissions will also make life more difficult for hackers trying to access your scripts.

mod_perl

The mod_perl program allows Perl scripts to be run very fast by a dedicated Perl interpreter within Apache, which will not need starting separately for each request. Perl scripts are reported to run between two and twenty times faster than mod_cgi, depending on the script itself. However the increased speed of script processing comes at a price.

 

The mod_perl module is a complex module that is complicated to install and configure, and the actual steps needed depend on the versions of mod_perl and Apache being used; it also has three user modes, and thirty configuration options during build. Therefore, detailed installation instructions are beyond the scope of this book, though you can get more help from the INSTALL text file that comes with the mod_perl download or from the Apache web site (www.apache.org). Furthermore, the installation of mod_perl will break your existing Apache configuration. It has to be installed first and Apache reinstalled on top, which mean that you will have to customize Apache again from scratch. You have to decide, right from the onset, whether to include mod_perl in your server system, as it is currently very difficult to incorporate it later on.

 

The discussion on mod_perl has been left until now, because its benefits would only become apparent under conditions of very heavy server usage. For moderate or low usage, then CGI is only marginally slower and there is little advantage in having the increased script processing power that mod_perl offers. The mod_perl program is an advanced application that should be considered for use, only if very high server usage is anticipated.

An Example Installation

 Below is a very standard installation procedure for mod_perl. Download the module from www.apache.org and uncompress it to /usr/local. Then carry out the following steps.

The installation steps reproduced below are highly simplified and can only be said to work on most systems. You should look up the Apache documentation for more detailed instructions.

# cd /usr/local/mod_perl

# perl Makefile.PL APACHE_SRC=../apache_version/src\

> DO_HTTPD=1 USE_DSO=1 USE_APACI=1 EVERYTHING=1

# make && make test && make install

# cd ../apache_x.x.x

# make install

 

After the installation is complete, and mod_perl and Apache are working as they should, then Apache will need to be configured for mod_perl. This consists of adding a few directives to httpd.conf. The first tells Apache to look for /home/www/fast-perl/anyscript.cgi given a request for www.trampolining.net/fast-perl/anyscript.cgi.

 

Alias /fast-perl/ /home/www/fast-perl

 

The next lines tell Apache to allow scripts to be executed in this directory, and to execute them by passing them to mod_perl:

 

<Location /cgi-perl>

   AllowOverride None

   SetHandler perl-script PerlHandler

   Apache::PerlRun

   Options ExecCGI

   allow from all

   PerlSendHeader On

</Location>

 

mod_perl is a powerful and configurable module. Much more information on configuration is available from the Apache on-line documentation.

Java Servlets

Java is a programming language developed by Sun Microsystems. It is unique in that once compiled, Java programs will run on any machine, architecture or operating system with the help of a Java Virtual Machine (JVM). The compiled program, called a servlet, is not designed to run on any specific machine but instead on a JVM, a piece of software which provides a standard set of commands like that of a chipset. JVMs can and have been developed for nearly all the important operating systems, guaranteeing that well-written code should work on any platform without recompilation. This cross-platform portability is an important feature of the Java development environment which ensures that your development resources will never be made obsolete by new hardware ý investment will survive a change of platform.

 

Java servlets are called by the browser, but are run on the server with the results being sent to the browser. This eliminates any need to worry about the browser type as no code is sent. It is possible to implement infinitely complex algorithms using Java servlets, but if the servlet is designed to return output as pure HTML, the results will be viewable by even the simplest text based browsers.

 

Servlet support in Apache is performed using ApacheJServ, a fully featured Java servlet runtime container supporting all commands up to JSDK 2.0. While the ApacheJServ modules are not particularly big, the Java Development Kit which is required to compile servlets and provide the JVM, is a huge 45 MB in size (the zipped archive is just over 19MB in size), and using servlets will also cause a step increase in memory requirement of around 32MB due to the JVM. However, the benefits of servlet technology far outweigh the cost of set up, so read on!

 

Servlets are relatively more difficult to configure than CGI, and Java may take some getting used to ý it is a very powerful language with many similarities to C++. However, Java offers the increased security of its in-built security model which makes it much more difficult for hackers to cause damage by passing harmful system commands to the servlet. Complex tasks like chat-rooms or server-side parts of games are also ideally suited to Java, because you can create servlets which will stay alive right from their initial instantiation. While Perl is ideally suited to text processing applications, Java can be used to develop code of infinite complexity with extensions available to make multi-tier distributed applications possible. With the help of MySQL, details of which you will find at www.mysql.org, it is possible to use SQL databases. You will find a more complete discussion of these topics in the Wrox publication Professional Java Server Programming.

 

And so we come to installing ApacheJServ.

 

To run ApacheJServ requires the Java Development Kit 1.2 for glibc 2.1 from http://www.blackdown.org. 1 (Note that older Linux distributions may require the glibc 2.0 version.) The JDK is currently only available as a bzip2 archive, so you will need to install the bzip2 utility as well (http://sourceware.cygnus.com/bzip2/). You will also need the Java Servlet Development Kit (JSDK) version 2.0 from http://java.sun.com/products/Servlet. Download JServ from http://java.apache.org and extract into the /usr/local/ApacheJServ-1.0 directory and type the following commands:

      

       # mkdir /usr/local/apache/src/modules/jserv

# cd /usr/local/ApacheJServ-1.0

# ./configure --prefix=/usr/local/ApacheJServ-1.0 --with-apache-\
> install=/usr/local/apache --with-jsdk=/usr/local/JSDK2.0/lib/jsdk.jar

# make

# make install

 

ApacheJServ should now be installed and configured. Open httpd.conf for editing and add this directive to the very end of the file:

 

Include /usr/local/ApacheJServ-1.0/example/jserv.conf

 

Appending this command forces Apache to read jserv.conf from its installed location. Future versions of ApacheJServ may instead install this file in the same directory as httpd.conf. The jserv.conf file contains all the commands to configure the Apache side of ApacheJServ.

 

Restart Apache, give it a moment to two for JServ to begin accepting requests, and if everything works, visiting http://localhost/example/Hello should produce a success page!

 

 

If this does not work, then your version of ApacheJServ configures /servlet as the test zone, which means that you would have to type http://localhost/servlet/Hello.

Java Server Pages

While Java servlets offer boundless possibilities for powerful server-side processing, for simple applications they can be quite unwieldy. Perhaps you want to insert the time and date at one point on your page, and perform a calculation at another; using JavaScript or a Java Applet prevents older browsers viewing your page correctly. You could use a single servlet to create the whole page. However, the page content itself is now mixed up within Java code, making maintenance difficult ý particularly if the programmers and web designers are different groups of people. Alternatively, you could keep the page content in an HTML file which uses Server-Side Includes to call successive CGI scripts to insert the correct text at each point. This way the web designers can maintain the HTML without worrying about the code. However, this simple page now has one HTML file and several CGI scripts associated with it, which again makes maintenance complicated.

 

For simple applications, the ideal solution would be to have the Java code and HTML contained in a single file. It will have the look and 'feel' of HTML, so the web designers can understand it, but would contain additional code which would be run on the server before delivering the page back to the client. Sun's new member of the Java family, JavaServer Pages (JSP), provides this solution. Code can be inserted in line within the HTML, which is executed on the server and the results merged with the HTML in the output. This parallels how Microsoft's ASP works, and JSP is emerging as the open source challenger to ASP in this field.

 

The file which leaves the server is pure HTML, so unlike JavaScript and Java Applets, which have to be run on the client, you can have the interactivity and programming flexibility of Java while ensuring that all existing HTML browsers can display the output. Furthermore, you maintain all the advantages of Java's portability should you later decide to change operating system or web server. There are already many web sites that use JSP instead of ASP.

 

Up until recently, the main open source JSP implementations were GNU Server Pages (GSP) and GNU Java Server Pages (GNUJSP), which are independent development efforts despite their similar names. Both are written as regular Java servlets, and although they are difficult to install and configure, they can be used to create JSPs and develop web sites. Information on GSP and GNUJSP can be found at www.bitmechanic.com and www.klomp.org/gnujsp respectively.

 

However, JSP support in Apache now is in the form of a module called Jakarta, named after the project team which implemented it (or the largest city on the Indonesian island of Java which might or might not be a coincidence). At time of going to press, Jakarta is in final pre-release form, so by the time you read this Jakarta will almost certainly be in production release. The latest version of Jakarta and its installation instructions are available online at http://jakarta.apache.org.

 

Logs and Analysis

To develop a web site effectively, you will need to regularly analyze the web site's log files, which contain data on everyone who accesses the site. From it you can determine, the number of requests made, the identities (IP addresses) of the clients and the pattern of hyperlinks that are followed across the web site. While small scale information can be gained by manually viewing the log files, this technique is not appropriate for finding large-scale trends. Each request for a page creates 60 bytes or so of data that is added to the log file ý more if images are requested along with the pages, which is usually the case. Multiplying this number by, say, 200 daily page requests means that roughly 50-60 kilobytes of data added to the log each day. Therefore, manual viewing is in reality restricted to small samples of the logs.

 

To automatically analyze the complete logs, we will be using Analog, a small yet powerful program which is configurable, scalable and free. It is currently the most popular log file analysis program on the web (a 25% market share according to a GVU report at http://www.gvu.gatech.edu). It will be configured to produce separate reports for each virtual host, and update them each morning, and the reports will only be read by authorized people.

Manual Logfile Analysis

While manual analysis will not be suitable for viewing overall trends, it allows you to interpret the logs with human intelligence. For example, if you notice lots of visitors are requesting one page then leaving, you may want to investigate ways of encouraging them to stay on your site. Do you provide links to other relevant pages? Are they arriving directly into a frame and being trapped with no links out? Are your pages so large, or your connection so slow, they are giving up waiting and leaving the site?

 

You will have chosen where to place your logs when editing httpd.conf. Simply open one in an editor and concentrate on a small section. Below is an extract from access_log on my machine (with the IP addresses replaced by dummy ones):

 

231.231.231.231 - - [02/Oct/1999:19:47:35 +0000] "GET / HTTP/1.1" 200 9621 "-" "Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)"

231.231.231.231 - - [02/Oct/1999:19:47:41 +0000] "GET /trampnetmini.gif HTTP/1.1" 304 - "http://www.trampolining.net/" "Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)"

231.231.231.231 - - [02/Oct/1999:19:47:58 +0000] "GET /trampnetmini.gif HTTP/1.1" 304 - "http://www.trampolining.net/" "Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)"

231.231.231.231 - - [02/Oct/1999:19:47:58 +0000] "GET /coach.gif HTTP/1.1" 304 - "http://www.trampolining.net/" "Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)"

231.231.231.231 - - [02/Oct/1999:19:47:59 +0000] "GET /news.gif HTTP/1.1" 304 - "http://www.trampolining.net/" "Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)"

231.231.231.231 - - [02/Oct/1999:19:47:59 +0000] "GET /improve.gif HTTP/1.1" 304 - "http://www.trampolining.net/" "Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)"

231.231.231.231 - - [02/Oct/1999:19:47:59 +0000] "GET /merger.gif HTTP/1.1" 304 - "http://www.trampolining.net/" "Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)"

231.231.231.231 - - [02/Oct/1999:19:47:59 +0000] "GET /chat.gif HTTP/1.1" 304 - "http://www.trampolining.net/" "Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)"

132.132.132.132 - - [03/Oct/1999:16:30:45 +0000] "POST /cgi-bin/poll.pl?voted HTTP/1.1" 302 291 "http://www.trampolining.net/" "Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)"

132.132.132.132 - - [03/Oct/1999:16:30:46 +0000] "GET / HTTP/1.1" 200 10137 "http://www.trampolining.net/" "Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)"

132.132.132.132 - - [03/Oct/1999:16:30:47 +0000] "GET /trampnetmini.gif HTTP/1.1" 200 6971 "http://www.trampolining.net/" "Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)"

132.132.132.132 - - [03/Oct/1999:16:30:47 +0000] "GET /improve.gif HTTP/1.1" 200 4727 "http://www.trampolining.net/" "Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)"

132.132.132.132 - - [03/Oct/1999:16:30:49 +0000] "GET /merger.gif HTTP/1.1" 200 4526 "http://www.trampolining.net/" "Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)"

 

The first number in each line is the IP address of the client. By following an IP address through the log, you can find the path an individual visitor took through your site. (Office networks and ISPs such as AOL employing proxies represent around 25% of web traffic, and can cause a single user to appear to come from multiple IP addresses, or allow users to receive some pages without them appearing in your logs. This technique remains accurate the remainder of the time, and is normally accurate even during access via a proxy server, assuming there are not multiple caches. However there is as yet no way round this growing problem).

 

There follows the date and time, followed by the requested filename and the version of HTTP in double quotes. A single slash (/) here represents a directory request, which usually returns index.html. The number immediately following the request is the HTTP success code which is either 200 or 304 as shown above. Any unsuccessful requests, i.e. producing) 403 (Access forbidden) or 404 (File not found) codes go into the error_log file.

 

The next field is the referrer, which in all of the above log entries is http://www.trampolining.net/. The identity of the referrer depends on what file is being logged at the time. In the case of images, the referrer is simply the page that contains the image, but in the case of pages, it is the page the browser was previously viewing ý this gives a good idea where your visitors are coming from. The final pieces of information are the browser and the version of operating system.

 

As you can see, each page can generate many lines of log so to make this kind of following easier, we can cut out some of the unwanted information. To follow the path of just one client, type:

 

$ grep 231.231.231.231 /usr/local/apache/logs/trampolining_access_log | more

 

This will display only log entries created by the client with IP address 231.231.231.231.

 

There are many log file entries corresponding to images, which are often of little interest. To view only page entries, type:

 

$ grep 'html HTTP' /usr/local/apache/logs/trampolining_access_log | more

 

You can even view page requests from a single client:

 

$ grep 'html HTTP' /usr/local/apache/logs/trampolining_access_log | grep 231.231.231.231

 

A final technique allows you to watch current requests in real time. This command is:

 

$ tail -f /usr/local/apache/logs/trampolining_access_log

 

You can make this easier to read by removing the image requests and displaying only page requests:

 

$ tail -f /usr/local/apache/logs/trampolining_access_log | grep 'html HTTP'

Automatic Analysis

This is the vehicle by which we will obtain an overview of our system's usage. Installation of analog is quite simple:

 

q     Download Analog from http://www.statslab.cam.ac.uk/~sret1/analog/ to /usr/local/analog/.

q     Change to the /usr/local/analog directory.

q     Open the analhead.h file for editing and change ANALOGDIR to /usr/local/analog/.

q     Type make.

 

We also need to prepare a directory for the reports and populate it with the necessary images:

 

q     Type mkdir /home/www/trampolining.net/analog.

q     Copy /usr/local/analog/images/* to /home/www/trampolining.net/analog.

 

That's it ‑ Analog is ready for use.

 

Analog is set up using configuration files; the default is analog.cfg which we will edit now, and later on we will create an additional configuration file for each virtual host.

 

LOGFORMAT specifies the format of log used. Analog natively supports the Apache formats COMBINED and COMMON. LOGFILE tells Analog where to look for the access log.

 

LOGFORMAT COMBINED

LOGFILE /usr/local/apache/logs/access_log

 

HOSTNAME specifies the name to put at the top of the report.

 

HOSTNAME "www.trampolining.net"

 

Remember we told Apache not to resolve IP addresses? This little section tells Analog to resolve them, but is much more efficient because addresses are only resolved once, and then written to the cache file specified in DNSFILE. DNSGOODHOURS is the number of hours to trust an entry in the cache file, DNSBADHOURS is the number of hours to wait before attempting to resolve a bad IP address again. DNS WRITE tells Analog to try to resolve unknown IP addresses, then write them to the dnsfile.txt file. The alternative command DNS READ would tell Analog to skip IP addresses which didn't exist in the dnsfile.txt file, thus saving time. On the first run, Analog will complain about dnsfile.txt not existing ý ignore it, Analog will create it.

 

DNSFILE /usr/local/analog/dnsfile.txt

DNSGOODHOURS 1250

DNSBADHOURS 350

DNS WRITE

 

This directive tells Analog where to create the report.

 

OUTFILE /home/www/trampolining.net/analog/trampolining_net_report.html

 

HOSTEXCLUDE directives tell Analog to ignore accesses from a certain IP address or hostname. This allows you to report what your visitors do, without being influenced by your own visits! In this example I exclude all page accesses from Cambridge University using Cambridge's IP allocation, and exclude all accesses from York University using the resolved hostnames.

 

HOSTEXCLUDE 131.111.*.*

HOSTEXCLUDE *.york.ac.uk

 

If your web site contains pages with other extensions than .htm or .html, for example JSPs or .shtml