|
2.3 Internet and the World Wide Web (WWW)
Origin of the Internet is the ARPANET (Advanced Research Projects Agency Network) that was initiated in 1969 to support researchers on DOD (Department of Defense) projects. For many years, Internet had been used mainly by scientists and programmers to transfer files and send/receive electronic mail. The users of Internet relied on text-based user interfaces and tedious commands to access remote computing resources. In 1989, this changed with the introduction of World Wide Web (WWW), commonly referred to as the Web. The Web has been a major contributor in turning the Internet, once an obscure tool, into a household word. Why? Mainly because the Web allows users to access, navigate, and share information around the globe through GUI clients ("Web browsers") that are available on almost all computing platforms. The Web browsers allow users to access information that is linked through hypermedia links. Thus a user transparently browses around, or "surfs" around, different pieces of information that are located on different computers in different cities and even in different countries.
2.3.1 Internet, Intranets, and Extranets
Simply stated, Internet is a network of networks. Technically, however, Internet is a collection of networks based on the IP (Internet Protocol) stack. This protocol stack, initially referred to as the DOD (Department of Defense) or ARPANET Protocol Suite and commonly referred to as the TCP/IP (Transmission Control Protocol/Internet Protocol) was designed to support e-mail, file transfer, and terminal emulation for ARPANET users. The services and protocols supported by IP have dramatically grown in popularity and have become the de facto standards for heterogeneous enterprise networks. At present, the term Internet is used to symbolize the IP (or loosely speaking, TCP/IP) networks in the following situations:
l Public Internet, or just the Internet, that is not owned by any single entityit consists of many independent IP networks that are tied together loosely. Initially, the public Internet was used to tie different university networks together. With time, several commercial and private networks have joined the public Internet. The computers on the public Internet have publicly known Internet Protocol (IP) addresses that are used to exchange information over the public Internet (see discussion on addressing below). The public Internet at present consists of thousands of networks.
l Private Internets, or Intranets, are the IP networks that are used by corporations for their own businesses. Technically, an Intranet is the same as the public Internet, only smaller and privately owned (thus hopefully better controlled and more secure). Thus any applications and services that are available on the public Internet are also available on the Intranets. This is an important point for WWW, because many companies are using WWW technologies on their Intranets for internal applications (e.g., employee information systems).
l Business-to-business Internets, or Extranets, are the TCP/IP networks that are used for business-to-business activities. Technically, an Extranet is the same as the public Internet but is better controlled and more secure. Many electronic commerce applications between business partners are beginning to use Extranets. Any applications and services that are available on the public Internet are also available on the Extranets.
Basically: Internet = Public Internet + Intranets + Extranets
The following protocols (the first three belong to the original DOD Suite) are among the best known application protocols in the Internet (see Figure 2.6):
l Telnet: This protocol is used to provide terminal access to hosts and runs on top of TCP.
l File Transfer Protocol (FTP): This TCP-based protocol provides a way to transfer files between hosts on the Internet.
l Simple Mail Transfer Protocol (SMTP): This TCP-based protocol is the Internet electronic mail exchange mechanism.
l Trivial File Transfer Protocol (TFTP): This UDP-based protocol also transfers files between hosts, but with less functionality (e.g., no authorization mechanism). This protocol is typically used for "booting" over the network.
l Network File System (NFS) Protocol: This UDP-based protocol has become a de facto standard for use in building distributed file systems through transparent access.
l Xwindow: This is a windowing system that provides uniform user views of several executing programs and processes on bit-mapped displays. Although Xwindow is supposedly network independent, it has been implemented widely on top of TCP.
l SUN Remote Procedure Call (RPC): This protocol allows programs to execute subroutines that are actually at remote sites. RPCs, similar to Xwindow, are supposedly network independent but have been implemented widely on top of TCP. SUN RPC is one of the oldest RPCs. Examples of other RPCs are OSF, DCE, RPC, and Netwise RPC.
l Domain Naming Services: This protocol defines hierarchical naming structures that are much easier to remember than the IP addresses. The naming structures define the organization type, organization name, etc.
l SNMP (Simple Network Management Protocol): This is a protocol defined for managing (monitoring and controlling) networks.
l Kerberos: This is a security authentication protocol developed at MIT.
l Time and Daytime Protocol: This provides machine-readable time and day information.
The World Wide Web (WWW) has introduced additional application protocols and services. For example, the Web browsers, the Web servers, and the HTTP protocol used in WWW reside on top of the IP stack (see next section). As the use of Internet grows, more services and protocols for the IP application layer will emerge.
Figure 2.6 Technical View of Internet and World Wide Web
2.3.2 Overview of World Wide Web
World Wide Web (WWW) is a wide area information retrieval project that was started in 1989 by Tim Berners-Lee at the Geneva European Laboratory for Particle Physics (known as CERN, based on the laboratorys French name) [Berners-Lee 1993] and [Berners-Lee 1996]. The initial proposal suggested development of a "hypertext system" to enable efficient and easy information sharing among geographically separated teams of researchers in the High Energy Physics community.
Technically speaking, WWW is a collection of middleware that operates on top of IP networks (i.e., the Internet). Figure 2.6 shows this layered view. The purpose of the WWW middleware is to support the growing number of users and applications ranging from entertainment to corporate information systems. Like many other (successful) Internet technologies, the WWW middleware is based on a few simple concepts and technologies such as the following (see Figure 2.7):
l Web servers
l Web browsers
l Uniform Resource Locator (URL)
l Hypertext Transfer Protocol (HTTP)
l Hypertext Markup Language (HTML)
l Web navigation and search tools
l Gateways to non-Web resources
Figure 2.7 Conceptual View of World Wide Web
Let us briefly review these components and show how they tie with each other through an example.
Web sites provide the content that is accessed by Web users. Web sites are populated and in many cases managed by the content providers. For example, Web sites provide the commercial presence for each of the content providers doing business over the Internet. Conceptually, a Web site is a catalog of information for each content provider over the Web. In reality, a Web site consists of three types of components: a Web server (a program), content files ("Web pages"), and/or gateways (programs that access non-Web content). A Web server is a program (technically a server process) that receives calls from Web clients and retrieves Web pages and/or receives information from gateways (we will discuss gateways later). Once again, a Web user views a Web site as a collection of files on a computer, usually a UNIX or Windows NT machine. In many cases, a machine is dedicated/designated as a Web site on which Web accessible contents are stored. As a matter of convention, the entry point to a Web site is a "home page" that advertises the company business. Very much like storefront signs in a shopping mall, the home pages include company logos, fancy artwork for attention, special deals, overviews, pointers to additional information, etc. The large number of Web sites containing a wide range of information that can be navigated and searched transparently by Web users is the main strength of WWW. Figure 2.7 shows two Web sitesone for a shoe shop (www.shoes.com) and the other for a computer science department for a university (cs.um.edu).
Web browsers are the clients that typically use graphical user interfaces to wander through the Web sites. The first GUI browser, Mosaic, was developed at the National Center for Supercomputer Applications at the University of Illinois. Mosaic runs on PC Windows, Macintosh, UNIX, and Xterminals. At present, Web browsers are commercially available from Netscape, Microsoft and many other software/freeware providers. These Web browsers provide an intuitive view of information where hyperlinks (links to other text information) appear as underlined items or highlighted text/images. If a user points and clicks on the highlighted text/images, then the Web browser uses HTTP to fetch the requested document from an appropriate Web site. Web browsers are designed to display information prepared in a markup language, known as HTML. We will discuss HTTP and HTML later. Three different browsers are shown in Figure 2.7. Even though these are different browsers residing on different machines, they all use the same protocol (HTTP) to communicate with the Web servers (HTTP compliance is a basic requirement for Web browsers).
Most browsers at present are relatively dumb (i.e., they just pass user requests to Web servers and display the results). However, this is changing very quickly because of Java, a programming language developed by Sun Microsystems. Java programs, known as Java applets, can run on Java-compatible browsers. This is creating many interesting possibilities where Java applets are downloaded to the Java enabled browsers where they run producing graphs/charts, invoking multimedia applications, and accessing remote databases.
Uniform Resource Locator (URL) is the basis for locating resources in WWW. A URL consists of a string of characters that uniquely identifies a resource. A user can connect to resources by typing the URL in a browser window or by clicking on a hyperlink that implicitly invokes a URL. Perhaps the best way to explain URLs is through an example. Let us look at the URL "http://cs.um.edu/faculty.html" shown in Figure 2.7. The "http" in the URL tells the server that an HTTP request is being initiated (if you substitute http with ftp, then an FTP session is initiated). The "cs.um.edu" is the name of the machine running the Web server. (This is actually the domain name used by the Internet to locate machines on the Internet.) The "/faculty.html" is the name of a file on the machine cs.um.edu. The "html" suffix indicates that this is an HTML file. When this URL is clicked or typed, the browser initiates a connection to the "cs.um.edu" machine and initiates a "Get" request for the "faculty.html" file. Depending on the type of browser you are using, you can see these requests flying around in an appropriate window spot. Eventually, this document is fetched, transferred to, and displayed at the Web browser. You can access any information through the Web by issuing a URL (directly or indirectly). As we will see later, the Web search tools basically return a bunch of URLs in response to a search query. The general format of URL is:
protocol://host:port/path
where
protocol represents the protocol to retrieve or send information. Examples of valid protocols are HTTP, FTP, Telnet, Gopher, and NNTP (Network News Transfer Protocol)
host is the computer host on which the resource resides
port is an optional port number (this is not needed unless you want to override the HTTP default port, port 80)
path is an identification, typically a file name, on the computer host
Hypertext Markup Language (HTML) is an easy to use language that tags the text files for display at Web browsers. HTML also helps in creation of hypertext links, usually called hyperlinks, which provide a path from one document to another. The hyperlinks contain URLs for the needed resources. The main purpose of HTML is to allow users to flip through Web documents in a manner similar to flipping through a book, magazine, or a catalog. The Web site "cs.um.edu" shown in Figure 2.7 contains two HTML documents: "faculty.html" and "courses.html." HTML documents can imbed text, images, audio, and video.
Hypertext Transfer Protocol (HTTP) is an application-level protocol designed for Web users. It is intended for collaborative, distributed, hypermedia information systems. HTTP uses an extremely simple request/response model that establishes connection with the Web server specified in the URL, retrieves the needed document, and closes the connection. Once the document has been transferred to your Web browser, then the browser takes over. Keep in mind that every time you click on a hyperlink, you are initiating an HTTP session to transfer the needed information to your browser. The Web users shown in Figure 2.7 access the information stored in the two servers by using the HTTP protocol.
Web navigation and search services are used to search and surf the vast resources available over the "cyberspace." The term cyberspace, as stated previously, was first introduced through a science fiction book by Gibson [1984] but currently refers to the computer-mediated experiences for visualization, communication, and browser/decision support. The general search paradigm used determines that each search service contains an index of information available on Web sites. This index is almost always created and updated by "spiders" that crawl around the Web sites chasing hyperlinks for different pieces of information. Search engines support key-word and/or subject-oriented browsing through the index. Result of this browsing is a "hit list" of hyperlinks (URLs) that the user can click on to access the needed information. For example, the Web users in Figure 2.7 can issue a keyword search by using a search service for shoe stores in Chicago. This will return a hit list of potential shoe stores that are Web content providers. You then point and click till you find a shoe store of your choice. Many search services are currently available on the Web. Examples are Yahoo, Lycos, and Alta Vista. At present, many of these tools are being integrated with Web pages and Web browsers. For example, the Netscape Browser automatically invokes the Netscape home page that displays search tools that you can invoke by just pointing and clicking. It is beyond the scope of this book to describe the various Web navigation and search tools. Many books on Internet describe these search tools quite well.
Gateways to non-Web resources are used to bridge the gap between Web browsers and the corporate applications and databases. Web gateways are used for accessing information from heterogeneous data sources (e.g., relational databases, indexed files, and legacy information sources) and can be used to handle almost anything that is not designed with an HTML interface. The basic issue is that the Web browsers can display HTML information. These gateways are used to access non-HTML information and convert it to HTML format for display at a Web browser. The gateway programs typically run on Web sites and are invoked by the Web servers. At present, Common Gateway Interface (CGI) is used frequently. "Relational gateways" that provide access to relational databases from Web browsers are an area of active work.
2.3.3 A Simple Example
Figure 2.8 illustrates how the Web components can be used for a department store "Clothes-XYZ." This store wants to advertise its products on the Web (i.e., it wants to be a Web content provider). The store first designates a machine or buys services on a machine called "clothes.com" as a Web site. It then creates an overview document "overview.html" that tells the potential customers of the product highlights (think of this as the first few pages of a catalog). In addition, several HTML documents on the Web site for different types of clothes (men.html, women. html, kids.html) are created with pictures of clothes, size information, etc. (once again think of this as a catalog). We can assume that the overview page has hyperlinks to the other documents (as a matter of fact, it could have hyperlinks to other branches of Clothes-XYZ). In reality, design of the Web pages would require a richer, deeper tree structure design as well as sequential links for alphabetical and keyword searches needed to support the "flipping through" catalog behavior.
Once HTML documents have been created on the Web server, then an Internet user can browse through them as if he/she is flipping through a catalog. The customers typically supply the URL, directly or indirectly, for the overview (http://clothes.com/overview.html) and then use the hyperlinks to look at different types of clothes. Experienced customers may directly go to the type of clothes needed (e.g., men may directly go to "men.html" document). As shown in Figure 2.8, the URL consists of three components: the protocol (http), the Web server name (clothes.com), and the needed document (overview.html). HTTP provides the transfer of information between the Web users (the clients) and the Web Servers.
At first, Clothes-XYZ is only using Web to store an electronic catalog. After a customer has browsed through the catalog and has selected an item, he/she calls the store and places an order. Let us say that Clothes-XYZ wants to be more forward-looking and wants the customers to purchase the items over the Internet. In this case, a "Purchasing Gateway" software is developed and installed at the Web site. This gateway program gets into action when a user clicks on the "purchase" button on his screen. It prompts the user with a form (HTML supports forms) that the user fills out. The gateway program uses this form information to interact with a purchasing system that processes the purchase (see Figure 2.8). The purchasing system can be an existing system that is used for traditional purchasing. The role of the gateway is to provide a Web interface to the purchasing system.
Figure 2.8 A Simple Web Example
|