The Technology of the Web

We now return to the question of how a page gets from the server to your browser. Lets use the idea of a URI as our starting point for this section. In fact lets start by addressing a very common point of confusion What is a URI and what is URL, you probably more commonly think of URL’s.

All locators are also identifiers, so all URLs are URI’s. Think about it this way, Bradley Miller is an identifier, it identifies me, but does not tell you how to locate me or how to contact me. 321 Olin Hall, Luther College, Decorah IA, is a locator and an identifier for a specific place in the world. On the web URL’s have three main pieces, lets look at one and identify the parts:

http://interactivepython.org/runestone/static/webfundamentals/index.html

This is a common URL with three parts:

This URL contains a lot of information. When you type this into the address bar of your browser you have given your browser enough information to get the specific page you wanted. A diagram will help with the next part of this discussion.

../_images/SimpleRequest.png

Lets suppose we continue with the URL above. You, the user, type in the URL to your browser, The browser knows that the host name is the part that comes after the first two slashes, but before the next slash. So it is able to extract that part of the URL. Using patterns like this occurs frequently in Computer Science. The name is then passed on to another server called the Domain Name Service (DNS) in the diagram above. The DNS is actually very large and complex network of servers, but we’ll just take it for granted that it works and gives us back the IP address correctly. With the address in hand, the browser establishes a connection to the server through a mechansim known as a socket. This socket allows the browser and the server to reliably communicate over the internet.

An HTTP Request

With the connection established the browser sends the following message:

GET runestone/static/webfundamentals/index.html HTTP/1.1
Host: myname.luther.edu
User-Agent: Mozilla/6.0
Accept-Language: en
_

The format of this message is defined by HTTP. Remember a protocol is a well defined set of rules for how one program communicates with another. Let us break down this message into its component parts.

  • GET – HTTP defines several commands, but the most common are GET and POST. As you might imagine, GET is a request to retrieve some resource from the server.
  • runestone/static/webfundamentals/index.html This is the URI for the resource that we want to get. Recall that this will correspond to a file on the disk drive of the server.
  • HTTP/1.1 tells the server that the browser is “speaking” the 1.1 version of the HTTP protocol.

This is all of the information that can be contained on the first line of the message. If there is anything else on the line it may confuse the server, or the server may just ignore it. This line is called the request line.

What follows the request line are a series of lines called the headers There may be just one header or there may be many. However, there can only be one header per line and they all follow the pattern of name : value. So the first line the name is host and the value is myname.luther.edu this lets the server know the name of the namchine that is contacting it. The User-Agent header tells the server that the user is running version 6.0 of the mozilla browser.

Finally the last line is blank. The server knows that there are no more headers when it gets a blank line.

An HTTP Response

The server extracts the URI from the request line and goes out to its disk to locate the file. Assuming the file is there it will make the following response through the socket back to the browser:

HTTP/1.1 200 OK
Date: Tues, 2 Sept 2014 15:39:21
Server: Apache/1.4.1
Content-Type: text/html

<html>
<head>
</head>
<body>
<h1>Hello World</h1>
</body>
</html>

Just like the request, the format of the response is spelled out by the HTTP. The first line again has three parts HTTP/1.1 tells the browser that the server is following the HTTP 1.1 protocol. the 200 is the response code that tells the browser that everything is fine and the OK is a descriptive message: Some other common response lines are:

  • HTTP/1.1 404 Not Found – you may have seen a 404 if you’ve every mistyped a URL
  • HTTP/1.1 500 Internal Server Error – The server itself has crashed for some reason
  • HTTP/1.1 302 Resource Moved

Following the response line are the headers from the server. Most of these are self explanator, but the Content-Type header is particularly important. This tells the browser that what follows is going to be some HTML. Another common Content-type is image/jpg this tells the browser that what comes next is going to be an image. Note that each image in a page will require its own request / response from the browser to the server.

Finally, notice the blank line followed by the contents of the web page itself expressed in HTML, which is the subject of the next chapter.

Next Section - The HyperText Markup Language