|What will we cover?|
The story of the world wide web and its invention by Tim Berners-Lee is probably one of the best known in computing. However it is worth revisiting some of the key points in the story to provide a background to why the web and its technology are the way they are today. The web was invented by Berners-Lee to solve a real problem that he and his colleagues at CERN were experiencing namely, document sharing. They had a large number of documents in various locations and in different formats. Each required understanding of concepts that were explained in other documents. Many were works in progress, being developed by multiple authors. Berners-Lee realised that a techology existed which could greatly ease the situation, it was Hypertext. Hypertext had been around in various forms for many years, in fact the idea pre-dated the inventionof the modern computer! There were several Hypertext solutions available on computers when Berners-Lee studied the problem, but they were either expensive and proprietary or only capable of running on a single computer. Berners-Lee's big contribution was to take the idea and put it on a network. This was what enabled the collaboration of many workers at once and the access to many diversely located documents. By making the network access transparent to the user it was as if the library was one gigantic document all cross-linked and seamlessly joined together, truly a world wide web of information.
Berners-Lee had already built a type of hypertext system and he had experience with the internet so it was fairly natural for him to take these two ideas and join them together. In doing so he invented several component parts which together form the web. The Hypertext concept required a mechanism for linking documents together, and so Berners-Lee invented a text formatting, or markup, system which he called HyperText Markup Language (HTML) based on an existing standard called Standard Generalised Markup Language (SGML). All web pages are written in HTML, (or XHTML which is an XML compliant variant of HTML) either by a human author or by a computer program. Web browsers display their pages by interpreting the HTML markup tags and presenting the formatted content. This is not the place to try and teach HTML, so if you don't already know how to create simple HTML documents then I suggest you find a good reference or tutorial, like the one here.
Having defined a markup language Berners-Lee could now create basic hypertext documents, like this tutorial. The pages could be formatted, images inserted and links to other documents created. So far no networking is involved. The next step was to make these documents accessible over the network and to do that required the definition of a standard address format for content. This is the now ubiquitous Uniform Resource Locator or URL. The first part of a URL identifies the protocol to be used, usually http for the web. The next part uses standard internet naming to identify a server (and optional port, the default being 80) and the final part is the logical location of the content on the server. I say logical because, although the specification looks like an absolute directory path in Unix, it is in fact relative to some fixed location known to the server, and indeed may not be a real location at all, since the server may translate the location into an application invokation or other form of data source. So if we look at the full URL of this page it is:
One problem with the static hypertext content that this scheme provided was that it was read-only. There was no way to interact with the content, nor indeed could the content be easily modified. Berners-Lee wanted more, he wanted the ability to interact with the content dynamically.
The solution to providing dynamic content lay in appending data to the end of the URL in what is known as an encoded string or, sometimes, an escaped string. These are the long strings of letters, numbers and percent signs that you sometimes see in the address bar of your browser after visiting a link. We will discuss these strings in more detail later, since to communicate with dynamic web sites we will need to be able to construct these encoded strings.
The ability to interact with the content provided several features. One important advantage was the ability to provide secured access to pages by requiring the user to enter a username and password. Users could be required to register interest in pages before being granted access. And of course the same ability to capture usernames etc. could also beusd to capture credit card details and selections from catalogs, and thus internet shopping became possible.
The technology for creating dynamic web content is called the Common Gateway Interface or CGI. The CGI interface had the advantage of being very very easy to implement and was entirely language neutral. Early CGI based web applications were written in languages as diverse as C and C++ to DOS Batch files and Unix shell scripts, with everything in between. However, it didn't take long for the scripting language Perl to become a favouraite, largely due to Perl's built in text handling power and the existence of a powerful CGI library module for handling web requests. As you would expect Python also has a cgi module for buildng basic dynamic web applications and we will look at that in the web server topic.
There is one major snag to standard CGI programming which is that every request requires the web server to create a new process. This is slow and resource hungry, especially on Windows computers. It was only a matter of time theefore before new, mose efficient mechanisms were developed that still utiliseed the CGI protocol but enabled the web servers to handle the requests more efficiently. Nowadays relatively few web applications are written using basic CGI and Perl has lost its monopoly as the web language of choice, with technologies like Active Server Pages (ASP), Java Server Pages (JSP), Servlets, PHP and so on becoming dominant. We will look at one such technology based on Python (of course!) in the web server topic.
Before we get to the server however I want to look at how we can access existing web servers using web client techniques. Essentially, how can we write a program that emulates a web browser. One that can fetch data from a URL and process it. Taken to extreme these tehniques could even be used to create a web browser if you were sufficiently keen! I will limit myself to less ambitious goals for now!
Before we can start programming web clients we need a basic understanding of how the web works. And that means looking at the HyperText Transfer Protocol, or http. If you followed the link you'll see that thee is quite a lot to learn to become an expert on http. Fortunately we don't need to be experts and the level of knowledge required to use http is quite low.
Basically when a web client sends a request to a web server it issues an http GET request. The server responds either with an error message, for which there are several standard codes, or with an HTML document. It is possible to do all of that manually using sockets and formatted strings. But a much easier method is to use the urllib module provided in the Python standard library. More specifically I recommend using the more recent urllib2 module which has some very useful extra features.
Urllib2 tries to make working with URLs nearly as easy as working with files. Let's look at it in action.
The sequence of steps to fetch a basic web page is very simple. Basically you just need to import urllib2 then open a url and read the result. Like this:
>>> import urllib2 as url >>> site = url.urlopen('http://www.sourceforge.net') >>> page = site.read() >>> print page
That really is all there that's required. You now have a string containing all the HTML for the Sourceforge homepage. You can compare it, if you like, with the effects of using View->Source from your browser. Apart from the fact that the browser formats the HTML to make it more readable it is exactly the same content. The next question is what to do with the content now that we have it? We need to parse it to extract the information we need. At its simplest level parsing just involves searching for a specific string or regular expression. For example, you could find all of the image tags in a page by searching for the string '<img' or '<IMG' or the regular expression '<img' with the re.IGNORECASE flag set.
However usually web pages return multiple bits of information that we want to extract from specific parts of the page. Web pages are contructed in a particular heirarchical form, often with nested elements (such as headings within cells of tables which are within headings of the document). Fortunately for us there are several libraries available that will parse html into a data structure for us. One of the best is Beautiful Soup which is exceptionally good at handling the irregular forms of HTML that are frequently found in real web pages. However that is an add-on module not found in the standard Python library so we will take a look at the standard tools. However, if you do find you need to examine some complex or badly written HTML do take a look at Beautiful Soup.
Before we look at parsing HTML however we need to consider how to send data in our original request, for example a search string to SourceForge.
There are basically two ways to send data to a web server using http. These are known as GET and POST. Get is the easiest to use but is also limited in the amount of data that can practically be transmitteed. Post is more powerful and a little bit more secure.
From which its pretty obvious that our search string goes after the &wordsand forms the end of the url string
Now by viewing the page source we can see what kind of string we can expect to get back. Doing a search for some keywords in the first result finds a section of html that looks like:
<div class=g><!--m--> <link rel="prefetch" href="http://www.python.org/"> <h2 class=r><a href="http://www.python.org/" class=l> <b>Python</b> Programming Language -- Official Website</a> </h2><table border=0 cellpadding=0 cellspacing=0><tr> <td class=j><font size=-1>Home page for <b>Python</b>,...
By examining that closely we see that the hyperlink has a class=l attribute. Scanning the rest of the page we see that this class is only used for search result links. So if we parse the returned string for this attribute of an a tag we should be able to build a list of the links returned by SourceForge.
Let's try that using urlib2:
import urllib2 as url target = 'http://sourceforge.net/search/?type_of_search=soft&words=python' page = url.urlopen(target) for line in page: if line: print line
There are several other gotchas waiting to bite you but I won't cover those here. Things you might look out for and have to do some research on are handling login prompts, using cookies and handling encrypted https connections. All of these are possible with a bit of effort but beyond the scope of this tutorial.
We mentioned above the need to parse the html string received from a web server. Parsing really just means understanding and interpreting the structure of text. For example every HTML page has a certain structure defined by tags. The rules of html determine which tags can be used where. Parsers understand these rules and use them to either build up a heirarchical data structure representing the web page which we can then access to extract the data or they generate 'events' corresponding to each tag found and we can then provide functions which respond to these events. The standard Python HTTP parsers that we use in this section follow the second route.
ElementTree, which we discuss later uses the first method, as indeed, its name suggests: it builds a tree structure.
Python's standard library includes a couple of HTML parsers that are fine for basic use, they are good for stripping tags from a page or pulling out a piece of data from a well formatted HTML page. Unfortunately they are not so easy for more complex parsing. The most powerful and flexible of the two is the one in htmllib module, however with flexibility comes complexity so we will look at the slightly easier to use option found in the HTMLParser module. Within this module is a single class, HTMLParser that is used as a base class for your task-specific parsing needs.
The HTMLParser class acts as an event driven parser in that it calls event handling functions on response to finding particular constructs or tasgs within the HTML. These event handers are predefined within the base class but by default do nothging. Thus to actually produce output we must override the appropriate handlers within our new sub class.
Perhaps the easiest parser is one which simply strips out all HTML tags and special characters and prints a plain text version of an HTML string. It looks like this:
html = ''' <html><head><title>Test page</title></head> <body> <center> <h1>Here is the first heading</h1> </center> <p>A short paragraph <h1>A second heading</h1> <p>A paragraph containing a <a href="www.google.com">hyperlink to google</a> </body></html> ''' from HTMLParser import HTMLParser class PlainTextParser(HTMLParser): def handle_data(self, data): print data parser = PlainTextParser() parser.feed(html) parser.close()
Obviously instead of the static HTML string we used here we could have used the result of reading a web site using urllib2 as we did earlier.
In practice we usually want to extract more specific data from a page, maybe the content of a particular row in a table or similar. For that we need to use the handle_starttag() and handle_endtag() methods. As an example let's extract the text of the second H1 level header:
html = ''' <html><head><title>Test page</title></head> <body> <center> <h1>Here is the first heading</h1> </center> <p>A short paragraph <h1>A second heading</h1> <p>A paragraph containing a <a href="www.google.com">hyperlink to google</a> </body></html> ''' from HTMLParser import HTMLParser class H1Parser(HTMLParser): def __init__(self): HTMLParser.__init__(self) self.h1_count = 0 self.isHeading = False def handle_starttag(self,tag,attributes=None): if tag == 'h1': self.h1_count += 1 self.isHeading = True def handle_endtag(self,tag): if tag == 'h1': self.isHeading = False def handle_data(self,data): if self.isHeading and self.h1_count == 2: print "Second Header contained: ", data parser = H1Parser() parser.feed(html) parser.close()
Here we introduce a counter to detect the second instance and a flag to indicate when we have found what we want. We use the starttag and endtag event handlers to control the flag and counter variables and the handle_data method to produce the output depending on the state of the two variables.
By combining various permutations of these techniques it is possible to extract most bits of information from an HTML page using this parser. However you can get caught out by badly written (or badly formed) HTML. In those cases you may need to write custom code to correct the HTML. If it is extensively malformed it might be better to write the HTML to a text file and use an external HTML checker like HTMLtidy to clean up and correct the HTML before trying to parse it. Alternatively investigate the third party package 'Beautiful Soup' which can cope with most of the common problems in HTML.
One other thing you need to be able to detect are the errors that are returned from the web server. In a browser these are displayed for us as "Page not Found" or similar, relatively friendly, error strings. However if you are fetching data from the server yourself you will find that the error comes back in the form of an error code in the http header which urllib2 converts to an exception (this is one of the extras provided by urllib2 over the older urllib module). If the web server address is wrong you will get an IOError exception from the underlying socket code. Now, we know how to catch exceptions using a normal try/except construct, so we can catch these errors quite easily like so:
import urllib2 as url try: asock = url.urlopen("http://www.google.com/map") except url.HTTPError, e: print e.code
The value in urllib2.HTTPError.code comes from the first line of the web server's HTTP response, just before the headers begin, (e.g. "HTTP/1.1 200 OK", or "HTTP/1.1 404 Not Found") and consists of the error number. The standard HTTP return codes are described here, the most interesting are those starting with 4 or 5.:
The most common error codes you will encounter are:
Some of these (e.g. 503,504) simply require a retry, possibly after a short delay, others (e.g. 407) require significantly more work on your part to access the web page! OK, Now lets look at another form of parser.
ElementTree for XML/XHTML. Based on DOM like model. Part of standard library since v2.5
A Parsing Pleasure
|Things to remember|
If you have any questions or feedback on this page send me mail at: email@example.com