|
4. Our server attempts to download a copy of the page
Now the fun bit begins. The url.php page is currently running the show and, as far as we can tell at this point, it is all good to go, ready for downloading a page.
It goes something like this:
- We try and open a connection via HTTP to the server where the page is located. E.g. if I was after kissmyfaerie.org/contact/, first thing would be to try and open a connection to kissmyfaerie.org itself. If this fails, the server is having some kind of trouble, and will be rejected at the first hurdle, aborting with an appropriate error message.
- Once the connection is open, we specify what page we are after. Going on the above example, we pass on the message that we want /contact/ from kissmyfaerie.org. The server reports back to us how it went. There are several alternatives to this, and it reports a code to us with it (part of HTTP) Please note that the writing after the code is my interpretation of what is going on... There are others, but these are by far the most common.
- 200 OK - It's fine to get the page - so I (the server) am sending it.
- 301/302 Object Moved - The page has moved, here's the new location for you. I'm sending the page on afterwards.
- 400 Bad Request - The server couldn't make any sense of the request. This page has failed.
- 401 Authentication Required - you have to enter a password to get here. We do not support this, so it fails.
- 403 Forbidden - the server has been told to forbid access to this page, so it fails.
- 404 Page Not Found - pretty obvious really.
- If the information came back as 200 or 301/302, we can process it, because the server is still sending us the information needed. In the cast of 301/302, what we do is interpret the extra information coming from the server. Before it sends the page, it sends over a 'header', a block of information about the page itself, including information like the page length. If a 301/302 comes up, the server lets us know that the page has moved and tells us where it went, inside the header. We then take note of that and use it to work out where links should go, etc.
- Once the connection is open, we then have another look at the above header. Specifically, there is a line called "Content-Type" which the server uses to report what type of information is being sent. Normal webpages, and plain text files, are sent as "text/html" and "text/plain" respectively, whilst other files types have their own respective definition (these are MIME Types). GIF images, for example, are "image/gif", whilst MP3 sounds are "audio/mp3". If it isn't "text/html" or "text/plain", we don't want to know - and abort straight away.
- Otherwise, everything is open and ready, so we just keep requesting the information from the server until we hit the end of the page - it sends it to us in chunks, so we process it as it turns up, until we have the entire page contents stored in the temporary memory of our server - just long enough to modify it.
Now that we have a temporary copy of the webpage in our server, let's move on to part 5 where we modify the page itself.
|
|
|