Home       About       Bugs       Contact       FAQ       Links       Status       Voices    

3. Our server checks whether we can and should do the link

By this time, the action is with the server, it knows what website you have asked for, and indeed what style of modification it should put on the page once it has got it.

But we have to check several things first.

  1. Is the selected voice on our list of webpage-supporting voices?
  2. Is the link an actual http:// link?
  3. Is the link in our banned list?
  4. Does the site have a robots.txt?
    1. If so, is The Voices of Many listed?
    2. If not, is there a * entry (which covers TVoM)?
    3. If there is an appropriate entry, is the URL requested a banned one?

Looks complicated, doesn't it? Well, let's go through them, and hopefully it'll be clearer.

  1. Is the selected voice on our list of webpage-supporting voices?
    Some voices by their nature are too awkward to support webpages (e.g. binary would multiply the size of pages drastically because it turns every non-HTML letter/number/symbol into a group of 8 digits and a space). Others haven't as yet been made efficient enough to support the rigours of webpages. So we have to check that nothing went wrong at this stage... only entries actually in the drop-down menues should be supported. If something odd turns up, we will let you know.
  2. Is the link an actual http:// link?
    You've probably noticed many websites starting with the prefix http://. This signifies that the page is being transmitted through the Hyper Text Transfer Protocol (series of instructions between browser and server about sending/receiving pages). There are other protocols, such as HTTPS (secure form of HTTP), FTP (file transfer protocol), news:// (for newsgroups), and others. Unless the address beings with http://, to prove that it is a proper HTTP address, we can't even attempt it. It really should also be in lower case, but we sort it out if not. If the link isn't http://, we abort and tell you that there is a problem.
  3. Is the link in our banned list?
    As explained in this article, not all webmasters like us doing this sort of thing. Some of them have a robots.txt file (see next item), but others don't want us having anything to do with their server. If this is the case, they should let me know, and I add their website's details into our banned list. This is an internal list stored on The Voices of Many. What happens is that url.php looks through the list, to see if it can find any mention of the same website as requested in your link. If it does, it aborts immediately, and explains the situation. In that event, I will require confirmation from the website owner - not you, as the user - that it would be OK to remove the item from the banned list.
  4. Does the site have a robots.txt?
    A common feature these days is for webmasters to put a file at the very top of their website (e.g. thevoicesofmany.com/robots.txt) containing instructions on how to process the website, including areas to avoid. It's normally meant for 'robots', automated programs which scour the web for content (mainly search engines). But because we're nice, we have added in the facility to check such things. The first thing we have to do is see if there is one. It will always live in the 'top level' of the website. E.g. if The Voices of Many had one, it would be thevoicesofmany.com/robots.txt. (As it happens, it does have one, but it's empty) So, having established where it would be, we try and download a copy of it for temporary use. If we can't download it, we know that the website authors haven't set up any rules, so we can move on to part 4 straight away. Otherwise, if we managed to download one, we have to check if it's OK to proceed. It works like this:
    1. If so, is The Voices of Many listed?
      What we are looking for is a line that looks like this: "User-agent: thevoicesofmany". If we find that, we know that we have to look further on. We'll come back to that in a minute, because we know at this point there is a rule which covers The Voices of Many.
    2. If not, is there a * entry (which covers TVoM)?
      Some sites have generic rules which cover everybody, not just - say - Google, or The Voices of Many. They won't maintain massive rulesets for everybody. So if they don't list The Voices of Many specifically, it may be that a general rule covers the situation instead. If there is a generic rule, the line will look something like this: "User-agent: *". If we find that, and we don't find "User-agent: thevoicesofmany" anywhere, we'll follow that rule. It may be that we don't find either. If we don't have either, we jump out at this point, and move on to part 4 straight away.
    3. If there is an appropriate entry, is the URL requested a banned one?
      If we're here, we either found a rule for The Voices of Many, or for generic robots. We accept either, so we need to look further. Rules follow the form of "Disallow: path". For example, supposing I was the webmaster of www.example.com, and if I had a directory called "dontgoin" off the top level, and I didn't want The Voices of Many to go in, I'd put the following in www.example.com/robots.txt:
          User-agent: thevoicesofmany
          Disallow: /dontgoin
          
      That means that any webpage requested which began "www.example.com/dontgoin" would fail. So, www.example.com/dontgointhere and www.example.com/dontgoin/specialcontent would both fail. The same would be true if instead of thevoicesofmany, for user-agent, we put a *. (Although if we did it with an asterix, it would block all compliant robots from going in there). If we find a matching entry against the requested webpage, we abort with an appropriate message. Otherwise, we move on to part 4.