Home       About       Bugs       Contact       FAQ       Links       Status       Voices    

5. Our server modifies the original page

Well, if we're at this stage, we should have gotten all the webpage content in and checked that everything is in order to proceed. So, now the real fun begins...

Several things happen behind the scenes before we get onto the real fun of processing the page. This section runs just to get us started.

  • Entities are decoded. (Some special characters, including most common accented letters are wrapped up as 'entities'. Common ones include the copyright symbol, which is written as ©, the & (ampersand) which is written as & and things like that. Basically, we translate the different entities like © and & into their respective symbols, © and & so that we can process them without any awkwardness and without any modification of them either. They might be important, so we leave them alone.)
  • We then load the appropriate module for text translation, depending on whichever module was requested.

At that point, we are ready to go with the main page itself. In each module, there is a special function (miniature program) called 'revoice_text', which is given the text to work with and outputs the changed text back to url.php in order to set it up nicely for you. However, revoice_text is set up to handle text only, it doesn't touch HTML at all.

The actual process works just goes through looking for starts and ends of HTML tags (looking for < and > symbols). When it finds one, it stops for a moment.

Anything between the last > and the current < is assumed to be normal text (and should be at this point) and is thrown over to revoice_text to play with, and bolt the output onto the temporarily-stored new page contents (we don't send the page contents until we're actually done).

Meanwhile, anything between the current < and the next > is an HTML tag and we must evaluate it, in case we need to manipulate it.

Most tags require no special handling, because they don't reference external documents.

The first type of manipulation is to take a specified weblink and preserve it, turning it into an absolute link, so it will still work and point to the original site without problem. The obvious example is IMG for images (so that we still keep images properly). The following tags, and their attributes, need this kind of modification:

  • APPLET tag (codebase attribute) - for Java/other applets
  • AREA tag (href attribute) - for image maps
  • BASE tag (href attribute) - base URL for references
  • BODY tag (background attribute) - for any background images
  • FORM tag (action attribute) - for any forms to be sent back to original server
  • FRAME tag (longdesc attribute) - for long descriptions of frames
  • HEAD tag (profile) - for any site which defines a general profile of itself
  • IFRAME tag (longdesc attribute) - for long descriptions of inline frames
  • IMG tag (src, lowsrc attributes) - for images
  • INPUT tag (src, usemap attributes) - for input controls on forms which use external data
  • LINK tag (href attributes) - for stylesheets and other meta-level links
  • OBJECT tag (codebase, classid, data, usemap attributes) - for inline applications to preserve their data
  • SUBMISSION tag (action attribute) - for any forms, like the FORM tag

The second kind of manipulation is a little more involved, but basically it involves not making aboslute links to preserve them, but to make absolute links - so we can plug those into thevoicesofmany.com. Basically, what happens here is we set up the link so that you can move from one page to the next without any work. These links are:

  • A tag (href attribute) - for hyperlinks
  • FRAME tag (src attribute) - for frames
  • IFRAME tag (src attribute) - for inline frames

There is a third kind of manipulation, which is used on a few tags which contain modifiable text, e.g. the writing you get when you hover over some links (like on The Voices of Many's menu). All we do is work out which bits the text is, and plug those straight into revoice_text. These are:

  • A tag (title attribute) - for hyperlinks
  • ABBR tag (title attribute) - for full forms of abbrievations
  • ACRONYM tag (title attribute) - for full forms of acronyms
  • IMG tag (alt attribute) - for alternative text instead of images
  • INPUT tag (alt attribute) - for alternative text instead of images

STYLE and SCRIPT tags, despite needing some of the above must be dealt with separately, along with comments (<!-- comment --> tags)

For these, we look to see if any external files are referenced (SCRIPT tag, src attribute) and preserve the address if so. Otherwise, we just look for the end of the tag, returning everything in between as absolutely normal and unmodified, for CSS or JavaScript as required.

Phew. Having got all of the HTML processing out of the way, let's look at the actual text translation itself - revoice_text, a function defined by each translation, but unique to each as well.

The problem is that it is very hard to explain how revoice_text works, because it is different for every single module, and isn't just a case of searching for one thing and replacing it with another.

Some modules, like Morse code, can function pretty simply like that (although that removes anything which isn't a letter or digit before processing), but for example the Elmer Fudd module is a lot more complicated.

A snippet of the logic behind the Elmer Fudd module goes like this:

  • Is the current letter an R?
  • NO: return the letter and move on to the next one.
  • YES:
    • Is the letter after it a vowel?
    • NO: return the two letters as they are and move on.
    • YES:
      • Is the letter before it a vowel?
      • NO: Change the R to a W (careful of getting upper/lower case right) and move on.
      • YES: Return the r and move on to the next letter.

At the time of writing, the Elmer Fudd dialect was actually the most complex on the site, due to the rulesets and the nature of rules being within rules. It is not implemented quite as above, but that it how the logic would follow through.

Once the modifications are applied to all the text, it sends it back to url.php with a message which basically says, "All done, you can send that back to the browser now". url.php does a little more processing, converting special characters back into entites (as above), and setting up a bit more of the page. Basically, the result is the transformed text plus the now-sorted HTML.

There are a couple of final things to do before we can return the newly translated webpage. Please move on to part 6 to see what.