In Outline: Multiligual Websites

Multi-national organizations sometimes have multiple domains of the same base name but different country codes. If the content served on behalf of these domains is similar but with each in a different default language, the whole could be considered a form of multi-lingual website. This discussion however, is about how a single webapp, tied to a single domain, may serve content in multiple languages - in other words, a multilingual website.

Dissemino imposes the common addressing convention of using the applicable language code as the first part of the URL path, but only where the applicable language is other than the default. So with a domain of www.mydomain.com, English as the default language and German as a supported language, a URL of http://www.mydomain.com/ finds the English homepage, and a URL of http://www.mydomain.com/de/ finds the homepage in German.

// This convention gives the appearance of subdirectories of the document root, with language codes as names. Such subdirectories are fictitious and do not exist. As explained in article 3.2 "Basic HTML // Pages", the document root directory of a Dissemino web application, unlike that of an Apache website, is intended strictly for passive, fixed content resources. Files placed in the // document root and any subdirectories thereof, such as image files, are served verbatim as inert binaries and have no processing applied to them. The homepage and other pages of the // web application, are entities defined in the configs. They are not files and do not reside in directories. The recommended practice when assigning URLs to pages, is for there to be // no apparent directory tree structure, i.e. all pages are given URLs of the form http://mydomain/pagename. This however, is only a recommendation and it is common practice for pages // to be grouped in some way and given URLs of the form http://mydomain/groupname/pagename. Bear in mind that where this is done, the groupname directories are ficticious.

Ideally, a multilingual website would have a version of each page for each supported language, written by human translators. Where there is a lack of funds, there is always machine translation. It may be that only some of the supported languages are human translated, or that for a given language, only certain pages are.

The options are: Human translation throughout; Prioritize some languages for human translation and use machine translation for the others; Or use machine translation throughout. The objective is to support all three options.

With human translation, pages can differ Ideally, a multilingual website would have a version of each page for each supported language, written by human translators. Where the funding for this is unavailable, there is the option of machine translation. The options are: Use human translators throughout; Prioritize some languages for human translation and use machine translation for the others; Or use use machine translation throughout. The objective is too support all three options.

On the face of it, Multiligual websites ought to be simple, at least conceptually. Starting from an original website in the original language, the pages are translated to identical sets of pages in each of the languages you are going to support. Addressing is achieved by adding the language code to the start of the path part of the page URLs, and links within pages are adjusted accordingly. Visitors can switch language by such means as a pull down menu of language names alongside national flags. Clicking on any flag takes the visitor to the current page in the target language. It is a lot of work but in theory, simple.

Dissemino does not have in-built translation capability but works with online translation services such as Google Translate. These services translate words and sentences, but leave alphanumeric sequences unchanged so "p1.4 Hello World" translated to German is returned as "p1.4 Hallo Welt". Dissemino relies on this behaviour to match translated strings back to their originals.

Multiligual URLs

// Dissemino imposes the common addressing convention of using the applicable language code as the first part of the URL path, but only where the applicable language is other than the // default. So with a domain of www.mydomain.com, English as the default language and German as a supported language, a URL of http://www.mydomain.com/ finds the English homepage, and // a URL of http://www.mydomain.com/de/ finds the homepage in German.

// This convention gives the appearance of subdirectories of the document root, with language codes as names. Such subdirectories do not exist. As explained in article 3.2 "Basic HTML // Pages", the document root directory of a Dissemino web application, unlike that of an Apache website, is intended strictly for passive, fixed content resources. Files placed in the // document root and any subdirectories thereof, such as image files, are served verbatim as inert binaries and have no processing applied to them. The homepage and other pages of the // web application, are entities defined in the configs. They are not files and do not reside in directories. The recommended practice when assigning URLs to pages, is for there to be // no apparent directory tree structure, i.e. all pages are given URLs of the form http://mydomain/pagename. This however, is only a recommendation and it is common practice for pages // to be grouped in some way and given URLs of the form http://mydomain/groupname/pagename. Bear in mind that where this is done, the groupname directories are ficticious.

Language Support Operation

In the web application configs, page and article definitions must specify the URL, but also the language unless the default is to be assumed. Where multiple languages are supported entirely by machine translation or not at all, each page or article will only have one definition. In each case however, where a page or article has a human translation, there will be an additional definition, with the language specified but the exact same URL.

This addressing regime means that nothing other than text needs to be changed by human translators. Crucially, there is no need to consider any impact supporting multiple languages may have on links, as Dissemino automatically adjusts these in the output HTML. Link adjustment is simple: If the request URL contains a language specifier, all links in the output HTML will contain the same language specifier. There are no cookies involved so the behaviour is the same irrespective of any user sessions. Language selection, usually by means of q pull down menu of languages alongside national flags, is also simple. In all cases the current page is refetched, but with the desired language specifier.

Automatic Machine Translations

Dissemino assumes machine translation results in paragraphs in the target language document mapping 1:1 to paragraphs in the original language document. Other text items such as table headings and cells are likewise. This assumption may be unsound, particularly for paragraphs but that is how thing currently stand. In practice this is a string to string mapping, as paragraphs are split by the subtags within paragraph tags (please see notes on Visual Entity text).

Dissemino assigns each string occuring anywhere in the website, a USL (Universal String Lable), which is an id of this alphanumeric and hierarchical form. During HTML generation, the USL is used to look up the actual string value in a map in an instance of hdsLang - for which there is one for every language the website supports.

The languages you want to support are listed in the configs under a <siteLanguages> tag as follows:-

<siteLanguages default="en-US">
    <language code="en-US" name="English" native="English" flag="/img/natflag.us.png"/>
    <language code="de"    name="German"  native="Deutsch" flag="/img/natflag.de.png"/>
</siteLanguages>

The attributes of the <language> sub-tag are 'code' which is the international language code, 'name' which is the language name in the default language, 'native' which is the language name in the language itself - and 'flag' which is the URL of the flag image for presentation of a language menu. Including the Dissemino tag <xlangslct> in a page will generate the HTML to present a default language selection menu, although you may wish to formulate your own approach to this. Essentially though, the menu will request the current page, but prefix the URL with the requested language code followed by a / character. For example "/de/thispage.html".

Currently the translation is not built in to Dissemino itself. Instead the website would be put together in an original language such a en-us (American English) and then exported as text files. The exported files are to the document root directory as defined in the <DisseminoWebsite> tag in the configs and would be called 'website.code.txt' where 'code' was the default language code. Separate text files are produced for each article group with filenames of the form 'artGroup.code.txt'. These files have each text string appear as a line of the form 'string_id) string' followed by a blank line. This format goes through most online translators with the strings translated but the identifiers unchanged. The translated text should then be put in files named according to the same convention e.g website.de.txt. Upon startup, Dissemino will automatically read these files in and populate the approriate string map. If a language is listed as supported but the text file for it is missing, the pages will continue to appear but revert to the default language.

In addition to the above there is the <x> tag. This is for translation control. One important consequence of using XML to encapsulate HTML is how the tag contents are actually comprised. Technically, the contents of any tag (XML or HTML), is the sum total of everything between the tag and corresponding anti-tag. However that model poses considerable complexity. Consider the following very simple example:-

 <p>The cat <b>sat</b> on the mat</p>

What is the content of the <p> tag in this instance? Because it isn't "The cat " followed by a sub-tag of <b> followed by " on the mat". Instead it is just the last string " on the mat". The <p> has a sub-tag of <b> whose content is "sat" but which also has a 'pre-text' value of "The cat ". The <p> tag is rendered by concatenation of the pre-text value and recursively rendered content of each sub-tag in turn and only after the last sub-tag is the actual content of <p> added. Such is the design of the HadronZoo XML parser! Sounds messy but it solves a problem. It is simple to impliment the tags. They each have a pointer for the first sub-tag (child) and a pointer for the next tag (sibling). And they each have two strings, pre-text and content.

On top of this basic model sit a few tag-specific rules. Only certain tags may be 'text-bearing' as it were. The <td> tag is considered to be a 'text container' as is the <pre> tag and the <p> tag of course. A text container tag does has a content but not have a pre-text value and may only contain sub-tags that have both a pre-text value and contant but cannot themselves have sub-tags! Surprisingly these rules are not particularly restrictive and probably the most noticable restriction is that you cannot put text in the <body> tag but outside of a text container tag and expect it to appear in the page.

Language support applies by default to everything within a <p> tag but not within either the <td> or <pre> tags. The role of the <x> tag is to reverse this default behavior. If you are using <pre> to display a bit of code you can do something like this:-

 Some code ;  // <x>Some comment</x>

This will allow the comment to be translated but not the code. Likewise within the <p> tag the <x> tag can be used to isolate part of the text that you wish to keep in its original language.