Multilingual Web Site - WWW2006

Multilingual Web Site

From WWW2006

Jump to: navigation, search

Edit only typos/grammar. Add comments in the dicussion.


  • 1 Introduction
  • 2 Background
    • 2.1 Transparent Content Negotiation (TCN)
    • 2.2 Translation
    • 2.3 Authorship, Translation and Publication chain (ATP-chain)
    • 2.4 Multilingual parallel text
      • 2.4.1 Source and target languages
      • 2.4.2 Dimensions
  • 3 MWS aspects
    • 3.1 User side
    • 3.2 Webmaster side
      • 3.2.1 General approach
        • Input
        • Output
      • 3.2.2 Skeleton
      • 3.2.3 Language table
        • Language key
        • Language value
      • 3.2.4 ATP-chain in MWS
      • 3.2.5 Generating techniques
      • 3.2.6 Multilingual Web Content Management System
      • 3.2.7 Ampersand page
      • 3.2.8 Language neutral URIs
      • 3.2.9 User translation request
  • 4 Next steps
  • 5 Author


Multilingual Web Site (MWS) refers to sites that contain multilingual parallel texts; i.e., texts that are translations of each other. For example, most of the European Institutions sites are MWS, such as Europa.

The objective is to specify a comprehensive open architecture (and not just an application) that allows the creation of high quality low cost MWS. Many existing applications have some multilingual facilities and (stating the obvious) one should harvest the best techniques around.

Comprehensive in the sense of having one whole integrated architecture that addresses the cycle Authorship, Translation and Publication chain (ATP-chain).

MWS are of great practical relevance as these are very important portals with many hits; also they are very complex and costly to create and maintain.

This document is a position paper for the Multilingual Web Site BOF at WWW2006.




Transparent Content Negotiation (TCN)

One URI can have several variants. For example, http://mysite/doc could have:

  • English in HTML
  • English in PDF
  • Spanish in HTML
  • Spanish in PDF


  • Variant list: list of availables variants
  • Language variant list: list of available linguistic versions; a subset of variant list

Some of the HTTP header fields involved are:

  • Accept-Language
  • Content-Language
  • Alternate
  • Referer

In addition to language and format (MIME type), there are other dimensions. For details (and strict definitions) have a look to the RFC Transparent Content Negotiation in HTTP.

Often, servers do not return the variant list. For example, Apache seems only to return the variant list with 406 Not Acceptable. One can make Apache to always return the variant list (in Alternate) by changing only one line in the source code and recompiling it (thanks to K. Holtman for pointing out the line). But the requirement is for parametrizing servers to return the variant list or subsets. For example:

  • VariantList All
  • VariantList Language
  • VariantList Language MediaType

Note that this do not exist in Apache. It is just an example and proposal.



The greatest cost with MWS is translating:

  • Original pages
  • Maintainance, in particular, linguistic segments

The public expect web sites to be up to date; errors are expected to be corrected inmediatly. This is very different from paper publications where the public expect errors to be corrected in the next edition.

Hence, often ones has to translate many linguistic segments; a costly business as there is a fix overhead for each translation request, indepently of the size. Indeed, most translation services are geared to the translations of full documents.


Authorship, Translation and Publication chain (ATP-chain)

ATP-chain is the cycle for multilingual publishing. Traditionally it was a one way path:

  • Authorship: The author writes the source material
  • Translation: The translator(s) translate(s) into the target language(s)
  • Publication: The typographer composes the publication

For non-literary materials, now a day this chain could be two ways; e.g., the translator could send back the source material to the author with change requests to facilitate the translation. Also, one has marking from the beginning to automate the whole process.


Multilingual parallel text

Multilingual parallel texts are translations of each other. For example, the Treaty of Rome in 22 languages.


Source and target languages

The most common case is that the author writes in one source language and it is translated to other target languages. But it is not rare to have multilingual sources; e.g., a document with three chapters each written in a different language. Indeed, in the case of MWS is quite common.

For a legal point of view, one can have multilingual parallel texts where all the linguistics version are considered source languages.



Multilingual parallel texts have several dimensions:

  • Completeness: full translation, partial translation (ongoing translation), summary, etc.
  • Aligness: aligned at document level, paragraph level, term or even word level.
  • Resource: human, machine and anything in between.

Each of these dimensions should be considered a continuum between to extremes.


MWS aspects

The two main aspects are:

  • User side
  • Webmaster side

User side

The user should have at least the following facilities:

  • The best language variant with TCN
  • A mechanism to access all the language variants
  • If non of the requested languages is available, an automatic offer of the available language variants with a selection mechanism

The access mechanism to the language variants can be:

  • Browser side: a language button in the browser (in the same row as File, Edit, etc), enabled when other language variants are available. When non of the requested language variants are available, it will be enable even for one language variant.
  • Server side: a language link that when followed will trigger the server to generate an HTML page with the language variants, if any. When non of the requested language variants are available, it will be triggered automatically.

Webmaster side

In this context, Webmaster refers to all the aspect of the construction of MWS: author, translator, etc.


General approach

The general approach should be to generate all the linguistic versions in parallel. It is based on the following two components:

  • Skeleton
  • Language table

The intention is to replace each language key in the skeleton by its language value from the language table.





<html lang="&lang;">

Language table (as two text files)

hello=Hello word
hello=Hola mundo



<html lang="en">
   <p>Hello word</p>

<html lang="es">
   <p>Hola mundo</p>


This construction is format (MIME type) dependent; e.g., it can be done in HTML and XML, but it might not be done in other formats.


Language table

Given the pair (language key, language), one must obtain the corresponding language value.

The language table can be impletemented in at least the following ways:

  • Text files: one file per language; e.g., the line k1 in
  • URI: e.g., http://es.mysite/k1 or http://mysite/es/k1
  • Database

Language key

The language key is a unique identifier. The language keys in a skeleton could be abbreviated; e.g., http://es.mysite/k1 could be abbreviated to just k1.

The HTML generator program must know how to compose the full key; e.g., a parameter or a meta declaration in the skeleton.


Language value

A language value is whatever is pointed to by a language key. Typically a text phrase. But it could be in any format; e.g., a sound file.

In this context, phrase do not have any grammatical connotation. In the case of text, one can think as a string.


ATP-chain in MWS

  • The author produces the skeleton and the source language values
  • The translator produces the other language values
  • A program generates the HTML pages

Generating techniques

It could be:

  • Internal to the server
    • Dynamically when they are requested
    • Generate the first time requested and keep until stale, for example because one of the entities has changed
  • Separated program
    • Batch; i.e., all in one go

Multilingual Web Content Management System



Ampersand page

The language link can be generalized as the ampersand page. In addition to the language variants, the server could also return with a good presentation:

  • Other variants
  • Metadata and in particular copyrights
  • The Accept-Language sent to the server


Ampersand page for http://mysite/doc
This page available in: English, Spanish, French
Copyleft for http://mysite/doc: Somebody
Copyright for the site http://mysite: see http://mysite/copyright
Your preferred languages: English, Spanish

The language of the ampersand page will be as per the TCN. If the server does not know any of languages requested, there is an intermediate step with a menu, so the user chooses one of the languages available for the ampersand page.

The ampersand link (the href attribute) should be the same for all pages. For example, http://mysite/amp. One can use the field Referer to generate the page. Having different ampersand links for each page would it much harder.


Language neutral URIs

e.g., numbers



User translation request

Mechanism that allows users to request translations.



Next steps

  • Write a report from the BOF
  • Start a working group
  • It could be within existing organization; e.g., W3C


M.T. Carrasco Benitez Disclaimer: I only talk for myself.

Retrieved from ""
  • Article
  • Discussion
  • Edit
  • History
Personal tools
  • Log in / create account
  • Main Page
  • I will be there!
  • Evening activities
  • Traveling
  • BOF Session planning
  • Recent changes
  • Random page
  • Help
  • What links here
  • Related changes
  • Special pages
  • Printable version