Thursday, March 15, 2012

EU's Machine Translation: Moving from Rules to Statistics

A Multilingual Web workshop on the topic "The Way Ahead" takes place between 15th and 16th of March 2012 in Luxembourg, organized by W3C and chaired by Richard Ishida. The MultilingualWeb project explores standards and best practices that support the creation, localization and use of multilingual web-based information. We wrote already about the first MultilingualWeb workshop.

In the current workshop, there are many interesting presentations that cover different aspects of the field. Presentations are given by representatives of Microsoft, Intel, Wikimedia Foundation, Joomla, Mozilla and many others.

A remarkable presentation was given by Spyridon Pilos from the Directorate General for Translation of European Commission. He described the new machine translation service of European commission. The notable aspect here was the strong commitment to the data driven approach. This development was hinted already in 2007 by Juhani Lönnroth, the former Director General, Directorate General for Translation. This means that European commission abandons their old rule-based system based on Systran. Pilos told that using the data driven approach facilitates making best use of Commission's language resources and the internal linguistic expertise (1700 translators for 23 languages).

Pilos described the overall framework emphasizing flexibility and openness that includes the possibility of incorporating components from external partners. He also stressed the importance of data; when MT systems are trained, data is needed. For the moment, many useful resources are in the web in such a form that they cannot be accessed in a standard manner and called for developments in this area.

A historical remark of the European innovative capacity can be made here. Google used Systran until about 2007. An important development for them was when Franz Och moved to work for Google. Och's roots are strongly European. He received PhD in 2002 in Computer Science at the Technical University of Aachen (RWTH), Germany. An important figure behind these developments has been Prof. Hermann Ney. Also Philipp Koehn deserves to be mentioned in this context. Among other things, Koehn is with his team the originator of the Moses environment that is used as the core component in Commission's new translation system.

A personal note can also be made here. My own interest on using data driven methods for natural language processing started by the end of 1980s. The first publication in 1991 already mentioned machine translation as a potential application but my own research then concentrated on the area of information retrieval where the Websom method for visual information organization and retrival became a well-known result. Since 1996, similar ideas on data driven modeling and visual exploration of semantic space have recurred through several scientific generations. From linguistic point of view, the use of independent component analysis (and related non-negative matrix factorization) may approve to be an interesting possibility for automatic resource construction.

