Friday, March 22, 2013

Pylkkönen: Towards Efficient and Robust Automatic Speech Recognition

Janne Pylkkönen is defending his PhD thesis "Towards Efficient and Robust Automatic Speech Recognition: Decoding Techniques and Discriminative Training" on Friday, 22nd of March, 2013 at Aalto University School of Science. The research has been conducted in a research group focusing on speech technology, lead by Prof. Mikko Kurimo. Dr. Erik McDermott, Google, is serving as the opponent and Prof. Erkki Oja as the custos. The research on speech recognition has long roots in Otaniemi as Academician Teuvo Kohonen conducted active research in this area already in early 1980s and developed the famous neural phonetic typewriter with his team.

Pylkkönen's thesis presents methods for decoding and modeling the acoustics for large vocabulary continuous speech recognition with two main contributions. First, he has developed a large vocabulary decoder suitable especially for morphologically rich languages. Second, he has improved discriminative training of acoustic models to increase their robustness. The thesis also includes a theoretical analysis of discriminative training where the extended Baum-Welch algorithm is formulated as a constrained optimization method. The methods have been tested using a speech recognition system that has been developed over the years with contributions from a number of researchers.

In the lectio precursoria, Pylkkönen described the background and motivation of his research related to speaker-independent large vocabulary continuous speech recognition. He introduced the main ideas related to acoustic modeling, language modeling and decoding that combines the information related to the speech signal and language statistics. Pylkkönen gave an example on the complexity of the task. When the system "knows" 20,000 morphs (word segments), 24 phonemes, 1500 phoneme models and 40,000 Gaussian components, there are 3,200,000 parameters to be estimated.

In his opening statement, Erik McDermott first told that he works in the speech division at Google that develops speech recognition capabilities of the Android phones. He mentioned that Android has speech recognition for 40 languages including Finnish. He reminded of the challenges related to speech recognition. There are still fundamental problems with the technology and therefore active research is still needed. McDermott recognized the important contributions by Teuvo Kohonen and Erkki Oja in providing what he called an organic view to pattern recognition systems. In essence, this refers to the contributions related to unsupervised machine learning where systems improve over time based on a data-driven approach.

In the discussion, several methodological themes considered in detail related to decoding techniques, potential use of finite-state transducers, pruning techniques, discriminative training, maximum likelihood modeling and Gaussian mixtures.

3 comments:

Janne H. said...

What's the relation between Pylkkönen's and other academic work/advances compared to what technologies Google has to offer? To what extent does the academia offer applicable results? Did McDermott say anything that would reflect this?

Unknown said...

This is a good question for which there are many answers. The system developed in Otaniemi is of very high quality and in that sense Google does not have methodologically much more to offer. The most important differences seem to be that Google has huge text collections to train their language models and they have possibility to put more resources to the practical implementation of their systems. Therefore, they can, for example, provide systems for very many languages.

The majority of the core methodologies in speech recognition has originally been developed in universities.Large companies like Google have, of course, chance to hire high-level experts in the area and thus have also very high high level research and development. However, at the level of ideas there is no gap and, moreover, the academic research community as a whole can cover a wider scope of ideas.

Unknown said...

Machine translation is another interesting area from this point of view. Google hired in 2004 a talented researcher, Franz Joseph Och, who had defended his PhD thesis in 2002 at Technical University of Aachen, Germany. It seems that Och benefited greatly from the knowledge that was available in Aachen. Aachen remains to be one of the leading universities in the area of statistical machine translation in the world, with university of Edinburgh where Philipp Koehn is the acknowledged master mind in SMT. In Aachen, professor Hermann Ney is the leading figure whose publications with Och, Kneser and others belong to the core knowledge in this area. Google has, in essence, relied on expertise much of which is of European origin. This pattern is a challenge for European enterprises to consider.