Sami Virpioja defended his comprehensive dissertation "Learning Constructions of Natural Language: Statistical Models and Evaluations" for the Aalto University Department of Information and Computer Science. In his thesis Virpioja studies the problem of lexical unit selection for the automatic processing of text. He proposes the use of unsupervised and semi-supervised statistical methods instead of simple heuristic or grammatical rule-based methods.
The work is based on the previously developed unsupervised Morfessor method which learns to segment words into surface morphemes (morphs) based solely on the statistical regularities found in a text corpus. In Virpioja's thesis, the discovered morphs are shown to improve different applications, such as automatic speech recognition and statistical machine translation.
Virpioja has extended the Morfessor method to handle allomorphic variations which can model the morphological relations between morphs. He has also developed a minimally semi-supervised variant of the original method that takes a very small number of manually segmented words as additional input and can find morphs which better match with a known linguistic segmentation. Virpioja has also developed methods which can evaluate the match between the discovered morphs and linguistic morphemes. The same problem of finding relationship between features in a multidimensional data was solved with CCA to leverage an existing bi-lingual corpus for the evaluation of learned sematinc vector spaces for documents.
Prof. Brian Roark (Oregon Health & Science University) and Doc. Krister Lindén (University of Helsinki) served as the opponents and provided expertise for both the computational and linguistic sides of the dissertation. The questions ranged from possible extensions and applications of the work to philosophical ruminations of linguistic theory. In their final statement, they thanked the candidate for his excellent work which covered experiments both in vivo and in vitro.
The work is based on the previously developed unsupervised Morfessor method which learns to segment words into surface morphemes (morphs) based solely on the statistical regularities found in a text corpus. In Virpioja's thesis, the discovered morphs are shown to improve different applications, such as automatic speech recognition and statistical machine translation.
Virpioja has extended the Morfessor method to handle allomorphic variations which can model the morphological relations between morphs. He has also developed a minimally semi-supervised variant of the original method that takes a very small number of manually segmented words as additional input and can find morphs which better match with a known linguistic segmentation. Virpioja has also developed methods which can evaluate the match between the discovered morphs and linguistic morphemes. The same problem of finding relationship between features in a multidimensional data was solved with CCA to leverage an existing bi-lingual corpus for the evaluation of learned sematinc vector spaces for documents.
Prof. Brian Roark (Oregon Health & Science University) and Doc. Krister Lindén (University of Helsinki) served as the opponents and provided expertise for both the computational and linguistic sides of the dissertation. The questions ranged from possible extensions and applications of the work to philosophical ruminations of linguistic theory. In their final statement, they thanked the candidate for his excellent work which covered experiments both in vivo and in vitro.
No comments:
Post a Comment