Mari-Sanna Paukkeri defends her dissertation "Language- and domain-independent text mining". In her dissertation for the Aalto University Department of Information and Computer Science, Paukkeri has studied how textual data can be processed and analysed automatically with machine learning methods. She has developed computational methods for text processing independent of language or domain.
Paukkeri considers fully automatic methods, language independence and subjectivity in several natural language processing tasks. A fully automatic and language-independent approach for keyphrase extraction called Likey is presented and its performance is shown for 11 European languages, including English and Finnish.
In the thesis, an approach for learning taxonomies from encyclopedia documents is proposed. The work is an early step to automate the construction of ontologies and get ontologies more applicable to multilingual settings.
In the work related to lexical choice, machine learning methods are applied to a collection of as many linguistic features as possible to study how the linguistic features help in the machine learning task.
In Paukkeri's thesis, the feature extraction step in text mining is studied by analyzing the effect of different dimensionality reduction, normalization and distance measures in the task of document clustering and proposing an evaluation method for feature extraction (or document representation). To further show the level of language independence of these methods, the experiments are run with several languages from different language families.
Assessing user-specific difficulty of documents", has been published in the prestigious Information Processing & Management journal.
Jussi Karlgren (Gavagai AB and KTH, Stockholm) served as an opponent. Karlgren's questions ranged from fundamental methodological issues to views on future research possibilities, based on his long experience in this field. In the final statement, he thanked the defender for her solid work in an important area of research.