Friday, November 09, 2012

Paukkeri: Language- and domain-independent text mining

Mari-Sanna Paukkeri defends her dissertation "Language- and domain-independent text mining". In her dissertation for the Aalto University Department of Information and Computer Science, Paukkeri has studied how textual data can be processed and analysed automatically with machine learning methods. She has developed computational methods for text processing independent of language or domain.

Paukkeri considers fully automatic methods, language independence and subjectivity in several natural language processing tasks. A fully automatic and language-independent approach for keyphrase extraction called Likey is presented and its performance is shown for 11 European languages, including English and Finnish.

In the thesis, an approach for learning taxonomies from encyclopedia documents is proposed. The work is an early step to automate the construction of ontologies and get ontologies more applicable to multilingual settings.

In the work related to lexical choice, machine learning methods are applied to a collection of as many linguistic features as possible to study how the linguistic features help in the machine learning task.

In Paukkeri's thesis, the feature extraction step in text mining is studied by analyzing the effect of different dimensionality reduction, normalization and distance measures in the task of document clustering and proposing an evaluation method for feature extraction (or document representation). To further show the level of language independence of these methods, the experiments are run with several languages from different language families.

The third main theme of the thesis, subjectivity of language use, is specifically considered in a task of assessing the difficulty of a text. A novel approach is proposed, in which the difficulty assessment is done separately for each user. In contrast to the traditional readability measures for difficulty assessment, the proposed method is intended for assessing suitable documents for adults that have knowledge of varying expertise areas. The article on this topic, "Assessing user-specific difficulty of documents", has been published in the prestigious Information Processing & Management journal.

Jussi Karlgren (Gavagai AB and KTH, Stockholm) served as an opponent. Karlgren's questions ranged from fundamental methodological issues to views on future research possibilities, based on his long experience in this field. In the final statement, he thanked the defender for her solid work in an important area of research.

1 comment:

Vahid Moosavi said...

Hi Dear Timo
It seems to be an interesting research and I would like to read the full text of it in detail.
But at the same time, I am wondering why there is no more publication and active research on WEBSOM.
Because in my opinion, it is a unique way to keep not just the frequencies of words in the text, but their semantic relations as well. I think WEBSOM is the only work (to my knowledge) which is not going rapidly to statistics and works with probabilistic networks.
I am really amazed by its idea and I found that WEBSOM can be described as a Markovian-SOM, in which the first layer is acting as a compressor and filter, which is based on Markov Chain of word sequences. And I think this combination can be applied in any network based representation of real phenomena (e.g. people movements in city networks).
And further it can be optimized with a target based SOM in the second layer for those classification tasks like sentiment analysis.
Further, I had this question about the performance of Random Projection Method comparing to other dimensionality reduction methods.
If I am not lazy in writing, I will write a technical explanation of WEBSOM as a Markovian-SOM.