Jefrey Lijffijt successfully defended yesterday, 16th of December, his thesis "Computational methods for comparison and exploration of event sequences". In his thesis work, Lijffijt has developed computationally efficient methods that can be used to compare and explore event sequences such as natural language texts, DNA sequences or sensor data. Central terms in the thesis are burstiness and dispersion that are measures of the variability of the frequency of an event. A event that is bursty or that has low dispersion tends to be frequent in some parts of an event sequence and infrequent in all other parts of an event sequence.
Lijffijt and his colleagues, both linguists and machine learning specialists, have applied the methods developed in the thesis work to data sets from different domains. Text corpora include British National Corpus, Corpus of Early English Correspondence, and the novel "Pride and Prejudice" by Jane Austen. Another kinds of event sequences are the spatial occurrence patterns of nucleotides and dinucleotides in the human reference genomes, and train sensor time series of the Hollandse Brug, a bridge in the Netherlands.
To model the contextual behaviour of words, Lijffijt considers their spatial distribution throughout texts. The primary unit used in modelling is the interval between two occurrences of a word in the texts. Bursty words tend to exhibit long inter-arrival times followed by short inter-arrival times, while the inter-arrival times for non-bursty words have smaller variance. In one of the case studies, the purpose was to test if there are linguistic differences between texts of fiction prose written by male and female authors. The results indicated, in a consistent manner with earlier research, that male-authored fiction is dominated by frequent use of noun-related forms, while female-authored fiction is more verb-oriented. Moreover, the personal pronouns that are overrepresented in male-authored texts are the first person plural forms "us" and "we" and the third-person pronouns "its", "their", and "they", while women overuse the second-person forms "you" and "your" which can have singular and plural referents. One important methodological conclusion was that the choice of the statistical test matters both in theory and in practice.