Thursday, November 24, 2011

Biological Evolution and the Diversification of Languages

BEDLAN (Biological Evolution and the Diversification of Languages) is a project funded by Kone Foundation. The background of the project and its main objectives were presented by its director, Prof. Urho Määttä from University of Tampere. The project is a collaboration between universities of Tampere, Turku and Helsinki and The Research Institute for the Languages of Finland. The presentation took place in a meeting organized by the Society for the Study of Finnish.

BEDLAN conducts research in two main areas:
  • development of dialects ("microevolution")
  • development of languages ("macroevalution")
The project is interdisciplinary including researchers from linguistics, biology and philosophy. Methodology applied in the BEDLAN project includes methods used for modeling complex dynamical systems (stemming from population genetics) and methods of historical linguistics.

Kaj Syrjänen was unable to attend the meeting but detailed description of the research results was given by his collaborators Jyri Lehtonen and Terhi Honkonen. Lehtinen introduced a Uralic vocabulary data collection used in the project. The data includes 17 languages with information on connections between lexical items in these languages (e.g. Finnish, Sami, Estonian, Komi, Udmurt, Hungarian, Mordvin, Mansi, Khanty, Livonian, Tundra Nenets, Karelian and Veps). Etymological dictionaries were used to analyze the historical connection. The number of words was 226. Examples of words include "meet", "moon" and "mother". These are in Finnish "liha", "kuu" and "äiti", in Karelian "liha", "kuu" and "emä", and in Veps "liha", "ku" and "mam".

Terhi Honkonen gave a presentation on the computational analysis of the data. In the introduction of the methodology, she referred to McMahon and McMahon (2005): "Language Classification by Numbers" (Oxford), and Atkinson and Gray (2006): "Curious parallels and curious connections - Phylogenetic thinking in biology and historical linguistics" (Systematic Biology, 54:513-526). The method used was Bayesian phylogenetic analysis (using a program called MrBayes).

A strong merit of this kind of research is that conclusions on the relationships between languages and dialects are made based on vocabulary patterns rather than on individual word instances.

Jyri Lehtonen continued by presenting research on using network analysis methods. First he discussed the differences between tree-based models and network models (e.g. Heggarty et al. 2010) and showed results of network analysis on Uralic languages. The network analysis divided the languages into groups of Baltic-Finnic, Saami, Samoyedic, Ugric and Permic languages. Meadow Mari and Mordvin were not clearly connected with any of these groups. Lehtonen mentioned the classical lexico-statistical research by Swadesh in 1950s and continued by presenting research on the effect of using more or less central vocabulary. Usually central vocabulary is used where the lexical items are typically stable and morphologically simple. In the Loanword Typology Project, 1400 meanings are considered. This has further lead to Leipzig-Jakarta list which includes 100 "most central" meanings. Lehtonen argued that less central vocabulary may help in detecting language connections in a fine-grained manner.

Lehtonen's presentation inspired to think about the future of scientific representation and the role of animations in it. In this case, it seems that an animation of the development of the network structure could be useful.

In the end of the meeting, Terhi Honkonen presented research results on analyzing the timing of divergence of languages. The method used for the analysis is BEAST (Bayesian Evolutionary Analysis Sampling Trees), originally developed for Bayesian MCMC analysis of molecular sequences.

No comments: