C.M. Downey

PhD Student, University of Washington Linguistics

Low-Resource NLP, Unsupervised Morphosyntax, Typology, Endangered Language Revitalization

About Me

I am currently a PhD student in Computational Linguistics with the University of Washington Department of Linguistics, under the advisement of Dr. Fei Xia. My research focuses on NLP at roughly the Morphosyntactic level: including unsupervised morpheme segmentation, syntactic parsing, and POS tagging. I am also very interested in applying NLP to low-resource and endangered languages, and generally to the task of language preservation and revitalization.

My languages of research are most often North American Indigenous varieties, including Lakota (Lakȟótiyapi) and Inuktitut. In my typological studies, I also often work with highly inflecting Slavic and Uralic languages such as Russian, Hungarian, and Saami.

For the duration of the 2019-2020 academic year, my research will be funded by the FLAS fellowship from the Henry M Jackson School of International Studies, for research and coursework involving Inuktitut.

My Research

My research focuses on Natural Language Processing methods for low-resource languages, and specifically, how these methods can be applied to the task of Language Revitalization for Indigenous languages. My training before my PhD program consisted of a B.S. in Linguistics, as well as extensive coursework in Computer Science, including algorithmic analysis, Machine Learning, and Computational Linguistics.

I am interested in such topics as unsupervised segmentation of units smaller than word-level, due to the fact that many Indigenous and otherwise understudied languages have a complex structure within a single “word” as delineated by spaces that appear to each side. This requires NLP that diverges significantly from models constructed for English and other European languages, in which single “words” tend to have much less internal structure.

The task of Indigenous NLP also requires leverage around the problem of scarcity of data. Most state-of-the-art models for NLP today assume very large bodies of text, with a corpus of 1 million words being considered “small”. Most Indigenous and endangered languages have bodies of text only a small fraction of the size of English NLP datasets, and so my research also aims to discover methods of leveraging statistics, manipulating data, and incorporating information or data from related languages to work around this problem.

Finally, it is my aim to channel my research not just towards the acquisition of linguistic knowledge, but towards the more concrete task of Language Revitalization among groups who have experienced serious attrition of their native language due to colonialism and other global influences. Many of these communities have experienced outside pressure to give up their traditional language, and many of these languages now exist only in archives by-and-large inaccessible to Indigenous groups. It is my hope that NLP can be a tool to restore such languages to some level of vibrancy within Indigenous communities by making technology-based language-learning easy and accessible for both students and teachers in Indigenous groups, even if the language in question has been dormant for generations.