Natural Language Processing and the challenge of Kalaallisut
Natural Language Processing (NLP) has made substantial leaps in recent years, not least of which in the form of large language models (LLMs) and generative artificial intelligence. For many NLP engineers, language is essentially a solved problem – all we need is more data and more computing power. However, the theoretical constructs underlying contemporary NLP involve assumptions about languages which are, by and large, based on how major Indo-European languages behave. The engineer can reply with empirical results demonstrating the efficacy of their systems on goal-oriented language tasks – if it works, it works. But does it actually work?
In this talk, I present a work-in-progress experiment which seeks to test the limits of contemporary NLP on a language which looks and behaves very differently to major Indo-European languages, namely Kalaallisut (West Greenlandic). The project tests the feasibility of creating a neural machine translation model for Kalaallisut to Danish, why this is an important problem, and what the results might suggest going forward.
Ross Deans Kristensen-McLachlan is Associate Professor in Cognitive Science and Humanities Computing, based at the Center for Humanities Computing, Aarhus University. He works at the intersection of a number of related disciplines, such as linguistics, cognitive science, natural language processing, and artificial intelligence. Drawing on these disciplines, he studies the following questions:
- How do we construct meaning with and through natural language?
- What are the dynamic, historical processes underlying cultural evolution and change?
- How can we study these phenomena at scale using textual cultural heritage data?
He collaborates with humanities scholars from domains as varied as literary history and classics, to art history and the study of religion. In practice, this collaborative and interdisciplinary approach combines detailed, domain-specific knowledge from the humanities with state-of-the-art technical methods derived from natural language processing and computer vision.