Understanding language

Ved seminaret Understanding Language holder fire forskere oplæg om forskellige tilgange til betydning med udgangspunkt i henholdsvis kognitionsforskning, datalingvistik, sprogvidenskab og filosofi. Foredragsholderne er Mikkel Walentin fra Aarhus Universitet, Desmond Elliott fra Københavns Universitet, Juan Luis Gastaldi fra Eidgenössische Technische Hochschule Zürich og Stéphane Polis fra Université Liège.

Program

14.00	Mikkel Walentin Language as a window into the mind
14.50	Stéphane Polis Computer-assisted approaches to semantic maps: History, methods and future prospects in semantic typology
15.40	Coffee Break
16.10	Anna Rogers Training data for Large Language Models: how can we collect it ethically and study it?
17.00	Juan Luis Gastaldi Understanding Language (Models): AI-Language Models From a Structuralist Perspective
17.50	Desmond Elliott Language Modelling from Pixels

Abstracts og bionoter

Mikkel Wallentin

Language as a window into the mind

Abstract: Linguistics is grounded on the assumption that language can be studied independently of other human behaviour. This talk takes as its point of departure that this assumption is wrong. Language is deeply intertwined with everything we do and think and is never independent of cognition and vice versa. In this talk, Mikkel Wallentin will present some of his studies that exemplify how language affects cognition and how language can be used as a means for studying the human mind.

Bio note: Mikkel Wallentin is a Professor of cognitive science at Aarhus University. He is affiliated with the School of Communication, Culture and Cognitive Science at Aarhus University and the Centre for Functionally Integrative Neuroscience at Aarhus University Hospital. Central to his research is the way language is processed in the brain, but he also works with a range of other topics related to cognitive science. His latest publications deal with the role of inner speech in task completion, gender differences in children’s word production, and the way music and literature relate to cognition. Mikkel is also a successful fiction writer and playwright.

Stéphane Polis

Computer-assisted approaches to semantic maps: History, methods and future prospects in semantic typology

Abstract: A semantic map is a way to visually represent the relationships (or similarity) between meanings based on patterns of co-expression across languages (Georgakopoulos & Polis 2018; 2022). In this talk, I will first present a brief history of research in the field, starting with the so-called ‘classical semantic maps’ (Haspelmath 2003; van der Auwera 2013) — which typically take the form of a graph, with nodes standing for meanings and edges between nodes standing for relationships between meanings —, before turning to the so-called ‘proximity maps’ — which resort to statistical techniques that position the meanings in a two-dimensional space (Klis & Tellings 2022; Wälchli 2023). In a second step, I will demonstrate how both types of maps may be inferred from large-scale typological data and evaluate the pros and cons (including the goodness of fit) of these two methods for creating co-expression maps (Croft 2022). Finally, three fundamental questions of the semantic map model will be addressed based on case studies in the fields of perception/cognition and of emotions/values: (1) the validity of the basic assumption, namely, to what extent does co-expression reflect semantic similarity; (2) the central problem of identifying analytical primitives in the domain of semantics; and (3) the possible use of semantic maps to support diachronic and synchronic descriptions of individual languages.

Bio note: Stéphane Polis is a Senior Research Associate at the National Fund for Scientific Research (FNRS/Belgium) and Associate Professor at the University of Liège. His fields of research include Ancient Egyptian linguistics, Late Egyptian philology and linguistic typology. His work focuses mostly on language variation and change in Ancient Egyptian–Coptic, and on the publication and analysis of hieratic material from the community of Deir el-Medina. He supervises the development of the Ramses Project at the University of Liège with Jean Winand (http://ramses.ulg.ac.be) and of the Thot Sign-List (Berlin-Brandenburg Academy of Science and Humanity & the University of Liège). He coordinates the semantic maps project Le Diasema with Thanasis Georgakopoulos (http://web.philo.ulg.ac.be/lediasema/) and the project Crossing Boundaries. Understanding Complex Scribal Practices in Ancient Egypt, with Antonio Loprieno (http://web.philo.ulg.ac.be/x-bound/). Publications: tinyurl.com/bxfddr5d

Anna Rogers

Training data for Large Language Models: how can we collect it ethically and study it?

Abstract: The continued growth of LLMs and their wide-scale adoption in commercial applications such as chatGPT make it increasingly important to (a) develop ways to source their training data in a more transparent way, and (b) investigate it, both for research and for ethical issues. This talk will discuss the current state of affairs and some data governance lessons learned from Big Science, an open-source effort to train a multilingual LLM - including an ongoing effort for investigating the 1.6 Tb multilingual ROOTS corpus.

Bio note: Anna Rogers is an Assistant Professor in the Computer Science Department at the IT University of Copenhagen. She was previously a postdoc at the University of Massachusetts and later at the Centre for Social Data Science at the University of Copenhagen. Her main research area is Natural Language Processing. She holds a PhD degree from the Department of Language and Information Sciences at the University of Tokyo (Japan). She has worked with sentiment analysis, question answering and analysis of meaning representations. Her current research focuses on model analysis and evaluation of natural language understanding systems.

Juan Luis Gastaldi

Understanding Language (Models): AI-Language Models From a Structuralist Perspective

Abstract: Neural language models have revolutionised the study of natural language in the last decade. However, their surprisingly high performance has not yet been followed by solid theoretical advances concerning language itself. To the extent that the evolution of this new line of research often leaves the impression that our theoretical understanding of language actually recedes. To address this problem, I propose to assess the technical advances of current neural models from the perspective of classical structuralist linguistics. Such a perspective should result in a clarification of the theoretical stakes of current research trends as well as a critical renewal of classical theoretical principles for the study of language.

Bio note: Juan Luis Gastaldi is a philosopher and historian of science, specialised in the philosophy and history of formal knowledge (mathematics, logic and computer science), from the beginning of the 19th century to our days. His research focuses on the formalisation of meaning as a central problem of modern and contemporary philosophy, grasping the complex articulation between logic, mathematics and linguistics. Developing the principles of historical epistemology, he intends to understand the emergence and evolution of formal knowledge, from early 19th-century mathematical logic to contemporary computer science, in the light of a theory of formal languages on the crossroad between natural and social sciences.

Desmond Elliott

Language Modelling from Pixels

Abstract: Language models are defined over a finite set of inputs, which creates a representational and computational bottleneck when we attempt to scale the number of languages supported by a model. Tackling this bottleneck usually results in a trade-off between what can be represented in the embedding matrix and computational issues in the output layer. Desmond Elliott will present the Pixel-based Encoder of Language, which suffers from neither of these issues. PIXEL is a pre-trained language model that renders text as images, making it possible to transfer representations across languages based on orthographic similarity or the co-activation of pixels. It is trained on predominantly English data in the Wikipedia and Bookcorpus datasets to reconstruct the pixels of masked patches instead of predicting a probability distribution over tokens. Desmond will present the results of an 86M parameter model on downstream syntactic and semantic tasks in 32 typologically diverse languages across 14 scripts. He will also show that PIXEL is robust to noisy text inputs, further confirming the benefits of modelling language with pixels. Finally, he will discuss some more recent experiments on further improving pixel-based language representations.

Bio note: Desmond Elliott is an Assistant Professor and a Villum Young Investigator at the University of Copenhagen where he develops multimodal and multilingual models. His work received the Best Long Paper Award at EMNLP 2021 and an Area Chair Favourite paper at COLING 2018. His research is funded by the Velux Foundations, the Innovation Foundation Denmark, and the Novo Nordisk Foundation. He co-organised the Multimodal Machine Translation Shared Task from 2016–2018, the 2018 Frederick Jelinek Memorial Workshop on Grounded Sequence-to-Sequence Learning, the How2 Challenge Workshop at ICML 2019, and the Workshop on Multilingual Multimodal Learning at ACL 2022.