NAACL 2025 Tutorial: Learning Language through Grounding

Learning Language through Grounding

(NAACL 2025 Tutorial)

Grounding has been a long-standing concept in natural language processing (NLP) and computational linguistics (CL). This tutorial provides a historical overview and introduces recent advances in learning language through grounding, with a particular emphasis on the latter. We will begin by tracing the history of grounding and presenting a unified perspective on the term. In Parts II to IV, we will delve into recent progress in learning lexical semantics, syntax, and complex meanings through various forms of grounding. We will conclude by discussing future directions and open challenges, particularly those related to the growing trend of large language models and scaling.

Tutorial Instructors

Freda Shi

University of Waterloo & Vector Institute, Canada CIFAR AI Chair

Ziqiao Ma

University of Michigan

Jiayuan Mao

Massachusetts Institute of Technology

Parisa Kordjamshidi

Michigan State University

Joyce Chai

University of Michigan

Materials (180 minutes)

Part I (15 minutes): Introduction to grounding. [Slides] [Video]

Presenter: Freda Shi

We will review the history of grounding, and introduce the unified definition of grounding. In particular, grounding, in this tutorial, refers to processing the primary data with supervision from another source, where the two sources of data have positive mutual information. We will exemplify the definition through connection to existing work such as visual grounding, acoustic grounding, factual grounding, and cross-lingual grounding.

We refer to NAACL 2024 Tutorial 6 on spatial and temporal grounding, ACL 2020 Tutorial 5 on building common ground through communication, and AAAI 2013 Keynote for early work on grounded language learning.

Part II (25 minutes): Learning lexicons through grounding. [Slides] [Video]

Presenter: Martin Ziqiao Ma

Word acquisition is a core challenge in both cognitive science and robotics. Recent advances in neural networks and multimodal machine learning have enabled efforts to ground the meanings of written and spoken words in visual signals. In this talk, we will explore research on grounding noun and verb meanings through changes in the physical world. We will also briefly discuss extensions of lexicon grounding beyond the visual modality, as well as approaches to bootstrapping grounded word acquisition through meta-learning.

In the first 10 minutes, we will introduce the background and focus on recent advances in the remaining time. Work on vision-language models, learning lexical semantics through interaction or learning lexicon to compose sentence-level meanings will be deferred to Part IV.

Part III (25 minutes): Learning syntax through grounding. [Slides] [Video]

Presenter: Freda Shi

Constituency parses of sentences can be learned by grounding to visual signals. Follow-up work has demonstrated the effectiveness of such visually grounded systems on learning variants of constituency and dependency grammars. On another line, word alignment, based cross-lingual transfer can also be considered as an instantiation of learning syntax through cross-lingual grounding, where the text in the target language(s) is grounded to existing knowledge in the source language(s).

A brief introduction of related syntactic knowledge, such as constituency, dependency, and combinatory categorial grammars, will be presented in the first 10 minutes of this part to help the audience better understand the content. We will focus on recent approaches to learning syntax through visual grounding and cross-lingual grounding in the rest of the time. Efforts on joint learning of syntax and semantics will be delivered in Part IV.

Part IV (100 minutes): Learning complex meanings (semantics and pragmatics) through grounding.

Part IV-1 (25 minutes): Learning concepts through grounding. [Slides] [Video]

Presenter: Jiayuan Mao

Grounded lexicon learning and grounded syntax learning come together to enable the formation of complex, compositional grounded concepts. Lexicon learning maps individual words to grounded perceptual or executable representations, while syntax learning governs how these word-level representations are composed into structured meanings. By integrating both, models can not only learn visual or perceptual concepts from language but also generalize to novel compositions, facilitating systematic and interpretable understanding of grounded semantics across diverse domains.

Part IV-2 (25 minutes): Grounding language to world representations: The case of space. [Slides] [Video]

Presenter: Parisa Kordjamshidi

We cover how spatial semantics are represented, the available datasets and annotations, and the connection between information extraction models, qualitative spatial reasoning, and end-to-end deep learning approaches. We review recent large language models for spatial language comprehension, their evaluation, and the key limitations and challenges in this area. We clarify the role of spatial language in downstream applications, highlighting tasks such as grounding language in the visual world for navigation, wayfinding agents, human-machine interaction, and situated dialogue systems.

Part IV-3 (25 minutes): Scaling vision-language models with grounding. [Slides] [Video]

Presenter: Martin Ziqiao Ma

While modern vision-language models (VLMs) have made remarkable progress, achieving fine-grained grounding of linguistic units to perceptual referents remains an open challenge. We will review recent advances in mechanistically grounded VLMs, spanning both encoder-based and generative ones. We highlight how these models offer more detailed perceptual understanding and greater interpretability, providing new insights into the mechanisms underlying grounded language acquisition.

Part IV-4 (25 minutes): Learning pragmatics through grounding. [Slides] [Video]

Presenter: Joyce Chai

Grounded interaction provides a powerful source of supervision for language learning, connecting linguistic expressions directly to perception and action. Beyond mapping words to perceptual referents, successful communication requires models to interpret language in context — leveraging shared goals, conventions, and the visual and embodied environment. We discuss research on grounded settings and pragmatic modeling, analyzing how grounding in physical and social contexts shapes linguistic meaning, and how task goals, environmental structure, and communicative affordances enrich the process of language grounding.

Part V (15 minutes): Future directions and open problems. [Slides] [Video]

Presenter: Freda Shi

A key discussion for future directions centers around whether grounding should emerge naturally from scaling models or whether we should enforce grounded supervision to achieve more efficient learning. Additionally, the scope of grounding can be broadened beyond traditional modalities, incorporating touch, olfaction, non-human sensors, video and temporal data, 3D environments, proprioception, episodic experiences, and even other forms of meta-cognition.

References show selected / show all by topic

Overview
Lexicon Learning / Syntax Learning / Semantics Learning / Pragmatics Learning
Crossmodal Grounding / Crosslingual Grounding / Epistemic Grounding / Interactive Grounding

BibTeX

@proceedings{naacl2025grounding,
    author    = {Shi, Freda and Ma, Ziqiao and Mao, Jiayuan and Kordjamshidi, Parisa and Chai, Joyce},
    title     = {Learning Language through Grounding},
    booktitle = {Proceedings of the 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 5: Tutorial Abstracts)},
    year      = {2025},
}

Freda Shi

University of Waterloo & Vector Institute, Canada CIFAR AI Chair

Ziqiao Ma

University of Michigan

Jiayuan Mao

Massachusetts Institute of Technology

Parisa Kordjamshidi

Michigan State University

Joyce Chai

University of Michigan

Materials (180 minutes)

Part I (15 minutes): Introduction to grounding. [Slides] [Video]

Presenter: Freda Shi

Part II (25 minutes): Learning lexicons through grounding. [Slides] [Video]

Presenter: Martin Ziqiao Ma

Part III (25 minutes): Learning syntax through grounding. [Slides] [Video]

Presenter: Freda Shi

Part IV (100 minutes): Learning complex meanings (semantics and pragmatics) through grounding.

Part IV-1 (25 minutes): Learning concepts through grounding. [Slides] [Video]

Presenter: Jiayuan Mao

Part IV-2 (25 minutes): Grounding language to world representations: The case of space. [Slides] [Video]

Presenter: Parisa Kordjamshidi

Part IV-3 (25 minutes): Scaling vision-language models with grounding. [Slides] [Video]

Presenter: Martin Ziqiao Ma

Part IV-4 (25 minutes): Learning pragmatics through grounding. [Slides] [Video]

Presenter: Joyce Chai

Part V (15 minutes): Future directions and open problems. [Slides] [Video]

Presenter: Freda Shi

References show selected / show all by topic

Learning Language Structures through Grounding

Haoyue Freda Shi

The Vector Grounding Problem

Dimitri Coelho Mollo, Raphaël Millière

Pragmatics in Language Grounding: Phenomena, Tasks, and Modeling Approaches

Daniel Fried, Nicholas Tomlin, Jennifer Hu, Roma Patel, Aida Nematzadeh

Grounding 'Grounding' in NLP

Khyathi Raghavi Chandu, Yonatan Bisk, Alan W. Black

Experience Grounds Language

Yonatan Bisk, Ari Holtzman, Jesse Thomason, Jacob Andreas, Yoshua Bengio, Joyce Chai, Mirella Lapata, Angeliki Lazaridou, Jonathan May, Aleksandr Nisnevich, Nicolas Pinto, Joseph Turian

Language to Action: Towards Interactive Task Learning with Physical Agents

Joyce Y. Chai, Qiaozi Gao, Lanbo She, Shaohua Yang, Sari Saba-Sadiya, Guangyue Xu

The Symbol Grounding Problem

Stevan Harnad

Grounding in Communication

Herbert H. Clark, Susan E. Brennan

World-to-Words: Grounded Open Vocabulary Acquisition through Fast Mapping in Vision-Language Models

Ziqiao Ma, Jiayi Pan, Joyce Chai

Word Discovery in Visually Grounded, Self-Supervised Speech Models

Puyuan Peng, David Harwath

Cross-lingual Entity Alignment with Incidental Supervision

Muhao Chen, Weijia Shi, Ben Zhou, Dan Roth

Bilingual Lexicon Induction via Unsupervised Bitext Construction and Word Alignment

Haoyue Shi, Luke Zettlemoyer, Sida I. Wang

Learning Morphosyntactic Analyzers from the Bible via Iterative Annotation Projection across 26 Languages

Garrett Nicolai, David Yarowsky

The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision

Jiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B. Tenenbaum, Jiajun Wu

Bilingual Lexicon Induction through Unsupervised Machine Translation

Mikel Artetxe, Gorka Labaka, Eneko Agirre

Acoustically Grounded Word Embeddings for Improved Acoustics-to-Word Speech Recognition

Shane Settle, Kartik Audhkhasi, Karen Livescu, Michael Picheny

Verb Physics: Relative Physical Knowledge of Actions and Objects

Maxwell Forbes, Yejin Choi

Interactive Learning of Grounded Verb Semantics Towards Human-Robot Communication

Lanbo She, Joyce Chai

Physical Causality of Action Verbs in Grounded Language Understanding

Qiaozi Gao, Malcolm Doering, Shaohua Yang, Joyce Chai

Incremental Acquisition of Verb Hypothesis Space Towards Physical World Interaction

Lanbo She, Joyce Chai

Reframing Linguistic Bootstrapping as Joint Inference Using Visually-Grounded Grammar Induction Models

Eva Portelance, Siva Reddy, Timothy J O'Donnell

Audio-Visual Neural Syntax Acquisition

Cheng-I Jeff Lai, Freda Shi, Puyuan Peng, Yoon Kim, Kevin Gimpel, Shiyu Chang, Yung-Sung Chuang, Saurabhchand Bhati, David Cox, David Harwath, Yang Zhang, Karen Livescu, James Glass

Substructure Distribution Projection for Zero-Shot Cross-Lingual Dependency Parsing

Freda Shi, Kevin Gimpel, Karen Livescu

PPT: Parsimonious Parser Transfer for Unsupervised Cross-Lingual Adaptation

Kemal Kurniawan, Lea Frermann, Philip Schulz, Trevor Cohn

"Wikily" Supervised Neural Translation Tailored to Cross-Lingual Tasks

Mohammad Sadegh Rasooli, Chris Callison-Burch, Derry Tanti Wijaya

Dependency Induction Through the Lens of Visual Perception