Materials (180 minutes)
Part I (15 minutes): Introduction to grounding.
[Slides] [Video]
Presenter: Freda Shi
We will review the history of grounding, and introduce the unified definition of grounding.
In particular, grounding, in this tutorial, refers to processing the primary data with supervision
from another source, where the two sources of data have positive mutual information.
We will exemplify the definition through connection to existing work such as visual grounding,
acoustic grounding, factual grounding, and cross-lingual grounding.
We refer to NAACL 2024 Tutorial 6 on
spatial and temporal grounding,
ACL 2020 Tutorial 5 on
building common ground through communication, and
AAAI 2013 Keynote
for early work on grounded language learning.
Part II (25 minutes): Learning lexicons through grounding.
[Slides] [Video]
Presenter: Martin Ziqiao Ma
Word acquisition is a core challenge in both cognitive science and robotics.
Recent advances in neural networks and multimodal machine learning have enabled efforts to
ground the meanings of written and spoken words in visual signals.
In this talk, we will explore research on grounding noun and verb meanings through changes in the physical
world.
We will also briefly discuss extensions of lexicon grounding beyond the visual modality,
as well as approaches to bootstrapping grounded word acquisition through meta-learning.
In the first 10 minutes, we will introduce the background and focus on recent advances in the
remaining time.
Work on vision-language models, learning lexical semantics through interaction or
learning lexicon to compose sentence-level meanings will be deferred to Part IV.
Part III (25 minutes): Learning syntax through grounding.
[Slides] [Video]
Presenter: Freda Shi
Constituency parses of sentences can be learned by grounding to visual signals.
Follow-up work has demonstrated the effectiveness of such visually grounded systems on learning
variants of constituency and dependency grammars.
On another line, word alignment, based cross-lingual transfer can also be considered as an
instantiation of learning syntax through cross-lingual grounding,
where the text in the target language(s) is grounded to existing knowledge in the source
language(s).
A brief introduction of related syntactic knowledge, such as constituency, dependency, and
combinatory categorial grammars,
will be presented in the first 10 minutes of this part to help the audience better understand the
content.
We will focus on recent approaches to learning syntax through visual grounding and cross-lingual
grounding in the rest of the time.
Efforts on joint learning of syntax and semantics will be delivered in Part IV.
Part IV (100 minutes): Learning complex meanings (semantics and pragmatics) through grounding.
Part IV-1 (25 minutes): Learning concepts through grounding.
[Slides] [Video]
Presenter: Jiayuan Mao
Grounded lexicon learning and grounded syntax learning come together to enable the formation of complex,
compositional grounded concepts.
Lexicon learning maps individual words to grounded perceptual or executable representations,
while syntax learning governs how these word-level representations are composed into structured meanings.
By integrating both, models can not only learn visual or perceptual concepts from language
but also generalize to novel compositions, facilitating systematic and interpretable understanding
of grounded semantics across diverse domains.
Part IV-2 (25 minutes): Grounding language to world representations: The case of space.
[Slides] [Video]
Presenter: Parisa Kordjamshidi
We cover how spatial semantics are represented, the available datasets and annotations,
and the connection between information extraction models, qualitative spatial reasoning,
and end-to-end deep learning approaches.
We review recent large language models for spatial language comprehension, their evaluation,
and the key limitations and challenges in this area.
We clarify the role of spatial language in downstream applications,
highlighting tasks such as grounding language in the visual world for navigation, wayfinding agents,
human-machine interaction, and situated dialogue systems.
Part IV-3 (25 minutes): Scaling vision-language models with grounding.
[Slides] [Video]
Presenter: Martin Ziqiao Ma
While modern vision-language models (VLMs) have made remarkable progress,
achieving fine-grained grounding of linguistic units to perceptual referents remains an open challenge.
We will review recent advances in mechanistically grounded VLMs, spanning both encoder-based and generative
ones.
We highlight how these models offer more detailed perceptual understanding and greater interpretability,
providing new insights into the mechanisms underlying grounded language acquisition.
Part IV-4 (25 minutes): Learning pragmatics through grounding.
[Slides] [Video]
Presenter: Joyce Chai
Grounded interaction provides a powerful source of supervision for language learning,
connecting linguistic expressions directly to perception and action.
Beyond mapping words to perceptual referents, successful communication requires models
to interpret language in context — leveraging shared goals, conventions, and the visual and embodied
environment.
We discuss research on grounded settings and pragmatic modeling, analyzing how grounding
in physical and social contexts shapes linguistic meaning, and how task goals, environmental structure,
and communicative affordances enrich the process of language grounding.
Part V (15 minutes): Future directions and open problems.
[Slides] [Video]
Presenter: Freda Shi
A key discussion for future directions centers around whether grounding should emerge naturally from
scaling models or whether we should enforce grounded supervision to achieve more efficient learning.
Additionally, the scope of grounding can be broadened beyond traditional modalities,
incorporating touch, olfaction, non-human sensors, video and temporal data, 3D environments,
proprioception, episodic experiences, and even other forms of meta-cognition.