Materials
Part I (20 minutes): Introduction to grounding.
We will review the history of grounding, and introduce the unified definition of grounding.
In particular, grounding, in this tutorial, refers to processing the primary data with supervision
from another source, where the two sources of data have positive mutual information.
We will exemplify the definition through connection to existing work such as visual grounding,
acoustic grounding, factual grounding, and cross-lingual grounding.
We refer to ACL 2020 Tutorial 5 on
building common ground through communication, and
AAAI 2013 Keynote
for early work on grounded language learning.
Part II (30 minutes): Learning lexicons through grounding.
Word acquisition has been a fundamental problem in language acquisition concerned by both cognitive
science and robotics.
With the advancement of neural networks and multimodal machine learning, there has been work on
learning the meanings of written or spoken words by grounding language to visual signals.
Particularly, there has been work focusing on grounding verb semantics to the change of the physical
world.
Another line of work on learning lexicons through cross-lingual grounding.
In the first 10 minutes, we will introduce the background and focus on recent advances in the
remaining time.
Work on learning lexical semantics through interaction or learning lexicon to compose sentence-level
meanings will be deferred to Part IV.
Part III (30 minutes): Learning syntax through grounding.
Constituency parses of sentences can be learned by grounding to visual signals.
Follow-up work has demonstrated the effectiveness of such visually grounded systems on learning
variants of constituency and dependency grammars.
On another line, word alignment, based cross-lingual transfer can also be considered as an
instantiation of learning syntax through cross-lingual grounding,
where the text in the target language(s) is grounded to existing knowledge in the source
language(s).
A brief introduction of related syntactic knowledge, such as constituency, dependency, and
combinatory categorial grammars,
will be presented in the first 10 minutes of this part to help the audience better understand the
content.
We will focus on recent approaches to learning syntax through visual grounding and cross-lingual
grounding in the rest of the time.
Efforts on joint learning of syntax and semantics will be delivered in Part IV.
Part IV (60 minutes): Learning complex meanings (semantics
and pragmatics) through grounding.
It has attracted significant interest in learning and evaluating meaning acquisition in visually
grounded settings.
In addition to visual grounding, interaction is also a common source of supervision, where
considerations regarding pragmatics and theory of mind are often taken into account.
Similarly to what has been mentioned in Part II, cross-lingual transfer on sentence or
document-level meanings,
particularly transferring knowledge from high-resource to low-resource languages, should also be
considered as instantiations of cross-lingual grounding.
This part will cover three topics for 20 minutes each: learning semantics through grounding,
learning pragmatics through grounded interaction,
and learning cross-lingual text representations through cross-lingual grounding.
Part V (15 minutes): Discussion on future directions and
open problems.
A key discussion for future directions centers around whether grounding should emerge naturally from
scaling models or whether we should enforce grounded supervision to achieve more efficient learning.
Additionally, the scope of grounding can be broadened beyond traditional modalities,
incorporating touch, olfaction, non-human sensors, video and temporal data, 3D environments,
proprioception, episodic experiences, and even other forms of meta-cognition.