s10462-022-10174-9

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 63

Artificial Intelligence Review (2023) 56:365–427

https://doi.org/10.1007/s10462-022-10174-9

Visual language navigation: a survey and open challenges

Sang‑Min Park1 · Young‑Gab Kim2

Accepted: 11 March 2022 / Published online: 25 March 2022


© The Author(s), under exclusive licence to Springer Nature B.V. 2022

Abstract
With the recent development of deep learning, AI models are widely used in various
domains. AI models show good performance for definite tasks such as image classifica-
tion and text generation. With the recent development of generative models (e.g., BigGAN,
GPT-3), AI models also show impressive results for diverse generation tasks (e.g., photo-
realistic image, paragraph generation). As the performance of each AI model improves,
interest in comprehensive tasks, such as visual language navigation (VLN) which follows
the language instruction with an egocentric view, is also growing. However, the model
integration for VLN has a problem due to the model complexity, modal heterogeneity, and
paired data shortage. This study provides a comprehensive survey on VLN with a systemic
approach for reviewing recent trends. At first, we define a taxonomy for fundamental tech-
niques which need to perform VLN. We analyze from four perspectives of VLN: represen-
tation learning, reinforcement learning, component, and evaluation. We investigate the pros
and cons of each component and methodology that have been conducted recently. This sur-
vey categorizes major research institute’s approaches with taxonomy defined in four per-
spectives, unlike other conventional surveys. Finally, we discuss current open challenges
and conclude our study by giving possible future directions.

Keywords Artificial intelligence · Visual language navigation · Representation learning ·


Reinforcement learning

1 Introduction

With the success of artificial intelligence (AI) models in each domain, interest in compre-
hensive tasks through model integration is also increasing. In particular, there has raised
a lot of study for integration and translation between vision and language. This is because

* Young‑Gab Kim
alwaysgabi@sejong.ac.kr
Sang‑Min Park
wiyard@korea.ac.kr
1
Department of Computer Science and Engineering, Korea
University, Seoul 02841, Republic of Korea
2
Department of Computer and Information Security, and Convergence Engineering for Intelligent
Drone, Sejong University, Seoul 05006, Republic of Korea

13
Vol.:(0123456789)
366 S.-M. Park, Y.-G. Kim

both vision and language domains have high technology maturity and use similar methods
(e.g., transformers and attentions). Representative examples of comprehensive tasks are
visual question-and-answer (VQA), which answers natural language queries from images,
and video pre-trained model (VPM), a pre-training model that simultaneously trains
images and language. Furthermore, the studies of embodied question-and-answer (EQA)
and visual language navigation (VLN) are also increasing by introducing the concept of
embodiment. EQA is a task that expands VQA by adding the concept of agent movement.
VQA (e.g., How many bananas are there?) is a series of processes that interprets the user’s
natural language query with a natural language model and finds the answer in the image
with a computer vision model. EQA includes the process of finding a location and answer-
ing the question that is difficult to answer in the current context. For example, in order to
answer the question of how many cups are in the kitchen, agents need to understand the
meaning of the query with a natural language model, then go to the kitchen and recognize
the number of cups with a computer vision model. VLN is a task that follows a person’s
instruction with an egocentric vision. VLN agent moves to a specific place to obtain infor-
mation or achieve a given goal. Suppose there is a VLN instruction "Go out the door and
turn right. Go forward and turn left to reach the refrigerator." To perform these VLN tasks,
the natural language model is able to decompose and interpret the user’s instructions into
low-level actions. After that, in order to move to the target, agents recognize objects (e.g.,
doors and refrigerators) using the computer vision model and interpretable information
necessary for movement (e.g., obstacles and spaces).
Since VLN tasks have a problem with model complexity, modal heterogeneity, and
paired data shortage, VLN research has introduced multimodal, multi-task, and end-to-
end (E2E) learning. Recently, novel methodologies have been proposed to reduce model
complexity and enable scalable AI integration for VLN tasks (Mei et al. 2020; Garcia-
Ceja et al. 2018). They insist on the necessity of research for modal coherence, augmented
paired data, and a multi-agent simulation environment.
This survey focuses on the latest approaches to solve VLN tasks from four perspectives:
representation learning, reinforcement learning, component, and evaluation in a compre-
hensive manner (see Fig. 1). Representation learning is the beginning of a VLN task and
requires a comprehensive understanding of natural language instruction and visual recog-
nition of the surrounding environment. Humans do not interpret only the meaning of tex-
tual instruction when they communicate with others. The current multimodal dialog sys-
tem compensates for the ambiguity of the instruction to determine the speaker’s underlying
intention based on the direct and indirect representations of the surrounding environment.
VLN agent obtains more detailed visual information by recognizing the text in the image
instead of simply identifying the object type. However, a multi-modal system requires
many paired data because it needs to learn different modals simultaneously. For example,
a multi-modal interaction system requires various sensor information (e.g., expression,
tone, lip-reading, conversation, eye-gazing) processing of the same timeline. On the other
hand, the multimodal joint training of similar tasks was recently proposed to obtain a better
performance based on common characteristics of tasks. After creating a large pre-trained
model, it is used in downstream tasks by applying fine-tuning and few-shot learning. VLN
model based on the pre-trained model obtains a better performance and robustness with a
relatively small size.
Reinforcement learning plays an important role in path planning and task planning
in a complex VLN environment. VLN agent aims to improve progressively with choices
that maximize rewards according to each environmental state without explicit feedback.
Language-grounded RL utilizes the characteristics of extensible language and splits

13
Visual language navigation: a survey and open challenges 367

Fig. 1  Organization of the survey of visual language navigation

tasks with sub-task and low-level skills. Offline RL is recently proposed for stable learn-
ing due to batch training and good performance in closed-loop environments. We have
covered the latest studies on meta-learning, planning, hierarchical and self-supervised
approaches to improve the performance of RL for VLN tasks.
Diverse VLN approaches have been proposed to solve the inconsistency between
multi-tasks because the coherence combination of vision, language, and action is impor-
tant. Some studies offer agent models with generality and robustness for unseen envi-
ronments by processing with E2E learning and using pre-trained models (e.g., BERT,
GPT-3) with large amounts of data. Graph models for structural approaches and genera-
tive models for providing more diverse responses are also used in the VLN task. In the

13
368 S.-M. Park, Y.-G. Kim

case of complex VLN tasks, the problem is solved by inferring the history and facts
using memory and reasoning. Since VLN comprises several complex tasks, research
on methodologies such as data and benchmarks is required. Simulation is important for
comprehensive tasks because it is difficult to analyze and evaluate various modals and
tasks in the real world due to cost and complexity. However, a general-purpose simula-
tion that consistently handles all of these modals is insufficient.
VLN can be divided into a method of simply navigating through space according to
an intuitive and detailed description, finding an object in a specific location, and finding
an object in an unknown location. Recently, through communication with Oracle, agents
are able to find things in more complex tasks. Instructions become shorter and implicit.
Whether you have prior knowledge of space and common sense, the problem is getting
complicated in the form of interpreting the coreference resolution. Conversations have
increasingly connotative meanings and deal with more objects.
Since VLN deals with diverse tasks, including representation, reasoning, men-
tal model, pose, and movement in a complex manner, survey research for each area is
essential. Modal conversion and fusion in VLN tasks are crucial tasks, and there have
been several comprehensive studies. Table 1 shows a comparison of existing surveys
concerned with VLN.
Mei et al. reviewed the approach for ’vision to language’ and ’language to vision’ (Mei
et al. 2020). Aditya et al. summarized vision and language integration tasks in the aspect
of the problem formulation, methods, datasets, evaluation measures, and comparison of
results (Mogadala et al. 2021). Integrated tasks include VQA, visual dialog, referring
expression, MMT, and visual reasoning. Mshali et al. summarized multi-modal data ana-
lytics, including critical components, collaboration, adversarial competition, and fusion
over multi-modal spaces (Mshali et al. 2018). Madureira and Schlangen presented a survey
that focuses on linguistically interpretable and grounded representations, justification of
design decisions, and evaluation of RL’s NL state representation effectiveness (Madureira
and Schlangen 2020).
Various types of surveys such as reasoning, poses, and health monitoring provide the
necessary information for VLN tasks. Chen et al. investigated a survey that focuses on the
problem of formalizing reasoning for embedded agents in a modeled environment (Chen
et al. 2020e). Garcia-Ceja et al. provided surveys in mental health monitoring systems
(MHMS) which introduce a classification taxonomy to guide mental disorders and condi-
tions (e.g., depression, anxiety, stress) (Garcia-Ceja et al. 2018). Wang et al. reviewed the
recent deep learning-based 2D and 3D human pose estimation in the aspects of frame-
works, benchmark datasets, evaluation metrics, and performance comparison (Wang 2021).
Zeng et al. provided a systematic review of visual navigation based on deep reinforce-
ment learning (Zeng et al. 2020). They introduce methods, challenges, and opportunities
focusing on RL-based visual navigation. However, this survey deals with visual language
navigation with broad taxonomy, including representation learning, and compares research
institutes’ recent research as a case study. The study’s main contributions are as follows:

• VLN taxonomy is proposed by summarized technologies and is used to classify the


studies of major research institutes (see Fig. 2). We classified the VLN models and
major approaches into representation learning, reinforcement learning, components,
and evaluations. For each approach, we have summarized the technologies that have
recently become issues and interests.
• Based on the classified results, we summarize the technologies they are focusing on
and the direction of the major research institutes.

13
Table 1  Descriptions and comparisons of surveys concerned with visual language navigation
Vendor Domain Topic Data Metric Model Task Highlights

Mei et al. (2020) Conversation VL and LV O O O Review the recent advances along vision-to-language and language-to-
vision
Mogadala et al. (2021) Integration VL O O O Focus on ten prominent tasks that integrate language and vision by
discussing their problem formulation, methods, existing datasets,
evaluation measures
Wang (2021) Integration Multimodal O O O O Multimodal data analytics for collaboration, adversarial competition and
fusion over multi-modal spaces
Garcia-Ceja et al. (2018) Sensor Multimodal O O O Focus on mental disorders and conditions (depression, anxiety, bipolar
disorder, stress)
Visual language navigation: a survey and open challenges

Chen et al. (2020a) 2020a, b, Sensor Pose estimation O O O Review the recent deep learning-based 2D and 3D human pose estima-
c, d, e) tion methods
Mshali et al. (2018) Application Home environment O O O Review of health smart monitoring systems for individuals especially for
elderly and dependent persons
Madureira and Schlangen (2020) Method RL with NL O O Dive into NLP papers that apply RL methods and whose state repre-
sentations have to capture linguistic features that influence decision-
making
Zeng et al. (2020) Method VLN O Provide a comprehensive and systematic review of visual navigation
based on deep reinforcement learning
Proposed Method VLN O O O O Classified the VLN models and major approaches into representa-
tion learning, reinforcement learning, components, and evaluations;
summarized technologies and is used to classify the studies of major
research institutes via taxonomy
369

13
370 S.-M. Park, Y.-G. Kim

Fig. 2  Taxonomy of visual language navigation

• Finally, we discuss the necessary technologies in the near future based on the
approaches of major research institutes.

The remainder of this study is organized as follows. Section 2 summarizes the


approaches of VLN representation learning, which is the first of the three main stages.
Section 3 deals with VLN reinforcement learning. Section 4 summarizes various VLN
methods from graph model to interdisciplinary model. Section 5 presents various evalu-
ation methods, including datasets, simulations, and benchmarks. After that, we apply the
taxonomy to the research results of major research institutes: DeepMind, Google Research,
and Facebook research. Section 6 presents and comprehensively discusses the studies and
insights of three research institutes in VLN tasks. We comprehensively predict the direc-
tion of development and suggest the necessary challenges in Sect. 7. Finally, Sect. 8 con-
cludes the paper.

13
Visual language navigation: a survey and open challenges 371

2 VLN Representation Learning

Representation learning is a machine learning method to learn useful representations (e.g.,


latent features) by compressing and expressing input information in a specific form. There
has been a growing trend of research on multi-modal representation learning that analyzes
the object with multiple modals: vision, language, and sound. Robust recognition and accu-
rate classification based on representation are important to adapt an unseen environment
for VLN tasks. It is the beginning of VLN in recognizing the visual difference between the
current location and the target object’s location.

2.1 Visual representation

Visual representation is divided into object and scene representation. Object representation
is used to classify target objects, and recently, research on acquiring text information of
objects and objects is attracting attention. Scene representation expresses the entire image
in detail, taking into account the arrangement and relation of multiple objects with seg-
mentation and classification. In VLN, object and scene representation are used for agents
to identify their location by recognizing surrounding objects from an egocentric view.
Agents decompose the scene in more detail by utilizing semantic information and relation
between components inside a scene. Engelcke et al. proposed an object-centric generative
model, GENESIS, which decomposes and generates scenes based on the relation of scene
components (Engelcke et al. 2020). GENESIS decodes object-centric latent variables by
parameterizing a spatial Gaussian Mixture Model. Latent variables are sampled from an
auto-regressive prior and sequentially inferred.
In order to analyze a scene where objects are overlapped or unseen, information about
the scene is specified using semantic information, geometric information, and the structure
of the human body. Zhan et al. recovered hidden scene structures without ordering and
amodal annotations as supervisions via progressive ordering recovery, amodal completion,
and content completion (Zhan et al. 2020). Zheng et al. proposed an active semantic under-
standing of unseen indoor scenes with RGB-D reconstruction and semantic segmentation
(Zheng et al. 2019). Qi et al. proposed an agent that learns a spatial representation of a
scene that is trained to be effective when coupled with traditional geometric planners (Qi
et al. 2020a). Rosenberger et al. used human body segmentation and finger segmentation to
hand over between human and robot in real-time robotic manipulation (Rosenberger et al.
2020).
From a methodological point of view, unsupervised and self-supervised learning is
commonly used in visual representation for utilizing unlabeled data. Gidaris et al. used
self-supervised ConvNet to quantize feature maps and use k-means methods for vocabulary
(Gidaris et al. 2020). They predict the histogram of visual words in the perturbed images
by using another ConvNet. ConvNet has good performance for downstream tasks because
it learns perturbation-invariant and context-aware.
The previous scene description work mainly focuses on the passive analysis of images.
The semantic understanding of the scenario is separated from the interaction between the
agent and the environment. Chen et al. proposed a method of improving sample complex-
ity by using a general and easy-to-decode intermediate-level visual representation in order
to solve the difficulty starting from scratch (Chen et al. 2020a). Tan et al. proposed an
Embodied Scene Description that finds the optimal viewpoint in the environment for scene

13
372 S.-M. Park, Y.-G. Kim

description by utilizing the agent’s ability to implement (Tan et al. 2020). As such, visual
representation learning is improved to deal with a photorealistic-level scene and used in
various forms in various domains.

2.2 Language representation

Recently, pre-trained models have been researched embedding natural language with visual
information in the VLN task. Most of the language representation in VLN tasks is used to
deal with the intents of instructions. Since VLN initiates from user instructions, researches
on natural language representation have an important role. Some researchers utilize natu-
ral language instructions directly in reinforcement learning (RL) as a goal and decompose
them into low-level skills for better understanding. Beyond simple instruction following,
communication between human and agent is developing as multiple-turn dialogue for a
more advanced interaction by reducing the scope of the target through a query to Oracle.
From a methodological point of view, the use of pre-trained models such as BERT is
increasing and evolving in two forms: contraction and expansion. BERT models are gradu-
ally compressed and lightweight (e.g., MobileBERT, DistilBERT) (Sun et al. 2020; Sanh
et al. 2019). Sun et al. proposed MobileBERT, which compresses and accelerates the
BERT model. MobileBERT uses a knowledge-transferred student model from the teacher
BERT large model (Sun et al. 2020). Sanh et al. proposed a general-purpose language rep-
resentation model, DistilBERT, a fine-tune BERT model. DistilBERT utilizes knowledge
distillation, triple loss, distillation, and cosine-distance (Sanh et al. 2019). Contrary, BERT
scales on a massive scale to handle long sentences that need to take into account long-term
dependencies. Zaheer et al. proposed a sparse attention model, BIGBERT, which reduces
the quadratic dependence on sequence length (Zaheer et al. 2020). It handles eight times of
long context with a universal approximator. GPT-3, a pre-trained auto-regressive model, is
developing in various forms (e.g., generating codes, writing novels) without explicit super-
vised training. Brown et al. proposed a 175 billion parameterized autoregressive model,
GPT-3, which applies few-shot learning without any gradient update and fine-tuning for the
downstream task (Brown et al. 2020).
Language bias is sometimes used to improve performance in natural language interpre-
tation. Using these linguistic characteristics, agents decompose complex instructions into
relatively simple commands. In addition, language bias is also used to augment training
data in a VLN task. Chevalier-Boisvert et al. proposed BabyAI, which supports humans in
the loop and curriculum learning for grounded language learning (Chevalier-Boisvert et al.
2019). It has an extensible suite of 19 levels, increasing difficulty and acquiring a combina-
tional rich synthetic language. Waytowich et al. proposed a model that receives compensa-
tion through a natural language narration paradigm that is easy to access the problem of
compensation scarcity (Waytowich et al. 2019). The narration guide RL agent is character-
ized by projecting a sequence of natural language commands onto the same high-dimen-
sional representation space as the target state.

2.3 Visual language representation

An image is translated to description with captioning and, conversely, it is generated from


the description with text-to-image translation. Natural language is encoded in a representa-
tion network and used to generate an image in the generation network. For the representa-
tion learning of VL tasks, some researchers utilize a joint representation of language and

13
Visual language navigation: a survey and open challenges 373

image. Landi et al. exploited dynamic convolutional filters to encode visual and language
information. They abstract with high-level navigation space and decode series of low-level
user-friendly actions with dynamic convolution (Landi et al. 2019).
There are researches on how to directly handle unstructured commands and raw image
data at pixel level (Yu et al. 2018; Goyal et al. 2020). Yu et al. proposed a language ground-
ing model which handles raw pixels, unstructured commands, and sparse rewards as inputs
for embodied agents (Yu et al. 2018). They use a language-guided transformation of visual
features and latent sentence embedding as the transformation matrices. Goyal et al. pro-
posed a model that directly maps pixels to rewards for free-form natural language explana-
tions, thereby improving the sample efficiency of policy learning in an environment where
language-based rewards are scarce (Goyal et al. 2020).
Memory attention and reasoning methods are also used in visual language representa-
tion (Sammani and Melas-Kyriazi 2020; Prabhudesai et al. 2020). Sammani and Melas-
Kyriazi proposed a caption-editing model which consists of EditNet and DCNet (Sammani
and Melas-Kyriazi 2020). EditNet uses adaptive and selective memory attention. DCNet
uses an LSTM-based denoising auto-encoder to copy and modify existing captions directly.
Prabhudesai et al. made two dependency trees for utterance and referential expression of
related images (Prabhudesai et al. 2020). They generated a 3D feature map with plausible
reasoning and localize object referents in a 3D map. The method of guiding the agent’s
search using the existing natural language is possible in a structured expression or assum-
ing a partial structure of the natural language command.
Natural language elements such as dictionaries and vokenization enrich the visual lan-
guage representation (Guo et al. 2019; Tan and Bansal 2020). Guo et al. designed an evalu-
ation metric to quantitatively measure the range of language dictionary effects of the VQA
model and proposed a normalization method for the VQA model (Guo et al. 2019). Tan
and Bansal contextually mapped language tokens and related images through vokenization,
and performed a multi-mode alignment. They apply over a relatively small set of image
caption data using the generated model (Tan and Bansal 2020).

2.4 Video representation

Video is used to compensate for insufficient training data (e.g., imitation learning) and to
utilize comprehensive multi-modal information of vision, voice, and sound. In the past,
research has been conducted to decompose video as continuous images and speech, but
recent studies perform multi-modal learning directly using video information. There are
many studies on joint learning that maps images and speech contents to interpret together.
The video representation deals with the continuous temporal relationship more important
than the visual language representation. Wu et al. proposed Boundary Adaptive Refinement
(BAR), which guides the process of gradually improving temporal boundaries for temporal
grounding of natural language in the untrimmed video. They also extend RL to temporal
localization with weak supervision (Wu et al. 2020a). 2020a, b, c). Li et al. designed video-
subtitle matching and frame order modeling, predicting global and local temporal align-
ment with the correct order of shuffled video frames. It focuses on narrated instructional
videos (e.g., cooking) because it needs accurate alignment (Li et al. 2020d).
Attention methods (Liu et al. 2020c; Chu et al. 2020) and hierarchical RL methods
(Chang et al. 2020) are also used in-depth for video representation. Liu et al. proposed a
multi-concept video self-attention model that summarizes multi-modal information with
context diversity and jointly exploits temporal and concept video features (Liu et al. 2020c).

13
374 S.-M. Park, Y.-G. Kim

Chu et al. proposed a multi-step joint-modality attention network that handles visual and
textual representations to integrate information and reason videos (Chu et al. 2020). Chang
et al. improved the efficiency of ObjectGoal work through a hierarchical search policy and
Q-learning for pseudo-labeled conversions four times (image, action, next image, reward)
to solve the problem of unlabeled Internet video (Chang et al. 2020). They used YouTube
videos to tackle Q-learning on pseudo-labeled transition quadruples. It learns semantic
cues to objects of interest for unseen navigation environments.

2.5 Multi‑modal representation

There has been a continuing interest and demand for multi-modal learning for a long time.
Such multi-modal analysis is generally superior to unimodal analysis and performs well
on complex tasks (e.g., captioning, sentiment analysis) by compensating additional context
with other modals. In particular, with the recent advent of the attention model, the use of
transformers (Landi et al. 2021; Tsai et al. 2019) in the multi-modal domain is increasing
as well as CNN (Joze et al. 2020) and RNN (Mou et al. 2020). It is robust for the unseen
environment and is used for specific tasks with fine-tuning. Fusion is an important process
in multimodal and is divided into late fusion and early fusion depending on the time point
of fusion. Landi et al. proposed Perceive, Transform, and Act (PTA), a full attention trans-
former model in VLN tasks (Landi et al. 2021). They use early fusion between linguistic
and visual information in the encoder and a late fusion between action history and percep-
tion in the decoding phase. Tsai et al. used E2E multi-modal transformer to solve long-
range dependencies and non-alignment due to various sampling rates of multi-modal data
(Tsai et al. 2019). They used cross-modal attention for interaction between multi-modal
sequences and adapted a stream of different latency and time steps. Joze et al. utilized
multi-modal transfer and slow fusion to add a different level of feature hierarchy in CNN
(Joze et al. 2020). Mou et al. used the multi-modal fusion, attention, and BiGRU/BiLSTM
encoders for multi-modal open-domain QA (Mou et al. 2020). Teacher forcing and sched-
uled sampling are used to analyze a sequence of question–answer pairs about a video and
decide the final answer among them.
If there were many previous studies on the fusion method, researchers also focus on
simpler and more versatile E2E approaches based on the pre-trained model. Li et al. used
a universal multi-modal transformer to learn joint representation and generate multi-modal
dialogue with a pre-trained language model (Li et al. 2021). 2020a, b, c, d, e). Multi-modal
representation is used in various fields such as speech recognition and intention classi-
fication together (Rao et al. 2020) and social learning (Shridhar et al. 2021). Rao et al.
proposed a jointly trained end-to-end Spoken Language Understanding (SLU) and tran-
script generation models to extract an intent directly from speech without intermediate text
output (Rao et al. 2020). Shridhar et al. proposed SociAPL (Auxiliary Prediction Loss),
new social learning that can educate agents to solve difficult exploration tasks using expert
cues with model-based secondary losses (Shridhar et al. 2021). It promotes social learning
through reputation cues and enables social learning through an environment that penalizes
personal exploration.

2.6 Pre‑trained multi‑modal representation

In recent years, the use of pre-trained models has been essential in performing natu-
ral language tasks. A video-based pre-training model that processes images and natural

13
Visual language navigation: a survey and open challenges 375

language together has been proposed. Pre-trained models are one of the effective meth-
ods for visual language tasks (Su et al. 2020; Le and Hoi 2020; Huang et al. 2019).
Su et al. proposed pre-trained Visual-Linguistic BERT for the generic representation
of visual-linguistic tasks (Su et al. 2020). The inputs are words of the sentence and a
region-of-interest (RoI) of the image. Le and Hoi extended pre-trained language models,
GPT-2 model for video features and dialogue features at video-grounded dialogue (Le
and Hoi 2020). Visual dialogue is a difficult task because video has spatial and tempo-
ral dimensions and multiple-turn dialogue is needed. Huang et al. adapted pre-trained
vision and language representations for cross-modal sequence alignment and sequence
coherence tasks (Huang et al. 2019). They improve the performance of the success rate
weighted by path length (SPL) and transfer the domain-adapted representations.
Recently, a model that trains at once by merging tasks including actions is proposed
for more complex VLN tasks. Hao et al. proposed a pre-trained model for the generic
representation of VLN tasks (Hao et al. 2020). They pre-trains image-text-action triplets
in self-supervised learning and fine-tunes for the adaption of VLN tasks.
Table 2 shows the summarizations and comparisons of VLN representation learning.

2.7 Open challenges

VLN basically aims to perform tasks with egocentric vision-based movement. In order
to specify a target object, utilizing additional acoustic information makes it more accu-
rate compared with the case of using only vision. In VLN tasks, when considering the
perception performance and distance measurement of vision without a depth sensor,
performance can be improved by utilizing acoustic information. However, many studies
have not been conducted due to limitations on complexity (e.g., echo, simulation envi-
ronments). Although not many, some studies are being conducted to gather the location
information or supplement vision information using acoustic information.
Gan et al. described audio-visual embodied navigation, consisting of visual percep-
tion, sound perception, and dynamic path planner. It constructs sparse memory and
infers the relative location of the environment (Gan et al. 2020). On the other hand,
Thomason et al. insisted that unimodal models sometimes outperform multi-modal
models because they can better capture and reflect dataset biases (Thomason et al.
2019). Therefore, we need to consider using the unimodal method for relatively fewer
complex tasks depending on the task, not limited to multi-modal.

3 VLN reinforcement learning

In reinforcement learning, the agent aims to improve progressively with choices that
maximize rewards according to each environmental state. Unlike supervised learning,
RL does not use explicit feedback but uses rewards of action in an exploration. Rein-
forcement learning in VLN tasks focuses on path navigation and task planning for com-
plex situations (see Fig. 3) (Wijmans et al. 2020). Language-grounded RL is popularly
used to reduce the complexity of conversion modality and utilize the characteristics of
extensible language.

13
Table 2  Descriptions and comparisons of VLN representation learning
376

Vendor Paper Method Purpose Highlights

13
Visual representation Engelcke et al. (2020) GENESIS Decompose scene Decompose and generate scenes
based on the relation of scene com-
ponent; parameterizing a spatial
Gaussian Mixture Model
Zheng et al. (2019) Active semantic understanding Unseen environment Unseen indoor scenes with RGB-D
reconstruction and semantic
segmentation
Gidaris et al. (2020) Self-supervised ConvNet Predict the histogram of visual Quantize feature maps and use
words k-means methods for vocabulary;
learn perturbation-invariant, and
context-aware
Language representation Sanh et al. (2019) DistilBERT Fine-fune BERT model Utilize knowledge distillation, triple
loss, distillation, and cosine-
distance
Zaheer et al. (2020) Big Bird Reduce the dependency on Eight times of long context with
sequence length universal approximator
Brown et al. (2020) GPT-3 Pre-trained language model 175 billion parameterized autore-
gressive model; few-shot learning
without any gradient update
Visual language representation Yu et al. (2018) Language grounding model Input for embodied agents Handle raw pixels, unstructured
commands and sparse rewards; a
language-guided transformation of
visual features and latent sentence
embedding
Sammani and Melas- EditNet and DCNet Caption-editing Adaptive and selective memory
Kyriazi (2020) attention; LSTM-based denoising
auto-encoder to directly copy and
modify existing captions
Video representation Li et al. (2020d) Video-subtitle matching Accurate alignment Frame order modeling; predicts
both global and local temporal
alignment; focuses on narrated
S.-M. Park, Y.-G. Kim

instructional videos
Table 2  (continued)
Vendor Paper Method Purpose Highlights

Chang et al. (2020) YouTube video Tackle Q-learning Learn semantic cues to objects of
interest for unseen navigation
environment
Chu et al. (2020) A multi-step joint-modality atten- Reason videos Handle visual and textual representa-
tion tions to integrate information
Multi-modal representation Mou et al. (2020) Multi-modal fusion Multi-modal open-domain QA Attention and BiGRU/BiLSTM
encoders; Teacher forcing and
scheduled sampling
Landi et al. (2021) Perceive, Transform, and Act VLN task Full attention transformer model
in VLN tasks; an early fusion
between lingual and visual
information; a late fusion between
action history and perception
Tsai et al. (2019) E2E multi-modal transformer Solve long-range dependencies Use cross-modal attention for
Visual language navigation: a survey and open challenges

interaction between multi-modal


sequences; adapt stream of differ-
ent latency and time steps
Pre-train multi-modal representa- Su et al. (2020) Visual-Linguistic BERT Generic representation The inputs are words of the sentence
tion and a region-of-interest (RoI) of
the image
Huang et al. (2019) pre-trained VL representations Transfer the domain-adapted repre- Improve the performance of the suc-
sentations cess rate weighted by path length
(SPL)
Hao et al. (2020) VLN pre-trained model Generic representation Pre-trains image-text-action triplets
in self-supervised learning; fine-
tune for adaption of VLN tasks
377

13
378 S.-M. Park, Y.-G. Kim

Fig. 3  Decentralized distributed proximal policy optimization (DD-PPO) for distributed RL in resource-
intensive simulated environments (Wijmans et al. 2020; get the citation permission from the authors)

3.1 RL representation

It is difficult to define rewards in RL tasks. Since the VLN task starts from human instruc-
tion, language-guided RL, which has been widely used in recent years, has the advantage
of reducing the burden on integration between natural language module and reinforcement
learning module. Natural language is used to generate goals (Colas et al. 2020a) and con-
straint goals (Hutsebaut-Buysse et al. 2020; Co-Reyes et al. 2019) in RL. Colas et al. used
language to condition goal generators that generate language-agnostic goals for the agent
(Colas et al. 2020a). It decouples sensorimotor learning from language acquisition and
demonstrates a diversity of behaviors for the given instruction. For the sample efficiency
of goal-conditional RL agent, Hutsebaut-Buysse et al. examined pre-trained task-independ-
ent word embedding and facilitated transfer learning between different tasks on navigation
(Hutsebaut-Buysse et al. 2020). For better generalization on unseen environments, they
proposed to learn invariant environment-agnostic representations for navigation. Co-Reyes
et al. proposed language-guided policy learning that integrates instruction and a sequence
of corrections to acquire skills. They make iterative language corrections to guide an agent
in obtaining the desired skill (Co-Reyes et al. 2019). Feng et al. proposed a tracking-by-
detection formulation with natural language descriptions (Feng et al. 2020). It generates
regions related to the given description during the detection phase of the tracker and then
predicts the update of the target from regions. To generalize new goals, Siriwardhana et al.
proposed Hybrid Asynchronous Universal Successor Representations (HAUSR) with Uni-
versal Successor Representations and A2C Agents (Siriwardhana et al. 2018).

3.2 Planning

Planning is used in two forms in VLN tasks: path planning and task planning. Path plan-
ning is used for optimal movement to the target location in order to perform the task for the
instruction. Task planning is not simply used to move to a place but to use sub-tasks neces-
sary for carrying out an action (e.g., holding an object). Wang et al. proposed a planned-
ahead hybrid reinforcement learning model which integrates model-based and model-free
reinforcement learning for real-world VLN (Wang et al. 2018). They use a look-ahead pol-
icy to predict the next state and the reward.
Some studies use the symbolic layer and waypoint for efficient planning. Li et al. pro-
posed Hamilton–Jacobi (HJ) reachability-based method, which predicts waypoints with

13
Visual language navigation: a survey and open challenges 379

generated supervision in an unseen environment (Li et al. 2020a). It consists of a percep-


tion, planning, and control module for the dynamic planning and tracking of a trajectory.
Language is also used in agent planning and decision-making (Patel et al. 2020; Deng et al.
2020). For the potential of language in sequential decision-making, Patel et al. analyzed
the relationship between language elements and classes of the decision process (Patel et al.
2020). They also measure the usability of language information (e.g., verbs, nouns, adjec-
tive) according to partial observability. Deng et al. introduced a graphical planner (EGP),
linking natural language guidelines with progressively growing knowledge and evolving
for effective long-term planning and decision-making (Deng et al. 2020). It enables more
flexible decision-making by constructing dynamic graphic representation, analyzing the
ambiguity of guidelines, and generalizing the workspace.

3.3 Meta RL

RL has inefficient sample efficiency due to the high dimension of action space. So, it is dif-
ficult to generalize in new tasks and unseen environments. Meta-learning, which reduces
the complexity using meta-data, was studied several years ago to solve sampling ineffi-
ciency problems. Meta-learning has two major approaches, architecture search with hyper
parameter optimization and learning-to-learn. In this paper, we focus on the latter learning-
to-learn approaches, which include few-shot and zero-shot learning. Wortsman et al. pro-
posed a meta-reinforcement learning approach, the self-adaptive visual navigation method
(SAVN), which learns a self-supervised interaction loss and adapts without explicit super-
vision for generalization to unseen scenes (Wortsman et al. 2019). For visual navigation
with low resources, Li et al. proposed an unsupervised RL approach to learn transferable
meta-skills such as bypassing obstacles and going straight on unannotated environments
without supervisory signals (Li et al. 2020c). Chen et al. proposed Meta Module Network
(MMN), preserving compositionality and interpretability with modularized design (Chen
et al. 2021). 2019a, b). Program Generator parses an input question into a functional pro-
gram, and the recipe encoder translates the functions into their corresponding specifica-
tions. They used the teacher-student model with a symbolic teacher to provide guidelines
for the instantiated modules.

3.4 Hierarchical RL

Some studies utilize the hierarchical characteristics of language for RL tasks. Cerda-Mar-
dini et al. utilized multi-head attention as a blending layer and translated natural language
to a high-level behavioral language for VLN (Cerda-Mardini et al. 2020). Hierarchical
learning divides tasks into high-level commands that people can understand and low-level
skills that can operate on the agent as sub-tasks (Blukis et al. 2018; Xie et al. 2020; Chen
et al. 2021). Blukis et al. proposed high-level instructions following model which maps
directly from images, instructions for navigation, and estimates pose to continuous low-
level velocity commands for real-time control (Blukis et al. 2018). Xie et al. proposed
SnapNav, which uses a few snapshots of the environment (Xie et al. 2020). It consists of a
high-level commander, which provides directional commands, and a low-level controller,
which provides real-time control and obstacle avoidance.
Chen et al. proposed a sequential human demonstration method in the form of natural
language instruction and behavioral trajectories. They use high-level language generators

13
380 S.-M. Park, Y.-G. Kim

and low-level policies to reuse and share low-level policies in multitask learning for effi-
cient sampling (Chen et al. 2020d).
Hierarchical structure (Arumugam et al. 2019; Kipf et al. 2019) and hierarchical mod-
eling Wang et al. (2020a)2021) are also used in RL based on natural language. Arumugam
et al. proposed multiple grounding models with semantic goal representations with the
hierarchical structure of tasks and the compositional language (Arumugam et al. 2019).
Kipf et al. proposed Compositional Imitation Learning and Execution (CompILE), which
learns reusable and variable-length segments of hierarchically structured demonstration
with unsupervised and fully-differentiable sequence segmentation (Kipf et al. 2019). In a
task-oriented dialogue system, Wang et al. proposed hierarchical modeling between dia-
logue policy and natural language generator (NLG) as an options framework HDNO by
using hierarchical reinforcement learning (HRL) and LM-based discriminator (Wang et al.
2020a). 2021).

3.5 Self‑supervised RL

Self-supervised learning, which is one of unsupervised learning, augments training data


without explicit label data and trains without supervised policy. It is a reasonable solution
to solve the lack of training data and augments the data specialized for the task. Recently,
self-supervised learning has improved performance by utilizing contrastive learning and
adversarial training. Ma et al. proposed a self-monitoring agent with a visual-textual co-
grounding module and progress monitor (Ma et al. 2019a). It uses co-grounding both
instruction and moving direction and check whether the grounded instruction correctly
reflects the navigation progress. Zhu et al. proposed Auxiliary Reasoning Navigation
(AuxRN) with four self-supervised auxiliary reasoning tasks by using additional training
signals of the semantic information (Zhu et al. 2020a). Zhou and Small proposed an adver-
sarial inverse reinforcement learning algorithm with a variational goal generator that rela-
bels trajectories and samples diverse goals for a language-conditioned reward (Zhou and
Small 2020).

3.6 Multi‑agent RL

Multi-agent RL deals with collaboration and confrontation between agents. While existing
interactions simply considered peer-to-peer methods between humans and agents, recent
research proposes establishing relationships between agents and utilizing collective intel-
ligence. Rosano et al. compared single-target and multi-target approaches for image-based
localization in RL-based visual navigation (Rosano et al. 2020). Jain et al. divided col-
laborative communication into explicit messages and implicit perceptions of another agent
(Jain et al. 2019). They focus on when and what to communicate and how to act with
these communications based on the perception of the visual world. Wang et al. proposed
a decentralized multi-agent learning method, Bayesian delegation, which rapidly infer the
hidden intentions by inverse planning (Wu et al. 2020c).
Table 3 shows the summarizations and comparisons of VLN reinforcement learning.

13
Table 3  Descriptions and comparisons of VLN reinforcement learning
Vendor Paper Method Purpose Highlights

RL representation Co-Reyes et al. (2019) Language-guided policy learning Acquiring the desired skill Integrate an instruction and a sequence of correc-
tions to acquire skills; make iterative language
corrections to guide an agent
Colas et al. 2020a, b) Condition goal generators Demonstrate a diversity of behaviors Use language to condition goal generators; decou-
ples sensorimotor learning from language acqui-
sition and demonstrate a diversity of behaviors
for the given instruction
Hutsebaut-Buysse Transfer learning Sample efficient of goal-conditional RL Better generalization on unseen environments;
et al. (2020) propose to learn invariant environment-agnostic
representations for navigation
Planning Wang et al. (2018) Planned-ahead hybrid RL Predict the next state and the reward Integrate model-based and model-free RL for real-
world VLN; use a look-ahead policy
Meta RL Wortsman et al. (2019) SAVN Generalization to unseen scenes Self-adaptive visual navigation method (SAVN);
learn a self-supervised interaction loss and adapt
Visual language navigation: a survey and open challenges

without any explicit supervision


Li et al. (2020c) Unsupervised RL approach Visual navigation with low resource Learn transferable meta-skills on unannotated
environments without supervisory signals
Chen et al. Meta module network Providing guidelines Preserving compositionality and interpretability;
(2021) 2019a, b) teacher-student model with a symbolic teacher
Hierarchical RL Blukis et al. (2018) High-level instructions following Real-time control High-level instructions following model; instruc-
tions for navigation; continuous low-level veloc-
ity commands
Xie et al. (2020) SnapNav Navigation High-level commander (providing directional com-
mands); low-level controller (providing real-time
control and obstacle avoidance)
Kipf et al. (2019) CompILE Unseen environment Compositional Imitation Learning and Execution
(CompILE); learns reusable and variable-length
segment; hierarchically-structured demonstra-
tion; fully-differentiable sequence segmentation
381

13
Table 3  (continued)
382

Vendor Paper Method Purpose Highlights

13
Self-supervised RL Ma et al. (2019a) Self-monitoring agent Navigation Visual-textual co-grounding module and progress
monitor; co-grounding both instruction and mov-
ing direction; check the grounded instruction
Zhu et al. (2020a) Auxiliary reasoning Navigation Four self-supervised auxiliary reasoning tasks by
using additional training signals of the semantic
information
Zhou and Small (2020) Adversarial inverse RL Language-conditioned reward Variational goal generator that relabels trajectories
and samples diverse goals
Multi-agent RL Rosano et al. (2020) Image-based localization Navigation Compare single-target and multi-target approaches
in RL-based visual navigation
Jain et al. (2019) Perception Communications Divide collaborative communication into explicit
messages and implicit perception
S.-M. Park, Y.-G. Kim
Visual language navigation: a survey and open challenges 383

3.7 Open challenges

Reinforcement learning plays an important role in path planning and task planning in
VLN. However, in order to solve various problems (e.g., sample inefficiency, unstable
training), various approaches such as offline RL and contrastive RL are emerging. Unlike
existing off-policy RL and model-based RL, Offline RL uses only pre-collected learning
data, not online results. Offline RL shows stable learning due to batch training and good
performance in closed-loop environments. When the test environment is different from the
training environment, careful implementation is necessary due to covariate shifts. Some
studies improved performance by adding auxiliary tasks (Kulhánek et al. 2019) and aux-
iliary losses (Srinivas et al. 2020) based on imitation learning and contrastive manners.
Kulhánek et al. proposed the learning method for agent navigation with three additional
extra tasks (Kulhánek et al. 2019). It predicts the segmentation of the observation image,
the target image, and the depth map. Srinivas et al. proposed a Contrastive Unsupervised
Representations for Reinforcement Learning (CURL), which uses contrastive learning and
off-policy control to utilize high-level features from raw pixels (Srinivas et al. 2020). Puig
et al. proposed a WAH (Watch-And-Help) model that understands the goal of the task and
solves the problem in cooperation with an agent using a single demonstration of an agent
performing the same task (Puig et al. 2020).

4 VLN Components

In this paper, we look at the concepts that can be used in a wider range rather than con-
fining to VLN tasks. We cover diverse VLN component models (e.g., Generative model,
graph model), which are used for the VLN component. Furthermore, we introduce mem-
ory-based models and reasoning models to deal with long-term dependency and sequential
issues. We extend the scope of VLN with Interdisciplinary models and extended models
(e.g., hide-and-seek, text-based game).

4.1 VLN model

VLN is a comprehensive task, and studies on various frameworks and component models
have been proposed since research using the component is effective. In order to perform
an information-insufficient task, there are some researches on a model based on querying
insufficient information and proceeding with the task with the answers of Oracle (Fried
et al. 2018; Nguyen and Daumé 2019; Nguyen et al. 2019). Fried et al. proposed a speaker
model for new instruction augmentation and pragmatic reasoning (Fried et al. 2018). They
evaluate how well action sequences explain an instruction by a panoramic action space that
reflects the granularity of human instructions. Nguyen and Daumé proposed an interactive
photo-realistic simulator, Help Anna (HANNA), which is an extension of Automatic Natu-
ral Navigation Assistants (ANNA) with leveraging simulated human assistants in object-
finding tasks (Nguyen and Daumé 2019). It provides multi-modal instructions to direct the
agent towards the goals based on request. Nguyen et al. proposed Vision-based Naviga-
tion with Language-based Assistance (VNLA) which guide agent with visual perception
and instruction to find objects in photorealistic indoor environments (Nguyen et al. 2019).
When an agent loses their way, it makes a query to an advisor and obtains language sub-
goals to make progress.

13
384 S.-M. Park, Y.-G. Kim

Some methods (Liang et al. 2020; Wu et al. 2019b; Ma et al. 2019b) to predict and roll-
out the following results have been also proposed. Liang et al. proposed an E2E PnPNet
which predicts each time step object tracks and their future trajectories with sequential sen-
sor inputs (Liang et al. 2020). Wu et al. proposed NeoNav, which is guided by conceiving
the next expected observations to improve the cross-target and cross-scene generalization
on VLN tasks (Wu et al. 2019). Ma et al. proposed an end-to-end progress monitor which
searches on a navigation graph with regret module and progress marker (Ma et al. 2019b).
The regret module performs backtracking to continue moving forward or roll back to a pre-
vious state. Progress marker helps the agent decide which direction to go next by showing
directions.
Object relation graph, penalty function, stop method and augmented reality data are
also used for accurate navigation in the VLN tasks (Qi et al. 2020b; Xiang et al. 2020b;
Blukis et al. 2020). Qi et al. proposed an Object-and-Action Aware Model (OAAM) which
match object-centered/action-centered instruction to counterpart visual perception/action
orientation flexibly (Qi et al. 2020b). They penalized trajectories deviating from the ground
truth to predict viewpoint on the shortest path. Xiang et al. proposed Learning to Stop
(L2STOP), a policy module differentiating STOP from recognizing and stopping at the
correct location in complicated outdoor environments (Xiang et al. 2020b). Blukis et al.
proposed a process of constructing an object-centered learning map on the basis of object-
natural language mention for grounding of a language condition-specified object trained in
augmented reality data (Blukis et al. 2020). Adversarial methods and separating contents
enhance the classification performance in visual representation for VLN tasks (Liu et al.
2020a; Li et al. 2020e). Liu et al. proposed a method of generating spatiotemporal pertur-
bation using adversarial attacks with 3D adversarial examples and interaction records (Liu
et al. 2020a). Li et al. proposed policy-based image translation (PBIT) as an unsupervised
method of separating content and style from images (Li et al. 2020e). That expressions
learned by a search policy are consistent with images of the same content with different
styles.

4.2 Generative model

The generative model is used to augment training data or to enhance the expression in
the interaction. For example, it transforms language description to motion based on the
body and skeleton (Li et al. 2020b; Zhang et al. 2020; Lin et al. 2018). Li et al. proposed
a method that learns domain-invariant descriptors and fertilizes Word-level sign language
recognition (WSLR) models with transferring knowledge of subtitled news (Li et al.
2020b). Zhang et al. proposed an automatic system that generates plausible and natural-
posed 3D human bodies in that 3D scene (Zhang et al. 2020). Lin et al. map an NL descrip-
tion to an animation of a humanoid skeleton with a sequence-to-sequence model, E2E, pre-
trained with an auto-encoder (Lin et al. 2018).
From the relationship between a pair of objects, it is possible to capture the correlation
between the meaning of the scene and the agent’s self-centered viewpoint (Moghaddam
et al. 2021; Hong et al. 2020; Du et al. 2020). Moghaddam et al. used a graph neural
network that naturally expresses prior knowledge to encode the object-object relation-
ship in the pre-trained knowledge graph (Moghaddam et al. 2021). Hong et al. proposed
Language-Conditioned Visual Graph for modeling the inter-modal relationships between
text and vision and the intra-modal relationships among visual entities (Hong et al. 2020).

13
Visual language navigation: a survey and open challenges 385

For Visual Navigation, Du et al. proposed three complementary techniques; object rela-
tion graph (ORG), trial-driven imitation learning (IL), and a memory-augmented tentative
policy network (TPN) (Du et al. 2020). Integrating object relationships and Summarizing
lecture video makes, the graph-based algorithm is used to enhance the performance.

4.3 Graph model

External knowledge based on a knowledge graph is necessary to infer new facts and to
compensate for insufficient information with commonsense (Yu et al. 2019a, 2019b; Fang
et al. 2020). Yu et al. built commonsense layouts for path planning and enforced semantic
grounding of scenes in each step as auxiliary tasks (Yu et al. 2019a, 2019b). For better gen-
eralization, they update the semantically grounded navigator for the unseen environments.
Fang et al. Fang et al. generated commonsense captions directly from videos to describe
latent aspects and present a Video-to-Commonsense (V2C) dataset which contains various
actions of human agents with three types of commonsense descriptions (Fang et al. 2020).
Scene graphs are necessary to structurally analyze information about the environment
and effectively perform complex tasks (Bear et al. 2020; Ji et al. 2020). Bear et al. intro-
duced Physical Scene Graphs (PSGs), which represent scenes as hierarchical graphs (Bear
et al. 2020). Nodes are considered as object parts at different scales, and edges are physical
connections between parts. For the identification of meaningful scene elements, spatially-
uniform feature maps convert to object-centric graph structures with graph pooling and
vectorization. Ji et al. proposed Action Genome, which decomposes actions into spatio-
temporal scene graphs. It captures changes between objects and relationships while action
occurs (Ji et al. 2020).

4.4 Memory model

Memory is used in a variety of forms in areas that require prior knowledge and is also
used for joint training between vision and language. Episodic memory, which is updated
for temporal vectors, is used to remember specific events and improve model performance
(Zhu et al. 2020c; Chen et al. 2020c; Jaunet et al. 2020). Zhu et al. proposed the Cross-
modal Memory Network (CMN) for historical actions of navigation (Zhu et al. 2020c).
Language memory learns latent relationships between textual interaction and a dialog his-
tory with a multi-head attention mechanism. Visual memory learns to associate the cur-
rent views and the cross-modal memory about the previous actions. Chen et al. proposed
an Iterative Matching with Recurrent Attention Memory (IMRAM) method, which pro-
gressively explores the fine-grained correspondence between images and texts (Chen et al.
2020c). They use memory distillation to refine alignment knowledge from early steps to
later ones. Jaunet et al. proposed a visual analytics interface, DRLViz, to interpret the inter-
nal memory of an agent (Jaunet et al. 2020). When the agent moves in an environment,
large temporal vectors are updated in the memory. It deals with the number of dimensions,
dependencies to past vectors, spatial/temporal correlations, and co-correlation between
dimensions.
Several studies (Gordon et al. 2018; Qiu et al. 2020; Loynd et al. 2020) are conducted
to improve the performance of VLN tasks by effectively applying the hierarchy and graph
structure to memory. Gordon et al. proposed the Hierarchical Interactive Memory Network
(HIMN), which consists of a factorized set of controllers and operated at multiple levels of

13
386 S.-M. Park, Y.-G. Kim

temporal abstraction (Gordon et al. 2018). They also introduced IQUAD V1, which simu-
lates a photo-realistic environment of configurable indoor scenes with interactive objects.
Qiu et al. proposed a target-driven visual navigation algorithm, Memory-utilized Joint
hierarchical Object Learning for Navigation in Indoor Rooms (MJOLNIR), which learns
to associate an object with prior knowledge (Qiu et al. 2020). They compare PointGoal
navigation and target-driven navigation. They deal with the role of learning semantic con-
text and reward shaping for policy networks. Loynd et al. proposed the Working Memory
Graph (WMG), an agent that employs multi-head self-attention to reason over the observed
and recurrent state for sequential decision-making agents (Loynd et al. 2020).

4.5 Reasoning model

When the instruction is clear and simple, tasks are easily performed with internal knowl-
edge and policies. However, when it is complex and ambiguous, a comprehensive judg-
ment is required by using history, common sense, current context, and prior knowledge
(Zhu et al. 2020b; Shamsian et al. 2020; Pan et al. 2021). 2020). Zhu et al. identified func-
tionality, physics, intent, causality, and utility (FPICU) as the core domains of cognitive
AI (Zhu et al. 2020b). They proposed to use common sense to solve a wide range of tasks
with small data. Shamsian et al. conceptualized that localizing non-visible objects requires
two types of reasoning, occluded objects, and carried ones (Shamsian et al. 2020). They
proposed unified architecture, OPNet, for four subtasks that define subtypes of localization
tasks. Pan et al. proposed Actor-Context-Actor Relation Network (ACAR-Net) to model
indirect relations for spatio-temporal action localization with a high-order relation reason-
ing operator (Pan et al. 2021). 2020).
To solve complex tasks, some studies use counterfactual training (Fu et al. 2020b),
causal model (Shah et al. 2019; Abbasnejad et al. 2020), visual reasoning (Wang et al.
2019a; Marasović et al. 2020; Hong et al. 2020). Fu et al. proposed a model-agnostic
adversarial path sampler (APS) and an adversarial-driven counterfactual reasoning model
which considers effective conditions rather than low-quality augmentation (Fu et al.
2020b). Abbasnejad et al. utilized counterfactual training to leverage structural causal mod-
els (Abbasnejad et al. 2020). Shah et al. utilized an algorithm based on Maximum Causal
Entropy IRL for the implicit state preference in a suite of proof-of-concept environments
(Shah et al. 2019). Wang et al. proposed Reinforced Cross-Modal Matching (RCM), which
enhances local and global grounding (Wang et al. 2019a). A reasoning navigator enhances
the local visual scene, and a matching critic encourages global matching with an intrinsic
reward. Marasović et al. proposed a RATIONALEVT TRANSFORMER model for natu-
ral language-based visual reasoning in a complex environment (Marasović et al. 2020). It
combines pre-trained language models with object recognition, grounded visual semantic
frames, and visual commonsense graphs. Hong et al. proposed a Recursive Grounding Tree
(RVG-TREE) method that automatically constructs a natural language into a binary tree
and performs visual reasoning (Hong et al. 2019. 2020). This method recursively decom-
poses the language expression and repeatedly accumulates the grounding scores of the sub-
tree to obtain the grounding confidence score.

4.6 Interdisciplinary model

VLN tasks are supplemented and expanded in various forms through linkage with neuro-
science such as vector cell, bio-inspired mental model, and exemplar (Heinrich et al. 2020;

13
Visual language navigation: a survey and open challenges 387

Li et al. 2019b; Tamari et al. 2020). Heinrich et al. proposed a neuro-cognitive model,
Embodied Multimodal Interaction in Language learning (EMIL), which reflects bio-
inspired mechanisms (e.g., an implicit adaptation of time scales) (Heinrich et al. 2020).
They use cross-modal integration and E2E multi-modal abstraction for language ground-
ing. Li et al. proposed a Mental Imagery eNhanceD (MIND) module, which models the
dynamics of the environment and creates an object for better understanding for the embod-
ied agent (Li et al. 2019b). Tamari et al. described that in Cognitive Linguistics (ECL), nat-
ural language is essentially executable and is driven by metaphorical mapping and mental
simulation to the schema learned through hierarchical construction and interaction (Tamari
et al. 2020). Furthermore, it is argued that the use of grounding by metaphorical reasoning
and simulation will be of great help in NLU applications.
Curiosity and intrinsic motivation are necessary and effective methods of exploration
(Patro and Namboodiri 2018; Chaplot et al. 2020c; Karch et al. 2020). Patro et al. pro-
posed a predictive and curious neural model which learns with external agents inspired
by animate behaviors in social environments (Patro and Namboodiri 2018). Chaplot et al.
proposed self-supervised embodied interactive learning with a semantic curiosity policy
for training our exploration policy (Chaplot et al. 2020c). Karch et al. proposed intrinsic
motivated RL architecture, IMAGINE, which leverages descriptions to interpret imagined
goals (Karch et al. 2020). It uses a goal generator of NL goals with a decomposition, gated
attention, and object-centered representations.

4.7 Extended model

Using an environment that users can easily access (e.g., games) is also an excellent oppor-
tunity to facilitate research on VLN tasks. Ammanabrolu and Hausknecht propose KG-
A2C1 which builds a dynamic knowledge graph (KG) and generates a template-based
action (Ammanabrolu and Hausknecht 2020). It uses KG to reason game states and con-
strains NL generation for scalable exploration of large NL actions. Silva et al. trained mod-
els with given multi-modal sensory in the game and then executed with unimodal input
(Silva et al. 2020). It can generalize policy and reuse different modalities with multi-modal
latent representation.
Hide-and-seek is a simple yet effective simulation environment for multi-agent tasks
that use object and scene visual representations with eco-centric views (Baker et al. 2019;
Chen et al. 2019a; Ilinykh et al. 2019). Baker et al. used multi-agent competition, hide-
and-seek to find that agents create a self-supervised auto-curriculum that induces multiple
distinct rounds of emergent strategy (see Fig. 4) (Baker et al. 2019). Chen et al. proposed
Visual Hide and Seek, which trains embodied agents to avoid capture from a predator in
a simulated environment (Chen et al. 2019a). It contains a variety of obstacles to hide
behind, and agents are partially observable with an egocentric perspective. Ilinykh et al.
proposed a two-player coordination game, MeetUp, which finds each other in a visual envi-
ronment by talking about what they see and achieving mutual understanding (Ilinykh et al.
2019).
In the navigation task, geometric mapping (Martins et al. 2020; Morad et al. 2021) and
visual servoing (Li and Košecka 2020; Harish et al. 2020) are the necessary technologies.
Martins et al. enhanced map representation with object-level information for human–robot
interaction and visual navigation (Martins et al. 2020). It leverages a CNN-based object
detector with a 3D model-based segmentation and uses Kalman filters to combine sensor

13
388 S.-M. Park, Y.-G. Kim

Fig. 4  Emergent tool use from multi-agent autocurricula ( Adapted from Baker et al. 2019)

measurements over time. Morad et al. proposed an automatic curriculum learning method,
NavAssociation for Computational Linguistics (ACL) which selects relevant tasks using
geometric features for the navigation task (Morad et al. 2021). Li and Košecka proposed
methods to learn viewpoint invariant and target invariant visual servoing for local mobile
robot navigation (Li and Košecka 2020). Harish et al. proposed Image-based visual servo-
ing methods which consider optical flow as our visual features and systematically integrate
estimated depth with interaction matrix (Harish et al. 2020).
Table 4 shows the summarizations and comparisons of the VLN application.

4.8 Open challenges

Since VLN is composed of complex and tasks diverse models, the VLN component is
required for successful task execution. In particular, it is gradually expanding from the sim-
ple instruction-based method to a dialogue history-based method. At the task level, there is
an increasing number of expansions to valuable tasks (e.g., hide-and-seek, games) beyond
point-navigation for a specific point and object-navigation for a particular object. With the
development of the generative model and the graph model, studies using both models are

13
Table 4  Descriptions and comparisons of VLN Applications
Vendor Paper Method Purpose Highlights

VLN model Nguyen and Daumé (2019) HANNA Object-finding tasks Help Anna (HANNA) which is an
extension of Automatic Natural
Navigation; provide multi-modal
instructions to direct the agent
towards the goals Assistants
(ANNA); leveraging simulated
human assistants in object-finding
tasks
Nguyen et al. (2019) VNLA Find objects Guide agent with visual perception
and instruction; query to an advisor
and obtain language sub-goals to
make progress
Generative model Zhang et al. (2020) Automatic system Human body modeling Generates plausible and natural-
posed 3D human bodies in that 3D
Visual language navigation: a survey and open challenges

scene
Lin et al. (2018) Seq-to-Seq Skeleton animation Map a NL description to an anima-
tion of a humanoid skeleton; E2E
pre-trained with an auto-encoder
Graph model Yu et al. (2019a) 2019a, 2019b) Commonsense layouts Path planning Enforce semantic grounding of
scenes in each step as auxiliary
tasks; update the semantically
grounded navigator for the unseen
environments
Bear et al. (2020) Physical scene graphs Represent scenes Spatially-uniform feature maps
convert to object-centric graph
structures; graph pooling and
vectorization
389

13
Table 4  (continued)
390

Vendor Paper Method Purpose Highlights

13
Memory model Zhu et al. (2020c) Cross-modal memory network Historical actions of navigation Learn latent relationships between
textual interaction (language
memory); a dialog history (multi-
head attention); learns to associate
the current views (visual memory);
previous actions (cross-modal
memory)
Gordon et al. (2018) HIMN Visual navigation Hierarchical Interactive Memory
Network (HIMN); factorized set of
controllers and operate at multiple
levels of temporal abstraction;
simulate photo-realistic environ-
ment (IQUAD V1)
Qiu et al. (2020) MJOLNIR Target-driven visual navigation Memory-utilized Joint hierarchical
Object Learning for Navigation in
Indoor Rooms (MJOLNIR); which
learns to associate objects with
prior knowledge
Reasoning model Shamsian et al. (2020) OPNet Object permanence Conceptualize that localizing non-
visible objects requires two types
of reasoning, occluded objects, and
carried ones
Pan et al. (2021, 2020) ACAR-Net Spatio-temporal action localization Actor-Context-Actor Relation Net-
work (ACAR-Net); model indirect
relations with high-order relation
reasoning operator
Wang et al. (2019a) Reinforced cross-modal matching Reasoning navigator Enhance local and global grounding;
enhance in the local visual scene;
matching critic encourages global
matching with an intrinsic reward
S.-M. Park, Y.-G. Kim
Table 4  (continued)
Vendor Paper Method Purpose Highlights

Interdisciplinary models Heinrich et al. (2020) EMIL Language grounding Embodied Multimodal Interaction
in Language learning (EMIL);
cross modal integration and E2E
multimodal abstraction
Karch et al. (2020) IMAGINE Leverage descriptions Interpret imagined goals; NL goal
generator with a decomposition,
gated attention, and object-centered
representations
Extended model Baker et al. (2019) Emergent strategy Hide-and-seek Find that agents create a self-super-
vised auto-curriculum
Ammanabrolu and Hausknecht KG-A2C1 Reason game state Build a dynamic knowledge graph
(2020) (KG) and generates template-based
actions
Martins et al. (2020) Enhance map representation Human–robot interaction CNN-based object detector with a
Visual language navigation: a survey and open challenges

3D model-based segmentation;
Kalman filters to combine sensor
measurements
391

13
392 S.-M. Park, Y.-G. Kim

also increasing in VLN (Park and Kim 2021). In particular, response generation based on
generation model and image analysis based on scene graph enriches VLN tasks. Memory
and reasoning are necessary technologies to enable VLN agents to perform more complex
tasks and effectively use new tasks based on prior knowledge. In relation to interdiscipli-
nary research, interest in cognitive psychology and neuroscience is increasing because it
utilizes various knowledge of different domains for interaction. However, since it is being
utilized at the concept level or partially, more in-depth research is needed.

5 VLN Evaluations

In order for a large number of researchers to reliably share experimental results and com-
pare solutions, standard data and general-purpose simulation environments (e.g. Replica;
see Fig. 5) are required. In addition, benchmarks are needed to reduce redundant studies
and to ensure a fair evaluation. As VLN tasks cover various domains, extensive and in-
depth knowledge of each domain is required, so the VLN taxonomy and survey research
are also increasing. Lastly, interdisciplinary research becomes the basis for a more funda-
mental and explainable view of VLN issues.

5.1 Datasets

VLN task needs various data such as natural instruction, including conversation for
interaction, and multi-modal recognition, and navigation path, as shown in Table 5.
There are datasets for indoor navigation (Thomason et al. 2020; Qi et al. 2020c), out-
door navigation (Chen et al. 2019b), and action of embodied agents (Jia et al. 2020).
Thomason et al. proposed an embodied dialogs dataset, Cooperative Vision-and-Dialog

Fig. 5  Replica, which is photo-realistic 3D indoor scene reconstructions which consists of a dense mesh,
high-resolution high-dynamic-range (HDR) textures for embodied agents (https://​github.​com/​faceb​ookre​
search/​Repli​ca-​Datas​et)

13
Table 5  Descriptions and comparisons of VLN datasets
Method Purpose Dataset Size Highlights URL

CVDN Thomason et al. (2020) Embodied dialogs dataset 2050 human–human navigation Shortest path planner and dialog https://​github.​com/​mmurr​ay/​cvdn/
dialogs, 7 k navigation trajecto- history to infer navigation
ries, 83 house scans actions towards the goal in
unseen environments; photo-
realistic home environment,
oracle-navigator
REVERIE Qi et al. (2020c) Complex robot tasks 21,702 instructions, 1600 words Remote Embodied Visual refer- https://​github.​com/​Yuank​aiQi/​
vocab, 18 words average length ring Expression in Real Indoor REVER​IE
Environments (REVERIE);
natural language and visible
objects in a large set of real
images; remote refExp content;
localise remote object
TOUCHDOWN Chen et al. Instruction following 9326 dataset, 5626 vocab, 107 Spatial reasoning which contains https://​github.​com/​lil-​lab/​touch​
Visual language navigation: a survey and open challenges

2019a, b) mean text length instructions paired with actions down


and spatial descriptions for
navigation; outdoor navigation
and SDR
LEMMA Jia et al. (2020) Daily activities 324 samples, 4.6 M frames, 641 Single home with meticulously https://​github.​com/​Buzz-​Beater/​
action classes, 11,781 action designed settings to highlight LEMMA
segments, RGB-D different learning objectives
and annotate the atomic-actions
with human-object interactions;
goal-directed daily tasks; multi-
view video dataset; multi-agent,
multi-task activities
VIOLIN Liu et al. (2020b) VL inference 95,322 video-hypothesis pairs, Joint multimodal understanding of https://​github.​com/​jimmy​646/​violin
15,887 video clips, 582 h of video and text; from surface-
video level grounding to commonsense
reasoning; paired with a natural
language hypothesis based on
the video
393

13
Table 5  (continued)
394

Method Purpose Dataset Size Highlights URL

13
Quda Understand free-form NL 14,035 instances, 10 categories, Understand free-form natural lan- https://​github.​com/​freen​li/​quda_​
Fu et al. (2020a) 1.331 cardinality, 0.133 density guage in visualization-oriented corpus
natural language interfaces;
multi-label text classification;
five criteria
SNLI-VE Do et al. (2020) VL task prediction 401,717 image-hypothesis pairs Generating natural language https://​github.​com/​maxim​ek3/e-​ViL
for training and 14,740 pairs explanations for VL task predic-
for testing, 7.4 mean hypothesis tion; recognising visual-textual
length entailment; fine-grained multi-
modal reasoning
VideoNavQA Cangea et al. (2019) EQA 622 houses and 84,990 samples Contain pairs of questions and https://​github.​com/​catal​ina17/​Video​
for training, 56 houses and 7587 videos to handle a more com- NavQA
samples for testing plete variety of questions
S.-M. Park, Y.-G. Kim
Visual language navigation: a survey and open challenges 395

Navigation (CVDN), in simulated photorealistic home environments (Thomason et al.


2020). The navigator asks to question Oracle who knows the best next steps. It uses
the shortest path planner and dialog history to infer navigation actions towards the goal
in unseen environments. Qi et al. proposed Remote Embodied Visual referring Expres-
sion in Real Indoor Environments (REVERIE), which provides a dataset of varied and
complex robot tasks consisting of natural language and visible objects in a large set of
real images (Qi et al. 2020c). Chen et al. introduced the TOUCHDOWN dataset for
instruction following and spatial reasoning, which contains instructions paired with
actions and spatial descriptions for navigation (Chen et al. 2019b). Jia et al. proposed
the LEMMA dataset, which provides a single home with meticulously designed settings
to highlight different learning objectives and annotate the atomic actions with human-
object interactions in daily activities (Jia et al. 2020). The difference between indoor
and outdoor VLN tasks depends on whether the agent considers the target macroscopi-
cally or in detail. In the case of indoors, usually, the main task is to find the object, so
the object is regarded as important. However, in the case of the outdoor, the direction is
more important as there are many cases of finding a way to target.
There are datasets to effectively combine and utilize vision and natural language
information in a visual-language task (Liu et al. 2020b; Do et al. 2020; Cangea et al.
2019). Liu et al. proposed VIOLIN (VIdeO-and-Language Inference) for the joint multi-
modal understanding of video and text (Liu et al. 2020b). This dataset provides sophisti-
cated reasoning skill tests from surface-level grounding to commonsense reasoning. Fu
et al. presented Quda dataset to understand free-form natural language in Visualization-
oriented natural language interfaces (Fu et al. 2020a). Do et al. proposed a real-world
dataset SNLI-VE corpus to recognize visual-textual entailment and reason with fine-
grained multi-modal features (Do et al. 2020). Cangea et al. introduced the VideoN-
avQA dataset, which contains pairs of questions and videos to handle a more complete
variety of questions on EQA tasks (Cangea et al. 2019).

5.2 Simulations

Simulation is the biggest bottleneck in performing VLN tasks. The key to VLN simula-
tion is to close the gap between the real world and a simulator (Anderson et al. 2018;
Zhu et al. 2017; Deitke et al. 2020; Juliani et al. 2019; Jiang et al. 2020b). Anderson
et al. introduced the Matterport3D Simulator, which provides a large-scale RL environ-
ment with real images (Anderson et al. 2018). Zhu et al. used an actor-critic model to
generalize the unseen environment better and propose AI2-THOR, which provides an
environment with 3D scenes and a physics engine (Zhu et al. 2017). These E2E train-
able models show faster converses and generalization across targets and scenes. Deitke
et al. introduced ROBOTHOR to democratize simulated environment paired with physi-
cal counterparts and simulation-to-real transfer for interactive and embodied visual AI
(Deitke et al. 2020).
Juliani et al. presented a real-world joint simulation to map instructions for navigation
which has raw egocentric observations to continuous control (Juliani et al. 2019). They
estimate the need for exploration and predict the likelihood of visiting positions and con-
trols. They also introduce Supervised Reinforcement Asynchronous Learning (SuReAL)
to predict positions to visit and RL for continuous control. Jiang et al. presented light-
weight RL simulation environments, WordCraft which is fast to run and build relations

13
396 S.-M. Park, Y.-G. Kim

from real-world semantics (Jiang et al. 2020b). Matterport3D is a simulation environment


commonly used as a 3D scans simulator for indoor environments. Methods that consider
the physical environment (e.g., AI2-THOR and ROBOTHOR) have advantages in apply-
ing VLN tasks to the real environment. Furthermore, lightweight simulation environments
(e.g., WordCraft) are competitive because multimodal and RL learning bodies have heavy
drawbacks.

5.3 Benchmarks

Competition based on benchmarks (e.g., Habitat 2.0, Obstacle Tower, AVSD) is also nec-
essary to generalize the problem and share research results (Xia et al. 2020; Shridhar et al.
2020; Shridhar et al. 2020; Blukis et al. 2019; Le and Chen 2020). Xia et al. provided the
Interactive Gibson Environment, which simulates high fidelity physical dynamics in visual
scenes, and a set of Interactive Navigation metrics to the interplay between navigation and
physical interaction (Xia et al. 2020). Shridhar et al. presented household tasks benchmark,
Action Learning From Realistic Environments and Directives (ALFRED) which map from
NL instructions and egocentric vision to sequences of actions (Shridhar et al. 2020). It con-
tains long and compositional tasks with nonreversible state changes for a real-world exam-
ple. Blukis et al. proposed a Unity benchmark, Obstacle Tower, which has high fidelity and
is procedurally generated in a 3D environment (Blukis et al. 2019). Alamri et al. summa-
rized an AVSD challenge in DSTC8 (Alamri et al. 2018). Le and Chen adapted dot-product
attention to combining text and non-text features (Le and Chen 2020). Pointer networks are
used for pointing tokens from multiple source sequences in the generation step.
Table 6 shows the summarizations and comparisons of the VLN evaluation. Habitat 2.0
is a benchmark that is being actively developed through competition every year. ALFRED
has practical advantages as it focuses on housekeeping, which is the primary target of VLN
operations. Obstacle Tower is a slightly different benchmark, but it has the potential to
scale with Unity-based games.

5.4 Open challenges

VLN can be divided into photo-realistic simulation and relatively simple tabular maze sim-
ulation in terms of the simulation environment. In addition, the data of the photo-realistic
environment is divided into an in-door dataset and an out-door dataset. From the task point
of view, VLN is divided into a method of performing navigation and a method of perform-
ing more complex tasks (e.g., the operation of picking up objects). While simple simu-
lation has the advantage of clearly analyzing the influence of element components, it is
difficult to confirm the effect of the actual application in a complex manner. On the other
hand, it is possible to evaluate complex tasks in a form similar to the real world through
photorealistic simulation. Still, many sub-modules need to prepare according to the task. In
particular, simulators that can reflect results according to echo, noise, and distance are rare
in tasks related to sound events.

13
Table 6  Descriptions and comparisons of VLN evaluation
Vendor Paper Method Purpose Highlights

Simulation Zhu et al. (2017) AI2-THOR Better generalize Actor-critic model; provides an environment with 3D scenes and physics
engine; faster converses and generalization across targets and scene
Anderson et al. (2018) Matterport3D Large-scale Provide a large-scale RL environment with real image
Jiang et al. (2020b) WordCraft Lightweight Lightweight RL simulation environments; fast to run and build relations
from real-world semantics
Juliani et al. (2019) Real-world joint simulation Map instructions Raw egocentric observations to continuous control; estimate the need of
Visual language navigation: a survey and open challenges

exploration and predicts the likelihood


Benchmarks Alamri et al. (2018), AVSD DSTC8 Audio-Visual Scene-Aware Dialog (AVSD) is an extension from Video
Le and Chen (2020) Question Answering (QA); The Eighth Dialog System Technology Chal-
lenge (DSTC8)
Xia et al. (2020) Interactive Gibson Physical dynamics Simulate high fidelity physical dynamics in visuals scenes; a set of Interac-
tive Navigation metrics; the interplay between navigation and physical
interaction
Shridhar et al. (2020) ALFRED Household tasks benchmark Action Learning From Realistic Environments and Directives (ALFRED);
map from NL instructions and egocentric vision to sequences of actions
Blukis et al. (2019) Obstacle Tower Unity benchmark High fidelity and procedurally generate on 3D environment
397

13
398 S.-M. Park, Y.-G. Kim

Fig. 6  Tree research diagram of major research institute for visual language navigation

6 VLN applications

There are many research institutes that work on VLN, but in this survey, three research
institutes were targeted. As for the criteria, if the authors of the three institutes were
included, their papers were classified by including them in the results of the institutes.
Most of the articles were less than 2 years old, and the most recent articles were targeted.
Even if it was not used directly in the VLN task, it was included as a target if available. In
this section, as depicted in Fig. 6, we apply VLN taxonomy to researches of three research
institutes: DeepMind, Google research, and Facebook research.

6.1 DeepMind

DeepMind is researching VLN by inspiring the concept of brain structure and neurosci-
ence. It is consistently conducting cognitive navigation research based on reinforcement

13
Visual language navigation: a survey and open challenges 399

learning, and its strength is to interpret and prove its research results from the perspective
of neuroscience. They are doing various VLN researches using image-based maze simula-
tions (e.g., DeepMind lab) and photo-realistic city navigation.

6.1.1 VLN representation learning

DeepMind proposed diverse methods (e.g., BERT, QA, Shared embedding, common repre-
sentation, video representation, unsupervised methods) to enhance representation learning.
It makes representation more robust, reducing misalignment, and compensating rewards.
Language representation Das et al. used QA to decode and understand representation
with predictive modeling, action-conditional CPC, and SimCore (Das et al. 2020). They
use internal state representations for synthetic questions without backpropagation on the
decoder.
Visual-language representation Sigurdsson et al. proposed a hybrid visual-text map-
ping algorithm, MUVE, which maps words between the languages, particularly the ‘visual’
words (Sigurdsson et al. 2020). Using the shared embedding of existing unsupervised text-
based translation shows robustness and needs relatively low-resource languages. Jaderberg
et al. introduced the methods that simultaneously maximizes pseudo-reward functions and
rapidly adapt to the actual task (Jaderberg et al. 2017). It focuses on common representa-
tion for continual development when extrinsic rewards are absent.
Video representation Miech et al. proposed MIL-NCE, which learns strong video rep-
resentations from scratch without manual annotation, to reduce misalignments inherent in
narrated videos (Miech et al. 2020).

6.1.2 VLN reinforcement learning

DeepMind did a lot of researches for RL. However, we will only look at a few that are rel-
evant. They propose a method that uses multi-modal sensory inputs, compressed sensing,
and semantics relations of natural language.
RL representation Mirowski et al. formulated the navigation question and proposed
additional auxiliary tasks about multi-modal sensory inputs to improve data efficiency and
task performance (Mirowski et al. 2017). They jointly train the goal-driven RL problem
with auxiliary depth prediction and loop closure classification tasks.
Hierarchical RL Eysenbach et al. proposed Diversity is All You Need (DIAYN), which
learns useful skills without a reward function by maximizing an information-theoretic
objective using a maximum entropy policy (Eysenbach et al. 2019).
Self-supervised RL DeepMind proposed a system that captures the semantics of spatial
relations from natural language. They introduced a multi-modal objective based on gener-
ating images of scenes from descriptions for spatial reasoning. DeepMind proposed meth-
ods that learn a skill without rewards and a self-supervised model for better generalization
in noisy environments. Gruslys et al. proposed a model-free RL algorithm, the Advantage
Regret-Matching Actor-Critic (ARMAC), which saves a buffer of past policies (Gruslys
et al. 2020). They use retrospective value estimates to predict conditional advantages with
regret matching to produce a new policy.

13
400 S.-M. Park, Y.-G. Kim

6.1.3 VLN components

The hippocampus-based episodic memory is the primary mechanism of long-term mem-


ory, and reasoning is applied in the driving and robot domains. In particular, they con-
ducted a detailed study on driving with an egocentric view in a photorealistic environment
based on Google Street view.
VLN model Bruce et al. proposed efficient learning for goal-directed navigation poli-
cies in a large heterogeneous environment (Bruce et al. 2018). They use multiple forms of
efficient stochastic augmentation to learn policy with pre-computed embedding and dem-
onstrate on the real robot without fine-tuning. Li et al. transferred the ground view policy
to the unseen city with aerial observations instead of training a multi-modal policy on the
ground and aerial views (Li et al. 2019a). They learn a joint policy with similar embedding
space from transferable across views. Mirowski et al. pointed out the difficulty of naviga-
tion tasks such as perception, planning, memory, exploration, and optimization (Mirowski
et al. 2019). For the end-to-end goal-driven navigation, they proposed StreetLearn, which
has an interactive, first-person, partially-observed visual environment of Google Street
View. Hermann et al. proposed an instruction-following task to combine the practical-
ity of simulated environments with real-world data (Hermann et al. 2020). It provides
agents driving instructions to learn the interpretation of navigation in StreetNav. Mirowski
et al. proposed a dual pathway architecture that encapsulates locale-specific features and
transfers to multiple cities in an interactive navigation environment, Google Street View
(Mirowski et al. 2018).
Memory model Ritter et al. proposed a recursive implicit planning module with episodic
memories (Ritter et al. 2020). It uses prior knowledge and efficient exploring, and knowl-
edge-based planning for unseen environments. They consider the situation of severe par-
tial observability and long memories durations. Banino et al. proposed MEMO, which can
reason over longer distances (Banino et al. 2020). It separates facts and comprised items
stored in external memory. After that, it uses an adaptive retrieval mechanism that can rea-
son with a variable memory hop.
Interdisciplinary model A wide range of approaches to mammalian navigation, moti-
vation-based rewards, and rapid planning is proposed for route navigation. For the mam-
mal-like navigational abilities of the RL agent, Banino et al. trained a recurrent network
with representations resembling grid cells for path integration (Banino et al. 2018). They
showed that grid-like representations are effective for an agent to locate goals in unseen
environments and enable agents to conduct shortcut behaviors of mammals. Colas et al.
proposed IMAGINE, an intrinsically motivated deep RL architecture that a social peer
guides language description (Colas et al. 2020b). To use goal imagination, the agent inter-
prets these descriptions to imagined out-of-distribution goals. They use a decomposition
between reward function and policy, gated attention, and object-centered representations
for generalization. Khetarpal et al. proposed an affordance model to enable faster planning
by reducing the number of actions in a given situation and learning transition models more
efficiently and accurately from data with function approximation (Khetarpal et al. 2020).

6.1.4 Summary

DeepMind focuses on visual representation learning in the multi-modal domain. MONET


is used for abstraction for a visual representation to quickly process the vision context
in games and tasks (e.g., flag game, StarCraft). On the other hand, research has been

13
Visual language navigation: a survey and open challenges 401

conducted using natural language for RL, but research on natural language itself is less
than that of other research institutes. Spatial relations from natural language were used, or
sparse imitation learning was applied to text-based games. They show that an embodied
agent exhibits a credible understanding of language with similar one-shot word learning
(Hill et al. 2020).
DeepMind is a research group that leads reinforcement learning with OpenAI and
Berkeley. While there is a lot of research on reinforcement learning in DeepMind, there
aren’t many studies directly related to VLN because they focus on general-purpose AI.
Above all, generalization approaches (e.g., DIAYN), in which skills are not defined in
advance and learn subtasks by themselves through learning, can be well utilized for gen-
eral-purpose VLN tasks. They are inspired by neuroscience and are familiar with evaluating
the validity of RL through in vivo experiments. For example, based on mammal-like navi-
gation, there are studies on grid cell-based shortest-distance search and path finding using
episode memory (e.g., MERIN). In addition, as it is interested in scalability, it is showing
strength in street learning based on photorealistic Google Street view, outdoor navigation,
and autonomous driving on a city scale. Unfortunately, they do not provide open-source
codes and, in practical terms, are more restrictive than other research groups. However, as
an organization that has the potential to do moonshot technology (e.g., AlphaGo), we can
expect a general-purpose VLN that is more human-like and interpretable.

6.2 Google research

Google research deals with more tangible tasks (e.g., outdoor navigation) that can be
extended with Google navigation. Therefore, they are strong candidates for commercializ-
ing VLN outdoors. It also has the potential to expand VLN more widely in connection with
autonomous driving, Google Assistance, and even robot.

6.2.1 VLN representation learning

Google research conducted studies to abstract and transform text-based representation


using structural characteristics. A study on caption-to-video or controllable image caption-
ing was also proposed for multi-modal representation learning as well as natural language.
In addition, a characteristic study of multi-modal training and unimodal evaluation was
also conducted in EQA tasks.
Language representation Raffel et al. introduced a unified framework that converts lan-
guage problems into text-to-text formats using transfer learning (Raffel et al. 2020). Jiang
et al. used language as the abstraction across tasks to reason using structured language
(Jiang et al. 2019). Language provides compositional structure, fast learning. Kreutzer
et al. enabled NLP practitioners to utilize interaction logs as offline RLs by using NLP
and offline RL (Kreutzer et al. 2020). In addition, the RL researcher proposed a method to
enhance the algorithm for the demanding application of NLP. Inverse reinforcement learn-
ing was used to restore the reward function in the demo or a warm start agent to reduce the
sample complexity.
Visual-language representation For caption-to-video retrieval, Gabeur et al. utilized a
multimodal transformer to jointly encode the different modalities in a video for optimiz-
ing the language embedding (Gabeur et al. 2020). They use S3D, VGGish, DenseNet161,

13
402 S.-M. Park, Y.-G. Kim

word2vec embedding, and SSD face detector for multi-modal perception. Alikhani et al.
proposed coherence-aware, controllable image captioning models which learn inferences
in imagery and text, coherence relation prediction (Alikhani et al. 2020). They use coher-
ence annotations to learn relation classifiers as an intermediary step. To answer a question
with intelligent navigation, Anand et al. proposed a model that uses egocentric visual infor-
mation in EmbodiedQA and evaluates question-only baselines in an unseen environment
(Anand et al. 2018). Koh et al. proposed TReCS, a sequential model that generates images
using language grounding with Limited hidden Markov models and BERT models (Koh
et al. 2021). They retrieve images with a text-image dual encoder and input a complete
segmentation into the mask-image conversion model to synthesize a natural and realistic
image. Ku et al. proposed a model that resolves the known biases on the trajectory and
emphasizes the role of language in VLN by making more references to visible entities (Ku
et al. 2020). It has an advantage in a multilingual environment by setting a single language
and the standard score for multi-task learning.
Pretrained multimodal representation He et al. proposed to design a pre-trained model
of each domain and the UI function expression by matching the user’s action with natural
language in the web user interface (He et al. 2021).

Fig. 7  Google visual language navigation, Follownet ( Adapted from Shah et al. 2018)

13
Visual language navigation: a survey and open challenges 403

6.2.2 VLN Reinforcement learning

Google research also has many papers on RL, but inverse reinforcement learning, multi-
context imitation, and zero-shot imitation learning based on language-RL appear to be par-
ticularly useful for VLN tasks.
Self-supervised RL Agarwal et al. proposed a robust Q-learning algorithm, Random
Ensemble Mixture (REM), that enforces optimal Bellman consistency on random convex
combinations to enhance generalization for offline RL (Agarwal et al. 2020). For high-
quality navigation and cost-effective data collection, Pan et al. proposed a zero-shot imita-
tion learning framework to train a goal-driven visual navigation policy on a legged robot
from demonstrations (Pan et al. 2020).

6.2.3 VLN Components

Google research proposed SEED RL-based VALAN and SAPIEN considering physical
characteristics as a simulation environment for VLN. They proposed Follownet, which
transforms and executes complex sentences into simple forms (see Fig. 7). In addition, as
it considers extension with dialog beyond VLN, it is showing a wide range of progress
in natural language. Google considers spatio-temporal dependence and proposes several
memory models for handling long episodes.
VLN model Xiang et al. proposed SAPIEN, a realistic and physics-rich simulated environ-
ment with a large-scale set of articulated objects (Xiang et al. 2020a). Wang et al. introduced
a multi-task navigation model for VLN and Navigation from Dialog History (NDH) tasks
(Wang et al. 2020b). It is guided by natural language and transfers knowledge across tasks. It
learns environment-agnostic representations for the navigation policy on unseen environments.
Wang et al. proposed a generalized multitask navigation model which seamlessly trains on
language-grounded navigation tasks such as VLN and NDH (Wang et al. 2019b). It can effi-
ciently transfer knowledge across related tasks by using natural language. Shah et al. proposed
an end-to-end differentiable model, FollowNet, which learns multi-modal navigation policies
(Shah et al. 2018). It matches instructions to visual and depth inputs to locomotion primitives.
It uses attention to focus on the relevant parts of the command during navigation.
Memory model Fang et al. proposed a memory-based policy, Scene Memory Trans-
former (SMT), which embeds each observation to memory for efficient training over long
episodes (Fang et al. 2019). They also use the attention mechanism to exploit spatio-tem-
poral dependencies. Ren et al. proposed a few-shot learning dataset and contextual proto-
typical memory model for large-scale indoor environments (Ren et al. 2020). It mimics the
visual experience of an agent wandering within a world.

6.2.4 Summary

The papers containing the authors of Google are summarized in terms of representation
learning (e.g., T5) for transfer learning based on the composition of multi-modal joint
encoding for video captions. They consider the coherence relation and conduct to learn with
vision and evaluate with a question for the unseen environment. In research using reinforce-
ment learning, they utilize language conditional reward and extend multi-context imitation
to the robot domain. They also use environment-agnostic representations and efficiently
transfer knowledge to propose a model that is easy to transfer to new tasks. In particular,
in the case of Follownet, it has the characteristic of interpreting and operating a relatively

13
404 S.-M. Park, Y.-G. Kim

complex natural language that includes a large number of actions in an indoor environment.
However, the performance deteriorates with the sentence lengths. In terms of memory uti-
lization, scene memory and contextual prototypical memory are used to show stable opera-
tion even on a large scale. In the aspect of framework and simulation, they proposed Seed
RL-based VALAN framework and SAPIEN simulator for the photo-realistic environment.
Google has a broad spectrum of tasks, from Google Maps to robots. In addition, it is one
of the major companies that can commercialize VLN because autonomous driving or exten-
sion of Google Maps is possible using proficient natural language and vision processing.

6.3 Facebook research

Facebook research has conducted Habitat competition based on open source and realis-
tically approaches language and vision. They are directly involved in VLN, and because
they provide open-source codes and competitions, extensive research exchange is possible.
In addition, it provides practical software of the component technology required for VLN
(e.g., Habitat-sim, detectron2, craft assistant, ParlAI).

6.3.1 VLN representation learning

Facebook research used point clouds in VLN tasks to identify only partially visible objects
and conduct research on visual representation to measure visual similarity. In VLN tasks,
there is little research on audio representation. However, for the task of locating sound,
they proposed several audio-visual navigation studies that could process audio information.
They proposed ViLBERT, a multi-modal pre-trained model that can handle video repre-
sentation, and further proposed VLN-BERT, which can be used directly for VLN tasks. In
addition, studies on multi-modal representation considering visual grounding, object rela-
tions, and multi-modal transformers were also proposed.
Vision representation Wiles et al. proposed a differentiable point cloud renderer that
transforms latent 3D features into the target view (Wiles et al. 2020). They train real images
without any ground-truth 3D information in an E2E manner and then use a single image at
test time.
Language representation Perez et al. decomposed hard questions into easier sub-ques-
tions to improve the answering performance in QA tasks (Perez et al. 2020). They use an
unsupervised approach to produce sub-questions and map from the distribution of multi-
hop questions to the distribution of single-hop sub-questions.
Visual-language representation Datta et al. proposed an E2E caption-to-image retrieval
model, which guides the process of phrase localization, infers the latent correspondences
between regions-of-interest and phrases in the caption (Datta et al. 2019). They also create
a discriminative image representation using these matched RoIs to guide visual grounding.
Hu et al. proposed multimodal transformer architecture accompanied by a rich represen-
tation for text in images for the TextVQA task (Hu et al. 2020). Sadhu et al. proposed a
VOGNet framework to encode multi-modal object relations via self-attention with relative
position encoding (Sadhu et al. 2020).
Video representation Nagarajan et al. proposed a human-centric model which captures
the primary spatial zones of interaction and activities (Nagarajan et al. 2020). They decom-
pose space into a topological map derived from a first-person activity that learns directly
from the egocentric video. Many objects in the real world make dramatic changes depend-
ing on their visual shape. Objects become situational cues about how objects appear in

13
Visual language navigation: a survey and open challenges 405

the scene. Bertasius et al. proposed the Contextualized Object Embeddings (COBE) that
trains a visual detector using the transcription narration of educational video to predict the
contextualized word embedding of the object and the related narration (Bertasius and Tor-
resani 2020). It infers the ongoing or scheduled human-object interaction.
Multimodal representation As the alternative form of curiosity, Dean et al. introduced
multimodal rewards that both sight and sound play a critical role in exploration (Dean et al.
2020). It exploits multiple modalities for more efficient exploration on several Atari envi-
ronments and Habitat 2.0. Chen et al. introduced audio-visual navigation for acoustical and
visual 3D environments (Chen et al. 2020b). They reverberate audio and follow sound-
emitting targets. Gao et al. explored the spatial cues contained in echoes for spatial reason-
ing (Gao et al. 2020). After capturing echo responses in photo-realistic environments, they
use an interaction-based representation learning framework to learn useful visual features
via echolocation. Moon et al. proposed Situated Interactive Multi-Modal Conversations
(SIMMC), which trains multi-modal actions grounded in a multi-modal context with the
dialog history on two shopping domains, furniture, and fashion (Moon et al. 2020).
Pre-trained multimodal representation Majumdar et al. proposed VLN-BERT, which
uses a visio-linguistic transformer model that scores the compatibility between instruction
and panoramic images (Majumdar et al. 2020). Lu et al. proposed Vision-and-Language
BERT (ViLBERT), which learns task-agnostic joint representations of image and natural
language (Lu et al. 2019). It extends BERT to a multi-modal two-stream model that inter-
acts with co-attentional transformer layers.
VLN reinforcement learningFacebook research proposed fused representation and the
nav-graph-based representation for effective RL representation learning. They proposed
DD-PPO to increase sample and time efficiency in the PointNav environment and con-
ducted experiments on Embodied agents in various environments. For VLN tasks, vari-
ous approaches such as imitation learning, subsequent joint hierarchical training, Inflec-
tion Weighting, and sub-question decomposition were added as the language RL method to
enhance performance.
RL representation Shen et al. proposed an action-level representation fusion scheme
that trains an agent to fuse a large visual representation of diverse perceptions (Shen et al.
2019). Fused representations are used to predict an action candidate from each representa-
tion and adaptively consolidate to the final action. To reduce redundancies and improve
generalization, they use an inter-task affinity regularization. Krantz et al. extended the
VLN task to continuous 3D environments by removing assumptions imposed by the nav-
graph-based representation (Krantz et al. 2020). They map observations directly to low-
level control in an E2E manner.
Planning Das et al. proposed a modular approach for learning policies over long plan-
ning horizons from the language in navigation tasks (Das et al. 2018b). They use imitation
learning for warm-start policies and subsequent joint hierarchical training to adapt to the
sub-policies from the master policy.
Self-supervised RL Wijmans et al. proposed Decentralized Distributed Proximal Policy
Optimization (DD-PPO), a method for distributed RL in resource-intensive simulated envi-
ronments (Wijmans et al. 2020). DD-PPO uses multiple machines without a centralized
server in a synchronous manner for simple and easy implementation. To increase sam-
ple and time efficiency in PointNav, Ye et al. developed a method using self-supervised
auxiliary tasks with DD-PPO (Ye et al. 2020). They can predict the action between two
egocentric observations and the distance between two observations from a trajectory. Wij-
mans et al. proposed a loss-weighting scheme, Inflection Weighting, which trains recurrent

13
406 S.-M. Park, Y.-G. Kim

models for navigation with behavior cloning (Wijmans et al. 2019). Point clouds provide
a richer signal than RGB images for obstacle avoidance in the embodied navigation task.

6.3.2 VLN components

Facebook research is one of the research institutes that proposes various models (e.g., inter-
active role-play models such as EQA and Talk The Walk) and frameworks through Habitat
competition related to VLN. In particular, papers that apply the SLAM model and mid-
level vision through visual representation improvement based on the Habitat framework
have been proposed. In addition, a study was conducted to construct a top-down semantic
map for navigation and to apply curiosity to VLN tasks. In addition, they are lowering the
entry point of VLN tasks through game environments such as Minecraft, real-time strat-
egy games, text adventure games, and role-playing games. Generating natural language
responses also shows strengths such as using persona or transforming it into a form that 5
years old can understand. In addition, various studies such as Bayesian Relational Memory,
geometric reasoning, spatial reasoning, multi-hop reasoning, joint reasoning, relation rea-
soning, and context reasoning have been proposed.
VLN model Das et al. proposed Embodied Question Answering (EmbodiedQA), which
intelligently navigates to explore the environment and answer visual questions with ego-
centric vision (Das et al. 2018a). They use active perception, common sense reason-
ing and language grounding, and imitation learning for EQA tasks. Yu et al. proposed a

Fig. 8  Facebook visual language navigation framework, Habitat (Savva et al. 2019; get the citation permis-
sion from the authors)

13
Visual language navigation: a survey and open challenges 407

generalization of EQA, Multi-Target EQA, which can handle comparison questions of mul-
tiple targets in VLN tasks (Yu et al. 2019a, 2019b).
Savva et al. proposed an embodied AI platform, Habitat, which trains embodied agents
in photorealistic 3D simulation (see Fig. 8) (Savva et al. 2019). Chaplot et al. proposed a
modular and hierarchical approach, Active Neural SLAM, to learn policies for exploring
3D environments (Chaplot et al. 2020a). They use analytical path planners with SLAM
modules based on global and local policies. To learn faster and generalize better in RL
frameworks, Chaplot et al. also proposed Goal-Oriented Semantic Exploration, which uses
an episodic semantic map to explore the environment efficiently in habitat tasks (Chap-
lot et al. 2020b). Sax et al. integrated a generic perceptual skill set, mid-level vision, on
an RL framework (Sax et al. 2019). It provides the policy with a more processed state
rather than raw images. It shows good performance in habitat tasks. Gordon et al. proposed
decoupling visual perception and policy learning methods, SplitNet, which utilize auxil-
iary tasks, selective learning, and transferring between simulators (Gordon et al. 2019).
Ramakrishnan et al. inferred the occupancy state with egocentric RGB-D observations and
top-down maps (Ramakrishnan et al. 2020). They facilitate rapid spatial awareness and
efficient exploration in 3D environments.
Narasimhan et al. proposed the model to predict top-down belief maps of regions and
generate top-down semantic maps to predict a target point for room navigation (Narasim-
han et al. 2020). Nagarajan and Grauman proposed RL for exploration for the interaction of
an embodied agent with an egocentric RGB-D camera and a high-level action space (Naga-
rajan and Grauman 2020). They maximize interaction rewards for simultaneously training
an image-based affordance segmentation model.
Generative model Shuster et al. proposed Personality-captions, which incorporate con-
trollable style and personality traits (Shuster et al. 2019). Gafni et al. proposed the model
which generates sequences of person image according to arbitrary user-defined control sig-
nals (Gafni et al. 2019). The generated video shows the dynamics and appearance of a
person with have an arbitrary background. Fan et al. proposed an Explain Like I’m Five
(ELI5), which provides a comprehensible answer to five-year-olds (Fan et al. 2019). ELI5
comprises various questions requiring multi-sentence answers.
Memory model Wu et al. proposed a Bayesian Relational Memory (BRM) to improve
the generalization for semantic visual navigation agents in unseen environments (Wu et al.
2019). BRM uses a probabilistic relation graph over semantic entities by capturing the lay-
out prior from training and estimating the posterior layout with updating memory at test
time.
Reasoning model Chaplot et al. proposed topological representations with semantics and
afforded approximate geometric reasoning (Chaplot et al. 2020d). Lewis and Fan proposed
a multi-hop reasoning model to explain the question, not just to answer it (Lewis and Fan
2018). They generate a conditional language model with the joint distribution of questions
and answers because discriminative question answering models over-fit to superficial biases.
Zhong et al. proposed a grounded policy learning problem, Read to Fight Monsters (RTFM),
which is needed to jointly reason over a language goal, relevant dynamics, and observations
(Zhong et al. 2019). It procedurally generates environment dynamics and corresponding lan-
guage descriptions to prevent memorizing. They also use the txt2π model to capture interac-
tions between the goal, document, and observations. Wu et al. used a probabilistic graphical
model to learn sub-policies for semantic concepts and a prior distribution over pairwise rela-
tionships (Wu et al. 2018b). The agent dynamically updates its belief of semantic relation-
ships during exploration and plans interpretable routes. Singh et al. proposed a Look, Read,

13
408 S.-M. Park, Y.-G. Kim

Reason & Answer (LoRRA), which predicts an answer by reading text in the image and
reasoning with the context of the image and the question in TextVQA (Singh et al. 2019).
Extended model Szlam et al. built an open assistant in Minecraft game to learn from dia-
logue in the concept of human in the loop (Szlam et al. 2019). Hu et al. introduced a real-
time strategy game environment that coordinates the various actions of units across long
time scales (Hu et al. 2019). They use latent instructions as a compositional representation
of complex activities for hierarchical decision-making. Suhr et al. built a game environ-
ment to map user instructions to system actions (Suhr et al. 2019). It focuses on recovery
from cascading errors between instructions and explicit reasoning about multiple instruc-
tions. Shuster et al. proposed a role-playing game to get more realistic conversation data in
an open-domain fantasy world based on the conversations they have with humans (Shuster
et al. 2020). Nagarajan and Grauman introduced various downstream tasks such as finding
a knife and putting it in a drawer to use the new home environment intelligently (Nagarajan
and Grauman 2020).

6.3.3 Evaluations

Facebook research proposed simulators such as House3D and Replica that focus on
3D environments. In addition, ParlAI and SIMMC, which are frameworks focused on
dialogue, were proposed. They proposed the Sim2Real Correlation Coefficient, which
measures the difference between simulation and reality, and conducted a compositional
arrangement to arrange constituent objects. They also proposed several datasets for
evaluation and standard scenarios for benchmarking.
Simulation Kadian et al. measured the correlation between simulation and reality
by virtualizing reality and executing parallel experiments in visual navigation tasks
(Kadian et al. 2019). To measure simulation predictivity, they propose the Sim2Real
Correlation Coefficient (SRCC). To measure generalization on unseen environments,
Wu et al. proposed a House3D, consisting of human-designed 3D scenes of visually
realistic houses and a diverse set of fully labeled 3D objects (Wu et al. 2018a). They
consider the robustness of low-level variations (e.g., color) and high-level variations
(e.g., layout). Crook et al. proposed an extension to ParlAI, SIMMC for multi-modal
conversation and system evaluation (Crook et al. 2019). SIMMC simulates with AI
Habitat or Unity while engaging in a conversation. Fan et al. investigated methods to
compositionally arrange locations, characters, and objects for world creation in the mul-
tiplayer text adventure game environment (Fan et al. 2020).
Metric To investigate relationships between vision-and-language tasks, Lu et al. used
a single model on 12 datasets from four broad categories of tasks, including VQA, cap-
tion-based image retrieval, grounding referring expressions, and multi-modal verifica-
tion (Lu et al. 2020). Batra et al. revisited the problem of Object-Goal Navigation (Batra
et al. 2020). They recommend the evaluation metric, agent, and environments specifica-
tions for visual navigation towards objects, ObjectNav. Jiang et al. revisited grid features
for VQA and compared the solution of bottom-up attention with regions and grid fea-
tures from the same Layer (Jiang et al. 2020a).
Table 7 shows the summarizations and comparisons of the research institutes on
VLN.

13
Table 7  Descriptions and comparisons of research institutes
Vendor Topics Type Paper Descriptions

Deepmind Representation Representation Sigurdsson et al. (2020), Jaderberg et al. (2017), Utilize common and joint representation learning;
Miech et al. (2020) MONet as an abstraction for visual representation
RL RL Khetarpal et al. (2020) DIAYN for self-supervised RL
Self-supervised RL Eysenbach et al. (2019), Gruslys et al. (2020) Learn a skill without rewards; better generalization
Component VLN model Bruce et al. (2018), Li et al. (2019a), Mirowski StreetLearn based on Google Street view; Training
et al. (2019), Hermann et al. (2020), Mirowski multimodal policy and transfer to the unseen city
et al. (2018)
Memory model Ritter et al. (2020), Banino et al. (2020) MERLIN with episodic memory; MEMO for long-
distance reasoning
Interdisciplinary model Banino et al. (2018), Colas et al. 2020a, b), mammal-like navigational with grid cell; IMAGINE
Khetarpal et al. (2020) as intrinsically motivated RL
Google Representation Representation Raffel et al. (2020), Jiang et al. (2019), Gabeur A multimodal transformer; training multimodal and
et al. (2020), Alikhani et al. (2020), Anand et al. evaluating on Question-only baselines
Visual language navigation: a survey and open challenges

(2018)
RL Self-supervised RL Agarwal et al. (2020), Pan et al. (2020) Multi-context imitation; zero-shot imitation learning
Component VLN model Xiang et al. (2020a), Wang et al. (2020b, 2019b), VALAN based on the SEED RL; SAPIEN as a
Shah et al. (2018) realistic and physics-rich environment; FollowNet
for multi-modal navigation; language-grounded
navigation
Memory model Fang et al. (2019), Ren et al. (2020) Scene Memory Transformer; contextual prototypi-
cal memory model
Facebook Representation Visual representation Wiles et al. (2020) Differentiable point cloud renderer
Multimodal representation Dean et al. (2020), Chen et al. 2020a, b, c, d, e), audio-visual navigation
Gao et al. (2020)
Pre-trained multimodal representation Majumdar et al. (2020), Lu et al. (2019) VLN-BERT with visio-linguistic transformer; ViL-
BERT as multimodal pre-trained model
RL RL Representation Shen et al. (2019), Krantz et al. (2020) Action-level representation fusion; the nav-graph-
based representation
Self-supervised Wijmans et al. (2020), Das et al. ,2018a, b), Ye DD-PPO as distributed RL in resource-intensive
409

13
et al.(2020), Wijmans et al.(2019) environments
Table 7  (continued)
410

Vendor Topics Type Paper Descriptions

13
Component VLN model Das et al. 2018a, b), Yu et al. 2019a, 2019b), EmbodiedQA to answer visual Questions with ego-
Ramakrishnan et al. (2020), Chaplot et al. centric vision; Active Neural SLAM with modular
(2020a), Sax et al. (2019), Gordon et al. (2019) and hierarchical approach
Generative model Gafni et al. (2019), Fan et al. (2019) ELI5 for comprehensible answer to 5-year-olds;
PERSONALITY-CAPTIONS
Reasoning model Chaplot et al. 2020a, b, c, d), Lewis and Fan Geometric reasoning; multi-hop reasoning; joint
(2018), Zhong et al. (2019), Wu et al. (2018b), reasoning
Singh et al. (2019)
Extended model Szlam et al. (2019), Hu et al. (2019), Suhr et al. Minecraft game to learn from dialogue
(2019), Shuster et al. (2020)
Evaluation Simulation Kadian et al. (2019), Wu et al. (2018a), Crook SIMMC for multimodal conversation extended from
et al. (2019), Fan et al. (2020) ParlAI
Metric Lu et al. (2020), Batra et al. (2020), Jiang et al. Revisiting Object-Goal Navigation and VQA
(2020a)
S.-M. Park, Y.-G. Kim
Visual language navigation: a survey and open challenges 411

6.3.4 Summary

Facebook research has a wide spectrum, from visual representation learning using point
clouds and multi-modal learning using ViLBERT. In particular, VLN-BERT and audio-
visual navigation have strengths that differentiate them from other research groups. In
research using reinforcement learning, research using nav-graph-based representation
and DD-PPO is remarkable. They proposed various RL technologies such as long-range
planning, hierarchical training, and behavior cloning in natural language and navigation.
The strength of Facebook is the practical AI, which suggests VLN’s representative tasks
(e.g., EQA, Habitat), and shows expansion to games (e.g., Minecraft). In addition, they
showed strengths in multi-modal tasks (e.g., VQA) and conversation (e.g., Persona chat,
ParlAI). They perform multi-hop reasoning and jointly reasoning in VQA and expand to
VLN tasks by applying the concept of embodiment.
Facebook research is one of the promising research institutes because it offers vari-
ous technologies such as language, vision, and reinforcement learning in open-source
code and has the advantage of the easy connection between them. In addition, they are
the most active organizations to provide competition and habitat platforms to VLN and
are expanding by linking with Pyrobot.

7 Discussion and Challenges

Although the issues and open challenges are summarized in each section, we look back at
the significant points from a methodological and comprehensive perspective for VLN, as
depicted in Table 1.

7.1 Integration

One of the reasons why visual language navigation is difficult is the disparity integration
with the heterogeneity of diverse multimodal scales and multi-task. Each data source has a
different unit of measure and sampling rate, so different pre-processing steps and integra-
tion are required. In order to solve these problems, in-depth knowledge of the component
and broad knowledge of similarities and differences between each solution is needed. VLN
is currently mainly research on instruction-based navigation, but it is possible to expand
to the level of interaction between humans and agents. To perform these interaction-level
tasks, more diverse element technologies and complex integration tasks are required than
before. For this integration, pre-design for scalable expansion is also necessary for more
complete tasks.

7.2 Simulation

It is essential to prepare simulation tools that have a similar VLN environment to the real
world. Some research institutions provide their own tools and environments, but they
are not universally used because they are limited to specific tasks and are less versatile.
Methodically, competition and standardized metrics compensate for this problem by reus-
ing existing simulation tools. In terms of technology, it is necessary to general-purpose

13
412 S.-M. Park, Y.-G. Kim

development tools and handles various inputs (e.g., sounds, finger-pointing, sign language,
body language, eye gazing, head direction) in terms of technology.

7.3 Evaluation

VLN consists of a variety of complex environments, but the evaluation metrics are limited.
Just as defining rewards on reinforcement learning and evaluating novelty in a generated
model is difficult, comprehensive evaluation metrics are also crucial in VLN tasks. Primar-
ily, the VLN metric is based on the success rate or the minimum distance time to reach
the goal. In contrast, it is a good metric for navigation but cannot provide interaction-level
evaluation except simply measured conversation time or subjective user ratings. Language-
vision generation has a quantitative measure but generally relies on human evaluation. Since
VLN tasks perform tasks based on instructions and questions, it is not enough, but addi-
tional methods used in chatbots are considered. In addition, we need a design of how we can
determine the appropriate amount of evaluation data along with the amount of training data.

7.4 Extensibility

Immersive interaction for VLN requires interdisciplinary research but is used to a limited
form of the conceptual level nowadays. For example, sentiment analysis is an important
factor in rich interaction, but in the real world, it deals with a small number of classifica-
tions based on scalar values compared with the abundant theory of psychology. Due to the
characteristic of deep learning, research for causal reasoning and explainable AI is needed
to analyze the reason we encounter abnormal results. On the other hand, using information
such as storytelling and modal translation can also be a new approach instead of language
instruction.

7.5 Multimodal

For the cooperation with various modals, fast fusion and abstraction can help reduce the
latency of performing an action. Natural language state representation provides the ability
to distill, organize and maintain semantic information as well as interpretation of meaning.
Natural language is advantageous when expressing the environment and implicitly encod-
ing multimodal such as different images, sounds, and videos. We can perform abundant
visual-language tasks using a more fine-grained level of vision information (e.g., finger
shape, foot direction) and language processing, including common knowledge reasoning.
Language-grounded RL and pre-trained video models can reduce the complexity of mak-
ing environments and provide rich representation. By constructing a scene graph of entities
in the environment, agents can perform complex tasks more fine-grained level. There is a
lot of domain data that has not been paired with vision and language. The use of these data
in unsupervised and semi-supervised methods will be an interesting research direction.
Since video data is insufficient and its utilization is more complicated than other modals,
it is not used much, but VLN needs more research using video for imitation learning and
multimodal interaction synchronization.

13
Visual language navigation: a survey and open challenges 413

7.6 Multi‑tasking

Operating multiple tasks with one model have the advantage to perform robust behaviors
in new environments. This is more effective in complex environments and facilitates task
expansion by reducing the number of models required. Recent developments in pre-trained
models have led to an increasing number of studies on fine-tuning and few-shot learning in
downstream tasks. In the case of VLN-BERT, actions are added to natural language and
vision to make VLN tasks easier to perform. Pre-trained models allow VLN to be used in
a wider variety of fields. It is also essential to study how to apply to various downstream
tasks effectively and find practical downstream tasks.

7.7 Embodiment

Embodiment is an important element that will be the basis of future AI agents and human
interaction. The agent can extend to recognize body expressions (e.g., gestures) and non-
verbal instructions (e.g., Exophora) as intention. However, since these weak-signal intents
are noisy and have a large effect on time scales, it is essential to consider how to normal-
ize them. Embodiment is a domain that still lacks research on the concept and application
than other technologies. However, it gains a benefit from interdisciplinary research and has
significant growth potential. Because VLN is a suitable task for adapting embodiment with
other multi-modal, the embodiment of VLN requires a variety of trials and studies.

7.8 Realization

Having a model that works well in a simulation does not mean that an agent can work properly
in a real-world environment. Current VLNs are being tested mainly in a fixed environment
without taking into account the movement or interference of dynamic objects. Because each
individual member is unique, physiological and behavioral patterns vary from user to user.
It is necessary to consider realistic simulation environments and safe solutions that take into
account dynamic environments such as humans and animals. Also, operating on a device with
a small model may limit battery and model performance. Solving these problems requires an
effort to solve VLN tasks with a lightweight model. Safety verification is an essential part, and
it must be able to guarantee the safety of the user as well as the safety of the agent.

8 Conclusion

In this paper, we classified the component of visual language navigation into representation
learning, reinforcement learning, component, and evaluation. In addition, we described the
required technology and limitation for each step. Rather than simply summarizing the tech-
nologies required for VLN, we have described technologies that can be used in a broader
sense.
We analyzed representation learning in terms of vision, language, visual-language, video,
and multi-modal. In addition, we have dealt with a wide variety of methods ranging from
video pre-trained models, which have become an issue recently, to using YouTube video
for imitation learning. We also have dealt with various methods, hierarchical approaches,
and self-supervised approaches currently being used for reinforcement learning, a key

13
414 S.-M. Park, Y.-G. Kim

technology required for VLN path navigation and task planning. In addition, we extensively
introduce diverse component models and frameworks for VLN. For the evaluation, we have
also described the simulation, dataset, and benchmark papers currently in use.
Each research institute is contributing to VLN tasks using their respective strengths.
DeepMind is researching technologies that can be used in fast environments for navigation
and tasks (e.g., games). Google research has the possibility of large-scale navigation using
Google Map data with a broad base of technology. Facebook research is also expected to
perform realistic VLN tasks with a wide spectrum and expand it with natural interaction
like visual dialog.
The environment for VLN starts at the maze level and is developing into a photo-realis-
tic environment that supports textures and light reflection. The meaning of VLN research is
not limited to finding a maze. VLN is a research that can develop into various applications
ranging from robots to autonomous driving and immersive multi-modal interaction.
These component technologies and integration methodologies for visual language navi-
gation are evolving, but they are still limited and require large-scale improvements for
deeper interactions. Immersive interactions are critical to more realistic tasks that users
can actually benefit from. Therefore, visual language navigation is still an open issue and
requires further investigation in this direction.

Appendix

See Table 8.

13
Table 8  List of main acronyms
Acronym Full form Acronym Full form

ACAR-Net Actor-Context-Actor Relation Network IL Imitation Learning


ANNA Automatic Natural Navigation Assistants IMAGINE Intrinsic Motivations And Goal INvention for Exploration
APS Adversarial Path Sampler IMRAM Iterative Matching with Recurrent Attention Memory
ARMAC Advantage Regret-Matching Actor-Critic IQUAD Interactive Question Answering Dataset
AuxRN Auxiliary Reasoning Navigation KG Knowledge Graph
AVSD Audio visual scene-aware dialog L2STOP Learning to Stop
BAR Boundary Adaptive Refinement LEMMA LEarning Multi-agent Multi-task Activities
BERT Bidirectional Encoder Representations from Transformers LoRRA​ Look, Read, Reason & Answer
BiGRU​ Bidirectional Gated Recurrent Unit MERIN Memory, RL, and Inference Network
BiLSTM Bidirectional Long Short-Term Memory MHMS Mental Health Monitoring Systems
BRM Bayesian Relational Memory MIL-NCE Multiple Instance Learning and Noise Contrastive Estimation
CMN Cross-modal Memory Network MIND Mental Imagery eNhanceD
Visual language navigation: a survey and open challenges

CNN Convolutional Neural Network MJOLNIR Memory-utilized Joint hierarchical Object Learning for Navigation in
Indoor Rooms
COBE Contextualized Object Embeddings MMN Meta Module Network
CompILE Compositional Imitation Learning and Execution MMT Multimodal Machine Translation
ConvNet Convolutional Neural Network MONET Multi-Object Network
CPC Contrastive Predictive Coding MUVE Multilingual Unsupervised Visual Embeddings
CURL Contrastive Unsupervised Representations for Reinforcement Learning NDH Dialog History
DCNet diffusion convolutional network NeoNav Next expected observation Navigation
DD-PPO Decentralized Distributed Proximal Policy Optimization NL Natural Language
DIAYN Diversity is All You Need NLG Natural Language Generator
DistilBERT distilling BERT NLU Natural Language Understanding
DSTC8 Dialog State Tracking Challenge OAAM Object-and-Action Aware Model
E2E End-to-end OPNet Object Permanence Network
ECL Cognitive Linguistics ORG Object Relation Graph
415

13
EGP Evolving Graphical Planner ORL Offline RL
Table 8  (continued)
416

Acronym Full form Acronym Full form

13
ELI5 Explain Like I’m Five PBIT Policy-Based Image Translation
EMIL Embodied Multimodal Interaction in Language Learning PnPNet Perception and Prediction
EQA Embodied Question-and-Answer PSGs Physical Scene Graphs
FPICU Functionality, Physics, Intent, Causality, and Utility PTA Perceive, Transform, and Act
GENESIS GENErative Scene Inference and Sampling RCM Reinforced Cross-Modal Matching
GPT-3 Generative Pre-trained Transformer 3 REM Random Ensemble Mixture
HANNA Help Automatic Natural Navigation Assistants REVERIE Remote Embodied Visual referring Expression in Real Indoor Environ-
ments
HAUSR Hybrid Asynchronous Universal Successor Representations RL Reinforcement Learning
HDR High-Dynamic-Range RNN Recurrent Neural Networks
HIMN Hierarchical Interactive Memory Network RoI Region-of-Interest
HJ Hamilton–Jacobi RTFM Read to Fight Monsters
HRL Hierarchical Reinforcement Learning RVG-TREE Recursive Grounding Tree
S3D Separable 3D CNN THOR The House Of inteRactions (THOR)
SAPIEN Simulated Part-based Interactive Environment TPN Tentative Policy Network
SAVN Self-Adaptive Visual Navigation V2C Video-to-Commonsense
SEED Scalable, Efficient Deep-RL VALAN Vision and Language Agent Navigation
SIMMC Situated Interactive Multi-Modal Conversations ViLBERT Vision-and-Language BERT
SLAM Simultaneous Localization and Mapping VIOLIN VIdeO-and-Language Inference
SLU Spoken Language Understanding VLN Visual Language Navigation
SMT Scene Memory Transformer VNLA Vision-based Navigation with Language-based Assistance
SociAPL Auxiliary Prediction Loss VPM Video Pre-trained Model
SPL Shortest Path Length VQA Visual Question-and-Answer
SRCC​ Sim2Real Correlation Coefficient WAH Watch-And-Help
SSD Single Shot Detection WMG Working Memory Graph
SuReAL Supervised Reinforcement Asynchronous Learning WSLR Word-level sign language recognition
S.-M. Park, Y.-G. Kim
Table 8  (continued)
Acronym Full form Acronym Full form
T5 Text-To-Text Transfer Transformer V2C Video-to-Commonsense
Visual language navigation: a survey and open challenges
417

13
418 S.-M. Park, Y.-G. Kim

Acknowledgements This work was supported by the National Research Foundation of Korea (NRF) grant
funded by the Korea government (MSIT) (No. 2021R1A2C2012635)

Declarations
Conflict of interest The authors declare that they have no conflict of interest.

References
Abbasnejad E, Teney D, Parvaneh A, Shi J, Hengel AVD (2020) Counterfactual vision and language learn-
ing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
10044–10054
Agarwal R, Schuurmans D, Norouzi M (2020) An optimistic perspective on offline reinforcement learning.
In: International Conference on Machine Learning. PMLR, pp 104–114
Alamri H, Hori C, Marks TK, Batra D, ParikhD (2018) Audio visual scene-aware dialog (avsd) track for
natural language generation in dstc7. In: DSTC7 at AAAI2019 Workshop 2
Alikhani M, Sharma P, Li S, Soricut R, Stone M (2020) Clue: Cross-modal coherence modeling for caption
generation. In: Association for Computational Linguistics (ACL), 2020
Ammanabrolu P, Hausknecht M (2020) Graph constrained reinforcement learning for natural language
action spaces. In: International Conference on Learning Representations (ICLR), 2020
Anand A, Belilovsky E, Kastner K, Larochelle H, Courville A (2018) Blindfold baselines for embodied QA.
In: NIPS 2018 Visually-Grounded Interaction and Language (ViGilL) Workshop
Anderson P et al (2018) Vision-and-language navigation: Interpreting visually-grounded navigation instruc-
tions in real environments. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp 3674–3683
Arumugam D, Karamcheti S, Gopalan N, Williams EC, Rhee M, Wong LL, Tellex S (2019) Grounding nat-
ural language instructions to semantic goal representations for abstraction and generalization. Auton
Robot 43(2):449–468
Baker B, Kanitscheider I, Markov T, Wu Y, Powell G, McGrew B, Mordatch I (2019) Emergent tool use
from multi-agent autocurricula. In: International Conference on Learning Representations, 2019
Banino A et al (2018) Vector-based navigation using grid-like representations in artificial agents. Nature
557(7705):429–433
Banino A et al (2020) Memo: a deep network for flexible combination of episodic memories. In: Interna-
tional Conference on Learning Representations (ICLR), 2020
Batra D et al (2020) Objectnav revisited: on evaluation of embodied agents navigating to objects. CoRR
2020
Bear DM et al (2020) Learning physical graph representations from visual scenes. In: 34th Conference on
Neural Information Processing Systems (NeurIPS), 2020
Bertasius G, Torresani L (2020) COBE: contextualized object embeddings from narrated instructional
video. In: Neurips 2020
Blukis V, Brukhim N, Bennett A, Knepper RA, Artzi Y (2018) Following high-level navigation instructions
on a simulated quadcopter with imitation learning. Robot Sci Syst (RSS)
Blukis V, Terme Y, Niklasson E, Knepper RA, Artzi Y (2019) Learning to map natural language instruc-
tions to physical quadcopter control using simulated flight. In: Conference on Robot Learning (CoRL)
2019
Blukis V, Knepper RA, Artzi Y (2020) Few-shot object grounding and mapping for natural language robot
instruction following. In: 4th Conference on Robot Learning (CoRL 2020)
Brown TB et al (2020) Language models are few-shot learners. In: 34th Conference on Neural Informa-
tion Processing Systems (NeurIPS), 2020
Bruce J, Sünderhauf N, Mirowski P, Hadsell R, Milford M (2018) Learning deployable navigation poli-
cies at kilometer scale from a single traversal. In: Proceedings of The 2nd Conference on Robot
Learning. PMLR 87, pp 346–361
Cangea C, Belilovsky E, Liò P, Courville A (2019) VideoNavQA: bridging the gap between visual and
embodied question answering. In: BMVC 2019
Cerda-Mardini, P., Araujo, V., & Soto, A. (2020) Translating natural language instructions for behavioral
robot navigation with a multi-head attention mechanism. In: ACL 2020 WiNLP workshop

13
Visual language navigation: a survey and open challenges 419

Chang M, Gupta A, Gupta S (2020) Semantic visual navigation by watching youtube videos. In: Neu-
rIPS 2020
Chaplot DS, Gandhi D, Gupta S, Gupta A, Salakhutdinov R (2020a) Learning to explore using active
neural slam. In: International Conference on Learning Representations (ICLR), 2020a
Chaplot DS, Gandhi DP, Gupta A, Salakhutdinov RR (2020b) Object goal navigation using goal-ori-
ented semantic exploration. Adv Neural Inf Process Syst 33:4247
Chaplot DS, Jiang H, Gupta S, Gupta A (2020c) Semantic curiosity for active visual learning. In: Euro-
pean Conference on Computer Vision. Springer, Cham, pp 309–326
Chaplot DS, Salakhutdinov R, Gupta A, Gupta S (2020d) Neural topological slam for visual navigation.
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
12875–12884
Chen B, Song S, Lipson H, Vondrick C (2019a) Visual hide and seek. In: Artificial Life Conference Pro-
ceedings. One Rogers Street, MIT Press, Cambridge, MA
Chen H, Suhr A, Misra D, Snavely N, Artzi Y (2019b) Touchdown: natural language navigation and
spatial reasoning in visual street environments. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pp 12538–12547
Chen B et al (2020a) Robust policies via mid-level visual representations: an experimental study in
manipulation and navigation. CoRL 2020
Chen C et al (2020b) Soundspaces: audio-visual navigation in 3d environments. In: Computer Vision–
ECCV 2020a: 16th European Conference, vol 16. Springer, pp 17–36
Chen H, Ding G, Liu X, Lin Z, Liu J, Han J (2020c) Imram: Iterative matching with recurrent attention
memory for cross-modal image-text retrieval. In: Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, pp 12655–12663
Chen V, Gupta A, Marino K (2020d) Ask Your Humans: Using Human Instructions to Improve Gener-
alization in Reinforcement Learning. In: ICLR 2021
Chen Y, Tian Y, He M (2020e) Monocular human pose estimation: a survey of deep learning-based
methods. Comput Vision Image Underst 192:102897
Chen W, Gan Z, Li L, Cheng Y, Wang W, Liu J (2021) Meta module network for compositional visual
reasoning. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer
Vision, pp 655–664
Chevalier-Boisvert M, Bahdanau D, Lahlou S, Willems L, Saharia C, Nguyen TH, Bengio Y (2019)
BabyAI: first steps towards grounded language learning with a human in the loop. In: International
Conference on Learning Representations, p 105
Chu, Y. W, Lin, K. Y, Hsu, C. C, Ku, L. W. (2020) Multi-step joint-modality attention network for scene-
aware dialogue system. In: DSTC8 collocated with Association for the Advancement of Artificial
Intelligence (AAAI) 2020
Colas C, Akakzia A, Oudeyer PY, Chetouani M, Sigaud O (2020a) Language-conditioned goal genera-
tion: a new approach to language grounding for RL. In: ICML 2020a Workshop
Colas C, Karch T, Lair N, Dussoux JM, Moulin-Frier C, Dominey PF, Oudeyer PY (2020b) Language as
a cognitive tool to imagine goals in curiosity-driven exploration. In: NeurIPS 2020b
Co-Reyes JD et al (2019) Guiding policies with language via meta-learning. In: International Conference
on Learning Representations (ICLR), 2019
Crook PA, Poddar S, De A, Shafi S, Whitney D, Geramifard A, Subba R (2019) SIMMC: situated Inter-
active Multi-Modal Conversational Data Collection And Evaluation Platform. In: ASRU 2019
Das A, Datta S, Gkioxari G, Lee S, Parikh D, Batra D (2018a) Embodied question answering. In: Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1–10
Das A, Gkioxari G, Lee S, Parikh D, Batra D (2018b) Neural modular control for embodied question
answering. In: Conference on Robot Learning. PMLR, pp 53–62
Das A et al. (2020) Probing emergent semantics in predictive agents via question answering. In: Interna-
tional Conference on Machine Learning (ICML), 2020
Datta S, Sikka K, Roy A, Ahuja K, Parikh D, Divakaran A (2019) Align2ground: Weakly supervised phrase
grounding guided by image-caption alignment. In: Proceedings of the IEEE/CVF International Con-
ference on Computer Vision, pp 2601–2610
Dean V, Tulsiani S, Gupta A (2020) See, hear, explore: curiosity via audio-visual association. In: NeurIPS
2020
Deitke M et al (2020) Robothor: an open simulation-to-real embodied ai platform. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3164–3174
Deng Z, Narasimhan K, Russakovsky O (2020) Evolving graphical planner: contextual global planning for
vision-and-language navigation. In: Neurips2020

13
420 S.-M. Park, Y.-G. Kim

Do V, Camburu OM, Akata Z, Lukasiewicz T (2020) e-SNLI-VE-2.0: Corrected Visual-Textual Entailment


with Natural Language Explanations. In: IEEE CVPR Workshop on Fair, Data Efficient and Trusted
Computer Vision
Du H, Yu X, Zheng L (2020) Learning object relation graph and tentative policy for visual navigation. In:
European Conference on Computer Vision. Springer, Cham, pp 19–34
Engelcke M, Kosiorek AR, Parker Jones O, Posner H (2020) GENESIS: generative scene inference and
sampling of object-centric latent representations. In: Proceedings of the ICLR, 2020
Eysenbach B, Gupta A, Ibarz J, Levine S (2019) Diversity is all you need: learning skills without a reward
function. In: ICLR 2019 Conference 752
Fan A, Jernite Y, Perez E, Grangier D, Weston J, Auli M. (2019) ELI5: long form question answering. In:
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019
Fan A et al (2020) Generating interactive worlds with text. Proc AAAI Conf Artif Intell 34(02):1693–1700
Fang K, Toshev A, Fei-Fei L, Savarese S (2019) Scene memory transformer for embodied agents in long-
horizon tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition, pp 538–547
Fang Z, Gokhale T, Banerjee P, Baral C, Yang Y (2020) Video2commonsense: generating commonsense
descriptions to enrich video captioning. In: Conference on Empirical Methods in Natural Language
Processing (EMNLP), 2020
Feng Q, Ablavsky V, Bai Q, Li G, Sclaroff S (2020) Real-time visual object tracking with natural language
description. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer
Vision, pp 700–709
Fried D et al (2018) Speaker-follower models for vision-and-language navigation. In: Advances in Neural
Information Processing Systems
Fu S, Xiong K, Ge X, Tang S, Chen W, Wu Y (2020a) Quda: natural language queries for visual data analyt-
ics. CoRR 2020a
Fu TJ, Wang XE, Peterson MF, Grafton ST, Eckstein MP, Wang WY (2020b) Counterfactual vision-and-
language navigation via adversarial path sampler. In: European Conference on Computer Vision.
Springer, Cham, pp 71–86
Gabeur V, Sun C, Alahari K, Schmid C (2020) Multi-modal transformer for video retrieval. In: Computer
Vision–ECCV 2020: 16th European Conference, vol 16. Springer, Berlin, pp 214–229
Gafni O, Wolf L, Taigman Y (2019) Vid2game: controllable characters extracted from real-world videos.
In: ICLR 2020
Gan C, Zhang Y, Wu J, Gong B, Tenenbaum JB (2020) Look, listen, and act: towards audio-visual embod-
ied navigation. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), pp
9701–9707
Gao R, Chen C, Al-Halah Z, Schissler C, Grauman K (2020) Visualechoes: spatial image representation
learning through echolocation. In: European Conference on Computer Vision. Springer, Cham, pp
658–676
Garcia-Ceja E, Riegler M, Nordgreen T, Jakobsen P, Oedegaard KJ, Tørresen J (2018) Mental health moni-
toring with multimodal sensing and machine learning: a survey. Pervasive Mob Comput 51:1–26
Gidaris S, Bursuc A, Komodakis N, Pérez P, Cord M (2020) Learning representations by predicting bags of
visual words. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition, pp 6928–6938
Gordon D, Kembhavi A, Rastegari M, Redmon J, Fox D, Farhadi A (2018) Iqa: visual question answering
in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern
recognition, pp 4089–4098
Gordon D, Kadian A, Parikh D, Hoffman J, Batra D (2019) Splitnet: Sim2sim and task2task transfer for
embodied visual navigation. In: Proceedings of the IEEE/CVF International Conference on Computer
Vision, pp 1022–1031
Goyal, P, Niekum, S, Mooney, RJ (2020) PixL2R: guiding reinforcement learning using natural language by
mapping pixels to rewards. In: Conference on Robot Learning (CoRL) 2020
Gruslys A et al (2020) The advantage regret-matching actor-critic. CoRR 2020
Guo Y, Cheng Z, Nie L, Liu Y, Wang Y, Kankanhalli M (2019) Quantifying and alleviating the language
prior problem in visual question answering. In: Proceedings of the 42nd International ACM SIGIR
Conference on Research and Development in Information Retrieval, pp 75–84
Hao W, Li C, Li X, Carin L, Gao J (2020) Towards learning a generic agent for vision-and-language naviga-
tion via pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pp 13137–13146

13
Visual language navigation: a survey and open challenges 421

Harish YVS, Pandya H, Gaud A, Terupally S, Shankar S, Krishna KM (2020) DFVS: deep flow guided
scene agnostic image based visual servoing. In: 2020 IEEE International Conference on Robotics and
Automation (ICRA), pp 9000–9006
He Z et al (2021) ActionBert: leveraging user actions for semantic understanding of user interfaces. In:
AAAI Conference on Artificial Intelligence (AAAI-21) 2021
Heinrich S et al (2020) Crossmodal language grounding in an embodied neurocognitive model. Front Neu-
rorobot. https://​doi.​org/​10.​3389/​fnbot.​2020.​00052
Hermann KM, Malinowski M, Mirowski P, Banki-Horvath A, Anderson K, Hadsell R (2020) Learning to
follow directions in street view. Proc AAAI Conf Artif Intell 34(07):11773–11781
Hill F, Tieleman O, von Glehn T, Wong N, Merzic H, Clark S (2020) Grounded language learning fast and
slow. In: ICLR 2021
Hong R, Liu D, Mo X, He X, Zhang H (2019) Learning to compose and reason with language tree struc-
tures for visual grounding. IEEE Trans Pattern Anal Mach Intell 44:684
Hong Y, Rodriguez-Opazo C, Qi Y, Wu Q, Gould S (2020) Language and visual entity relationship graph
for agent navigation. In: NeurIPS 2020
Hu H, Yarats D, Gong Q, Tian Y, Lewis M (2019) Hierarchical decision making by generating and follow-
ing natural language instructions. In: Advances in neural information processing systems, 2019
Hu R, Singh A, Darrell T, Rohrbach M (2020) Iterative answer prediction with pointer-augmented multi-
modal transformers for textvqa. In: Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pp 9992–10002
Huang H, Jain V, Mehta H, Ku A, Magalhaes G, Baldridge J, Ie E (2019) Transferable representation learn-
ing in vision-and-language navigation. In: Proceedings of the IEEE/CVF International Conference on
Computer Vision, pp 7404–7413
Hutsebaut-Buysse M, Mets K, Latré S (2020) Pre-trained word embeddings for goal-conditional transfer
learning in reinforcement learning. In: International Conference on Machine Learning (ICML) 2020
Language in Reinforcement Learning (LaReL) Workshop
Ilinykh N, Zarrieß S, Schlangen D (2019) Meetup! a corpus of joint activity dialogues in a visual environ-
ment. In: Proceedings of the 23rd Workshop on the Semantics and Pragmatics of Dialogue (semdial/
LondonLogue)
Jaderberg M, Mnih V, Czarnecki WM, Schaul T, Leibo JZ, Silver D, Kavukcuoglu K (2017) Reinforcement
learning with unsupervised auxiliary tasks. In: ICLR 2017
Jain U et al (2019) Two body problem: collaborative visual task completion. In: Proceedings of the IEEE/
CVF Conference on Computer Vision and Pattern Recognition, pp 6689–6699
Jaunet T, Vuillemot R, Wolf C (2020) DRLViz: understanding decisions and memory in deep reinforcement
learning. Comput Gr Forum 39(3):49–61
Ji J, Krishna R, Fei-Fei L, Niebles JC(2020) Action genome: actions as compositions of spatio-temporal
scene graphs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition, pp 10236–10247
Jia B, Chen Y, Huang S, Zhu Y, Zhu SC (2020) Lemma: a multi-view dataset for learning multi-agent multi-
task activities. In: European Conference on Computer Vision. Springer, Cham, pp 767–786
Jiang Y, Gu SS, Murphy KP, Finn C (2019) Language as an abstraction for hierarchical deep reinforcement
learning. Adv Neural Inf Process Syst 32:9419–9431
Jiang H, Misra I, Rohrbach M, Learned-Miller E, Chen X (2020a) In defense of grid features for visual
question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pp 10267–10276
Jiang M, Luketina J, Nardelli N, Minervini P, Torr PH, Whiteson S, Rocktäschel T (2020b) WordCraft: an
environment for benchmarking commonsense agents. In: ICML, 2020b Workshop
Joze HRV, Shaban A, Iuzzolino ML, Koishida K (2020) MMTM: multimodal transfer module for CNN
fusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pp 13289–13299
Juliani A et al (2019) Obstacle tower: a generalization challenge in vision, control, and planning. In: Interna-
tional Joint Conferences on Artificial Intelligence (IJCAI), 2019
Kadian A et al (2019) Are we making real progress in simulated environments? Measuring the sim2real
gap in embodied visual navigation. In: IEEE Robotics and Automation Letters (RA-L), 2019
Karch T, Lair N, Colas C, Dussoux JM, Moulin-Frier C, Dominey PF, Oudeyer PY (2020) Language-
goal imagination to foster creative exploration in Deep RL. In: ICML 2020 Workshop
Khetarpal K, Ahmed Z, Comanici G, Abel D, Precup D (2020) What can I do here? A theory of affor-
dances in reinforcement learning. In: International Conference on Machine Learning. PMLR, pp
5243–5253

13
422 S.-M. Park, Y.-G. Kim

Kipf T et al (2019) Compile: Compositional imitation learning and execution. In: International Confer-
ence on Machine Learning. PMLR, pp 3418–3428
Koh JY, Baldridge J, Lee H, Yang Y (2021) Text-to-image generation grounded by fine-grained user
attention. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer
Vision, pp 237–246
Krantz J, Wijmans E, Majumdar A, Batra D, Lee S (2020) Beyond the nav-graph: vision-and-language
navigation in continuous environments. In: European Conference on Computer Vision. Springer,
Cham, pp 104–120
Kreutzer J, Riezler S, Lawrence C (2020) Learning from human feedback: Challenges for real-world
reinforcement learning in nlp. In: Real-World RL Workshop at NeurIPS, 2020
Ku A, Anderson P, Patel R, Ie E, Baldridge J (2020) Room-across-room: multilingual vision-and-lan-
guage navigation with dense spatiotemporal grounding. In: Proceedings of the 2020 Conference
on Empirical Methods in Natural Language Processing (EMNLP), 2020
Kulhánek J, Derner E, De Bruin T, Babuška R (2019) Vision-based navigation using deep reinforcement
learning. In: 2019 European Conference on Mobile Robots (ECMR). IEEE, pp 1–8
Landi F, Baraldi L, Corsini M, Cucchiara R (2019) Embodied vision-and-language navigation with
dynamic convolutional filters. In: The British Machine Vision Conference (BMVC), 2019
Landi F, Baraldi L, Cornia M, Corsini M, Cucchiara R (2021) Multimodal attention networks for low-
level vision-and-language navigation. Comput Vision Image Underst 210:103255
Le H, Chen NF (2020) Multimodal transformer with pointer network for the dstc8 avsd challenge. In:
DSTC Workshop at Association for the Advancement of Artificial Intelligence (AAAI), 2020
Le H, Hoi SC (2020) Video-grounded dialogues with pretrained generation language models. Assoc
Comput Linguist (ACL). https://​doi.​org/​10.​48550/​arXiv.​2006.​15319
Lewis M, Fan A (2018) Generative question answering: learning to answer the whole question. In: Inter-
national Conference on Learning Representations
Li Y, Košecka J (2020) Learning view and target invariant visual servoing for navigation. In: 2020 IEEE
International Conference on Robotics and Automation (ICRA), pp 658–664
Li A, Hu H, Mirowski P, Farajtabar M (2019a) Cross-view policy learning for street navigation. In: Pro-
ceedings of the IEEE/CVF International Conference on Computer Vision, pp 8100–8109
Li J, Tang S, Wu F, Zhuang Y (2019b) Walking with mind: Mental imagery enhanced embodied qa. In:
Proceedings of the 27th ACM International Conference on Multimedia, pp 1211–1219
Li A, Bansal S, Giovanis G, Tolani V, Tomlin C, Chen M (2020a) Generating robust supervision for
learning-based visual navigation using hamilton-jacobi reachability. In: Learning for Dynamics
and Control. PMLR, pp 500–510
Li D, Yu X, Xu C, Petersson L, Li H (2020b) Transferring cross-domain knowledge for video sign lan-
guage recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pp 6205–6214
Li J, Wang X, Tang S, Shi H, Wu F, Zhuang Y, Wang WY (2020c) Unsupervised reinforcement learning
of transferable meta-skills for embodied navigation. In: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pp 12123–12132
Li L, Chen YC, Cheng Y, Gan Z, Yu L, Liu J (2020d) Hero: hierarchical encoder for video+ language
omni-representation pre-training. In: Conference on Empirical Methods in Natural Language Pro-
cessing (EMNLP), 2020d
Li S, Chaplot DS, Tsai YHH, Wu Y, Morency LP, Salakhutdinov R (2020e) Unsupervised domain adap-
tation for visual navigation. In: Deep Reinforcement Learning Workshop at NeurIPS, 2020e
Li Z, Li Z, Zhang J, Feng Y, Zhou J (2021) Bridging text and video: a universal multimodal transformer
for video-audio scene-aware dialog. In: IEEE/ACM Transactions on Audio, Speech, and Language
Processing
Liang M, Yang B, Zeng W, Chen Y, Hu R, Casas S, Urtasun R (2020) Pnpnet: end-to-end perception and
prediction with tracking in the loop. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pp 11553–11562
Lin AS, Wu L, Corona R, Tai K, Huang Q, Mooney RJ (2018) Generating animated videos of human
activities from natural language descriptions. Learning 1
Liu A et al (2020a) Spatiotemporal attacks for embodied agents. In: European Conference on Computer
Vision. Springer, Cham, pp 122–138
Liu J, Chen W, Cheng Y, Gan Z, Yu L, Yang Y, Liu J (2020b) Violin: a large-scale dataset for video-and-
language inference. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pp 10900–10910
Liu YT, Li YJ, Wang YCF (2020c) Transforming multi-concept attention into video summarization. In: Pro-
ceedings of the Asian Conference on Computer Vision

13
Visual language navigation: a survey and open challenges 423

Loynd R, Fernandez R, Celikyilmaz A, Swaminathan A, Hausknecht M (2020) Working memory graphs.


In: International Conference on Machine Learning. PMLR, pp 6404–6414
Lu J, Batra D, Parikh D, Lee S (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for
vision-and-language tasks. In: Advances in Neural Information Processing Systems 2019
Lu J, Goswami V, Rohrbach M, Parikh D, Lee S (2020) 12-in-1: Multi-task vision and language representa-
tion learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition, pp 10437–10446
Ma CY, Lu J, Wu Z, AlRegib G, Kira Z, Socher R, Xiong C (2019a) Self-monitoring navigation agent
via auxiliary progress estimation. In: International Conference on Learning Representations (ICLR),
2019a
Ma CY, Wu Z, AlRegib G, Xiong C, Kira Z (2019b) The regretful agent: heuristic-aided navigation through
progress estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pp 6732–6740
Madureira B, Schlangen D (2020) An overview of natural language state representation for reinforcement
learning. In: ICML 2020 Workshop on Language in Reinforcement Learning (LaReL), vol 4
Majumdar A, Shrivastava A, Lee S, Anderson P, Parikh D, Batra D (2020) Improving vision-and-language
navigation with image-text pairs from the web. In: European Conference on Computer Vision.
Springer, Cham, pp 259–274
Marasović A, Bhagavatula C, Park JS, Bras RL, Smith NA, Choi Y (2020) Natural language rationales with
full-stack visual reasoning: from pixels to semantic frames to commonsense graphs. In: Proceedings
of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings
Martins R, Bersan D, Campos MF, Nascimento ER (2020) Extending maps with semantic and contextual
object information for robot navigation: a learning-based framework using visual and depth cues. J
Intell Robot Syst 2020:1–15
Mei T, Zhang W, Yao T (2020) Vision and language: from visual perception to content creation. APSIPA
Trans Signal Inf Process. https://​doi.​org/​10.​1017/​ATSIP.​2020.​10
Miech A, Alayrac JB, Smaira L, Laptev I, Sivic J, Zisserman A (2020) End-to-end learning of visual rep-
resentations from uncurated instructional videos. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pp 9879–9889
Mirowski P et al (2017) Learning to navigate in complex environments. In: International Conference on
Learning Representations (ICLR), 2017
Mirowski P et al (2018) Learning to navigate in cities without a map. Adv Neural Inf Process Syst
31:2419–2430
Mirowski P et al (2019) The streetlearn environment and dataset. CoRR2019
Mogadala A, Kalimuthu M, Klakow D (2021) Trends in integration of vision and language research: A sur-
vey of tasks, datasets, and methods. J Artif Intell Res 71:1183
Moghaddam MK, Wu Q, Abbasnejad E, Shi J (2021) Optimistic agent: accurate graph-based value estima-
tion for more successful visual navigation. In: Proceedings of the IEEE/CVF Winter Conference on
Applications of Computer Vision, pp 3733–3742
Moon S et al (2020) Situated and interactive multimodal conversations. In: Proceedings of the 28th Interna-
tional Conference on Computational Linguistics, pp 1103–1121
Morad SD, Mecca R, Poudel RP, Liwicki S, Cipolla R (2021) Embodied visual navigation with automatic
curriculum learning in real environments. IEEE Robot Autom Lett 6(2):683–690
Mou X, Sigouin B, Steenstra I, Su H (2020) Multimodal dialogue state tracking by qa approach with
data augmentation. In: Association for the Advancement of Artificial Intelligence (AAAI) DSTC8
Workshop
Mshali H, Lemlouma T, Moloney M, Magoni D (2018) A survey on health monitoring systems for health
smart homes. Int J Ind Ergon 66:26–56
Nagarajan T, Grauman K (2020) Learning affordance landscapes for interaction exploration in 3D environ-
ments. In: NeurIPS 2020
Nagarajan T, Li Y, Feichtenhofer C, Grauman K (2020) Ego-topo: environment affordances from egocentric
video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
163–172
Narasimhan M, Wijmans E, Chen X, Darrell T, Batra D, Parikh D, Singh A (2020) Seeing the un-scene:
Learning amodal semantic maps for room navigation. In: European Conference on Computer Vision.
Springer, Cham, pp 513–529
Nguyen K, Daumé III H (2019a) Help, anna! visual navigation with natural multimodal assistance via ret-
rospective curiosity-encouraging imitation learning. In: Conference on Empirical Methods in Natural
Language Processing (EMNLP), 2019

13
424 S.-M. Park, Y.-G. Kim

Nguyen K, Dey D, Brockett C, Dolan B (2019b) Vision-based navigation with language-based assistance
via imitation learning with indirect intervention. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pp 12527–12537
Pan X, Zhang T, Ichter B, Faust A, Tan J, Ha S (2020) Zero-shot imitation learning from demonstrations for
legged robot visual navigation. In: 2020 IEEE International Conference on Robotics and Automation
(ICRA), pp 679–685
Pan J, Chen S, Shou MZ, Liu Y, Shao J, Li H (2021) Actor-context-actor relation network for spatio-tempo-
ral action localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pp 464–474
Park SM, Kim YG (2021) Survey and challenges of story generation models-A multimodal perspective with
five steps: data embedding, topic modeling, storyline generation, draft story generation, and story
evaluation. Inf Fusion 67:41–63
Patel R, Rodriguez-Sanchez R, Konidaris G (2020) On the relationship between structure in natural lan-
guage and models of sequential decision processes. In: The 1st Workshop on Language in Reinforce-
ment Learning, International Conference on Machine Learning (ICML), 2020
Patro B, Namboodiri VP (2018) Differential attention for visual question answering. In: Proceedings of the
IEEE conference on computer vision and pattern recognition, pp 7680–7688
Perez E, Lewis P, Yih WT, Cho K, Kiela D (2020) Unsupervised question decomposition for question
answering. In: Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020
Prabhudesai M, Tung HYF, Javed SA, Sieb M, Harley AW, Fragkiadaki K (2020) Embodied language
grounding with 3d visual feature representations. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pp 2220–2229
Puig X et al (2020) Watch-and-help: a challenge for social perception and human-AI collaboration. In:
ICLR2021
Qi W, Mullapudi RT, Gupta S, Ramanan D (2020a) Learning to move with affordance maps. In: Interna-
tional Conference on Learning Representations (ICLR), 2020a
Qi Y, Pan Z, Zhang S, van den Hengel A, Wu Q (2020b) Object-and-action aware model for visual lan-
guage navigation. In: Computer Vision–ECCV 2020b: 16th European Conference, vol 16. Springer,
pp 303–317
Qi Y, Wu Q, Anderson P, Wang X, Wang WY, Shen C, Hengel AVD (2020c) Reverie: remote embodied
visual referring expression in real indoor environments. In: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pp 9982–9991
Qiu Y, Pal A, Christensen HI (2020) Target driven visual navigation exploiting object relationships. In:
CoRL 2020
Raffel C et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach
Learn Res 21:1–67
Ramakrishnan SK, Al-Halah Z, GraumanK (2020) Occupancy anticipation for efficient exploration and nav-
igation. In: European Conference on Computer Vision. Springer, Cham, pp 400–418
Rao M, Raju A, Dheram P, Bui B, Rastrow A (2020) Speech to semantics: improve asr and nlu jointly via
all-neural interfaces. In: Proceedings of INTERSPEECH, 2020
Ren M, Iuzzolino ML, Mozer MC, Zemel RS (2020) Wandering within a world: online contextualized few-
shot learning. In: ICML 2020 Workshop LifelongML
Ritter S, Faulkner R, Sartran L, Santoro A, Botvinick M, Raposo D (2020) Rapid task-solving in novel envi-
ronments. In: ICLR 2021
Rosano M, Furnari A, Gulino L, Farinella GM (2020) A comparison of visual navigation approaches based
on localization and reinforcement learning in virtual and real environments. In: VISIGRAPP, pp
628–635
Rosenberger P, Cosgun A, Newbury R, Kwan J, Ortenzi V, Corke P, Grafinger M (2020) Object-independ-
ent human-to-robot handovers using real time robotic vision. IEEE Robot Autom Lett 6(1):17–23
Sadhu A, Chen K, Nevatia R (2020) Video object grounding using semantic roles in language descrip-
tion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
10417–10427
Sammani F, Melas-Kyriazi L (2020) Show, edit and tell: a framework for editing image captions. In: Pro-
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4808–4816
Sanh V, Debut L, Chaumond J, Wolf T (2019) DistilBERT, a distilled version of BERT: smaller, faster,
cheaper and lighter. In: 5th Workshop on Energy Efficient Machine Learning and Cognitive Comput-
ing—NeurIPS, 2019
Savva M et al (2019) Habitat: a platform for embodied ai research. In: Proceedings of the IEEE/CVF Inter-
national Conference on Computer Vision, pp 9339–9347

13
Visual language navigation: a survey and open challenges 425

Sax A, Zhang JO, Emi B, Zamir A, Savarese S, Guibas L, Malik J (2019) Learning to navigate using mid-
level visual priors. In: Conference on Robot Learning, 2019
Shah P, Fiser M, Faust A, Kew JC, Hakkani-Tur D (2018) Follownet: robot navigation by following natural
language directions with deep reinforcement learning. In: Third Machine Learning in Planning and
Control of Robot Motion Workshop at ICRA, 2018
Shah R, Krasheninnikov D, Alexander J, Abbeel P, Dragan A (2019) The implicit preference information in
an initial state. In: International Conference on Learning Representations
Shamsian A, Kleinfeld O, Globerson A, Chechik G (2020) Learning object permanence from video. In:
European Conference on Computer Vision. Springer, Cham, pp 35–50
Shen WB, Xu D, Zhu Y, Guibas LJ, Fei-Fei L, Savarese S (2019) Situational fusion of visual representa-
tion for visual navigation. In: Proceedings of the IEEE/CVF International Conference on Computer
Vision, pp 2881–2890
Shridhar M et al (2020) Alfred: a benchmark for interpreting grounded instructions for everyday tasks.
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp
10740–10749
Shridhar M, Yuan X, Côté MA, Bisk Y, Trischler A, Hausknecht M (2021) ALFWorld: aligning text and
embodied environments for interactive learning. In: ICLR2021
Shuster K, Humeau S, Hu H, Bordes A, Weston J (2019) Engaging image captioning via personality.
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
12516–12526
Shuster K, Urbanek J, Dinan E, Szlam A, Weston J (2020) Deploying lifelong open-domain dialogue learn-
ing. CoRR 2020
Sigurdsson G et al (2020) Visual grounding in video for unsupervised word translation. In: Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10850–10859
Silva R, Vasco M, Melo FS, Paiva A, Veloso M (2020) Playing games in the Dark: an approach for cross-
modality transfer in reinforcement learning. In: Proceedings of the 19th International Conference on
Autonomous Agents and MultiAgent Systems, 2020
Singh A et al (2019) Towards vqa models that can read. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pp 8317–8326
Siriwardhana S, Weerasekera R, Nanayakkara S (2018) Target driven visual navigation with hybrid asyn-
chronous universal successor representations. In: Deep Reinforcement Learning Workshop, NeurIPS,
2018
Srinivas A, Laskin M, Abbeel P (2020) Curl: Contrastive unsupervised representations for reinforcement
learning. In: International Conference on Machine Learning (ICML), 2020
Su W, Zhu X, Cao Y, Li B, Lu L, Wei F, Dai J (2020) Vl-bert: Pre-training of generic visual-linguistic repre-
sentations. In: International Conference on Learning Representations (ICLR), 2020
Suhr A et al (2019) Executing instructions in situated collaborative interactions. In: Conference on Empiri-
cal Methods in Natural Language Processing (EMNLP), 2019
Sun Z, Yu H, Song X, Liu R, Yang Y, Zhou D (2020) Mobilebert: a compact task-agnostic bert for resource-
limited devices. In: ACL, 2020
Szlam A et al (2019) Why build an assistant in minecraft? CoRR 2019
Tamari R, Shani C, Hope T, Petruck MR, Abend O, Shahaf D (2020) Language (re) modelling: towards
embodied language understanding. In: Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics
Tan H, Bansal M (2020) Vokenization: improving language understanding with contextualized, visual-
grounded supervision. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Lan-
guage Processing (EMNLP), 2020
Tan S, Liu H, Guo D, Zhang X, Sun F (2020) Towards embodied scene description. Robot Sci Syst
Thomason J, Gordon D, Bisk Y (2019) Shifting the baseline: single modality performance on visual naviga-
tion qa. NAACL 2019:1977–1983
Thomason J, Murray M, Cakmak M, Zettlemoyer L (2020) Vision-and-dialog navigation. In: Conference on
Robot Learning. PMLR, pp 394–406
Tsai YHH, Bai S, Liang PP, Kolter JZ, Morency LP, Salakhutdinov R (2019) Multimodal transformer for
unaligned multimodal language sequences. In: Proceedings of the conference. Association for Com-
putational Linguistics. Meeting. NIH Public Access, 2019, p 6558
Wang X, Xiong W, Wang H, Wang WY (2018) Look before you leap: bridging model-free and model-
based reinforcement learning for planned-ahead vision-and-language navigation. In: Proceedings
of the European Conference on Computer Vision (ECCV), pp 37–53

13
426 S.-M. Park, Y.-G. Kim

Wang X et al (2019a) Reinforced cross-modal matching and self-supervised imitation learning for
vision-language navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pp 6629–6638
Wang X, Jain V, Ie E, Wang WY, Kozareva Z, Ravi S (2019b) Natural language grounded multitask
navigation. In: ViGIL@ NeurIPS
Wang J, Zhang Y, Kim TK, Gu Y (2020a) Modelling hierarchical structure between dialogue policy and
natural language generator with option framework for task-oriented dialogue system. In: ICLR
2021
Wang XE, Jain V, Ie E, Wang WY, Kozareva Z, Ravi S (2020b) Environment-agnostic multitask learn-
ing for natural language grounded navigation. In: Computer Vision–ECCV 2020b: 16th European
Conference, vol. 16. Springer, pp 413–430
Wang Y (2021) Survey on deep multi-modal data analytics: collaboration, rivalry, and fusion. ACM
Trans Multimed Comput Commun Appl TOMM 17(1s):1–25
Waytowich N, Barton SL, Lawhern V, Warnell G (2019) A narration-based reward shaping approach
using grounded natural language commands. In: The Imitation, Intent and Interaction (I3) work-
shop, ICML 2019
Wijmans E et al (2019) Embodied question answering in photorealistic environments with point cloud
perception. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition, pp 6659–6668
Wijmans E et al (2020) Dd-ppo: Learning near-perfect pointgoal navigators from 2.5 billion frames. In:
ICML 2020 Workshop
Wiles O, Gkioxari G, Szeliski R, Johnson J (2020) Synsin: end-to-end view synthesis from a single
image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
tion, pp 7467–7477
Wortsman M, Ehsani K, Rastegari M, Farhadi A, Mottaghi R (2019) Learning to learn how to learn:
Self-adaptive visual navigation using meta-learning. In: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pp 6750–6759
Wu Y, Wu Y, Gkioxari G, Tian Y (2018a) Building generalizable agents with a realistic and rich 3d
environment. In: ICLR, 2018
Wu Y, Wu Y, Gkioxari G, Tian Y, Tamar A, Russell S (2018b) Learning a Semantic Prior for Guided
Navigation. In: European Conference on Computer Vision (ECCV), 2018
Wu Y, Wu Y, Tamar A, Russell S, Gkioxari G, Tian Y (2019) Bayesian relational memory for seman-
tic visual navigation. In: Proceedings of the IEEE/CVF International Conference on Computer
Vision, pp 2769–2779
Wu J, Li G, Han X, Lin L (2020a) Reinforcement learning for weakly supervised temporal grounding of
natural language in untrimmed videos. In: Proceedings of the 28th ACM International Conference
on Multimedia, pp 1283–1291
Wu Q, Manocha D, Wang J, Xu K (2020b) Neonav: improving the generalization of visual navigation via
generating next expected observations. Proc AAAI Conf Artif Intell 34(06):10001–10008
Wu SA, Wang RE, Evans JA, Tenenbaum J, Parkes DC, Kleiman-Weiner M (2020c) Too many cooks:
coordinating multi-agent collaboration through inverse planning. In: CogSci
Xia F et al (2020) Interactive gibson benchmark: a benchmark for interactive navigation in cluttered
environments. IEEE Robot Autom Soc 5(2):713
Xiang F et al (2020a) Sapien: A simulated part-based interactive environment. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition:11097–11107
Xiang J, Wang XE, Wang WY (2020b) Learning to stop: a simple yet effective approach to urban vision-
language navigation. In: Conference on Empirical Methods in Natural Language Processing
(EMNLP), 2020
Xie L, Markham A, Trigoni N (2020) SnapNav: learning mapless visual navigation with sparse direc-
tional guidance and visual reference. In: 2020 IEEE International Conference on Robotics and
Automation (ICRA), pp 1682–1688
Ye J, Batra D, Wijmans E, Das A (2020) Auxiliary tasks speed up learning pointgoal navigation. CoRL
2020
Yu H, Lian X, Zhang H, Xu W (2018) Guided feature transformation (gft): a neural language grounding
module for embodied agents. In: Conference on Robot Learning. PMLR, pp 81–98
Yu D et al (2019a) Commonsense and semantic-guided navigation through language in embodied envi-
ronment. In: ViGIL@ NeurIPS
Yu L, Chen X, Gkioxari G, Bansal M, Berg TL, Batra D (2019b) Multi-target embodied question answer-
ing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
6309–6318

13
Visual language navigation: a survey and open challenges 427

Zaheer M et al (2020) Big Bird: transformers for longer sequences. In: NeurIPS
Zeng F, Wang C, Ge SS (2020) A survey on visual navigation for artificial agents with deep reinforcement
learning. IEEE Access 8:135426–135442
Zhan X, Pan X, Dai B, Liu Z, Lin D, Loy CC (2020) Self-supervised scene de-occlusion. In: Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3784–3792
Zhang Y, Hassan M, Neumann H, Black MJ, Tang S (2020) Generating 3d people in scenes without peo-
ple. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
6194–6204
Zheng L, Zhu C, Zhang J, Zhao H, Huang H, Niessner M, Xu K (2019) Active scene understanding via
online semantic reconstruction. Comput Gr Forum 38(7):103–114
Zhong V, Rocktäschel T, Grefenstette E (2019) RTFM: generalising to novel environment dynamics via
reading. In: International Conference on Learning Representations (ICLR), 2020
Zhou L, Small K (2020) Inverse reinforcement learning with natural language goals. CoRR 2020
Zhu Y, Mottaghi R, Kolve E, Lim JJ, Gupta A, Fei-Fei L, Farhadi A (2017) Target-driven visual navigation
in indoor scenes using deep reinforcement learning. In: 2017 IEEE international conference on robot-
ics and automation (ICRA), pp 3357–3364
Zhu F, Zhu, Y Chang X, Liang X (2020a) Vision-language navigation with self-supervised auxiliary reason-
ing tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pp 10012–10022
Zhu Y et al (2020b) Dark, beyond deep: a paradigm shift to cognitive ai with humanlike common sense.
Engineering 6(3):310–345
Zhu Y, Zhu F, Zhan Z, Lin B, Jiao J, Chang X, Liang X (2020c) Vision-dialog navigation by exploring
cross-modal memory. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pp 10730–10739

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.

13

You might also like