Professional Documents
Culture Documents
Disentanglement in Conceptual Space During Sensorimotor Interaction
Disentanglement in Conceptual Space During Sensorimotor Interaction
Research Article
E-mail: zhong@junpei.eu
Abstract: The disentanglement of different objective properties from the external world is the foundation of language
development for agents. The basic target of this process is to summarise the common natural properties and then to name it to
describe those properties in the future. To realise this purpose, a new learning model is introduced for the disentanglement of
several sensorimotor concepts (e.g. sizes, colours and shapes of objects) while the causal relationship is being learnt during
interaction without much a priori experience and external instructions. This learning model links predictive deep neural models
and the variational auto-encoder (VAE) and provides the possibility that the independent concepts can be extracted and
disentangled from both perception and action. Moreover, such extraction is further learnt by VAE to memorise their common
statistical features. The authors examine this model in the affordance learning setting, where the robot is trying to learn to
disentangle about shapes of the tools and objects. The results show that such a process can be found in the neural activities of
the β-VAE unit, which indicate that using similar VAE models is a promising way to learn the concepts, and thereby to learn the
causal relationship of the sensorimotor interaction.
In the VAFA-PredNet architecture, the latent vectors in VAE are To endow an efficient training in both ends, the variational Bayes
used to have a conceptual compression from the internal memory approach simultaneously learns both the parameters of pθ(x, z) as
of the topmost convLSTM units. In the meanwhile, on the other well as those of a posterior approximation qϕ(z x). To accomplish
hand, in the temporal domain, it gives rise to the values of the this, the posterior qθ(z x) is regularised with its Kullback-Leibler
internal memory of the next time step with the reconstruction. divergence (KL divergence) from a priori distribution p(z). That is
The first version of VAE was proposed by Kingma and Welling why the prior is typically chosen to be also a Gaussian with zero
[26]. It is a directed graphical model but is distributed calculated as mean and unit variance (8), such that the KL term between
an auto-encoder. However, being different from the auto-encoders, posterior and this prior can be computed in the closed form. The
the VAE learns to construct the latent space z ∈ ℝn to encode the loss function to optimise the stochastic construction VAE can be
statistical regularities in the continuous data, by selecting the best defined as
parameters of the Gaussian distribution in the latent space.
Therefore, the common features of the data can be observed from ℒ(ϕ, θ; x) = Eq(z x)[log pθ(x, z)] − DKL(qϕ(z x) p(z)) (11)
the forms of variational inference, in the latent space.
To accomplish this process, rather than constructing a direct
This loss function is called the evidence lower bound (ELBO).
modelling of the prior data distribution p(x) in the reconstruction,
There also exists an improved version of VAE which is able to
the VAE first models the posterior of the joint distribution of the disentangle the concepts of the input data by putting a constraint β
data p(z x) as the approximate Gaussian probability density qθ(z x)
on the regularised term so that the learning of the separate concepts
with the parameter θ. Then, with the assumption that the in the properties of the input data such as the size and angle can be
variational posterior qθ(z x) approximates the true posterior p(z x), further disentangled in the latent vector. So in the case of β-VAE,
which follows the Gaussian distribution, the p(z) should also be an the ELBO loss function was added with a regularised term
isotropic unit of the Gaussian distribution:
ℒ(ϕ, θ; x) = Eq(z x)[log pθ(x, z)] − βDKL(qϕ(z x) p(z)) (12)
p(Z) = ∑ p(Z X)p(X)
x From the perspective of the neural networks, while being
= ∑ N(0, I)p(X) implemented in the graphical model (i.e. auto-encoder), where the
x (8) representation of the highest layer z is treated as the latent variable
where the generative process starts. Since p(z) represents the prior
= N(0, I) ∑ p(X)
x
distribution of the latent variable z, the data generation resulting in
the reconstruction of the data x^ .
= N(0, I) Therefore, the VAE units on top of the VAFA-PredNet model,
the inputs and outputs of the encoder and decoder are
VAE is its ability of extracting basic concepts from the input data,
which we will be shown in the next section.
4 Discussion
4.1 Learning concepts by affordance learning
The grounding theory in psychology suggests that the usage of
Fig. 9 Latent units in case 1 – tap from right
natural language relies on situational context. To understand the
(a) Latent unit 6, (b) Latent unit 3, (c) Latent unit 5
language dependent on the physical environment and capture such
possible common abstract values, concepts that ground linguistic grounded meaning such as motor actions, perception and the
meaning are neither internal nor external to language users but integration of both.
instead it spans the objective–subjective boundary. As reviewed in In our case, the disentanglement of concepts relies on
[7], the higher level of concepts is rooted from the low level of interaction with the environment. So, it contains the factors of
motor action, the tool and the objects. Therefore, the
disentanglement is the notation of the perceived affordances: how In our experiment, the concepts are disentangled with the tool-
the interaction and its effect can be abstracted as the structured use data in an unsupervised manner. Specifically, the recent
units, which can be further used for prediction at the future development of β-VAE seems to provide an example about
interaction. Specifically, in our case, the concepts such as ‘shapes’ disentangling different concepts without any a priori knowledge or
and ‘colours’ of ‘tools’ and ‘objects’ are the common properties supervised learning. As the analysis is shown in [32], the additional
that contribute to the affordance learning during the tool-use constraint parameter provides an additional representation capacity
experiment. Such abstract concepts are first rooted from the low of the latent units Z. This is done by adjusting the relative
level of grounded meaning of both motor action and the visual weighting of the KL divergence of independent concepts in the
perception. It is similar as the abstract grounding problem but not latent space Z while constructing the estimated posterior q(z x) to
totally the same: the abstract examples rooted directly from the the true a priori distribution for p(z x). In the case of β-VAE, since
sensorimotor interaction (such as ‘give’ and ‘accept’ or numerical this distribution components of z are independent (e.g.
concepts shown in [7]) but the conceptualisation depends heavily disentangled) concepts with independent physical properties, we
on the abstraction of the common physical properties of the can emphasise different concepts. Nevertheless, from the
sensory inputs. Besides that, it needs additional ‘meta-learning’ comparison between the VAE and the β-VAE methods, we can
procedure which mathematically attempts to align the axis of the conclude that such disentanglement is sensitive to relative
‘concept’ of the visual information while the affordance is being weighting (i.e. the parameter of β) of the effect features that are
learnt. Importantly, such meta-learning procedure can be bridged encoded in different units and channels.
by affordance learning. Since it was first introduced to understand The results of this model also suggest that the importance of the
better one of the basic properties of the objects independent to the role of intentionality for embodied learning. We have seen that the
specific actions that depend on individuals, it links the learning disentanglement of concepts emerges from the agent's sensorimotor
agent and its subjective understanding of the world through its exploration. It guides the agent to select the regions that are in the
motor action and sensing to the world. intermediate level of difficulty. Furthermore, during the process of
In the next step, our work can be extended in extracting the pre- finding disentangled categories of concepts, we observed that
symbolic representation about ‘concepts’ of the objects while the different concepts sometimes still have the same values using the
robot is situated in a robot–object interactive scenario. In the β-VAE. It is related to the facts that those concepts have similar
robotics community, though there are several existing works on values in the effect space (e.g. the movements of the ball and the
obtaining and using object–action relation, not so much work has lemon are similar, given the same motor action and the tool). It also
successfully incorporated different aspects of ‘concepts’ from the indicates that the naive clustering using simple motor action is not
visual inputs and motor outputs in the process of affordance enough while the agent is learning the affordance. In that case,
learning. On the basis of this interaction setting, we argue that applying further exploratory actions is necessary for the interaction
while humans are doing the conceptualisation of the world, they of the objects handling this complexity. For example, in the case of
are not only doing it based on the geometry appearance features of tool-use experiment, after the action of pushing, the robot can poke
the objects but also its integration to the voluntary motor actions. both of the objects one by one, so that the effects of categories can
On the other hand, though we have witnessed the state-of-the-art be distinguished when the observations obtained from the action of
deep learning method perform well in the object categorisation, we poking can be taken into account, without any extrinsic reward.
need more earning methods to understand the concepts by using This can be driven by the intrinsic motivation mechanism [33, 34].
their tool functions and observing the visual effects, and connecting
the objective world and the subjective prediction. We believe 4.3 Consciousness prior
similar learning models (deep learning, together with probabilistic
learning) could be useful in the future development in terms of its The idea of building an abstract representation with a low-
function in linking the concepts in the low-dimensional attention dimensional representation of the features for the tasks with a high-
space with the peripheral signals in the high-dimensional dimensional representation is an essential bridge between the state-
representation space. of-the-art deep learning methods and the higher cognitive abilities
and the artificial general intelligence (AGI). To accomplish this
4.2 From conceptualisation to language development task, a new prior [35] should be used in the conscious state, which
allows a low-dimensional calculation such as unconsciousness
The approaches to language acquisition can be coarsely divided attention to change different cognitive statuses. Such a priori is an
into nativism or empiricism. Presently, most of the evidence (e.g. abstract of the high-dimensional observed representation such as
[29, 30]) believe that the environmental factors contribute to the specific kinds of motor action output or sensory inputs. The
majority of the process of children acquiring their language skills cognitive function of such a priori is the clustering of them as a
from an early age learning of language is embedded in the concept (or a ‘vector of thoughts’), so that the higher level of
behaviour within the child's social context. This mechanism can be cognitive thought can be computed by the attention mechanism on
observed, for instance, when children gradually acquire the rules of the unconscious level. Such a vector can be physically
grammar and the complexities of word comprehension that conceptualised by the different dimensions of multi-modalities
eventually lead to the production of words and sentences. This is inputs, a control signal for motor actions or an inference result
contradicted with the views from the nativists such as Chomsky from the higher level of manipulation. Moreover, another
and Pinker, who consider the lexical rules are passed down through advantage of such a priori is that it can also connect with the
the child's environment [31]. The basic mechanism of learning a language and symbolic representation.
language such as the grammar and syntax is already rooted at birth In terms of machine learning, one of the key advantages of
and development of the biological organism which contains that using the latent unit, which has a small-dimensional but rich
language instinct. representation, is to allow for improved generalisation results with
novel data and unexpected noise. Our proposed model covers the