Professional Documents
Culture Documents
VR Facial Animation Via Multiview Image Translation PDF
VR Facial Animation Via Multiview Image Translation PDF
SHIH-EN WEI, JASON SARAGIH, TOMAS SIMON, ADAM W. HARLEY∗ , STEPHEN LOMBARDI,
MICHAL PERDOCH, ALEXANDER HYPES, DAWEI WANG, HERNAN BADINO, and YASER SHEIKH
Facebook Reality Labs
Data Collection Establishing Correspondence Realtime Animation
Multiview Image Style Translation
Image Loss
Training HMC (9 cameras) Tracking HMC (3 cameras)
Estimate
Background-aware
Differentiable Rendering
Geometry
z
Decoder
v
Deep Appearance Avatar
[Lombardi et al. 2018] Animated Avatar
Texture Maps Rendered Avatar Input Images In VR
Fig. 1. We present a VR realtime facial animation system with headset mounted cameras (HMC) which augment a standard head mounted display (HMD).
Our method establishes precise correspondence between 9 camera images from a training HMC and the parameters of a photorealistic avatar. Finally, we use
a common subset of 3 cameras on a tracking HMC to animate the avatar in realtime.
A key promise of Virtual Reality (VR) is the possibility of remote social the avatar and the HMC-acquired images are automatically found through
interaction that is more immersive than any prior telecommunication media. self-supervised multiview image translation, which does not require manual
However, existing social VR experiences are mediated by inauthentic digital annotation or one-to-one correspondence between domains. We evaluate
representations of the user (i.e., stylized avatars). These stylized representa- the system on a variety of users and demonstrate significant improvements
tions have limited the adoption of social VR applications in precisely those over prior work.
cases where immersion is most necessary (e.g., professional interactions and
CCS Concepts: • Human-centered computing → Virtual reality; •
intimate conversations). In this work, we present a bidirectional system that
Computing methodologies → Computer vision; Unsupervised learn-
can animate avatar heads of both users’ full likeness using consumer-friendly
ing; Animation.
headset mounted cameras (HMC). There are two main challenges in doing
this: unaccommodating camera views and the image-to-avatar domain gap. Additional Key Words and Phrases: Face Tracking, Unsupervised Image Style
We address both challenges by leveraging constraints imposed by multiview Transfer, Differentiable Rendering
geometry to establish precise image-to-avatar correspondence, which are
ACM Reference Format:
then used to learn an end-to-end model for real-time tracking. We present
Shih-En Wei, Jason Saragih, Tomas Simon, Adam W. Harley, Stephen Lom-
designs for a training HMC, aimed at data-collection and model building, and
bardi, Michal Perdoch, Alexander Hypes, Dawei Wang, Hernan Badino,
a tracking HMC for use during interactions in VR. Correspondence between
and Yaser Sheikh. 2019. VR Facial Animation via Multiview Image Trans-
∗ Currently at Carnegie Mellon University, work done while at Facebook Reality Labs. lation. ACM Trans. Graph. 38, 4, Article 67 (July 2019), 16 pages. https:
//doi.org/10.1145/3306346.3323030
Authors’ address: Shih-En Wei, shih-en.wei@fb.com; Jason Saragih, jason.saragih@fb.
com; Tomas Simon, tomas.simon@fb.com; Adam W. Harley, aharley@cmu.edu; Stephen
Lombardi, stephen.lombardi@fb.com; Michal Perdoch, michal.perdoch@oculus.com;
1 INTRODUCTION
Alexander Hypes, alexander.hypes@oculus.com; Dawei Wang, dawei.wang@oculus. Virtual Reality (VR) has seen increased ubiquity in recent years.
com; Hernan Badino, hernan.badino@oculus.com; Yaser Sheikh, yasers@fb.com, Face-
book Reality Labs, Pittsburgh, PA.
This has opened up the possibility for remote collaboration and
interaction that is more engaging and immersive than achievable
Permission to make digital or hard copies of part or all of this work for personal or through other media. Concurrently, there has been great progress
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation in generating accurate digital doubles and avatars. Driven by the
on the first page. Copyrights for third-party components of this work must be honored. gaming and movie industries, a number of compelling demonstra-
For all other uses, contact the owner/author(s). tions of state of the art systems have recently attracted interest in
© 2019 Copyright held by the owner/author(s).
0730-0301/2019/7-ART67 the community [Epic Games 2017; Hellblade 2018; Magic Leap 2018;
https://doi.org/10.1145/3306346.3323030 Seymour et al. 2017; Unreal Engine 4 2018]. These systems show
ACM Trans. Graph., Vol. 38, No. 4, Article 67. Publication date: July 2019.
67:2 • Wei et al.
highly photo-realistic avatars driven and rendered in real-time. Al- without requiring any manual input or semantically defined labels.
though impressive results are achieved, they are all designed for The results are highly precise estimates of the avatar’s parameters
one-way interactions, where the actor is equipped with sensors that match the user’s performance, which are suitable for learning
optimally placed to capture facial expression. Unfortunately, these an end-to-end mapping relating them to the tracking HMC.
sensor placements are not compatible with existing VR-headset de- In the following, we discuss prior work in §2, and present our
signs, which largely occlude the face. Thus, these systems are better method for finding image-to-avatar correspondence in §3, including
suited to live performances than interaction. hardware design, image style transfer and the core optimization
If we consider instead works that are aimed at bidirectional com- problem. Real-time facial animation is then covered in §4. We present
munication in VR, we discover that existing systems mostly use results of our approach in §5, and conclude in §6 with a discussion
non-photorealistic/stylized avatars [BinaryVR 2019; Li et al. 2015; and directions of future work.
Olszewski et al. 2016]. These representations tend to have a more
limited range of expressivity, which renders errors from facial ex-
2 RELATED WORK
pression tracking less perceptible than with photo-real avatars. In
this work, we argue that this is no coincidence, and that precise face 2.1 Face Tracking in VR
tracking from consumer-friendly headset-mounted camera configu- Tracking faces in VR is a unique and challenging problem because
rations is significantly harder than in conventional settings. There the object we want to track is largely occluded by the VR headset.
are two main reasons for this. First, instead of capturing a complete In the literature, solutions vary in how hardware designs are used
and unobscured view of the face, headset-mounted camera place- to circumvent sensing challenges, as well as in methods to bridge
ments tend to provide only partial and non-overlapping views of the domain gap between sensor data and the face representation in
the face at extreme and oblique views. Minimizing reconstruction order to find their correspondence. Specifically, sensors used to build
errors in these viewpoints often does not translate to correct results the face model and later drive it are typically comprised of different
when viewed frontally. Secondly, these cameras often operate in sets of camera configurations. Modeling sensors (i.e., cameras for
the infrared (IR) spectrum, which is not directly comparable to the building a shape and appearance model of the face) are typically
avatar’s RGB appearance and makes analysis-by-synthesis tech- multiview, high-fidelity RGB cameras with an unobstructed view
niques less effective. To partially alleviate these difficulties, existing of the face [Beeler et al. 2011; Dimensional Imaging 2016; Fyffe
systems are designed to work using structurally and mechanically et al. 2014; Lombardi et al. 2018]. Tracking sensors (i.e., cameras for
sub-optimal sensor designs that are more accommodating to the driving the face model to match a user’s expressions) are typically
computer-vision tracking problem. Despite this, their performance mounted on the HMD, resulting in a patch-work of oblique and par-
is still only suitable for stylized avatar representations. tially overlapping views of different facial parts with narrow depth
Although the difficult sensor configurations of headset-mounted of field, and typically operate in the infra-red (IR) spectrum [Li et al.
cameras (HMC) can prove challenging for classical face alignment 2015; Olszewski et al. 2016]. Moreover, since the HMDs obscure the
approaches like [Saragih et al. 2009; Xiong and la Torre 2013], in this face, and modeling sensors require an unobscured view, data from
work we show that end-to-end deep neural networks are capable of modeling and tracking sensors can not be captured concurrently.
learning the complex mapping from sensor measurements to avatar As the eyebox in a VR headset is enclosed, the face is typically
parameters, given the availability of sufficient high-precision train- divided into an occluded upper face and a visible lower face. As such,
ing examples relating the two domains. We demonstrate compelling some works use specialized strategies to obtain the facial state for
results where fully expressive and accurate performances can be each part separately, followed by a compositing step to get the full
tracked in real time, at a precision matching the representation face state in realtime. In [Li et al. 2015] and [Thies et al. 2018], an
capacity of modern photo-realistic avatars. The challenge, then, is RGBD sensor is used, which allows direct registration of the lower
how to acquire the correspondences required to pose the problem face with the model’s geometry. To have a better viewpoint, Li et al.
as supervised learning. [2015] attach the sensor with a protruding mount, placing it slightly
Our solution for acquiring correspondence is to leverage mul- below the mouth. Similar to [BinaryVR 2019], this frontal viewpoint
tiview geometry in addressing both the problem of oblique view- is optimal for modeling lower mouth expressions, but is not ideal
points as well as the sensor-avatar domain gap. Specifically, we from a hardware design standpoint. In [Thies et al. 2018], the sensor
propose the use of a training HMC design that shares a sensor con- is placed in the environment which limits the range of a user’s head
figuration with a consumer-friendly design (the tracking HMC), pose. For the upper face, Li et al. [2015] use strain gauges to sense
but has additional cameras to support better coverage and view- voltage changes that accompany facial expressions. By building a
ing angles while minimally disturbing the quality of data acquired skeleton headset without a display unit that otherwise occludes the
from the shared cameras. With these additional viewpoints, clas- upper face, they can acquire face model parameters corresponding
sical analysis-by-synthesis constraints become more meaningful, to strain gauge measurements through depth registration. They use
and results generalize better to common vantage points (i.e., frontal this to train a real-time upper face blendshape regressor. However,
view of the face). These additional views also provide more signal this input signal has low SNR, exhibits drift, and contains only
to improve the fidelity at which we can perform domain translation limited information about facial expressions. In comparison, Thies
to address the sensor-avatar domain gap. With data collected from et al. [2018] use infrared (IR) cameras pointing at the eye region
these cameras along with a pre-trained personalized avatar, our to avoid interfering with VR usage. They use a calibration process
system learns to discover correspondences through self-supervision, where subjects are instructed to gaze at known positions to obtain
ACM Trans. Graph., Vol. 38, No. 4, Article 67. Publication date: July 2019.
VR Facial Animation via Multiview Image Translation • 67:3
correspondence for training. Although effective in estimating gaze 2.2 Image Style Transfer
direction and eyelid blinks, it does not capture other upper-face Image style transfer is the task of transforming one image to match
expressions such as eyebrow motion and the temporal pattern of the appearance of another while retaining its semantic and struc-
wrinkles in the forehead, nose and areas surrounding the eyes. tural content [Gatys et al. 2016]. Recent works [Isola et al. 2017]
If the parametric facial model exhibits appearance statistics match- have tackled this task using GANs [Goodfellow et al. 2014], which
ing those from the sensors used to drive the model, then “analysis- can generate images with high realism. CycleGAN [Zhu et al. 2017]
by-synthesis” or “vision as inverse graphics” approaches [Kulkarni introduced the concept of cycle-consistency which improves on the
et al. 2015; Nair et al. 2008; Yildirim et al. 2015] can be used to find mode-collapse problem with GANs. Although architectural choices,
correspondences. Here, the challenge is to find the parameters of the such as the U-net architecture, in CycleGAN encourages structural
models that, when rendered from the corresponding camera view, consistency during transfer, it can still suffer from semantic drift,
match the images obtained from the driving sensors. This is typically especially when the distributions between the domains are not bal-
achieved though variants of the gradient-descent algorithm [Blanz anced. In [Fu et al. 2018], a geometric loss was added to encourage
and Vetter 1999; Cootes et al. 1998; Thies et al. 2016]. Unfortunately, the preservation of spatial structure, and in [Mueller et al. 2018], a
for VR applications, the domain gap between imaging sensors and silhouette matching term was used instead. To further encourage
modeling sensors makes it unlikely that minimizing the difference the preservation of semantic information during transfer, [Bansal
between the rendered model and the sensor’s image will result in an et al. 2018] extend the idea of cycle-consistency by utilizing tempo-
expression matching that of the user. Although explicit landmark ral structures of data. Instead of adding additional terms, in [Harley
detectors can be used to define sparse geometric correspondence et al. 2019], an “uncooperative” optimization strategy is employed
between the model and tracking images [Cao et al. 2014], landmarks instead to prevent semantic drift caused by the forward and back-
alone lack expressiveness, and the oblique viewpoints from HMD wards transformations colluding to complement errors produced by
mounted cameras make reprojection errors less effective. each other. In our method, the preservation of facial expression is
Other approaches correspond sensor data and face parameters particularly important, because we rely on generated (fake) images
using non-visual information such as manual semantic annotations as supervision for optimizing face parameters. Besides matching
of facial expressions and audio signals. For example, in [Olszewski distribution carefully and the use of uncooperative optimization,
et al. 2016], to model the entire face using a single RGB sensor for the we present cross-view cycle consistency to further enforce seman-
lower face and IR cameras for the eyes, the subjects were instructed tic preservation, utilizing synchronized multiview data from our
to performed a predefined set of expressions and sentences that were designed HMC.
used as correspondence with a blendshape model face animation
of the same content. To generate more correspondence for training 2.3 Differentiable rendering
a neural network with a small temporal window, dynamic time
warping of the audio signal was used to align the sentences with the In analysis-by-synthesis approaches [Tewari et al. 2017; Thies et al.
animation. Although the approach demonstrated realistic results, 2018], differentiable rendering a textured mesh is necessary to al-
the animation tends to exhibit tokenized expressions which look low the error signal to flow from pixel errors to parameters of
plausible but are not faithful reconstructions of the user’s facial the graphics engine in a system trained end-to-end. However, it
motion. This approach is also limited by the granularity at which involves a discrete rasterization step to assign triangles to every
facial expressions can be expressed, as well as their repeatability. pixel, which has a non-differentiable property causing the gradients
Unpaired learning-based methods have also been investigated. from pixel errors hard to be propagated to mesh geometry. Tewari
In [Lombardi et al. 2018], synthetic renderings of the avatar are et al. [2017] circumvent this issue by formulating the loss over the
used together with real tracking sensor images to build a domain- vertices instead of the image pixels. Kato et al. [2018] address the
conditioned variational autoencoder (VAE) while simultaneously problem by approximating gradients according to the geometric
learning a mapping from the latent space to face model parameters, relationship between a triangle and a pixel. In our application, this
using correspondences from the rendered domain exclusively. In problem manifests as failures in matching face silhouettes, and we
that work, correspondences are found by leveraging parsimony in address it through a formulation whereby gradients of background
the VAE network, and the model is never trained directly for the pixels that are not covered by any rasterized triangle can still be
target setting (i.e., input is headset images, output is avatar parame- backpropagated to affect mesh geometry.
ters). Although compelling results were demonstrated for speech
sequences, its performance deteriorates with expressive content. 3 ESTABLISHING CORRESPONDENCE
While our method also leverages unpaired learning methods, dif- In recent years, end-to-end regression from images has proven effec-
ferently, we explicitly transfer multiview IR images to avatar-like tive for high-quality and real-time facial motion estimation [Laine
rendered images with Generative Adversarial Networks (GANs), et al. 2017; Tewari et al. 2017]. These works leverage the repre-
which can generate high quality modality-transferred images while sentation capacity of deep neural networks to model the complex
better preserving facial expression. We also leverage the additional mapping between raw image pixels and the parameters of a para-
views provided by the training HMC, and infer more precise corre- metric face model. With input-output pairs, the problem can be
spondences through differentiable rendering. posed within a supervised learning framework, where a precise
mapping that generalizes well can be found (see §4). The main chal-
lenge, therefore, is to acquire a high quality training set: a sufficient
ACM Trans. Graph., Vol. 38, No. 4, Article 67. Publication date: July 2019.
67:4 • Wei et al.
✓2
All cameras are synchronized and capture at 90Hz. They were geo-
<latexit sha1_base64="8vDmJGbdI/cg6Vo6I6c5+HIpb2o=">AAAB7nicbVBNS8NAEN34WetX1aOXxSJ4KkkRrLeCF48VjC20oWy2k3bpZhN3J0IJ/RNePKh49fd489+4bXPQ1gcDj/dmmJkXplIYdN1vZ219Y3Nru7RT3t3bPzisHB0/mCTTHHyeyER3QmZACgU+CpTQSTWwOJTQDsc3M7/9BNqIRN3jJIUgZkMlIsEZWqnTwxEg69f7lapbc+egq8QrSJUUaPUrX71BwrMYFHLJjOl6bopBzjQKLmFa7mUGUsbHbAhdSxWLwQT5/N4pPbfKgEaJtqWQztXfEzmLjZnEoe2MGY7MsjcT//O6GUaNIBcqzRAUXyyKMkkxobPn6UBo4CgnljCuhb2V8hHTjKONqGxD8JZfXiV+vXZd8+4uq81GkUaJnJIzckE8ckWa5Ja0iE84keSZvJI359F5cd6dj0XrmlPMnJA/cD5/ADe9j5c=</latexit>
sha1_base64="X/BbPPQRM1pmBhxdK1enSbL+gJw=">AAAB2HicbZDNSgMxFIXv1L86Vq1rN8EiuCozbtSd4MZlBccW2qFkMnfa0ExmSO4IpfQFXLhRfDB3vo3pz0KtBwIf5yTk3pOUSloKgi+vtrW9s7tX3/cPGv7h0XGz8WSLygiMRKEK00u4RSU1RiRJYa80yPNEYTeZ3C3y7jMaKwv9SNMS45yPtMyk4OSszrDZCtrBUmwTwjW0YK1h83OQFqLKUZNQ3Np+GJQUz7ghKRTO/UFlseRiwkfYd6h5jjaeLcecs3PnpCwrjDua2NL9+WLGc2uneeJu5pzG9m+2MP/L+hVl1/FM6rIi1GL1UVYpRgVb7MxSaVCQmjrgwkg3KxNjbrgg14zvOgj/brwJ0WX7ph0+BFCHUziDCwjhCm7hHjoQgYAUXuDNG3uv3vuqqpq37uwEfsn7+AaqKYoN</latexit>
sha1_base64="zJqa6K42zjvcM99kYyFSbiEFNko=">AAAB43icbZDNSgMxFIXv+Ftr1erWTbAIrspMN+pOcOOygrWFdiiZ9E4bmsmMyR2hDH0JNy5UfCd3vo3pz0JbDwQ+zknIvSfKlLTk+9/exubW9s5uaa+8Xzk4PKoeVx5tmhuBLZGq1HQiblFJjS2SpLCTGeRJpLAdjW9nefsZjZWpfqBJhmHCh1rGUnByVqdHIyTeb/SrNb/uz8XWIVhCDZZq9qtfvUEq8gQ1CcWt7QZ+RmHBDUmhcFru5RYzLsZ8iF2Hmidow2I+75SdO2fA4tS4o4nN3d8vCp5YO0kidzPhNLKr2cz8L+vmFF+FhdRZTqjF4qM4V4xSNlueDaRBQWrigAsj3axMjLjhglxFZVdCsLryOrQa9et6cO9DCU7hDC4ggEu4gTtoQgsEKHiBN3j3nrxX72PR1oa3rO0E/sj7/AEKTI5I</latexit>
sha1_base64="x+VcxIuTiIkB60YIGLQQVeNUHUY=">AAAB7nicbVBNT8JAEN3iF+IX6tHLRmLiibRcxBuJF4+YWCGBhmyXKWzYbuvu1IQ0/AkvHtR49fd489+4QA8KvmSSl/dmMjMvTKUw6LrfTmljc2t7p7xb2ds/ODyqHp88mCTTHHyeyER3Q2ZACgU+CpTQTTWwOJTQCSc3c7/zBNqIRN3jNIUgZiMlIsEZWqnbxzEgGzQG1Zpbdxeg68QrSI0UaA+qX/1hwrMYFHLJjOl5bopBzjQKLmFW6WcGUsYnbAQ9SxWLwQT54t4ZvbDKkEaJtqWQLtTfEzmLjZnGoe2MGY7NqjcX//N6GUbNIBcqzRAUXy6KMkkxofPn6VBo4CinljCuhb2V8jHTjKONqGJD8FZfXid+o35d9+7cWqtZpFEmZ+ScXBKPXJEWuSVt4hNOJHkmr+TNeXRenHfnY9lacoqZU/IHzucPNn2Pkw==</latexit>
✓1 ⌧ ✓2
metrically calibrated using a custom 3D printed calibration pattern
<latexit sha1_base64="5jd0Z4vzTMXlZb9vlCRENOu1eFU=">AAAB/HicbVDLSsNAFJ34rPUVHzs3g0VwVZIiWHcFNy4rGFtoQphMJ+3QySTM3Ag1FH/FjQsVt36IO//GaZuFth64cOace5l7T5QJrsFxvq2V1bX1jc3KVnV7Z3dv3z44vNdprijzaCpS1Y2IZoJL5gEHwbqZYiSJBOtEo+up33lgSvNU3sE4Y0FCBpLHnBIwUmgf+zBkQEIX+0Lg+aMR2jWn7syAl4lbkhoq0Q7tL7+f0jxhEqggWvdcJ4OgIAo4FWxS9XPNMkJHZMB6hkqSMB0Us+0n+MwofRynypQEPFN/TxQk0XqcRKYzITDUi95U/M/r5RA3g4LLLAcm6fyjOBcYUjyNAve5YhTE2BBCFTe7YjokilAwgVVNCO7iycvEa9Sv6u7tRa3VLNOooBN0is6Riy5RC92gNvIQRY/oGb2iN+vJerHerY9564pVzhyhP7A+fwDUm5RX</latexit>
that ensures that parts of the pattern are within the depth of field
(d) (e)
of each camera, as shown in Fig. 2(d).
To build the dataset, we captured each subject twice using the
Fig. 2. Headset mounted cameras (HMC); (a) Training HMC, with stan-
same stimuli; once using the training HMC, and again using the
dard cameras (circled in red) and additional cameras (circled in blue). It is
used for collecting data to help establish better correspondence between
tracking HMC. The content included 73 expressions, 50 sentences,
HMC images and avatar parameters. (b) Tracking HMC, used for realtime a range of motion, a range of gaze directions and 10 minutes of free
face animation, with minimal camera configuration. (c) Example of captured conversation. This set was designed to cover the range of natural
images, with colored frames indicating standard views (red) and additional expressions. Collecting the same content in both devices ensures
views (blue). (d) Multi-plane calibration pattern used to geometrically cali- a roughly balanced distribution of facial expressions between the
brate all cameras in the training and tracking HMCs. (e) An illustration of two domains, which is important for unpaired domain transfer
the challenges of ergonomic camera placement (blue), where large motions algorithms to work well (see §3.3).
such as mouth opening project to small changes in the captured image, in
comparison to more accommodating camera placements (orange). 3.2 Overall Algorithm
We illustrate our overall algorithm to establish correspondences
in Fig. 3. It assumes the availability of a pre-trained personalized
parametric face model. For the experiments in this paper, we use the
number of precise correspondence between input HMC images and
deep appearance model [Lombardi et al. 2018] which generates ge-
avatar parameters spanning diverse facial expressions. This is par-
ometry and view-conditioned texture from an l-dimensional latent
ticularly challenging for VR applications, where existing methods
have significant limitations as described in §2.1. code z ∈ Rl and a 6-DOF rigid transform v ∈ R6 from the avatar’s
In this work, we establish high-quality correspondence using reference frame to the headset (represented by a reference camera)
domain-transferred multi-view analysis-by-synthesis. We describe using a deep deconvolutional neural network D:
our training- and tracking-HMC designs in §3.1. Our approach for M,T ← D(z, v). (1)
multiview consistent correspondence estimation is then described
Here, M ∈ Rn×3 is the facial shape comprising n-vertices, and
in §3.2, with a detailed treatment of domain transfer in §3.3, and
T ∈ Rw ×h is the generated texture. A rendered image R can be
background-aware differentiable rendering in §3.4.
generated from this shape and texture through rasterization:
3.1 Data Capture R ← R(M,T , A(v)), (2)
Consumer-grade VR headset designs need to be structurally and where A denotes the camera’s projection function.
mechanically robust, ergonomic, and aesthetically pleasing. HMC de- Given multiview images H = {Hi }i ∈ C acquired from a set of
signs that fit this criteria typically have partial and oblique views of headset cameras C, our goal is to estimate the user’s facial expres-
the face which are challenging to use with image-space reconstruc- sion as seen in these views. We solve this by estimating the latent
tion losses typically employed in registration algorithms. Fig. 2(e) parameters z and headset’s pose v that best aligns the rendered
illustrates this problem. For this reason, most published [Li et al. face model to the acquired images. However, instead of performing
2015; Olszewski et al. 2016] and commercial [BinaryVR 2019] VR face this task separately for each frame in a recording, similar to Tewari
tracking systems are designed to operate with more accommodating et al. [2017], we simultaneously estimate these attributes over the
designs, at the expense of size, weight, and cost considerations. The entire dataset comprising thousands of multiview frames. Specifi-
core idea of our work is that the challenging images acquired from cally, we estimate the parameters θ of a predictor Eθ that extracts
ACM Trans. Graph., Vol. 38, No. 4, Article 67. Publication date: July 2019.
{} {}
,{}
VR Facial Animation via Multiview Image Translation • 67:5
{}
HMC Images Established Correspondence
,
Avatar
<latexit sha1_base64="WVUPy8j6/oNeGty+Pga9KkyxATU=">AAAB63icbVDLSgNBEOyNrxhfUY9eFoPgKeyIkGvQixchgbwgWcLspBOHzMwuO7NCWPIFHtWLePWTPPg3TtZFTGJBQ1HVBdUdRIJr43lfTmFjc2t7p7hb2ts/ODwqH590dJjEDNssFGHcC6hGwRW2DTcCe1GMVAYCu8H0duF3HzHWPFQtM4vQl3Si+JgzaqzUvB+WK17Vy+CuE5KTCuRoDMufg1HIEonKMEG17hMvMn5KY8OZwHlpkGiMKJvSCfYtVVSi9tOs6Ny9sMrIHYexHWXcTP2bSKnUeiYDuympedCr3kL81xM8QNtALRf4lRNtSJZdqtcifrrogYrN7R/I6tXrpHNVJV6VNK8r9Zv8I0U4g3O4BAI1qMMdNKANDBCe4AVeHek8O2/O+89qwckzp7AE5+MbeRyPUA==</latexit>
<latexit sha1_base64="Ud+DuvwkbNb7ryLevURBVcOZ57A=">AAAB7XicbVDLSgNBEOyNrxhfUY9eFoPgKeyIkGvQS44R84JkCbOTThwyj2VnVghLPsGjehGvfpEH/8bNuohJLGgoqrqguoNQcGM978spbGxube8Ud0t7+weHR+Xjk47RccSwzbTQUS+gBgVX2LbcCuyFEVIZCOwG09uF333EyHCtWnYWoi/pRPExZ9Sm0n1jyIflilf1MrjrhOSkAjmaw/LnYKRZLFFZJqgxfeKF1k9oZDkTOC8NYoMhZVM6wX5KFZVo/CSrOncvUmXkjnWUjrJupv5NJFQaM5NBuimpfTCr3kL81xM8wLSBWi7wK8fGkiy7VK9F/GTRAxWbp38gq1evk85VlXhVcnddqd/kHynCGZzDJRCoQR0a0IQ2MJjAE7zAq6OdZ+fNef9ZLTh55hSW4Hx8A+2/kCc=</latexit>
<latexit sha1_base64="3zSqpQZCxWJm/a+23dU8oYy+lwU=">AAAB7XicbVDLSsNAFL2pr1pfUZdugkVwVTIiuC26cVmxL2hDmUxv69CZSchMhBL6CS7Vjbj1i1z4N05jENt64MLhnHvg3BvGgmvj+19OaW19Y3OrvF3Z2d3bP3APj9o6ShOGLRaJKOmGVKPgCluGG4HdOEEqQ4GdcHIz9zuPmGgeqaaZxhhIOlZ8xBk1VrpvDvjArfo1P4e3SkhBqlCgMXA/+8OIpRKVYYJq3SN+bIKMJoYzgbNK
sha1_base64="3zSqpQZCxWJm/a+23dU8oYy+lwU=">AAAB7XicbVDLSsNAFL2pr1pfUZdugkVwVTIiuC26cVmxL2hDmUxv69CZSchMhBL6CS7Vjbj1i1z4N05jENt64MLhnHvg3BvGgmvj+19OaW19Y3OrvF3Z2d3bP3APj9o6ShOGLRaJKOmGVKPgCluGG4HdOEEqQ4GdcHIz9zuPmGgeqaaZxhhIOlZ8xBk1VrpvDvjArfo1P4e3SkhBqlCgMXA/+8OIpRKVYYJq3SN+bIKMJoYzgbNKP9UYUzahY+xZqqhEHWR51Zl3ZpWhN4oSO8p4ufo3kVGp9VSGdlNS86CXvbn4ryd4iLaBWizwK6fakDy7UK9JgmzeAxWb2T+Q5atXSfuiRvwaubus1q+Lj5ThBE7hHAhcQR1uoQEtYDCGJ3iBVydynp035/1nteQUmWNYgPPxDQA6kDM=</latexit>
Parameter
{} {}
<latexit sha1_base64="ZpbU3QlWpsZz4w/kwceYFvcZ5R0=">AAAB8HicbVDLSsNAFL1TX7W+oi7dBIvgqmREcFt047JCX9CGMpne1rEzk5CZCCX0H1yqG3Hr/7jwb0xjENt64MLhnHvg3BtEUhjreV+ktLa+sblV3q7s7O7tHziHR20TJjHHFg9lGHcDZlAKjS0rrMRuFCNTgcROMLmZ+51HjI0IddNOI/QVG2sxEpzZTGr3+TC0ZuBUvZqXw10ltCBVKNAYOJ/9YcgThdpyyYzpUS+yfspiK7jEWaWfGIwYn7Ax9jKqmULjp3nbmXuWKUN3FMbZaOvm6t9EypQxUxVkm4rZe7PszcV/PSkCzBroxQK/cmIszbML9ZrUT+c9UPNZ9ge6fPUqaV/UqFejd5fV+nXxkTKcwCmcA4UrqMMtNKAFHB7gCV7glcTkmbyR95/VEikyx7AA8vENiPmRrg==</latexit>
,
<latexit sha1_base64="eQnzXC7dt7IZRHVD+z4DZzT5bdE=">AAAB9HicbZDNSsNAFIVv6l+tf1WXboJFcFUyIrgt6sJlhaYttqHMTG/r0MkkZCaFEvoWLtWNuPVtXPg2JjGIbT0wcDjnXvjmslAKbRznyyqtrW9sbpW3Kzu7e/sH1cOjtg7iiKPLAxlEXUY1SqHQNcJI7IYRUp9J7LDJTdZ3phhpEaiWmYXo+XSsxEhwatLooe9T88hGye18UK05dSeXvWpIYWpQqDmofvaHAY99VIZLqnWPOKHxEhoZwSXOK/1YY0j5hI6xl1pFfdRekhPP7bM0GdqjIEqfMnae/t1IqK/1zGfpZEaol7ss/LeTgmFKoBYBfuNYG5LvLuC1iJdkHKh4dgey/OtV076oE6dO7i9rjeviImU4gVM4BwJX0IA7aIILHBQ8wQu8WlPr2Xqz3n9GS1axcwwLsj6+AY2Wk2c=</latexit>
<latexit sha1_base64="b5ED6JL7HuVYcj3XhV5eZIN+Zhc=">AAAB/XicbZBLS8NAFIVv6qvWV6pLN8EiuCoZEdwWRXBZoS9oQphMb9uhkweZiVJC8Je4VDfi1l/iwn9jEoPY1gMDh3PuhW+uGwoulWl+aZW19Y3Nrep2bWd3b/9Arx/2ZBBHDLssEEE0cKlEwX3sKq4EDsIIqecK7Luz67zv32MkeeB31DxE26MTn485oyqLHL1ueVRN3XFykzqWmqKijt4wm2YhY9WQ0jSgVNvRP61RwGIPfcUElXJIzFDZCY0UZwLTmhVLDCmb0QkOM+tTD6WdFOipcZolI2McRNnzlVGkfzcS6kk599xsMgeVy10e/tsJ7mJG4C8C/MaxVKTYXcDrEDvJOdBnaXYHsvzrVdM7bxKzSe4uGq2r8iJVOIYTOAMCl9CCW2hDFxg8wBO8wKv2qD1rb9r7z2hFK3eOYEHaxzd8kZaw</latexit>
<latexit sha1_base64="n1zL5bQ1SCS3ygCCHocEvBWt8Uk=">AAAB63icbVDLSsNAFL2pr1pfUZdugkVwISXjxm3RjcsW+oI2lMn0tg6dmYTMRCihX+BS3YhbP8mFf+M0BrGtBy4czrkHzr1hLLg2vv/llDY2t7Z3yruVvf2DwyP3+KSjozRh2GaRiJJeSDUKrrBtuBHYixOkMhTYDad3C7/7iInmkWqZWYyBpBPFx5xRY6Xm1dCt+jU/h7dOSEGqUKAxdD8Ho4ilEpVhgmrdJ35sgowmhjOB88og1RhTNqUT7FuqqEQdZHnRuXdhlZE3jhI7yni5+jeRUan1TIZ2U1LzoFe9hfivJ3iItoFaLvArp9qQPLtUr0WCbNEDFZvbP5DVq9dJ57pG/Bppkmr9tvhIGc7gHC6BwA3U4R4a0AYGCE/wAq+OdJ6dN+f9Z7XkFJlTWILz8Q1FxY8s</latexit>
<latexit sha1_base64="79nn5lbtuftuxApg/bbjScOt3LU=">AAAB63icbVDLSgNBEOyJrxhfUY96WAyCBwk7XrwuevGYQF6QLGF20olDZmaXnVkhLPkCj+pFvPpJHvwbN5sgJrGgoajqguoOIimMdd1vUtjY3NreKe6W9vYPDo/KxyctEyYxxyYPZRh3AmZQCo1NK6zEThQjU4HEdjC+n/ntJ4yNCHXDTiL0FRtpMRSc2UyqX/fLFbfq5nDWCV2QincOOWr98ldvEPJEobZcMmO61I2sn7LYCi5xWuolBiPGx2yE3YxqptD4aV506lxmysAZhnE22jq5+jeRMmXMRAXZpmL20ax6M/FfT4oAswZ6ucCvnBhL8+xSvQb101kP1Hya/YGuXr1OWjdV6lZpnVa8u/lDoAhncAFXQOEWPHiAGjSBA8IzvMIbUeSFvJOP+WqBLDKnsATy+QPDzY+I</latexit>
<latexit sha1_base64="4uHZqYMgOi7NQ078cgy38M/NFto=">AAAB/HicbVDLSsNAFL2pr1pfUZdugkVwVTJu3BbduKzQF5hQJtPbduhkEmYmhRLql7hUN+LWP3Hh35jEILb1wIXDOffAuTeIBdfGdb+sysbm1vZOdbe2t39weGQfn3R1lCiGHRaJSPUDqlFwiR3DjcB+rJCGgcBeML3N/d4MleaRbJt5jH5Ix5KPOKMmkwa27QkcGS91PMXHE+MtBnbdbbgFnHVCSlKHEq2B/ekNI5aEKA0TVOsH4sbGT6kynAlc1LxEY0zZlI7xIaOShqj9tGi+cC4yZeiMIpWNNE6h/k2kNNR6HgbZZkjNRK96ufivJ3iAWQO5XOBXTrQhRXapXpv4ad4DJcv/QFavXifdqwZxG+Se1Js35UeqcAbncAkErqEJd9CCDjCYwRO8wKv1aD1bb9b7z2rFKjOnsATr4xt5+ZYY</latexit>
sha1_base64="Q7vR7Sv+uhA2R7II3GoG09bYf3Y=">AAAB/HicbVDLSsNAFL2pr1pfUZduQovgqmTcuA26cVmhL2hCmUxv26GTSchMCiXUL3HhQt2IW//EhX9jkhaxrQcuHM65B869fiS40rb9bZS2tnd298r7lYPDo+MT8/SsrcIkZthioQjjrk8VCi6xpbkW2I1ipIEvsONP7nK/M8VY8VA29SxCL6AjyYecUZ1JfdN0BQ61m1puzEdj7c77Zs2u2wWsTUKWpOZUofQMAI2++eUOQpYEKDUTVKkesSPtpTTWnAmcV9xEYUTZhI6wl1FJA1ReWjSfW5eZMrCGYZyN1Fah/k2kNFBqFvjZZkD1WK17ufivJ7iPWQO5WuBXTpQmRXalXpN4ad4DJcv/QNav3iTt6zqx6+SB1JxbWKAMF1CFKyBwAw7cQwNawGAKT/AKb8aj8WK8Gx+L1ZKxzJzDCozPH9vclxw=</latexit>
sha1_base64="TEn66AehV7gaxyMeoQeg2gvRHVs=">AAAB/HicbVDLSsNAFL2pr1pfUcGNLkKL4Kpk3LgtunFZoS9oQplMb9vBySRkJoUS6pcIbtSNuHDjn7jwb0zSIrb1wIXDOffAudcLBVfatr+Nwtr6xuZWcbu0s7u3f2AeHrVUEEcMmywQQdTxqELBJTY11wI7YYTU9wS2vfubzG+PMVI8kA09CdH16VDyAWdUp1LPNB2BA+0klhPx4Ug7055Zsat2DmuVkDmp1MpQeDr5OKv3zC+nH7DYR6mZoEp1iR1qN6GR5kzgtOTECkPK7ukQuymV1EflJnnzqXWeKn1rEETpSG3l6t9EQn2lJr6XbvpUj9Syl4n/eoJ7mDaQiwV+5VhpkmcX6jWIm2Q9ULLsD2T56lXSuqwSu0ruSKV2DTMU4RTKcAEErqAGt1CHJjAYwyO8wKvxYDwbb8b7bLVgzDPHsADj8wcMhJf+</latexit>
<latexit sha1_base64="4uHZqYMgOi7NQ078cgy38M/NFto=">AAAB/HicbVDLSsNAFL2pr1pfUZdugkVwVTJu3BbduKzQF5hQJtPbduhkEmYmhRLql7hUN+LWP3Hh35jEILb1wIXDOffAuTeIBdfGdb+sysbm1vZOdbe2t39weGQfn3R1lCiGHRaJSPUDqlFwiR3DjcB+rJCGgcBeML3N/d4MleaRbJt5jH5Ix5KPOKMmkwa27QkcGS91PMXHE+MtBnbdbbgFnHVCSlKHEq2B/ekNI5aEKA0TVOsH4sbGT6kynAlc1LxEY0zZlI7xIaOShqj9tGi+cC4yZeiMIpWNNE6h/k2kNNR6HgbZZkjNRK96ufivJ3iAWQO5XOBXTrQhRXapXpv4ad4DJcv/QFavXifdqwZxG+Se1Js35UeqcAbncAkErqEJd9CCDjCYwRO8wKv1aD1bb9b7z2rFKjOnsATr4xt5+ZYY</latexit>
sha1_base64="Q7vR7Sv+uhA2R7II3GoG09bYf3Y=">AAAB/HicbVDLSsNAFL2pr1pfUZduQovgqmTcuA26cVmhL2hCmUxv26GTSchMCiXUL3HhQt2IW//EhX9jkhaxrQcuHM65B869fiS40rb9bZS2tnd298r7lYPDo+MT8/SsrcIkZthioQjjrk8VCi6xpbkW2I1ipIEvsONP7nK/M8VY8VA29SxCL6AjyYecUZ1JfdN0BQ61m1puzEdj7c77Zs2u2wWsTUKWpOZUofQMAI2++eUOQpYEKDUTVKkesSPtpTTWnAmcV9xEYUTZhI6wl1FJA1ReWjSfW5eZMrCGYZyN1Fah/k2kNFBqFvjZZkD1WK17ufivJ7iPWQO5WuBXTpQmRXalXpN4ad4DJcv/QNav3iTt6zqx6+SB1JxbWKAMF1CFKyBwAw7cQwNawGAKT/AKb8aj8WK8Gx+L1ZKxzJzDCozPH9vclxw=</latexit>
sha1_base64="TEn66AehV7gaxyMeoQeg2gvRHVs=">AAAB/HicbVDLSsNAFL2pr1pfUcGNLkKL4Kpk3LgtunFZoS9oQplMb9vBySRkJoUS6pcIbtSNuHDjn7jwb0zSIrb1wIXDOffAudcLBVfatr+Nwtr6xuZWcbu0s7u3f2AeHrVUEEcMmywQQdTxqELBJTY11wI7YYTU9wS2vfubzG+PMVI8kA09CdH16VDyAWdUp1LPNB2BA+0klhPx4Ug7055Zsat2DmuVkDmp1MpQeDr5OKv3zC+nH7DYR6mZoEp1iR1qN6GR5kzgtOTECkPK7ukQuymV1EflJnnzqXWeKn1rEETpSG3l6t9EQn2lJr6XbvpUj9Syl4n/eoJ7mDaQiwV+5VhpkmcX6jWIm2Q9ULLsD2T56lXSuqwSu0ruSKV2DTMU4RTKcAEErqAGt1CHJjAYwyO8wKvxYDwbb8b7bLVgzDPHsADj8wcMhJf+</latexit>
<latexit sha1_base64="ZpbU3QlWpsZz4w/kwceYFvcZ5R0=">AAAB8HicbVDLSsNAFL1TX7W+oi7dBIvgqmREcFt047JCX9CGMpne1rEzk5CZCCX0H1yqG3Hr/7jwb0xjENt64MLhnHvg3BtEUhjreV+ktLa+sblV3q7s7O7tHziHR20TJjHHFg9lGHcDZlAKjS0rrMRuFCNTgcROMLmZ+51HjI0IddNOI/QVG2sxEpzZTGr3+TC0ZuBUvZqXw10ltCBVKNAYOJ/9YcgThdpyyYzpUS+yfspiK7jEWaWfGIwYn7Ax9jKqmULjp3nbmXuWKUN3FMbZaOvm6t9EypQxUxVkm4rZe7PszcV/PSkCzBroxQK/cmIszbML9ZrUT+c9UPNZ9ge6fPUqaV/UqFejd5fV+nXxkTKcwCmcA4UrqMMtNKAFHB7gCV7glcTkmbyR95/VEikyx7AA8vENiPmRrg==</latexit>
<latexit sha1_base64="vR22rjO7hn9ogFsuS0wf6jvn0e8=">AAAB9XicbVDLSsNAFL2pr1pfUZdugkVwVTIiuC26cVmhL2hDmUxv69DJTMhMlBL6GS7Vjbj1a1z4N05jENt6YOBwzj3cMzeMBdfG97+c0tr6xuZWebuys7u3f+AeHrW1ShOGLaaESroh1Si4xJbhRmA3TpBGocBOOLmZ+50HTDRXsmmmMQYRHUs+4owaK/X6IR8rwyPU3sCt+jU/h7dKSEGqUKAxcD/7Q8XSCKVhgmrdI35sgowmhjOBs0o/1RhTNqFj7Fkqqd0SZHnlmXdmlaE3Uol90ni5+jeR0UjraRTayYiae73szcV/PcFDtA3kYoFfOdWG5NmFek0SZPMeKNnM3oEs/3qVtC9qxK+Ru8tq/bq4SBlO4BTOgcAV1OEWGtACBgqe4AVenUfn2Xlz3n9GS06ROYYFOB/fB7uTpg==</latexit>
Headset's
6DOF Pose Mesh View-dependent Texture
<latexit sha1_base64="8211kZEp2fNHVkv6szzoH2iFlT0=">AAAB7XicbVDLSgNBEOyJrxhfUY9eFoPgKeyIkGvUi8eIeUGyhNlJJw6ZmV12ZoWw5BM8qhfx6hd58G/crIuYxIKGoqoLqtsPpTDWdb9IYW19Y3OruF3a2d3bPygfHrVNEEccWzyQQdT1mUEpNLassBK7YYRM+RI7/uRm7nceMTIi0E07DdFTbKzFSHBmU+n+aiAG5YpbdTM4q4TmpAI5GoPyZ38Y8FihtlwyY3rUDa2XsMgKLnFW6scGQ8YnbIy9lGqm0HhJVnXmnKXK0BkFUTraOpn6N5EwZcxU+emmYvbBLHtz8V9PCh/TBnqxwK8cG0uz7EK9JvWSeQ/UfJb+gS5fvUraF1XqVundZaV+nX+kCCdwCudAoQZ1uIUGtIDDGJ7gBV5JQJ7JG3n/WS2QPHMMCyAf3+MAkCA=</latexit>
HMC Camera
<latexit sha1_base64="n1zL5bQ1SCS3ygCCHocEvBWt8Uk=">AAAB63icbVDLSsNAFL2pr1pfUZdugkVwISXjxm3RjcsW+oI2lMn0tg6dmYTMRCihX+BS3YhbP8mFf+M0BrGtBy4czrkHzr1hLLg2vv/llDY2t7Z3yruVvf2DwyP3+KSjozRh2GaRiJJeSDUKrrBtuBHYixOkMhTYDad3C7/7iInmkWqZWYyBpBPFx5xRY6Xm1dCt+jU/h7dOSEGqUKAxdD8Ho4ilEpVhgmrdJ35sgowmhjOB88og1RhTNqUT7FuqqEQdZHnRuXdhlZE3jhI7yni5+jeRUan1TIZ2U1LzoFe9hfivJ3iItoFaLvArp9qQPLtUr0WCbNEDFZvbP5DVq9dJ57pG/Bppkmr9tvhIGc7gHC6BwA3U4R4a0AYGCE/wAq+OdJ6dN+f9Z7XkFJlTWILz8Q1FxY8s</latexit>
<latexit sha1_base64="79nn5lbtuftuxApg/bbjScOt3LU=">AAAB63icbVDLSgNBEOyJrxhfUY96WAyCBwk7XrwuevGYQF6QLGF20olDZmaXnVkhLPkCj+pFvPpJHvwbN5sgJrGgoajqguoOIimMdd1vUtjY3NreKe6W9vYPDo/KxyctEyYxxyYPZRh3AmZQCo1NK6zEThQjU4HEdjC+n/ntJ4yNCHXDTiL0FRtpMRSc2UyqX/fLFbfq5nDWCV2QincOOWr98ldvEPJEobZcMmO61I2sn7LYCi5xWuolBiPGx2yE3YxqptD4aV506lxmysAZhnE22jq5+jeRMmXMRAXZpmL20ax6M/FfT4oAswZ6ucCvnBhL8+xSvQb101kP1Hya/YGuXr1OWjdV6lZpnVa8u/lDoAhncAFXQOEWPHiAGjSBA8IzvMIbUeSFvJOP+WqBLDKnsATy+QPDzY+I</latexit>
{} {}
Image Style Differentiable
<latexit sha1_base64="4uHZqYMgOi7NQ078cgy38M/NFto=">AAAB/HicbVDLSsNAFL2pr1pfUZdugkVwVTJu3BbduKzQF5hQJtPbduhkEmYmhRLql7hUN+LWP3Hh35jEILb1wIXDOffAuTeIBdfGdb+sysbm1vZOdbe2t39weGQfn3R1lCiGHRaJSPUDqlFwiR3DjcB+rJCGgcBeML3N/d4MleaRbJt5jH5Ix5KPOKMmkwa27QkcGS91PMXHE+MtBnbdbbgFnHVCSlKHEq2B/ekNI5aEKA0TVOsH4sbGT6kynAlc1LxEY0zZlI7xIaOShqj9tGi+cC4yZeiMIpWNNE6h/k2kNNR6HgbZZkjNRK96ufivJ3iAWQO5XOBXTrQhRXapXpv4ad4DJcv/QFavXifdqwZxG+Se1Js35UeqcAbncAkErqEJd9CCDjCYwRO8wKv1aD1bb9b7z2rFKjOnsATr4xt5+ZYY</latexit>
sha1_base64="Q7vR7Sv+uhA2R7II3GoG09bYf3Y=">AAAB/HicbVDLSsNAFL2pr1pfUZduQovgqmTcuA26cVmhL2hCmUxv26GTSchMCiXUL3HhQt2IW//EhX9jkhaxrQcuHM65B869fiS40rb9bZS2tnd298r7lYPDo+MT8/SsrcIkZthioQjjrk8VCi6xpbkW2I1ipIEvsONP7nK/M8VY8VA29SxCL6AjyYecUZ1JfdN0BQ61m1puzEdj7c77Zs2u2wWsTUKWpOZUofQMAI2++eUOQpYEKDUTVKkesSPtpTTWnAmcV9xEYUTZhI6wl1FJA1ReWjSfW5eZMrCGYZyN1Fah/k2kNFBqFvjZZkD1WK17ufivJ7iPWQO5WuBXTpQmRXalXpN4ad4DJcv/QNav3iTt6zqx6+SB1JxbWKAMF1CFKyBwAw7cQwNawGAKT/AKb8aj8WK8Gx+L1ZKxzJzDCozPH9vclxw=</latexit>
sha1_base64="TEn66AehV7gaxyMeoQeg2gvRHVs=">AAAB/HicbVDLSsNAFL2pr1pfUcGNLkKL4Kpk3LgtunFZoS9oQplMb9vBySRkJoUS6pcIbtSNuHDjn7jwb0zSIrb1wIXDOffAudcLBVfatr+Nwtr6xuZWcbu0s7u3f2AeHrVUEEcMmywQQdTxqELBJTY11wI7YYTU9wS2vfubzG+PMVI8kA09CdH16VDyAWdUp1LPNB2BA+0klhPx4Ug7055Zsat2DmuVkDmp1MpQeDr5OKv3zC+nH7DYR6mZoEp1iR1qN6GR5kzgtOTECkPK7ukQuymV1EflJnnzqXWeKn1rEETpSG3l6t9EQn2lJr6XbvpUj9Syl4n/eoJ7mDaQiwV+5VhpkmcX6jWIm2Q9ULLsD2T56lXSuqwSu0ruSKV2DTMU4RTKcAEErqAGt1CHJjAYwyO8wKvxYDwbb8b7bLVgzDPHsADj8wcMhJf+</latexit>
<latexit sha1_base64="4uHZqYMgOi7NQ078cgy38M/NFto=">AAAB/HicbVDLSsNAFL2pr1pfUZdugkVwVTJu3BbduKzQF5hQJtPbduhkEmYmhRLql7hUN+LWP3Hh35jEILb1wIXDOffAuTeIBdfGdb+sysbm1vZOdbe2t39weGQfn3R1lCiGHRaJSPUDqlFwiR3DjcB+rJCGgcBeML3N/d4MleaRbJt5jH5Ix5KPOKMmkwa27QkcGS91PMXHE+MtBnbdbbgFnHVCSlKHEq2B/ekNI5aEKA0TVOsH4sbGT6kynAlc1LxEY0zZlI7xIaOShqj9tGi+cC4yZeiMIpWNNE6h/k2kNNR6HgbZZkjNRK96ufivJ3iAWQO5XOBXTrQhRXapXpv4ad4DJcv/QFavXifdqwZxG+Se1Js35UeqcAbncAkErqEJd9CCDjCYwRO8wKv1aD1bb9b7z2rFKjOnsATr4xt5+ZYY</latexit>
sha1_base64="Q7vR7Sv+uhA2R7II3GoG09bYf3Y=">AAAB/HicbVDLSsNAFL2pr1pfUZduQovgqmTcuA26cVmhL2hCmUxv26GTSchMCiXUL3HhQt2IW//EhX9jkhaxrQcuHM65B869fiS40rb9bZS2tnd298r7lYPDo+MT8/SsrcIkZthioQjjrk8VCi6xpbkW2I1ipIEvsONP7nK/M8VY8VA29SxCL6AjyYecUZ1JfdN0BQ61m1puzEdj7c77Zs2u2wWsTUKWpOZUofQMAI2++eUOQpYEKDUTVKkesSPtpTTWnAmcV9xEYUTZhI6wl1FJA1ReWjSfW5eZMrCGYZyN1Fah/k2kNFBqFvjZZkD1WK17ufivJ7iPWQO5WuBXTpQmRXalXpN4ad4DJcv/QNav3iTt6zqx6+SB1JxbWKAMF1CFKyBwAw7cQwNawGAKT/AKb8aj8WK8Gx+L1ZKxzJzDCozPH9vclxw=</latexit>
sha1_base64="TEn66AehV7gaxyMeoQeg2gvRHVs=">AAAB/HicbVDLSsNAFL2pr1pfUcGNLkKL4Kpk3LgtunFZoS9oQplMb9vBySRkJoUS6pcIbtSNuHDjn7jwb0zSIrb1wIXDOffAudcLBVfatr+Nwtr6xuZWcbu0s7u3f2AeHrVUEEcMmywQQdTxqELBJTY11wI7YYTU9wS2vfubzG+PMVI8kA09CdH16VDyAWdUp1LPNB2BA+0klhPx4Ug7055Zsat2DmuVkDmp1MpQeDr5OKv3zC+nH7DYR6mZoEp1iR1qN6GR5kzgtOTECkPK7ukQuymV1EflJnnzqXWeKn1rEETpSG3l6t9EQn2lJr6XbvpUj9Syl4n/eoJ7mDaQiwV+5VhpkmcX6jWIm2Q9ULLsD2T56lXSuqwSu0ruSKV2DTMU4RTKcAEErqAGt1CHJjAYwyO8wKvxYDwbb8b7bLVgzDPHsADj8wcMhJf+</latexit>
Calibration
Translation Rendering
,
<latexit sha1_base64="06m4rjwPepVLhOVmgttQObBpX5A=">AAAB+3icbZBLS8NAFIVv6qvWR6Mu3RSL4KpkRHBbFMRlhb6gCWEyvW2HTiYhMxFKyC9xqW7ErT/Fhf/GJAaxrQcGDufcC99cLxRcacv6Miobm1vbO9Xd2t7+wWHdPDruqyCOGPZYIIJo6FGFgkvsaa4FDsMIqe8JHHjz27wfPGKkeCC7ehGi49Op5BPOqM4i16zbPtUzb5Lcpa4dzrhrNq2WVaixbkhpmlCq45qf9jhgsY9SM0GVGhEr1E5CI82ZwLRmxwpDyuZ0iqPMSuqjcpICPG2cZ8m4MQmi7EndKNK/Gwn1lVr4XjaZY6rVLg//7QT3MCOQywC/caw0KXaX8LrESXIOlCzN7kBWf71u+pctYrXIw1WzfVNepAqncAYXQOAa2nAPHegBgxie4AVejdR4Nt6M95/RilHunMCSjI9v4d2VyA==</latexit> <latexit sha1_base64="jTu8vQbebuBnYnUoViEkiAoBK5c=">AAAB+3icbZBLS8NAFIVvfNb6aNSlm2ARXJWMCG6LLnRZoS9oQphMb+vQySRkJkIJ+SUu1Y249ae48N+YxCC29cDA4Zx74ZvrR4Irbdtfxtr6xubWdm2nvru3f9AwD4/6Kkxihj0WijAe+lSh4BJ7mmuBwyhGGvgCB/7spugHjxgrHsqunkfoBnQq+YQzqvPIMxtOQPWDP0lvM8+JFPfMpt2yS1mrhlSmCZU6nvnpjEOWBCg1E1SpEbEj7aY01pwJzOpOojCibEanOMqtpAEqNy3BM+ssT8bWJIzzJ7VVpn83UhooNQ/8fLLAVMtdEf7bCe5jTiAXAX7jRGlS7i7gdYmbFhwoWZbfgSz/etX0L1rEbpH7y2b7urpIDU7gFM6BwBW04Q460AMGCTzBC7wamfFsvBnvP6NrRrVzDAsyPr4B9EKV1A==</latexit>
<latexit sha1_base64="4uHZqYMgOi7NQ078cgy38M/NFto=">AAAB/HicbVDLSsNAFL2pr1pfUZdugkVwVTJu3BbduKzQF5hQJtPbduhkEmYmhRLql7hUN+LWP3Hh35jEILb1wIXDOffAuTeIBdfGdb+sysbm1vZOdbe2t39weGQfn3R1lCiGHRaJSPUDqlFwiR3DjcB+rJCGgcBeML3N/d4MleaRbJt5jH5Ix5KPOKMmkwa27QkcGS91PMXHE+MtBnbdbbgFnHVCSlKHEq2B/ekNI5aEKA0TVOsH4sbGT6kynAlc1LxEY0zZlI7xIaOShqj9tGi+cC4yZeiMIpWNNE6h/k2kNNR6HgbZZkjNRK96ufivJ3iAWQO5XOBXTrQhRXapXpv4ad4DJcv/QFavXifdqwZxG+Se1Js35UeqcAbncAkErqEJd9CCDjCYwRO8wKv1aD1bb9b7z2rFKjOnsATr4xt5+ZYY</latexit>
sha1_base64="Q7vR7Sv+uhA2R7II3GoG09bYf3Y=">AAAB/HicbVDLSsNAFL2pr1pfUZduQovgqmTcuA26cVmhL2hCmUxv26GTSchMCiXUL3HhQt2IW//EhX9jkhaxrQcuHM65B869fiS40rb9bZS2tnd298r7lYPDo+MT8/SsrcIkZthioQjjrk8VCi6xpbkW2I1ipIEvsONP7nK/M8VY8VA29SxCL6AjyYecUZ1JfdN0BQ61m1puzEdj7c77Zs2u2wWsTUKWpOZUofQMAI2++eUOQpYEKDUTVKkesSPtpTTWnAmcV9xEYUTZhI6wl1FJA1ReWjSfW5eZMrCGYZyN1Fah/k2kNFBqFvjZZkD1WK17ufivJ7iPWQO5WuBXTpQmRXalXpN4ad4DJcv/QNav3iTt6zqx6+SB1JxbWKAMF1CFKyBwAw7cQwNawGAKT/AKb8aj8WK8Gx+L1ZKxzJzDCozPH9vclxw=</latexit>
sha1_base64="TEn66AehV7gaxyMeoQeg2gvRHVs=">AAAB/HicbVDLSsNAFL2pr1pfUcGNLkKL4Kpk3LgtunFZoS9oQplMb9vBySRkJoUS6pcIbtSNuHDjn7jwb0zSIrb1wIXDOffAudcLBVfatr+Nwtr6xuZWcbu0s7u3f2AeHrVUEEcMmywQQdTxqELBJTY11wI7YYTU9wS2vfubzG+PMVI8kA09CdH16VDyAWdUp1LPNB2BA+0klhPx4Ug7055Zsat2DmuVkDmp1MpQeDr5OKv3zC+nH7DYR6mZoEp1iR1qN6GR5kzgtOTECkPK7ukQuymV1EflJnnzqXWeKn1rEETpSG3l6t9EQn2lJr6XbvpUj9Syl4n/eoJ7mDaQiwV+5VhpkmcX6jWIm2Q9ULLsD2T56lXSuqwSu0ruSKV2DTMU4RTKcAEErqAGt1CHJjAYwyO8wKvxYDwbb8b7bLVgzDPHsADj8wcMhJf+</latexit>
<latexit sha1_base64="4uHZqYMgOi7NQ078cgy38M/NFto=">AAAB/HicbVDLSsNAFL2pr1pfUZdugkVwVTJu3BbduKzQF5hQJtPbduhkEmYmhRLql7hUN+LWP3Hh35jEILb1wIXDOffAuTeIBdfGdb+sysbm1vZOdbe2t39weGQfn3R1lCiGHRaJSPUDqlFwiR3DjcB+rJCGgcBeML3N/d4MleaRbJt5jH5Ix5KPOKMmkwa27QkcGS91PMXHE+MtBnbdbbgFnHVCSlKHEq2B/ekNI5aEKA0TVOsH4sbGT6kynAlc1LxEY0zZlI7xIaOShqj9tGi+cC4yZeiMIpWNNE6h/k2kNNR6HgbZZkjNRK96ufivJ3iAWQO5XOBXTrQhRXapXpv4ad4DJcv/QFavXifdqwZxG+Se1Js35UeqcAbncAkErqEJd9CCDjCYwRO8wKv1aD1bb9b7z2rFKjOnsATr4xt5+ZYY</latexit>
sha1_base64="Q7vR7Sv+uhA2R7II3GoG09bYf3Y=">AAAB/HicbVDLSsNAFL2pr1pfUZduQovgqmTcuA26cVmhL2hCmUxv26GTSchMCiXUL3HhQt2IW//EhX9jkhaxrQcuHM65B869fiS40rb9bZS2tnd298r7lYPDo+MT8/SsrcIkZthioQjjrk8VCi6xpbkW2I1ipIEvsONP7nK/M8VY8VA29SxCL6AjyYecUZ1JfdN0BQ61m1puzEdj7c77Zs2u2wWsTUKWpOZUofQMAI2++eUOQpYEKDUTVKkesSPtpTTWnAmcV9xEYUTZhI6wl1FJA1ReWjSfW5eZMrCGYZyN1Fah/k2kNFBqFvjZZkD1WK17ufivJ7iPWQO5WuBXTpQmRXalXpN4ad4DJcv/QNav3iTt6zqx6+SB1JxbWKAMF1CFKyBwAw7cQwNawGAKT/AKb8aj8WK8Gx+L1ZKxzJzDCozPH9vclxw=</latexit>
sha1_base64="TEn66AehV7gaxyMeoQeg2gvRHVs=">AAAB/HicbVDLSsNAFL2pr1pfUcGNLkKL4Kpk3LgtunFZoS9oQplMb9vBySRkJoUS6pcIbtSNuHDjn7jwb0zSIrb1wIXDOffAudcLBVfatr+Nwtr6xuZWcbu0s7u3f2AeHrVUEEcMmywQQdTxqELBJTY11wI7YYTU9wS2vfubzG+PMVI8kA09CdH16VDyAWdUp1LPNB2BA+0klhPx4Ug7055Zsat2DmuVkDmp1MpQeDr5OKv3zC+nH7DYR6mZoEp1iR1qN6GR5kzgtOTECkPK7ukQuymV1EflJnnzqXWeKn1rEETpSG3l6t9EQn2lJr6XbvpUj9Syl4n/eoJ7mDaQiwV+5VhpkmcX6jWIm2Q9ULLsD2T56lXSuqwSu0ruSKV2DTMU4RTKcAEErqAGt1CHJjAYwyO8wKvxYDwbb8b7bLVgzDPHsADj8wcMhJf+</latexit>
<latexit sha1_base64="ZpbU3QlWpsZz4w/kwceYFvcZ5R0=">AAAB8HicbVDLSsNAFL1TX7W+oi7dBIvgqmREcFt047JCX9CGMpne1rEzk5CZCCX0H1yqG3Hr/7jwb0xjENt64MLhnHvg3BtEUhjreV+ktLa+sblV3q7s7O7tHziHR20TJjHHFg9lGHcDZlAKjS0rrMRuFCNTgcROMLmZ+51HjI0IddNOI/QVG2sxEpzZTGr3+TC0ZuBUvZqXw10ltCBVKNAYOJ/9YcgThdpyyYzpUS+yfspiK7jEWaWfGIwYn7Ax9jKqmULjp3nbmXuWKUN3FMbZaOvm6t9EypQxUxVkm4rZe7PszcV/PSkCzBroxQK/cmIszbML9ZrUT+c9UPNZ9ge6fPUqaV/UqFejd5fV+nXxkTKcwCmcA4UrqMMtNKAFHB7gCV7glcTkmbyR95/VEikyx7AA8vENiPmRrg==</latexit>
1 -Loss
<latexit sha1_base64="n1zL5bQ1SCS3ygCCHocEvBWt8Uk=">AAAB63icbVDLSsNAFL2pr1pfUZdugkVwISXjxm3RjcsW+oI2lMn0tg6dmYTMRCihX+BS3YhbP8mFf+M0BrGtBy4czrkHzr1hLLg2vv/llDY2t7Z3yruVvf2DwyP3+KSjozRh2GaRiJJeSDUKrrBtuBHYixOkMhTYDad3C7/7iInmkWqZWYyBpBPFx5xRY6Xm1dCt+jU/h7dOSEGqUKAxdD8Ho4ilEpVhgmrdJ35sgowmhjOB88og1RhTNqUT7FuqqEQdZHnRuXdhlZE3jhI7yni5+jeRUan1TIZ2U1LzoFe9hfivJ3iItoFaLvArp9qQPLtUr0WCbNEDFZvbP5DVq9dJ57pG/Bppkmr9tvhIGc7gHC6BwA3U4R4a0AYGCE/wAq+OdJ6dN+f9Z7XkFJlTWILz8Q1FxY8s</latexit>
<latexit sha1_base64="79nn5lbtuftuxApg/bbjScOt3LU=">AAAB63icbVDLSgNBEOyJrxhfUY96WAyCBwk7XrwuevGYQF6QLGF20olDZmaXnVkhLPkCj+pFvPpJHvwbN5sgJrGgoajqguoOIimMdd1vUtjY3NreKe6W9vYPDo/KxyctEyYxxyYPZRh3AmZQCo1NK6zEThQjU4HEdjC+n/ntJ4yNCHXDTiL0FRtpMRSc2UyqX/fLFbfq5nDWCV2QincOOWr98ldvEPJEobZcMmO61I2sn7LYCi5xWuolBiPGx2yE3YxqptD4aV506lxmysAZhnE22jq5+jeRMmXMRAXZpmL20ax6M/FfT4oAswZ6ucCvnBhL8+xSvQb101kP1Hya/YGuXr1OWjdV6lZpnVa8u/lDoAhncAFXQOEWPHiAGjSBA8IzvMIbUeSFvJOP+WqBLDKnsATy+QPDzY+I</latexit>
<latexit sha1_base64="4uHZqYMgOi7NQ078cgy38M/NFto=">AAAB/HicbVDLSsNAFL2pr1pfUZdugkVwVTJu3BbduKzQF5hQJtPbduhkEmYmhRLql7hUN+LWP3Hh35jEILb1wIXDOffAuTeIBdfGdb+sysbm1vZOdbe2t39weGQfn3R1lCiGHRaJSPUDqlFwiR3DjcB+rJCGgcBeML3N/d4MleaRbJt5jH5Ix5KPOKMmkwa27QkcGS91PMXHE+MtBnbdbbgFnHVCSlKHEq2B/ekNI5aEKA0TVOsH4sbGT6kynAlc1LxEY0zZlI7xIaOShqj9tGi+cC4yZeiMIpWNNE6h/k2kNNR6HgbZZkjNRK96ufivJ3iAWQO5XOBXTrQhRXapXpv4ad4DJcv/QFavXifdqwZxG+Se1Js35UeqcAbncAkErqEJd9CCDjCYwRO8wKv1aD1bb9b7z2rFKjOnsATr4xt5+ZYY</latexit>
sha1_base64="Q7vR7Sv+uhA2R7II3GoG09bYf3Y=">AAAB/HicbVDLSsNAFL2pr1pfUZduQovgqmTcuA26cVmhL2hCmUxv26GTSchMCiXUL3HhQt2IW//EhX9jkhaxrQcuHM65B869fiS40rb9bZS2tnd298r7lYPDo+MT8/SsrcIkZthioQjjrk8VCi6xpbkW2I1ipIEvsONP7nK/M8VY8VA29SxCL6AjyYecUZ1JfdN0BQ61m1puzEdj7c77Zs2u2wWsTUKWpOZUofQMAI2++eUOQpYEKDUTVKkesSPtpTTWnAmcV9xEYUTZhI6wl1FJA1ReWjSfW5eZMrCGYZyN1Fah/k2kNFBqFvjZZkD1WK17ufivJ7iPWQO5WuBXTpQmRXalXpN4ad4DJcv/QNav3iTt6zqx6+SB1JxbWKAMF1CFKyBwAw7cQwNawGAKT/AKb8aj8WK8Gx+L1ZKxzJzDCozPH9vclxw=</latexit>
sha1_base64="TEn66AehV7gaxyMeoQeg2gvRHVs=">AAAB/HicbVDLSsNAFL2pr1pfUcGNLkKL4Kpk3LgtunFZoS9oQplMb9vBySRkJoUS6pcIbtSNuHDjn7jwb0zSIrb1wIXDOffAudcLBVfatr+Nwtr6xuZWcbu0s7u3f2AeHrVUEEcMmywQQdTxqELBJTY11wI7YYTU9wS2vfubzG+PMVI8kA09CdH16VDyAWdUp1LPNB2BA+0klhPx4Ug7055Zsat2DmuVkDmp1MpQeDr5OKv3zC+nH7DYR6mZoEp1iR1qN6GR5kzgtOTECkPK7ukQuymV1EflJnnzqXWeKn1rEETpSG3l6t9EQn2lJr6XbvpUj9Syl4n/eoJ7mDaQiwV+5VhpkmcX6jWIm2Q9ULLsD2T56lXSuqwSu0ruSKV2DTMU4RTKcAEErqAGt1CHJjAYwyO8wKvxYDwbb8b7bLVgzDPHsADj8wcMhJf+</latexit>
<latexit sha1_base64="4uHZqYMgOi7NQ078cgy38M/NFto=">AAAB/HicbVDLSsNAFL2pr1pfUZdugkVwVTJu3BbduKzQF5hQJtPbduhkEmYmhRLql7hUN+LWP3Hh35jEILb1wIXDOffAuTeIBdfGdb+sysbm1vZOdbe2t39weGQfn3R1lCiGHRaJSPUDqlFwiR3DjcB+rJCGgcBeML3N/d4MleaRbJt5jH5Ix5KPOKMmkwa27QkcGS91PMXHE+MtBnbdbbgFnHVCSlKHEq2B/ekNI5aEKA0TVOsH4sbGT6kynAlc1LxEY0zZlI7xIaOShqj9tGi+cC4yZeiMIpWNNE6h/k2kNNR6HgbZZkjNRK96ufivJ3iAWQO5XOBXTrQhRXapXpv4ad4DJcv/QFavXifdqwZxG+Se1Js35UeqcAbncAkErqEJd9CCDjCYwRO8wKv1aD1bb9b7z2rFKjOnsATr4xt5+ZYY</latexit>
sha1_base64="Q7vR7Sv+uhA2R7II3GoG09bYf3Y=">AAAB/HicbVDLSsNAFL2pr1pfUZduQovgqmTcuA26cVmhL2hCmUxv26GTSchMCiXUL3HhQt2IW//EhX9jkhaxrQcuHM65B869fiS40rb9bZS2tnd298r7lYPDo+MT8/SsrcIkZthioQjjrk8VCi6xpbkW2I1ipIEvsONP7nK/M8VY8VA29SxCL6AjyYecUZ1JfdN0BQ61m1puzEdj7c77Zs2u2wWsTUKWpOZUofQMAI2++eUOQpYEKDUTVKkesSPtpTTWnAmcV9xEYUTZhI6wl1FJA1ReWjSfW5eZMrCGYZyN1Fah/k2kNFBqFvjZZkD1WK17ufivJ7iPWQO5WuBXTpQmRXalXpN4ad4DJcv/QNav3iTt6zqx6+SB1JxbWKAMF1CFKyBwAw7cQwNawGAKT/AKb8aj8WK8Gx+L1ZKxzJzDCozPH9vclxw=</latexit>
sha1_base64="TEn66AehV7gaxyMeoQeg2gvRHVs=">AAAB/HicbVDLSsNAFL2pr1pfUcGNLkKL4Kpk3LgtunFZoS9oQplMb9vBySRkJoUS6pcIbtSNuHDjn7jwb0zSIrb1wIXDOffAudcLBVfatr+Nwtr6xuZWcbu0s7u3f2AeHrVUEEcMmywQQdTxqELBJTY11wI7YYTU9wS2vfubzG+PMVI8kA09CdH16VDyAWdUp1LPNB2BA+0klhPx4Ug7055Zsat2DmuVkDmp1MpQeDr5OKv3zC+nH7DYR6mZoEp1iR1qN6GR5kzgtOTECkPK7ukQuymV1EflJnnzqXWeKn1rEETpSG3l6t9EQn2lJr6XbvpUj9Syl4n/eoJ7mDaQiwV+5VhpkmcX6jWIm2Q9ULLsD2T56lXSuqwSu0ruSKV2DTMU4RTKcAEErqAGt1CHJjAYwyO8wKvxYDwbb8b7bLVgzDPHsADj8wcMhJf+</latexit>
{z v } he a en code and headse s pose or rame ∈ T by here s a space o so u ons where one ne work can compensa e
o n y cons der ng da a rom a cameras a ha me ns an or he seman c error ncurred by he o her ead ng o ow re-
cons ruc on errors bu ncorrec es ma on o express on z and
z v ← Eθ H ∀ ∈T 3
headse pose v W hou add ona cons ra n s we observe ha
No e ha he same pred c or s used or a rames n he da ase h s phenomenon o en occurs n prac ce wh ch we re er o as
Ana ogous o non r g d s ruc ure rom mo on [X ao e a 2006] h s co abora ve se superv s on When he doma n gap compr ses pr
has he benefi ha regu ar y n ac a express on across me can mar y o appearance d fferences we con ec ure ha co abora ve
ur her cons ra n he op m za on process mak ng he es ma on se -superv s on ends o be more prom nen n arch ec ures ha
proces more res s an o erm na ng n poor oca m n ma do no re a n spa a s ruc ure Th s s he case n our sys em where
Due o he doma n-gap be ween he rendered mage R and he he a en code z s a vec or zed encod ng o he mage Th s prob-
camera mages H hey are no d rec y comparab e To address h s em has been observed n he s y e- rans er commun y where a
we a so earn he parame ers ϕ o a v ew dependen doma n rans er common so u on s o use u y convo u ona arch ec ures w h
ne work sk p connec ons [ so a e a 2017 Zhu e a 2017] As spa a s ruc
R̂ = Fϕ H ∀ ∈C 4 ure s propaga ed hrough he en re ne work s eas er o re a n
n s s mp es orm h s unc on s compr sed o ndependen ne - s ruc ure and hus ends o rema n unchanged when d fferences
works or each camera W h h s we can ormu a e he ana ys s- n s y e can be we exp a ned by appearance a one
by-syn hes s recons ruc on oss Mo va ed by he compe ng resu s ach eved by me hods such
! as [Zhu e a 2017] we decoup e Eq 5 n o wo s ages F rs we
Õ Õ
earn he doma n rans er Fϕ ha conver s headse mages H n o
Lθ ϕ = R̂ − R + λδ z
5 1 ake rendered mages R w hou chang ng he apparen express on
∈T ∈C
e seman cs n he second s age we fix Fϕ and op m ze Eq 5
where R s he rendered ace mode rom Eq 2 ras er zed us ng w h respec o Eθ
he known pro ec on unc on A whose parame ers are ob a ned
rom he ca bra ed camera as men oned n § 3 1 Here δ s a regu
ar za on erm over he a en codes z and λ we gh s s con r bu on 33 Express on-Preserv ng Doma n Transfer
aga ns he L 1 -norm recons ruc on o he doma n- rans erred m- Our express on-preserv ng rans er s based on unpa red mage
age rans a on ne works [Zhu e a 2017] Th s arch ec ure earns a
A hough a firs g ance h s ormu a on appears o be reasonab e b d rec ona doma n mapp ng Fϕ and Gψ by en orc ng cyc c con
w h h gh capac y encoder Eθ and doma n- rans er Fϕ unc ons s s ency be ween he doma ns and an adversar a oss or each o he
Captured Synthetically
HMC Images Rendered Avatar
<latexit sha1_base64="06m4rjwPepVLhOVmgttQObBpX5A=">AAAB+3icbZBLS8NAFIVv6qvWR6Mu3RSL4KpkRHBbFMRlhb6gCWEyvW2HTiYhMxFKyC9xqW7ErT/Fhf/GJAaxrQcGDufcC99cLxRcacv6Miobm1vbO9Xd2t7+wWHdPDruqyCOGPZYIIJo6FGFgkvsaa4FDsMIqe8JHHjz27wfPGKkeCC7ehGi49Op5BPOqM4i16zbPtUzb5Lcpa4dzrhrNq2WVaixbkhpmlCq45qf9jhgsY9SM0GVGhEr1E5CI82ZwLRmxwpDyuZ0iqPMSuqjcpICPG2cZ8m4MQmi7EndKNK/Gwn1lVr4XjaZY6rVLg//7QT3MCOQywC/caw0KXaX8LrESXIOlCzN7kBWf71u+pctYrXIw1WzfVNepAqncAYXQOAa2nAPHegBgxie4AVejdR4Nt6M95/RilHunMCSjI9v4d2VyA==</latexit>
𝐻 𝑅
<latexit sha1_base64="jTu8vQbebuBnYnUoViEkiAoBK5c=">AAAB+3icbZBLS8NAFIVvfNb6aNSlm2ARXJWMCG6LLnRZoS9oQphMb+vQySRkJkIJ+SUu1Y249ae48N+YxCC29cDA4Zx74ZvrR4Irbdtfxtr6xubWdm2nvru3f9AwD4/6Kkxihj0WijAe+lSh4BJ7mmuBwyhGGvgCB/7spugHjxgrHsqunkfoBnQq+YQzqvPIMxtOQPWDP0lvM8+JFPfMpt2yS1mrhlSmCZU6nvnpjEOWBCg1E1SpEbEj7aY01pwJzOpOojCibEanOMqtpAEqNy3BM+ssT8bWJIzzJ7VVpn83UhooNQ/8fLLAVMtdEf7bCe5jTiAXAX7jRGlS7i7gdYmbFhwoWZbfgSz/etX0L1rEbpH7y2b7urpIDU7gFM6BwBW04Q460AMGCTzBC7wamfFsvBnvP6NrRrVzDAsyPr4B9EKV1A==</latexit>
<latexit sha1_base64="+3H4J34WwTe/oCnge3GCDh7BMcY=">AAAB9nicbZBLS8NAFIVvfNb6qrp0EyyCq5Jx47aoiy4r9AVNKJPpbR06mYTMRCwhf8OluhG3/hkX/hsnMYhtPTBwOOde+Ob6keBKO86Xtba+sbm1Xdmp7u7tHxzWjo57Kkxihl0WijAe+FSh4BK7mmuBgyhGGvgC+/7sJu/7DxgrHsqOnkfoBXQq+YQzqk3kugHV9/4kvc1GrVGt7jScQvaqIaWpQ6n2qPbpjkOWBCg1E1SpIXEi7aU01pwJzKpuojCibEanODRW0gCVlxbMmX1ukrE9CWPzpLaL9O9GSgOl5oFvJnNGtdzl4b+d4D4aArkI8BsnSpNidwGvQ7w050DJMnMHsvzrVdO7bBCnQe5IvXldXqQCp3AGF0DgCprQgjZ0gUEET/ACr9aj9Wy9We8/o2tWuXMCC7I+vgHeLZQf</latexit>
sha1_base64="1XNRT5rQ0HkUA+k7YbrDrfRfx6s=">AAAB9nicbZBLS8NAFIVv6qvWV1VcuQkWwVXJuHFb1EWXFfqCJpTJ9LYOnUxCZiKWEPAHuBbcqBtx659x4b8xSYvY1gMDh3Puhe+OGwiutGV9G4WV1bX1jeJmaWt7Z3evvH/QVn4UMmwxX/hh16UKBZfY0lwL7AYhUs8V2HHHV1nfucNQcV829SRAx6MjyYecUZ1Gtu1RfesO4+ukX++XK1bVymUuGzIzldrR8+MDADT65S974LPIQ6mZoEr1iBVoJ6ah5kxgUrIjhQFlYzrCXmol9VA5cc6cmKdpMjCHfpg+qc08/bsRU0+pieemkxmjWuyy8N9OcBdTAjkP8BtHSpN8dw6vSZw440DJkvQfyOLVy6Z9XiVWldyQSu0SpirCMZzAGRC4gBrUoQEtYBDAE7zCm3FvvBjvxsd0tGDMdg5hTsbnD481lhk=</latexit>
sha1_base64="W3HeaYfud1tI1ZQHc8XfDFGFMxM=">AAAB9nicbZBLS8NAFIVv6qu2Pqriyk2wCl2VjBu3RV10WaEvMKVMprd16GQSMhOxhPgLXAtu1I249c+48N+YpEVs64GBwzn3wnfH8QVX2rK+jdzK6tr6Rn6zUNza3tkt7e23lRcGDFvME17QdahCwSW2NNcCu36A1HUEdpzxZdp37jBQ3JNNPfGx59KR5EPOqE4i23apvnWG0VXcr/dLZatqZTKXDZmZcu3w+fGhclJs9Etf9sBjoYtSM0GVuiGWr3sRDTRnAuOCHSr0KRvTEd4kVlIXVS/KmGPzNEkG5tALkie1maV/NyLqKjVxnWQyZVSLXRr+2wnuYEIg5wF+41Bpku3O4TVJL0o5ULI4+QeyePWyaZ9ViVUl16Rcu4Cp8nAEx1ABAudQgzo0oAUMfHiCV3gz7o0X4934mI7mjNnOAczJ+PwBAeqWbg==</latexit> <latexit sha1_base64="g+5sNfWiO9VJ9fouZkDZ4NwDyiA=">AAAB9nicbZBLS8NAFIVvfNb6qrp0EyyCq5Jx47aoC5dV+oImlMn0tg6dTEJmIpaQv+FS3Yhb/4wL/42TGMS2Hhg4nHMvfHP9SHClHefLWlldW9/YrGxVt3d29/ZrB4ddFSYxww4LRRj3fapQcIkdzbXAfhQjDXyBPX96lfe9B4wVD2VbzyL0AjqRfMwZ1SZy3YDqe3+cXmfDu2Gt7jScQvayIaWpQ6nWsPbpjkKWBCg1E1SpAXEi7aU01pwJzKpuojCibEonODBW0gCVlxbMmX1qkpE9DmPzpLaL9O9GSgOlZoFvJnNGtdjl4b+d4D4aAjkP8BsnSpNidw6vTbw050DJMnMHsvjrZdM9bxCnQW5JvXlZXqQCx3ACZ0DgAppwAy3oAIMInuAFXq1H69l6s95/RlescucI5mR9fAPtc5Qp</latexit>
sha1_base64="rF2O81tyuBmX6aQI/wJx9mrv42w=">AAAB9nicbZBLS8NAFIVv6qvWV1VcuQkWwVXJuHFb1IXLKn1BE8pkeluHTiYhMxFLCPgDXAtu1I249c+48N+YpEVs64GBwzn3wnfHDQRX2rK+jcLS8srqWnG9tLG5tb1T3t1rKT8KGTaZL/yw41KFgktsaq4FdoIQqecKbLuji6xv32GouC8behyg49Gh5APOqE4j2/aovnUH8WXSu+mVK1bVymUuGjI1ldrB8+MDANR75S+777PIQ6mZoEp1iRVoJ6ah5kxgUrIjhQFlIzrEbmol9VA5cc6cmMdp0jcHfpg+qc08/bsRU0+pseemkxmjmu+y8N9OcBdTAjkL8BtHSpN8dwavQZw440DJkvQfyPzVi6Z1WiVWlVyTSu0cJirCIRzBCRA4gxpcQR2awCCAJ3iFN+PeeDHejY/JaMGY7uzDjIzPH557liM=</latexit>
sha1_base64="XoznfhfAXIwNWHaEBJVANtOnmzM=">AAAB9nicbZBLS8NAFIVvfNbWR1VcuQlWoauSceO2qAuXVfqCJpTJ9LYOnUxCZiKWEH+Ba8GNuhG3/hkX/huTtohtPTBwOOde+O64geBKW9a3sbS8srq2ntvIFza3tneKu3tN5UchwwbzhR+2XapQcIkNzbXAdhAi9VyBLXd4kfWtOwwV92VdjwJ0PDqQvM8Z1Wlk2x7Vt24/vky6N91iyapYY5mLhkxNqXrw/PhQPi7UusUvu+ezyEOpmaBKdYgVaCemoeZMYJK3I4UBZUM6wE5qJfVQOfGYOTFP0qRn9v0wfVKb4/TvRkw9pUaem05mjGq+y8J/O8FdTAnkLMBvHClNxrszeHXixBkHSpak/0Dmr140zdMKsSrkmpSq5zBRDg7hCMpA4AyqcAU1aACDAJ7gFd6Me+PFeDc+JqNLxnRnH2ZkfP4AETCWeA==</latexit>
Real-world Controlled
Imaging Landmark Rendering
𝑃(𝑣)
Detection
ˆ
𝑃(𝑧, 𝑣) 𝑃(𝑧,
ˆ 𝑣)
Unknown Approximated
<latexit sha1_base64="b63qAZK6h6GcVMxXAiLIvJCqz6g=">AAAB9HicbVDLSsNAFL1TX7W+qi7dBItQNyXjptuiG5cV+sI2lMn0th06mYTMpFBC/sKluhG3/o0L/8Y0BrGtBy4czrkHzr1uIIU2tv1FClvbO7t7xf3SweHR8Un59Kyj/Sjk2Oa+9MOeyzRKobBthJHYC0Jkniux687uln53jqEWvmqZRYCOxyZKjAVnJpUeB1Nm4mZSnV8PyxW7ZmewNgnNSQVyNIflz8HI55GHynDJtO5TOzBOzEIjuMSkNIg0BozP2AT7KVXMQ+3EWePEukqVkTX2w3SUsTL1byJmntYLz003PWamet1biv96UriYNlCrBX7lSBuaZVfqtagTL3ug4kn6B7p+9Sbp3NSoXaMPtNK4zT9ShAu4hCpQqEMD7qEJbeCg4Ale4JXMyTN5I+8/qwWSZ85hBeTjG/apkwI=</latexit>
sha1_base64="vsQcKQWFJ2mTkTIkz4bhwVIDps4=">AAAB9HicbVDLSsNAFL2pr1pfVZduQotQNyXjxm3RjcsKfWEbymR62w6dTEJmUighf+FSRRC3/o0L/8ZpWsS2HrhwOOceOPd6oeBKO863ldva3tndy+8XDg6Pjk+Kp2ctFcQRwyYLRBB1PKpQcIlNzbXAThgh9T2BbW9yN/fbU4wUD2RDz0J0fTqSfMgZ1UZ67I2pTuppZXrVL5adqpPB3iRkScq1EuTeAKDeL371BgGLfZSaCapUlzihdhMaac4EpoVerDCkbEJH2DVUUh+Vm2SNU/vSKAN7GERmpLYz9W8iob5SM98zmz7VY7XuzcV/PcE9NA3kaoFfOVaaZNmVeg3iJvMeKFlq/kDWr94kresqcarkgZRrt7BAHi6gBBUgcAM1uIc6NIGBhCd4gVdraj1b79bHYjVnLTPnsALr8wdYm5QG</latexit>
sha1_base64="Ogi2zDGnAoC2TgNssehusvrXU0I=">AAAB9HicbVDLSsNAFL2pr1pfVcGNLkKLUDcl043bohuXFfrCNpTJ9LYdOpmEzKRQQv7CpY+FuBX8GBf+jWlaxLYeuHA45x449zq+4Epb1reR2djc2t7J7ub29g8Oj/LHJ03lhQHDBvOEF7QdqlBwiQ3NtcC2HyB1HYEtZ3w781sTDBT3ZF1PfbRdOpR8wBnVifTQHVEd1eLS5KqXL1plK4W5TsiCFKsFyLyefV7Uevmvbt9joYtSM0GV6hDL13ZEA82ZwDjXDRX6lI3pEDsJldRFZUdp49i8TJS+OfCCZKQ2U/VvIqKuUlPXSTZdqkdq1ZuJ/3qCO5g0kMsFfuVQaZJml+rViR3NeqBkcfIHsnr1OmlWysQqk3tSrN7AHFk4hwKUgMA1VOEOatAABhIe4RlejInxZLwZ7/PVjLHInMISjI8fiTSU6A==</latexit>
𝑃(𝑧)
<latexit sha1_base64="uQVzbfMlNlrwUK3hNmjb31xkYW8=">AAAB8HicbVDLSsNAFL1TX7W+qi7dBItQQUrGjduiG5cV+oI2lMn0to6dTEJmUqih/+BS3Yhb/8eFf2MSg9jWAxcO59wD5143kEIb2/4ihbX1jc2t4nZpZ3dv/6B8eNTWfhRybHFf+mHXZRqlUNgywkjsBiEyz5XYcSc3qd+ZYqiFr5pmFqDjsbESI8GZSaR2o/p4MT0flCt2zc5grRKakwrkaAzKn/2hzyMPleGSad2jdmCcmIVGcInzUj/SGDA+YWPsJVQxD7UTZ23n1lmiDK2RHyajjJWpfxMx87SeeW6y6TFzr5e9VPzXk8LFpIFaLPArR9rQLLtQr0mdOO2Bis+TP9Dlq1dJ+7JG7Rq9o5X6df6RIpzAKVSBwhXU4RYa0AIOD/AEL/BKQvJM3sj7z2qB5JljWAD5+AZnbJDv</latexit>
sha1_base64="ijHrQbWAyaZtzctOV3kFCs3xMTo=">AAAB8HicbVBNSwJRFL1jX2ZfVss2gxIYhMxz01Zq09LAL9BB3jyv9vK9N8O8N4IN/oeW1aba9n9a9G8aR4nUDlw4nHMPnHu9QHBtHOfbymxsbm3vZHdze/sHh0f545Om9qOQYYP5wg/bHtUouMKG4UZgOwiRSk9gyxvdzPzWGEPNfVU3kwBdSYeKDzijJpGatdLj5fiily86ZSeFvU7IghSrBci8A0Ctl//q9n0WSVSGCap1hziBcWMaGs4ETnPdSGNA2YgOsZNQRSVqN07bTu3zROnbAz9MRhk7Vf8mYiq1nkgv2ZTU3OtVbyb+6wnuYdJALRf4lSNtSJpdqlcnbjzrgYpNkz+Q1avXSbNSJk6Z3JFi9RrmyMIZFKAEBK6gCrdQgwYweIAneIFXK7SerTfrY76asRaZU1iC9fkDyU+R8w==</latexit>
sha1_base64="xE6FGVkboBQFjFMtt7LjSkqyIUo=">AAAB8HicbVBNSwJRFL1jX2ZfVtCmFoMSGITMa9NWatPSwFFBB3nzvNrLN2+GeW8EG/wPLauNtG3Xj2nRv2kcJVI7cOFwzj1w7nUDwZW2rG8js7a+sbmV3c7t7O7tH+QPj+rKj0KGNvOFHzZdqlBwibbmWmAzCJF6rsCGO7id+o0hhor7sqZHAToe7Uve44zqRKpXS0+Xw4tOvmiVrRTmKiFzUqwUIDM5+TyrdvJf7a7PIg+lZoIq1SJWoJ2YhpozgeNcO1IYUDagfWwlVFIPlROnbcfmeaJ0zZ4fJiO1map/EzH1lBp5brLpUf2glr2p+K8nuItJA7lY4FeOlCZpdqFejTjxtAdKNk7+QJavXiX1qzKxyuSeFCs3MEMWTqEAJSBwDRW4gyrYwOARnuEV3ozQeDEmxvtsNWPMM8ewAOPjB/noktU=</latexit>
Fig. 5. Landmarks on the texture map of the avatar (left) and HMC
Stimuli
ˆ images (right). Colors of landmarks indicate the point-to-point correspon-
dence across domains. We detect landmarks on all 9 views (only 3 views are
<latexit sha1_base64="sQyJpZ9CGBYzZe9g8XrqTDgxCBk=">AAAB9HicbVDLSsNAFL1TX7W+qi7dBItQNyXjxm3RjcsKfWEbymR62w6dTEJmUqghf+FS3Yhb/8aFf2Mag9jWAxcO59wD5143kEIb2/4ihY3Nre2d4m5pb//g8Kh8fNLWfhRybHFf+mHXZRqlUNgywkjsBiEyz5XYcae3C78zw1ALXzXNPEDHY2MlRoIzk0oP/QkzcSOpPl4OyhW7Zmew1gnNSQVyNAblz/7Q55GHynDJtO5ROzBOzEIjuMSk1I80BoxP2Rh7KVXMQ+3EWePEukiVoTXyw3SUsTL1byJmntZzz003PWYmetVbiP96UriYNlDLBX7lSBuaZZfqNakTL3qg4kn6B7p69TppX9WoXaP3tFK/yT9ShDM4hypQuIY63EEDWsBBwRO8wCuZkWfyRt5/Vgskz5zCEsjHN/zJkwY=</latexit>
sha1_base64="vkLC73hYIoKYib5MVs1NchItUZo=">AAAB9HicbVDLSsNAFL2pr1pfVZduQotQNyXjxm3RjcsKfWEbymR62w5OJiEzKdSQv3CpIohb/8aFf+M0LWJbD1w4nHMPnHu9UHClHefbym1sbm3v5HcLe/sHh0fF45OWCuKIYZMFIog6HlUouMSm5lpgJ4yQ+p7AtvdwM/PbE4wUD2RDT0N0fTqSfMgZ1Ua6742pTupp5fGiXyw7VSeDvU7IgpRrJci9AUC9X/zqDQIW+yg1E1SpLnFC7SY00pwJTAu9WGFI2QMdYddQSX1UbpI1Tu1zowzsYRCZkdrO1L+JhPpKTX3PbPpUj9WqNxP/9QT30DSQywV+5VhpkmWX6jWIm8x6oGSp+QNZvXqdtC6rxKmSO1KuXcMceTiDElSAwBXU4Bbq0AQGEp7gBV6tifVsvVsf89WctcicwhKszx9eu5QK</latexit>
sha1_base64="GTbVURbaJlKQeJYDClvQPtZcTfk=">AAAB9HicbVDLSsNAFL2pr1pfVcGNLkKLUDcl48Zt0Y3LCn1hG8pketsOnUxCZlKoIX/h0sdC3Ap+jAv/xjQtYlsPXDiccw+cex1fcKUt69vIrK1vbG5lt3M7u3v7B/nDo4bywoBhnXnCC1oOVSi4xLrmWmDLD5C6jsCmM7qZ+s0xBop7sqYnPtouHUje54zqRLrvDKmOqnHp4aKbL1plK4W5SsicFCsFyLyefJ5Vu/mvTs9joYtSM0GVahPL13ZEA82ZwDjXCRX6lI3oANsJldRFZUdp49g8T5Se2feCZKQ2U/VvIqKuUhPXSTZdqodq2ZuK/3qCO5g0kIsFfuVQaZJmF+rViB1Ne6BkcfIHsnz1KmlclolVJnekWLmGGbJwCgUoAYErqMAtVKEODCQ8wjO8GGPjyXgz3merGWOeOYYFGB8/j1SU7A==</latexit>
ACM Trans. Graph., Vol. 38, No. 4, Article 67. Publication date: July 2019.
VR Facial Animation via Multiview Image Translation • 67:7
Domain H Domain R cameras on the same side. Together with the standard terms of
CycleGAN [Zhu et al. 2017], our loss function takes the form:
view <latexit sha1_base64="06m4rjwPepVLhOVmgttQObBpX5A=">AAAB+3icbZBLS8NAFIVv6qvWR6Mu3RSL4KpkRHBbFMRlhb6gCWEyvW2HTiYhMxFKyC9xqW7ErT/Fhf/GJAaxrQcGDufcC99cLxRcacv6Miobm1vbO9Xd2t7+wWHdPDruqyCOGPZYIIJo6FGFgkvsaa4FDsMIqe8JHHjz27wfPGKkeCC7ehGi49Op5BPOqM4i16zbPtUzb5Lcpa4dzrhrNq2WVaixbkhpmlCq45qf9jhgsY9SM0GVGhEr1E5CI82ZwLRmxwpDyuZ0iqPMSuqjcpICPG2cZ8m4MQmi7EndKNK/Gwn1lVr4XjaZY6rVLg//7QT3MCOQywC/caw0KXaX8LrESXIOlCzN7kBWf71u+pctYrXIw1WzfVNepAqncAYXQOAa2nAPHegBgxie4AVejdR4Nt6M95/RilHunMCSjI9v4d2VyA==</latexit>
0 L = LC + λG LG + λ P LP + λV LV , (7)
<latexit sha1_base64="jTu8vQbebuBnYnUoViEkiAoBK5c=">AAAB+3icbZBLS8NAFIVvfNb6aNSlm2ARXJWMCG6LLnRZoS9oQphMb+vQySRkJkIJ+SUu1Y249ae48N+YxCC29cDA4Zx74ZvrR4Irbdtfxtr6xubWdm2nvru3f9AwD4/6Kkxihj0WijAe+lSh4BJ7mmuBwyhGGvgCB/7spugHjxgrHsqunkfoBnQq+YQzqvPIMxtOQPWDP0lvM8+JFPfMpt2yS1mrhlSmCZU6nvnpjEOWBCg1E1SpEbEj7aY01pwJzOpOojCibEanOMqtpAEqNy3BM+ssT8bWJIzzJ7VVpn83UhooNQ/8fLLAVMtdEf7bCe5jTiAXAX7jRGlS7i7gdYmbFhwoWZbfgSz/etX0L1rEbpH7y2b7urpIDU7gFM6BwBW04Q460AMGCTzBC7wamfFsvBnvP6NrRrVzDAsyPr4B9EKV1A==</latexit>
𝐻 𝑅
Cross-view Õ Õ
LCH =
Gψ ◦ Fϕ Hit − Hit
,
0 1 0 1 (8)
<latexit sha1_base64="odQLSrJxjsGdQUU8elJ/sYbHVDE=">AAAB9nicbZBLS8NAFIVv6qvWV9Wlm2ARXJUZN26LblxW6AuaUCbT2zp0MgmZSbGE/A2X6kbc+mdc+G9MYhDbemDgcM698M31Qim0IeTLqmxsbm3vVHdre/sHh0f145OeDuKIY5cHMogGHtMohcKuEUbiIIyQ+Z7Evje7zfv+HCMtAtUxixBdn02VmAjOTBY5js/MgzdJ2umIjOoN0iSF7HVDS9OAUu1R/dMZBzz2URkumdZDSkLjJiwygktMa06sMWR8xqY4zKxiPmo3KZhT+yJLxvYkiLKnjF2kfzcS5mu98L1sMmfUq10e/ttJ4WFGoJYBfuNYG1rsLuF1qJvkHKh4mt2Brv563fSumpQ06T1ttG7Ki1ThDM7hEihcQwvuoA1d4BDCE7zAq/VoPVtv1vvPaMUqd05hSdbHN8v9lBM=</latexit>
sha1_base64="q8+OXEv8fyRR2QAD33YXPThNIrE=">AAAB9nicbZBLS8NAFIVv6qvWV1VcuQkWwVXJuHEbdOOyQl/QhDKZ3tahk0nITMQSAv4A14IbdSNu/TMu/DcmaRHbemDgcM698N3xQsGVtqxvo7Syura+Ud6sbG3v7O5V9w/aKogjhi0WiCDqelSh4BJbmmuB3TBC6nsCO974Ku87dxgpHsimnoTo+nQk+ZAzqrPIcXyqb71h0kj7Vr9as+pWIXPZkJmp2UfPjw8A0OhXv5xBwGIfpWaCKtUjVqjdhEaaM4FpxYkVhpSN6Qh7mZXUR+UmBXNqnmbJwBwGUfakNov070ZCfaUmvpdN5oxqscvDfzvBPcwI5DzAbxwrTYrdObwmcZOcAyVLs38gi1cvm/Z5nVh1ckNq9iVMVYZjOIEzIHABNlxDA1rAIIQneIU34954Md6Nj+loyZjtHMKcjM8ffQWWDQ==</latexit>
sha1_base64="ULWeSB7f4o2DhXiySneL152dE4o=">AAAB9nicbZBLS8NAFIVvfNbWR1VcuQlWoauSceO26MZlhb6gCWUyva1DJ5OQmYglxF/gWnCjbsStf8aF/8YkLWJbDwwczrkXvjtuILjSlvVtrKyurW9sFraKpe2d3b3y/kFb+VHIsMV84YddlyoUXGJLcy2wG4RIPVdgxx1fZX3nDkPFfdnUkwAdj44kH3JGdRrZtkf1rTuMG0nf6pcrVs3KZS4bMjOV+tHz40P1tNTol7/sgc8iD6VmgirVI1agnZiGmjOBSdGOFAaUjekIe6mV1EPlxDlzYp6lycAc+mH6pDbz9O9GTD2lJp6bTmaMarHLwn87wV1MCeQ8wG8cKU3y3Tm8JnHijAMlS9J/IItXL5v2eY1YNXJDKvVLmKoAx3ACVSBwAXW4hga0gEEAT/AKb8a98WK8Gx/T0RVjtnMIczI+fwDvq5Zi</latexit>
<latexit sha1_base64="tJQE3MOBoA/35IgdJ8Pol0GC29s=">AAAB9nicbZBLS8NAFIVv6qvWV9Wlm2ARXJWMG7dFNy4r9AVNKJPpbR06mYTMpFhC/oZLdSNu/TMu/DdOYhDbemDgcM698M31I8GVdpwvq7KxubW9U92t7e0fHB7Vj096Kkxihl0WijAe+FSh4BK7mmuBgyhGGvgC+/7sNu/7c4wVD2VHLyL0AjqVfMIZ1SZy3YDqB3+StrMRGdUbTtMpZK8bUpoGlGqP6p/uOGRJgFIzQZUaEifSXkpjzZnArOYmCiPKZnSKQ2MlDVB5acGc2RcmGduTMDZPartI/26kNFBqEfhmMmdUq10e/tsJ7qMhkMsAv3GiNCl2l/A6xEtzDpQsM3cgq79eN72rJnGa5J40WjflRapwBudwCQSuoQV30IYuMIjgCV7g1Xq0nq036/1ntGKVO6ewJOvjG82ElBQ=</latexit>
sha1_base64="1N4vqjm3peaSDNyTncjNQgxCvQY=">AAAB9nicbZBLS8NAFIVv6qvWV1VcuQkWwVXJuHEbdOOyQl/QhDKZ3tahk0nITMQSAv4A14IbdSNu/TMu/DcmaRHbemDgcM698N3xQsGVtqxvo7Syura+Ud6sbG3v7O5V9w/aKogjhi0WiCDqelSh4BJbmmuB3TBC6nsCO974Ku87dxgpHsimnoTo+nQk+ZAzqrPIcXyqb71h0kj7pF+tWXWrkLlsyMzU7KPnxwcAaPSrX84gYLGPUjNBleoRK9RuQiPNmcC04sQKQ8rGdIS9zErqo3KTgjk1T7NkYA6DKHtSm0X6dyOhvlIT38smc0a12OXhv53gHmYEch7gN46VJsXuHF6TuEnOgZKl2T+QxauXTfu8Tqw6uSE1+xKmKsMxnMAZELgAG66hAS1gEMITvMKbcW+8GO/Gx3S0ZMx2DmFOxucPfoyWDg==</latexit>
sha1_base64="Ey3f6A0cmxkAdOcD2YRMK97jczY=">AAAB9nicbZBLS8NAFIVvfNbWR1VcuQlWoauSceO26MZlhb6gCWUyva1DJ5OQmYglxF/gWnCjbsStf8aF/8YkLWJbDwwczrkXvjtuILjSlvVtrKyurW9sFraKpe2d3b3y/kFb+VHIsMV84YddlyoUXGJLcy2wG4RIPVdgxx1fZX3nDkPFfdnUkwAdj44kH3JGdRrZtkf1rTuMG0mf9MsVq2blMpcNmZlK/ej58aF6Wmr0y1/2wGeRh1IzQZXqESvQTkxDzZnApGhHCgPKxnSEvdRK6qFy4pw5Mc/SZGAO/TB9Upt5+ncjpp5SE89NJzNGtdhl4b+d4C6mBHIe4DeOlCb57hxekzhxxoGSJek/kMWrl037vEasGrkhlfolTFWAYziBKhC4gDpcQwNawCCAJ3iFN+PeeDHejY/p6Iox2zmEORmfP/EylmM=</latexit>
<latexit sha1_base64="tJQE3MOBoA/35IgdJ8Pol0GC29s=">AAAB9nicbZBLS8NAFIVv6qvWV9Wlm2ARXJWMG7dFNy4r9AVNKJPpbR06mYTMpFhC/oZLdSNu/TMu/DdOYhDbemDgcM698M31I8GVdpwvq7KxubW9U92t7e0fHB7Vj096Kkxihl0WijAe+FSh4BK7mmuBgyhGGvgC+/7sNu/7c4wVD2VHLyL0AjqVfMIZ1SZy3YDqB3+StrMRGdUbTtMpZK8bUpoGlGqP6p/uOGRJgFIzQZUaEifSXkpjzZnArOYmCiPKZnSKQ2MlDVB5acGc2RcmGduTMDZPartI/26kNFBqEfhmMmdUq10e/tsJ7qMhkMsAv3GiNCl2l/A6xEtzDpQsM3cgq79eN72rJnGa5J40WjflRapwBudwCQSuoQV30IYuMIjgCV7g1Xq0nq036/1ntGKVO6ewJOvjG82ElBQ=</latexit>
sha1_base64="1N4vqjm3peaSDNyTncjNQgxCvQY=">AAAB9nicbZBLS8NAFIVv6qvWV1VcuQkWwVXJuHEbdOOyQl/QhDKZ3tahk0nITMQSAv4A14IbdSNu/TMu/DcmaRHbemDgcM698N3xQsGVtqxvo7Syura+Ud6sbG3v7O5V9w/aKogjhi0WiCDqelSh4BJbmmuB3TBC6nsCO974Ku87dxgpHsimnoTo+nQk+ZAzqrPIcXyqb71h0kj7pF+tWXWrkLlsyMzU7KPnxwcAaPSrX84gYLGPUjNBleoRK9RuQiPNmcC04sQKQ8rGdIS9zErqo3KTgjk1T7NkYA6DKHtSm0X6dyOhvlIT38smc0a12OXhv53gHmYEch7gN46VJsXuHF6TuEnOgZKl2T+QxauXTfu8Tqw6uSE1+xKmKsMxnMAZELgAG66hAS1gEMITvMKbcW+8GO/Gx3S0ZMx2DmFOxucPfoyWDg==</latexit>
sha1_base64="Ey3f6A0cmxkAdOcD2YRMK97jczY=">AAAB9nicbZBLS8NAFIVvfNbWR1VcuQlWoauSceO26MZlhb6gCWUyva1DJ5OQmYglxF/gWnCjbsStf8aF/8YkLWJbDwwczrkXvjtuILjSlvVtrKyurW9sFraKpe2d3b3y/kFb+VHIsMV84YddlyoUXGJLcy2wG4RIPVdgxx1fZX3nDkPFfdnUkwAdj44kH3JGdRrZtkf1rTuMG0mf9MsVq2blMpcNmZlK/ej58aF6Wmr0y1/2wGeRh1IzQZXqESvQTkxDzZnApGhHCgPKxnSEvdRK6qFy4pw5Mc/SZGAO/TB9Upt5+ncjpp5SE89NJzNGtdhl4b+d4C6mBHIe4DeOlCb57hxekzhxxoGSJek/kMWrl037vEasGrkhlfolTFWAYziBKhC4gDpcQwNawCCAJ3iFN+PeeDHejY/p6Iox2zmEORmfP/EylmM=</latexit>
<latexit sha1_base64="g+5sNfWiO9VJ9fouZkDZ4NwDyiA=">AAAB9nicbZBLS8NAFIVvfNb6qrp0EyyCq5Jx47aoC5dV+oImlMn0tg6dTEJmIpaQv+FS3Yhb/4wL/42TGMS2Hhg4nHMvfHP9SHClHefLWlldW9/YrGxVt3d29/ZrB4ddFSYxww4LRRj3fapQcIkdzbXAfhQjDXyBPX96lfe9B4wVD2VbzyL0AjqRfMwZ1SZy3YDqe3+cXmfDu2Gt7jScQvayIaWpQ6nWsPbpjkKWBCg1E1SpAXEi7aU01pwJzKpuojCibEonODBW0gCVlxbMmX1qkpE9DmPzpLaL9O9GSgOlZoFvJnNGtdjl4b+d4D4aAjkP8BsnSpNidw6vTbw050DJMnMHsvjrZdM9bxCnQW5JvXlZXqQCx3ACZ0DgAppwAy3oAIMInuAFXq1H69l6s95/RlescucI5mR9fAPtc5Qp</latexit>
sha1_base64="rF2O81tyuBmX6aQI/wJx9mrv42w=">AAAB9nicbZBLS8NAFIVv6qvWV1VcuQkWwVXJuHFb1IXLKn1BE8pkeluHTiYhMxFLCPgDXAtu1I249c+48N+YpEVs64GBwzn3wnfHDQRX2rK+jcLS8srqWnG9tLG5tb1T3t1rKT8KGTaZL/yw41KFgktsaq4FdoIQqecKbLuji6xv32GouC8behyg49Gh5APOqE4j2/aovnUH8WXSu+mVK1bVymUuGjI1ldrB8+MDANR75S+777PIQ6mZoEp1iRVoJ6ah5kxgUrIjhQFlIzrEbmol9VA5cc6cmMdp0jcHfpg+qc08/bsRU0+pseemkxmjmu+y8N9OcBdTAjkL8BtHSpN8dwavQZw440DJkvQfyPzVi6Z1WiVWlVyTSu0cJirCIRzBCRA4gxpcQR2awCCAJ3iFN+PeeDHejY/JaMGY7uzDjIzPH557liM=</latexit>
sha1_base64="XoznfhfAXIwNWHaEBJVANtOnmzM=">AAAB9nicbZBLS8NAFIVvfNbWR1VcuQlWoauSceO2qAuXVfqCJpTJ9LYOnUxCZiKWEH+Ba8GNuhG3/hkX/huTtohtPTBwOOde+O64geBKW9a3sbS8srq2ntvIFza3tneKu3tN5UchwwbzhR+2XapQcIkNzbXAdhAi9VyBLXd4kfWtOwwV92VdjwJ0PDqQvM8Z1Wlk2x7Vt24/vky6N91iyapYY5mLhkxNqXrw/PhQPi7UusUvu+ezyEOpmaBKdYgVaCemoeZMYJK3I4UBZUM6wE5qJfVQOfGYOTFP0qRn9v0wfVKb4/TvRkw9pUaem05mjGq+y8J/O8FdTAnkLMBvHClNxrszeHXixBkHSpak/0Dmr140zdMKsSrkmpSq5zBRDg7hCMpA4AyqcAU1aACDAJ7gFd6Me+PFeDc+JqNLxnRnH2ZkfP4AETCWeA==</latexit>
1
<latexit sha1_base64="+3H4J34WwTe/oCnge3GCDh7BMcY=">AAAB9nicbZBLS8NAFIVvfNb6qrp0EyyCq5Jx47aoiy4r9AVNKJPpbR06mYTMRCwhf8OluhG3/hkX/hsnMYhtPTBwOOde+Ob6keBKO86Xtba+sbm1Xdmp7u7tHxzWjo57Kkxihl0WijAe+FSh4BK7mmuBgyhGGvgC+/7sJu/7DxgrHsqOnkfoBXQq+YQzqk3kugHV9/4kvc1GrVGt7jScQvaqIaWpQ6n2qPbpjkOWBCg1E1SpIXEi7aU01pwJzKpuojCibEanODRW0gCVlxbMmX1ukrE9CWPzpLaL9O9GSgOl5oFvJnNGtdzl4b+d4D4aArkI8BsnSpNidwGvQ7w050DJMnMHsvzrVdO7bBCnQe5IvXldXqQCp3AGF0DgCprQgjZ0gUEET/ACr9aj9Wy9We8/o2tWuXMCC7I+vgHeLZQf</latexit>
sha1_base64="1XNRT5rQ0HkUA+k7YbrDrfRfx6s=">AAAB9nicbZBLS8NAFIVv6qvWV1VcuQkWwVXJuHFb1EWXFfqCJpTJ9LYOnUxCZiKWEPAHuBbcqBtx659x4b8xSYvY1gMDh3Puhe+OGwiutGV9G4WV1bX1jeJmaWt7Z3evvH/QVn4UMmwxX/hh16UKBZfY0lwL7AYhUs8V2HHHV1nfucNQcV829SRAx6MjyYecUZ1Gtu1RfesO4+ukX++XK1bVymUuGzIzldrR8+MDADT65S974LPIQ6mZoEr1iBVoJ6ah5kxgUrIjhQFlYzrCXmol9VA5cc6cmKdpMjCHfpg+qc08/bsRU0+pieemkxmjWuyy8N9OcBdTAjkP8BtHSpN8dw6vSZw440DJkvQfyOLVy6Z9XiVWldyQSu0SpirCMZzAGRC4gBrUoQEtYBDAE7zCm3FvvBjvxsd0tGDMdg5hTsbnD481lhk=</latexit>
sha1_base64="W3HeaYfud1tI1ZQHc8XfDFGFMxM=">AAAB9nicbZBLS8NAFIVv6qu2Pqriyk2wCl2VjBu3RV10WaEvMKVMprd16GQSMhOxhPgLXAtu1I249c+48N+YpEVs64GBwzn3wnfH8QVX2rK+jdzK6tr6Rn6zUNza3tkt7e23lRcGDFvME17QdahCwSW2NNcCu36A1HUEdpzxZdp37jBQ3JNNPfGx59KR5EPOqE4i23apvnWG0VXcr/dLZatqZTKXDZmZcu3w+fGhclJs9Etf9sBjoYtSM0GVuiGWr3sRDTRnAuOCHSr0KRvTEd4kVlIXVS/KmGPzNEkG5tALkie1maV/NyLqKjVxnWQyZVSLXRr+2wnuYEIg5wF+41Bpku3O4TVJL0o5ULI4+QeyePWyaZ9ViVUl16Rcu4Cp8nAEx1ABAudQgzo0oAUMfHiCV3gz7o0X4934mI7mjNnOAczJ+PwBAeqWbg==</latexit>
i ∈ {0,1} t
1 Loss
LG = LGH + LGR is the GAN-loss (for both the generator and
<latexit sha1_base64="YMU1x6jqiiFsIJSOfP3Rmpp+PcY=">AAAB7XicbVDLSgNBEOyNrxhfUY9eFoPgKeyIkGvQS44R84JkCbOTThwyj2VnVghLPsGjehGvfpEH/8bNuohJLGgoqrqguoNQcGM978spbGxube8Ud0t7+weHR+Xjk47RccSwzbTQUS+gBgVX2LbcCuyFEVIZCOwG09uF333EyHCtWnYWoi/pRPExZ9Sm0n1jSIblilf1MrjrhOSkAjmaw/LnYKRZLFFZJqgxfeKF1k9oZDkTOC8NYoMhZVM6wX5KFZVo/CSrOncvUmXkjnWUjrJupv5NJFQaM5NBuimpfTCr3kL81xM8wLSBWi7wK8fGkiy7VK9F/GTRAxWbp38gq1evk85VlXhVcnddqd/kHynCGZzDJRCoQR0a0IQ2MJjAE7zAq6OdZ+fNef9ZLTh55hSW4Hx8A5g3j+8=</latexit>
i ∈ {0,1} t
(9)
Fig. 6. Cross-view cycle consistency in multiview image style trans- LP is the loss for the view predictor,
lation. In additional to original cycle consistency within each domain, we
further constrain style transformers given paired multiview data in each
!
Pi (H t ) − H t
+
Pi (R s ) − R s
,
Õ Õ
Õ
LP =
domain, to encourage preserving spatial structure of the face images. Only i 1−i 1 i 1−i 1
one loss term (out of all 4 directions) is shown here. i ∈ {0,1} t s
(10)
and finally the cross-view cycle consistency LV = LV H + LV R ,
of the face modeling capture [Lombardi et al. 2018], independently.
Together with {Hit }, we now have training data for unpaired style
Õ Õ
LV H =
Pi ◦ Fϕ Hit − Fϕ H 1−i
t
(11)
transfer. We discard the estimated z t solved by Eq. (6) since they
1
i ∈ {0,1} t
tend to be inaccurate when relying solely on landmarks.
where LCR , LGR , and LV R are defined symmetrically, and DH and
3.3.2 Multiview Image Style Translation. Given images from the
DR are discriminators in both domains, respectively. Note that while
two domains {Hit } and {Ris }, one can utilize a method such as [Zhu
we do not have paired Hit and Ris , we do have paired Hit and H 1−i
t ,
et al. 2017] to learn the view-specific mappings Fϕ,i and Gψ ,i that s
as does Ri . An illustration of these components is shown in Fig. 6.
translate images back and forth between the domains. Since we are
Different than [Bansal et al. 2018], in our formulation, we share
careful to encourage a balanced distribution P(z, v) between the
P0 and P1 across domains, since the relative structural difference
domains, this straight-forward approach already produces reason-
between the views should be the same in both domains.
able results. The main failure cases occur due to limited rendering
Our problem takes the form of a minimax optimization problem:
fidelity of the parametric face model. This is most noticeable in
the eye images, where eyelashes and glints are almost completely
absent due to poor generalization of the view-conditioned rendering min max L P0 , P1 , Fϕ , Gψ , DH , DR , (12)
P0,P1,Fϕ ,Gψ DH ,DR
method of [Lombardi et al. 2018] to very close viewpoints. Specif-
ically, the style transferred image often exhibits a modified gaze which we repeat for every pair of views. Compared to standard Cy-
direction compared to the source. In addition, this modification is cleGAN, our problem involves additional Pi modules. If we simply
often inconsistent across different camera views. This last effect alternately train parameters in {P0 , P1 , Fϕ , Gψ } and parameters in
is also observed for the rest of the face, though not to the same {DH , DR } like [Bansal et al. 2018], we find that collusion between
extent as that for the eyes. When solving for (z, v) using constraints P and Fϕ (or Gψ ) can minimize the loss function without preserv-
from all camera views in Eq. (5), these inconsistent and independent ing expression across the domains; effectively learning different
errors have an averaging effect, which manifests as dampened facial behaviors on real and face data to compensate errors made by each
expressions. other. As a result, the semantics that we want to keep unchanged
To overcome this problem, we propose a method that exploits the get lost during the style transformation. To address this problem, we
spatial relationship between cameras during image style translation. apply “uncooperative training”, recently proposed by Harley et al.
Inspired by [Bansal et al. 2018], where additional cyclic-consistency [2019], which prevents this “cheating” by breaking the optimization
constraints are obtained through temporal prediction, we enforce into more steps. At each step, the loss function is readjusted so that
cyclic-consistency through cross-view prediction. Specifically, for only terms that operate on real data remain, and only modules that
a pairs of views, denoted 0 and 1 for simplicity, we train “spatial- take real data as input are updated. An outline of the algorithm
predictors” P0 and P1 to transform images in view 0 to view 1 and is presented in Algorithm 1. In this way, modules have no chance
vice versa. These pairs are chosen in such a way that they observe to learn to compensate for errors made by previous modules. As a
similar parts of the face to ensure that their contents are mutually result, expressions are better preserved through domain transfer
predictable, for example the stereo eye-camera pair, or lower-face and the cross-view predictions ensure multiview consistency.
ACM Trans. Graph., Vol. 38, No. 4, Article 67. Publication date: July 2019.
67:8 • Wei et al.
ALGORITHM 1: Uncooperative training for multiview image style for a foreground pixel (i.e., W (p) = 1) the gradient of C(p) still can
translation dC t (p)
be calculated from dq , by parameterizing the coordinates in the
Input: Unpaired real HMC images {H it } and real rendered images
texture (from which the pixel color is sampled) by the barycentric
{R is }, in two common views i ∈ {0, 1}.
coordinates of that pixel in its currently enclosing triangle. Although
Output: Converged Fϕ and Gψ
this way of formulating the rendering function and its derivative
Initialize parameters in all modules;
can already produce good results, in practice, in the presence of
repeat
Sample (t, s) to get H it , R is for i ∈ {0, 1} (total 4 images) ; multiview constraints, it exhibits failure cases that result from zero
Update ϕ for F using gradients minimizing LC H + LG H + LV H ; gradients from W (p). Specifically, if p is rendered as background
Update ψ for G using gradients minimizing LC R + LG R + LV R ; (i.e., W (p) = 0) but the target for that pixel is a foreground value,
Update P using gradients minimizing LP ; there are no gradients propagated to the parameters q. Similarly, a
Update DH and DR using gradients maximizing LG foreground pixel at the boundary of the rasterization has no pressure
until converged; to expand. In practice, this can lead to terminating in poor local
minima with large reconstruction errors. For example, in a puffed-
cheek expression, where the target image’s foreground image tends
foreground pixel to occupy larger area of the image, the estimated expression often
background pixel (contributing gradients) fails to match the contour of the cheek well (see Fig. 7).
background pixel (not contributing gradients) Intuitively, the force expanding the foreground area should come
(a) (b) (c) from a soft blending around the boundary between foreground and
projected mesh
background. Therefore, instead of a binary assignment of pixels
around the boundary to either a color sampled from the texture
map or the background color, we employ soft blending similar to
anti-aliasing. It is important to parameterize the blending weight
as a function of the face model’s projected geometry so that re-
2
construction errors along the rasterization’s boundary can be back-
<latexit sha1_base64="BQjPMHHgNSTg8uqwUSlYaJJneTI=">AAAB7XicbVDLSgNBEOyNrxhfUY9eFoPgKewEIdegF48R84JkCbOTThwyM7vszAphySd4VC/i1S/y4N84WRcxiQUNRVUXVHcQCa6N5305hY3Nre2d4m5pb//g8Kh8fNLRYRIzbLNQhHEvoBoFV9g23AjsRTFSGQjsBtObhd99xFjzULXMLEJf0oniY86osdJ9NKwNyxWv6mVw1wnJSQVyNIflz8EoZIlEZZigWveJFxk/pbHhTOC8NEg0RpRN6QT7lioqUftpVnXuXlhl5I7D2I4ybqb+TaRUaj2Tgd2U1DzoVW8h/usJHqBtoJYL/MqJNiTLLtVrET9d9EDF5vYPZPXqddKpVYlXJXdXlcZ1/pEinME5XAKBOjTgFprQBgYTeIIXeHVC59l5c95/VgtOnjmFJTgf39cmkBg=</latexit>
1 <latexit sha1_base64="pUqRb/a//Czs8dhQeiB/o4V+VK0=">AAAB63icbVDLSsNAFL1TX7W+oi7dBIvgqmREcFt047KFvqANZTK9rUNnJiEzEUroF7hUN+LWT3Lh35jGILb1wIXDOffAuTeIpDDW875IaWNza3unvFvZ2z84PHKOTzomTGKObR7KMO4FzKAUGttWWIm9KEamAondYHq38LuPGBsR6padRegrNtFiLDizmdSMhk7Vq3k53HVCC1KFAo2h8zkYhTxRqC2XzJg+9SLrpyy2gkucVwaJwYjxKZtgP6OaKTR+mheduxeZMnLHYZyNtm6u/k2kTBkzU0G2qZh9MKveQvzXkyLArIFeLvArJ8bSPLtUr0X9dNEDNZ9nf6CrV6+TzlWNejXavK7Wb4uPlOEMzuESKNxAHe6hAW3ggPAEL/BKFHkmb+T9Z7VEiswpLIF8fAOukY9z</latexit>
target face
contour
<latexit sha1_base64="Ev45o3IWA1rqi6C4d+oxCSB/hAA=">AAAB7XicbVDLSsNAFL2pr1pfUZdugkVwVTIiuC26cVmxL2hDmUxv69B5hMxEKKGf4FLdiFu/yIV/YxqD2NYDFw7n3APn3jAS3Fjf/3JKa+sbm1vl7crO7t7+gXt41DY6iRm2mBY67obUoOAKW5Zbgd0oRipDgZ1wcjP3O48YG65V004jDCQdKz7ijNpMuo8GZOBW/Zqfw1slpCBVKNAYuJ/9oWaJRGWZoMb0iB/ZIKWx5UzgrNJPDEaUTegYexlVVKIJ0rzqzDvLlKE30nE2ynq5+jeRUmnMVIbZpqT2wSx7c/FfT/AQswZqscCvnBhL8uxCvSYJ0nkPVGyW/YEsX71K2hc14tfI3WW1fl18pAwncArnQOAK6nALDWgBgzE8wQu8Otp5dt6c95/VklNkjmEBzsc31Z+QFw==</latexit>
dp
( 2)
(d) (e) W (p) = exp − 2 , (14)
σ
Fig. 7. Background-aware differentiable rendering. (a) Input HMC im- where dp is the perpendicular 2D distance from a background pixel
age with a puffed cheek (b) Style transferred result (target image) (c) Cur-
p to its closest edge of any projected triangle for pixels outside the
rent rendered avatar (d) Overlaying (b) and (c). Many pixels in the red box
rasterization coverage, and σ controls the rate of decay. The value
are currently rendered as background pixels, but should be in foreground.
(e) A closer look at pixels around the face contour. For a background pixel of Ct used in Eq. (13) for W (p) is set to the color in the texture of the
p within a bounding box of any projected triangle (dashed rectangle), its triangle at the closest edge. For pixels within the coverage, dp = 0.
color is blended from color C t at the closest point on the closest edge p1 p2 In practice, we set σ = 1 and evaluate W only for pixels within
and background color Cb , with a weighting related to distance dp . The red enclosing rectangles of each projected triangle for efficiency. With
arrows indicate that the gradient generated from dark gray pixels can be this background-aware rendering, even though only a small portion
back-propagated to geometry of the face. of background pixels contribute gradients to expand or contract
the boundary at each iteration, it is enough to prevent optimization
from terminating in poor local minima.
3.4 Background-aware Differentiable Rendering
A core component of our system is a differentiable renderer R for 3.5 Implementation Details
the parametric face model. It is used to generate synthetic samples For the style transformation described in §3.3, we use (256 × 256)-
for the domain transfer described above, but also for evaluating the sized images for both domains, and adapt the architecture design
reconstruction accuracy given the estimated expression and pose from [Zhu et al. 2017]. For Fϕ , Gψ , and Pi , we use a ResNet with
parameters q = (z, v) in Eq. (5). Our rendering function blends a 4× downsampling followed by 3 ResNet modules and another 4×
rasterization of the face model’s shape and background, so that the upsampling. For discriminators DH and DR , we apply spectral nor-
pixel color C(p) at image position p is defined as: malization [Miyato et al. 2018] for better quality of generated images
C(p) = W (p) Ct (p) + (1 − W (p)) Cb (13) and more stable training. For Eθ in the overall training described
in 3.2, we build separate convolutional networks to convert individ-
where Ct (p) is the rasterized color from texture at position p, and ual Hit to |C| vectors, which are concatenated and then converted
Cb is a constant background color. If we simply define W as a binary into both z t and v t using separate MLPs. For the prior δ (z t ) in Eq.(5),
mask of the rasterization’s pixel coverage, where W (p) = 1 if p is we use an L 2 -loss δ (z t ) =
z t
2 because the latent space associated
2
dW (p)
within a triangle and otherwise W (p) = 0, then dq would be with D is learned with a KL divergence against a standard normal
zero for all p because of the discreteness of rasterization. In this case, distribution [Lombardi et al. 2018].
ACM Trans. Graph., Vol. 38, No. 4, Article 67. Publication date: July 2019.
VR Facial Animation via Multiview Image Translation • 67:9
ACM Trans. Graph., Vol. 38, No. 4, Article 67. Publication date: July 2019.
67:10 • Wei et al.
Fig. 8. Qualitative results of established correspondence using the training HMC. For each subject, the first row shows input images from the training
HMC and detected landmarks. The second row shows the avatar rendered with found corresponding facial expressions and headset position (coupled with
camera calibration at each view), with projected landmarks whose uv-coordinates are optimized by Eq. (6). The leftmost image is rendered from a fixed frontal
view. The last row overlays the previous two, where good alignment shows high fidelity correspondence.
show evidence that background-aware rendering leads to better detected landmarks and projected landmarks of the avatar on all
convergence. Finally, we discuss the effects of using different loss 9 views, plus matching image gradients (edges) (b) only using 3
functions in the optimization Eq. (5) and Eq. (15). tracking views: training independent per-view style transfer and
solving Eq. 5 with only the tracking views, and (c) not matching
5.2.1 Effect of in-domain pixel matching, additional viewpoints, and the distribution of HMC pose: simply using a mean pose to ren-
distribution matching. To show the importance of using multiview der {Ris } for style transfer, instead of solving Eq. (6) to collect a set
style transfer in Eq. (5), we compare our results with (a) only us- of HMC poses to sample from.
ing domain-invariant features: minimizing 2D distance between
ACM Trans. Graph., Vol. 38, No. 4, Article 67. Publication date: July 2019.
VR Facial Animation via Multiview Image Translation • 67:11
Fig. 9. More examples of established correspondence. Each subfigure shows 8 facial expressions for each identity. The 9 grayscale images, with 3 tracking
views (large images) and 6 additional training views (small images), are input HMC images, and in color on the right is the rendered 3D avatar. Our method
can find high quality mappings even for subtle expressions on the upper face, where the camera angle is oblique and close to the subject. Our method can also
capture subtle differences in tongues, teeth, and eyes where the avatar does not have detailed geometry.
ACM Trans. Graph., Vol. 38, No. 4, Article 67. Publication date: July 2019.
67:12 • Wei et al.
Input Prediction Target Input Prediction Target Input Prediction Target Input Prediction Target
Fig. 10. Predictions on held out data for a tracking HMC. In each subfigure, the target image is “ground-truth” from our established correspondences,
and the prediction image is the output from Ẽ. While most of the time we can map from only 3-view inputs to targets successfully, there are some failure
cases highlighted in the red boxes, where the nuances are hard to observe without the additional cameras on the training HMC.
In Fig. 11, we show 8 expressions to compare. Results that only direction, do impact the final result, as illustrated in Fig. 12. To show
use landmark and edge constraints can roughly match the shape of the effect of the cross-view training described in §3.3, we group 9
mouth, but completely fail to match parts such as teeth, tongue, and camera views into 4 pairs and run Algorithm 1 for each pair (and
gaze direction, as landmarks cannot describe these details on a pho- still train the remaining 1 view individually). The resulting style
torealistic avatar. Matching image gradients does not improve the transferred images have fewer artifacts on the face in general. We
results because the domain gap between HMC images and rendered highlight the cases where the per-view CycleGAN baseline fails to
avatars is too large for naive edge matching to be useful. Using only capture visibility of the tongue and a consistent gaze direction. Using
3 tracking views with view-independent style transfer and Eq. (5) the cross-view consistency loss in our method, in contrast, gives
already generates much better results than 9-view results in (a), consistent images across views and therefore the final optimized
showing the importance of bridging the domain gap. But because of avatar better matches the inputs.
the obliqueness of viewpoints, it sometimes fails to generate precise
results when compared to using the full 9 views, such as estimating 5.2.3 Effect of background-aware differentiable rendering. We com-
the amount of mouth openness (1st , 5th expressions), mouth interior pare our background-aware differentiable rendering with a baseline
(6th ), and lip shape (7th ). Finally, the results from (c) are in general where background pixels are simply assigned a constant color and
worse than the full method, showing the importance of matching therefore have no path to back-propagate gradients from errors in
distribution of spatial structure. The quality of style transferred re- pixel color to mesh geometry. In particular, we show the image loss
sults degrades, leading to uncanny faces (1st , 4th , 8th ) and semantics curve during convergence in Fig. 13 to a target puffed cheek. Even
are sometimes modified such as gaze (3rd ). though there are constraints from 9 views, which already greatly
alleviates the background problem, the baseline still plateaus at a
higher loss, and the cheek does not fully expand to match the inputs.
5.2.2 Effect of view-consistent image style transfer. When training In contrast, our method allows gradients from the image loss to
style transformers Fϕ and Gψ for all 9 views, we found that training affect the geometry through Eq. 13 and therefore shows lower loss
them independently for each view already gives good results despite and better face alignment.
artifacts on the fake rendered image such as checkerboard patterns.
By jointly considering multiview and a robust L 1 loss in Eq.(5), 5.2.4 Is image loss alone enough to make Eq. (5) converge? While
these uncorrelated artifacts are often averaged out. However, other style transferred images provide supervision with great details, it
artifacts such as wrong color of mouth interior or changes in gaze is a common observation that the convergence basin when using a
ACM Trans. Graph., Vol. 38, No. 4, Article 67. Publication date: July 2019.
VR Facial Animation via Multiview Image Translation • 67:13
1 5
2 6
3 7
4 8
Fig. 11. Ablation on system design. We compare our method with simpler settings such as (a) without style transformation, only using landmarks and edge
matching, (b) without additional cameras, only using 3 tracking views, and (c) without matching distribution of headset positions, only using mean headset
pose to render data for style transfer. These settings failed to obtain good correspondences in different characteristics.
differentiable renderer can be small and therefore requires careful 5.2.5 Why do we need geometry and texture terms in Eq. (15)? To
initialization. However, we found that the convergence basin of fit established correspondences in § 4, we did not simply apply loss
our method is large enough that no sample-specific initialization on the predicted z t but also geometry M and texture T . Ideally,
is required. In Fig. 14, we compare two training schemes with the minimizing differences in z t should also lead to reduced differences
same initialization for θ : (1) use landmark constraints (L 2 -loss of in the appearance of the avatar. However, Fig. 15 shows that if we
2D distances on images) for the first 2000 iterations to bring the don’t add geometry and texture terms, while the geometry error
geometry into roughly the correct position, and then remove it is still well minimized, the texture suffers from much higher error,
to leave only the image loss to fine-tune the result, and (2) only which leads to mismatched expressions such as changes in gaze
use an image loss, as presented in our method. The results show direction or mouth interior (i.e., regions that do not have detailed
that these two schemes achieve the same level of image loss at geometry). It is also interesting to see that the decrease of distance in
convergence, showing that the convergence basin of (2) is good z t latent space becomes much slower if we add these terms, showing
enough to avoid getting stuck in local minima. Fig. 14 also shows that z t and appearance (at least in texture space) are not strongly
minimizing a landmark loss can lead to gradients that contradict coupled through the decoder,
the image loss (red solid curve goes up after iteration 2000). This is
5.3 Comparison with prior art
because of the imprecision of landmark detections in HMC images
and uv-coordinate estimation on the avatar’s texture map. While We compare our animation results with our reimplementations
the landmark loss can be used to find good headset positions, for of [Olszewski et al. 2016] and [Lombardi et al. 2018]. To compare
facial expression, which requires high precision, it can only provide with [Olszewski et al. 2016], we select a short temporal window of
a relatively rough signal that ultimately becomes unnecessary given HMC images labeled as a certain peak expression, and map them
the good convergence basin from the background-aware renderer. to the encoded z t of the corresponding expression captured with
the modeling sensors during avatar building. We also performed dy-
namic time warping to determine correspondence labels for speech
ACM Trans. Graph., Vol. 38, No. 4, Article 67. Publication date: July 2019.
67:14 • Wei et al.
0.115
Input
0.110
0.100
Per-view
0.095
our method, with a target “puffed cheeks” on all 9 views (only 1 is shown).
While the baseline gets stuck at a higher loss, our method fully expands the
cheek.
Optimized
Image Loss
3000 29000
Result
28000
2500
Per-view
27000
2000 26000
25000
0 1000 2000 3000 4000 5000 6000 7000 8000
Iteration
training schemes for Eq. (5): (1) use both landmark loss and image loss during
the first 2000 iterations, and only use image loss after that, in red curves,
and (2) only use image loss, in blue curves. The result shows the convergence
basin is large enough that landmark initialization is not necessary. We also
observe the contradiction between the two losses, as the red landmark loss
curve goes back to a similar level as blue after iteration 2000. The learning
rate is decreased by 10× at iteration 3500 and 7000 and hence the loss drop
Fig. 12. Comparison between per-view style transfer and cross-view
in both methods.
style transfer. In each subfigure, we show the results of style transferred
multiview images and optimized result from Eq. (5) using these target images
respectively. The cross-view style transferred images have better quality
and better preserve semantics. Here we show the visibility of tongue and facial movements) or in transitions from peak expressions to neu-
gaze direction are better mapped to input with cross-view style transfer.
tral. Moreover, even though speech data is densely labeled through
dynamic time warping based on audio, the expression, especially in
the upper face, can be quite different across captures.
data. Fig. 16 (left) clearly shows the limitations. It not only suffers [Lombardi et al. 2018] only demonstrated their animation results
from inconsistency due to subjects not repeating the same expres- on speech data. When applied to extreme expressions, it often gen-
sion twice, but also cannot generalize well to unlabeled expressions, erates far worse results as show in Fig. 16 (right). However, their
such as those in a range of motion clip (a short period of free-form method only uses 3 views and does not measure headset position.
ACM Trans. Graph., Vol. 38, No. 4, Article 67. Publication date: July 2019.
VR Facial Animation via Multiview Image Translation • 67:15
Loss of geometry/texture
Texture loss ( 1 = 2 = 0) Chen Cao, Qiming Hou, and Kun Zhou. 2014. Displaced Dynamic Expression Regression
15 60 for Real-time Facial Tracking and Animation. ACM Transactions on Graphics (TOG)
Latent space loss
ACM Trans. Graph., Vol. 38, No. 4, Article 67. Publication date: July 2019.
67:16 • Wei et al.
[Olszewski et al. 16] Ours [Lombardi et al. 18] w/ training HMC w/ headset pose Ours
Peak expression
Peak expression
Peak expression
Peak expression
Range of motion
Range of motion
Peak expression
Speech
Transition from peak
Peak expression
Fig. 16. Comparison with prior arts. On the left, we compare our results with [Olszewski et al. 2016], which suffers from the subject’s inconsistency in
peak expressions and poor generalization to data they cannot acquire semantic labels. On the right, we compare with [Lombardi et al. 2018] and also with
their method augmented with our design, including more input views (“w/ training HMC”), and matching distribution of headset position (“w/ headset pose”).
Their results improve when augmented with our designs, but our methods still significantly better capture nuances on the face.
Justus Thies, Michael Zollhöfer, Marc Stamminger, Christian Theobalt, and Matthias Ilker Yildirim, Winrich Freiwald, Tejas Kulkarni, and Joshua B. Tenenbaum. 2015.
Niessner. 2018. FaceVR: Real-Time Gaze-Aware Facial Reenactment in Virtual Reality. Efficient Analysis-by-synthesis in Vision: A Computational Framework, Behavioral
ACM Transactions on Graphics (TOG) 37, 2, Article 25 (June 2018), 25:1–25:15 pages. Tests, and Comparison with Neural Representations. In Proceedings of 37th Annual
Unreal Engine 4. 2018. Unreal Engine 4. https://www.unrealengine.com/. Conference of the Cognitive Science Society.
Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. 2016. Convolutional Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. 2017. Unpaired Image-
Pose Machines. In IEEE Conference on Computer Vision and Pattern Recognition to-Image Translation using Cycle-Consistent Adversarial Networks. In IEEE Inter-
(CVPR). national Conference on Computer Vision (ICCV).
Jing Xiao, Jinxiang Chai, and Takeo Kanade. 2006. A Closed-Form Solution to Non-Rigid
Shape and Motion Recovery. International Journal of Computer Vision (IJCV) 67, 2
(April 2006), 233–246.
Xuehan Xiong and Fernando De la Torre. 2013. Supervised Descent Method and Its
Applications to Face Alignment. In IEEE Conference on Computer Vision and Pattern
Recognition (CVPR).
ACM Trans. Graph., Vol. 38, No. 4, Article 67. Publication date: July 2019.