VR Facial Animation Via Multiview Image Translation PDF

VR Facial Animation via Multiview Image Translation
SHIH-EN WEI, JASON SARAGIH, TOMAS SIMON, ADAM W. HARLEY∗ , STEPHEN LOMBARDI,
MICHAL PERDOCH, ALEXANDER HYPES, DAWEI WANG, HERNAN BADINO, and YASER SHEIKH
Facebook Reality Labs
Data Collection Establishing Correspondence Realtime Animation
Multiview Image Style Translation
Synchronized IR Images Fake Rendered Images
Image Loss
Training HMC (9 cameras) Tracking HMC (3 cameras)
Estimate
Background-aware
Diﬀerentiable Rendering
Geometry
z
Decoder
v
Deep Appearance Avatar
[Lombardi et al. 2018] Animated Avatar
Texture Maps Rendered Avatar Input Images In VR
Fig. 1. We present a VR realtime facial animation system with headset mounted cameras (HMC) which augment a standard head mounted display (HMD).
Our method establishes precise correspondence between 9 camera images from a training HMC and the parameters of a photorealistic avatar. Finally, we use
a common subset of 3 cameras on a tracking HMC to animate the avatar in realtime.
A key promise of Virtual Reality (VR) is the possibility of remote social the avatar and the HMC-acquired images are automatically found through
interaction that is more immersive than any prior telecommunication media. self-supervised multiview image translation, which does not require manual
However, existing social VR experiences are mediated by inauthentic digital annotation or one-to-one correspondence between domains. We evaluate
representations of the user (i.e., stylized avatars). These stylized representa- the system on a variety of users and demonstrate significant improvements
tions have limited the adoption of social VR applications in precisely those over prior work.
cases where immersion is most necessary (e.g., professional interactions and
CCS Concepts: • Human-centered computing → Virtual reality; •
intimate conversations). In this work, we present a bidirectional system that
Computing methodologies → Computer vision; Unsupervised learn-
can animate avatar heads of both users’ full likeness using consumer-friendly
ing; Animation.
headset mounted cameras (HMC). There are two main challenges in doing
this: unaccommodating camera views and the image-to-avatar domain gap. Additional Key Words and Phrases: Face Tracking, Unsupervised Image Style
We address both challenges by leveraging constraints imposed by multiview Transfer, Differentiable Rendering
geometry to establish precise image-to-avatar correspondence, which are
ACM Reference Format:
then used to learn an end-to-end model for real-time tracking. We present
Shih-En Wei, Jason Saragih, Tomas Simon, Adam W. Harley, Stephen Lom-
designs for a training HMC, aimed at data-collection and model building, and
bardi, Michal Perdoch, Alexander Hypes, Dawei Wang, Hernan Badino,
a tracking HMC for use during interactions in VR. Correspondence between
and Yaser Sheikh. 2019. VR Facial Animation via Multiview Image Trans-
∗ Currently at Carnegie Mellon University, work done while at Facebook Reality Labs. lation. ACM Trans. Graph. 38, 4, Article 67 (July 2019), 16 pages. https:
//doi.org/10.1145/3306346.3323030
Authors’ address: Shih-En Wei, shih-en.wei@fb.com; Jason Saragih, jason.saragih@fb.
com; Tomas Simon, tomas.simon@fb.com; Adam W. Harley, aharley@cmu.edu; Stephen
Lombardi, stephen.lombardi@fb.com; Michal Perdoch, michal.perdoch@oculus.com;
1 INTRODUCTION
Alexander Hypes, alexander.hypes@oculus.com; Dawei Wang, dawei.wang@oculus. Virtual Reality (VR) has seen increased ubiquity in recent years.
com; Hernan Badino, hernan.badino@oculus.com; Yaser Sheikh, yasers@fb.com, Face-
book Reality Labs, Pittsburgh, PA.
This has opened up the possibility for remote collaboration and
interaction that is more engaging and immersive than achievable
Permission to make digital or hard copies of part or all of this work for personal or through other media. Concurrently, there has been great progress
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation in generating accurate digital doubles and avatars. Driven by the
on the first page. Copyrights for third-party components of this work must be honored. gaming and movie industries, a number of compelling demonstra-
For all other uses, contact the owner/author(s). tions of state of the art systems have recently attracted interest in
© 2019 Copyright held by the owner/author(s).
0730-0301/2019/7-ART67 the community [Epic Games 2017; Hellblade 2018; Magic Leap 2018;
https://doi.org/10.1145/3306346.3323030 Seymour et al. 2017; Unreal Engine 4 2018]. These systems show
ACM Trans. Graph., Vol. 38, No. 4, Article 67. Publication date: July 2019.
67:2 • Wei et al.
highly photo-realistic avatars driven and rendered in real-time. Al- without requiring any manual input or semantically defined labels.
though impressive results are achieved, they are all designed for The results are highly precise estimates of the avatar’s parameters
one-way interactions, where the actor is equipped with sensors that match the user’s performance, which are suitable for learning
optimally placed to capture facial expression. Unfortunately, these an end-to-end mapping relating them to the tracking HMC.
sensor placements are not compatible with existing VR-headset de- In the following, we discuss prior work in §2, and present our
signs, which largely occlude the face. Thus, these systems are better method for finding image-to-avatar correspondence in §3, including
suited to live performances than interaction. hardware design, image style transfer and the core optimization
If we consider instead works that are aimed at bidirectional com- problem. Real-time facial animation is then covered in §4. We present
munication in VR, we discover that existing systems mostly use results of our approach in §5, and conclude in §6 with a discussion
non-photorealistic/stylized avatars [BinaryVR 2019; Li et al. 2015; and directions of future work.
Olszewski et al. 2016]. These representations tend to have a more
limited range of expressivity, which renders errors from facial ex-
2 RELATED WORK
pression tracking less perceptible than with photo-real avatars. In
this work, we argue that this is no coincidence, and that precise face 2.1 Face Tracking in VR
tracking from consumer-friendly headset-mounted camera configu- Tracking faces in VR is a unique and challenging problem because
rations is significantly harder than in conventional settings. There the object we want to track is largely occluded by the VR headset.
are two main reasons for this. First, instead of capturing a complete In the literature, solutions vary in how hardware designs are used
and unobscured view of the face, headset-mounted camera place- to circumvent sensing challenges, as well as in methods to bridge
ments tend to provide only partial and non-overlapping views of the domain gap between sensor data and the face representation in
the face at extreme and oblique views. Minimizing reconstruction order to find their correspondence. Specifically, sensors used to build
errors in these viewpoints often does not translate to correct results the face model and later drive it are typically comprised of different
when viewed frontally. Secondly, these cameras often operate in sets of camera configurations. Modeling sensors (i.e., cameras for
the infrared (IR) spectrum, which is not directly comparable to the building a shape and appearance model of the face) are typically
avatar’s RGB appearance and makes analysis-by-synthesis tech- multiview, high-fidelity RGB cameras with an unobstructed view
niques less effective. To partially alleviate these difficulties, existing of the face [Beeler et al. 2011; Dimensional Imaging 2016; Fyffe
systems are designed to work using structurally and mechanically et al. 2014; Lombardi et al. 2018]. Tracking sensors (i.e., cameras for
sub-optimal sensor designs that are more accommodating to the driving the face model to match a user’s expressions) are typically
computer-vision tracking problem. Despite this, their performance mounted on the HMD, resulting in a patch-work of oblique and par-
is still only suitable for stylized avatar representations. tially overlapping views of different facial parts with narrow depth
Although the difficult sensor configurations of headset-mounted of field, and typically operate in the infra-red (IR) spectrum [Li et al.
cameras (HMC) can prove challenging for classical face alignment 2015; Olszewski et al. 2016]. Moreover, since the HMDs obscure the
approaches like [Saragih et al. 2009; Xiong and la Torre 2013], in this face, and modeling sensors require an unobscured view, data from
work we show that end-to-end deep neural networks are capable of modeling and tracking sensors can not be captured concurrently.
learning the complex mapping from sensor measurements to avatar As the eyebox in a VR headset is enclosed, the face is typically
parameters, given the availability of sufficient high-precision train- divided into an occluded upper face and a visible lower face. As such,
ing examples relating the two domains. We demonstrate compelling some works use specialized strategies to obtain the facial state for
results where fully expressive and accurate performances can be each part separately, followed by a compositing step to get the full
tracked in real time, at a precision matching the representation face state in realtime. In [Li et al. 2015] and [Thies et al. 2018], an
capacity of modern photo-realistic avatars. The challenge, then, is RGBD sensor is used, which allows direct registration of the lower
how to acquire the correspondences required to pose the problem face with the model’s geometry. To have a better viewpoint, Li et al.
as supervised learning. [2015] attach the sensor with a protruding mount, placing it slightly
Our solution for acquiring correspondence is to leverage mul- below the mouth. Similar to [BinaryVR 2019], this frontal viewpoint
tiview geometry in addressing both the problem of oblique view- is optimal for modeling lower mouth expressions, but is not ideal
points as well as the sensor-avatar domain gap. Specifically, we from a hardware design standpoint. In [Thies et al. 2018], the sensor
propose the use of a training HMC design that shares a sensor con- is placed in the environment which limits the range of a user’s head
figuration with a consumer-friendly design (the tracking HMC), pose. For the upper face, Li et al. [2015] use strain gauges to sense
but has additional cameras to support better coverage and view- voltage changes that accompany facial expressions. By building a
ing angles while minimally disturbing the quality of data acquired skeleton headset without a display unit that otherwise occludes the
from the shared cameras. With these additional viewpoints, clas- upper face, they can acquire face model parameters corresponding
sical analysis-by-synthesis constraints become more meaningful, to strain gauge measurements through depth registration. They use
and results generalize better to common vantage points (i.e., frontal this to train a real-time upper face blendshape regressor. However,
view of the face). These additional views also provide more signal this input signal has low SNR, exhibits drift, and contains only
to improve the fidelity at which we can perform domain translation limited information about facial expressions. In comparison, Thies
to address the sensor-avatar domain gap. With data collected from et al. [2018] use infrared (IR) cameras pointing at the eye region
these cameras along with a pre-trained personalized avatar, our to avoid interfering with VR usage. They use a calibration process
system learns to discover correspondences through self-supervision, where subjects are instructed to gaze at known positions to obtain
VR Facial Animation via Multiview Image Translation • 67:3
correspondence for training. Although effective in estimating gaze 2.2 Image Style Transfer
direction and eyelid blinks, it does not capture other upper-face Image style transfer is the task of transforming one image to match
expressions such as eyebrow motion and the temporal pattern of the appearance of another while retaining its semantic and struc-
wrinkles in the forehead, nose and areas surrounding the eyes. tural content [Gatys et al. 2016]. Recent works [Isola et al. 2017]
If the parametric facial model exhibits appearance statistics match- have tackled this task using GANs [Goodfellow et al. 2014], which
ing those from the sensors used to drive the model, then “analysis- can generate images with high realism. CycleGAN [Zhu et al. 2017]
by-synthesis” or “vision as inverse graphics” approaches [Kulkarni introduced the concept of cycle-consistency which improves on the
et al. 2015; Nair et al. 2008; Yildirim et al. 2015] can be used to find mode-collapse problem with GANs. Although architectural choices,
correspondences. Here, the challenge is to find the parameters of the such as the U-net architecture, in CycleGAN encourages structural
models that, when rendered from the corresponding camera view, consistency during transfer, it can still suffer from semantic drift,
match the images obtained from the driving sensors. This is typically especially when the distributions between the domains are not bal-
achieved though variants of the gradient-descent algorithm [Blanz anced. In [Fu et al. 2018], a geometric loss was added to encourage
and Vetter 1999; Cootes et al. 1998; Thies et al. 2016]. Unfortunately, the preservation of spatial structure, and in [Mueller et al. 2018], a
for VR applications, the domain gap between imaging sensors and silhouette matching term was used instead. To further encourage
modeling sensors makes it unlikely that minimizing the difference the preservation of semantic information during transfer, [Bansal
between the rendered model and the sensor’s image will result in an et al. 2018] extend the idea of cycle-consistency by utilizing tempo-
expression matching that of the user. Although explicit landmark ral structures of data. Instead of adding additional terms, in [Harley
detectors can be used to define sparse geometric correspondence et al. 2019], an “uncooperative” optimization strategy is employed
between the model and tracking images [Cao et al. 2014], landmarks instead to prevent semantic drift caused by the forward and back-
alone lack expressiveness, and the oblique viewpoints from HMD wards transformations colluding to complement errors produced by
mounted cameras make reprojection errors less effective. each other. In our method, the preservation of facial expression is
Other approaches correspond sensor data and face parameters particularly important, because we rely on generated (fake) images
using non-visual information such as manual semantic annotations as supervision for optimizing face parameters. Besides matching
of facial expressions and audio signals. For example, in [Olszewski distribution carefully and the use of uncooperative optimization,
et al. 2016], to model the entire face using a single RGB sensor for the we present cross-view cycle consistency to further enforce seman-
lower face and IR cameras for the eyes, the subjects were instructed tic preservation, utilizing synchronized multiview data from our
to performed a predefined set of expressions and sentences that were designed HMC.
used as correspondence with a blendshape model face animation
of the same content. To generate more correspondence for training 2.3 Differentiable rendering
a neural network with a small temporal window, dynamic time
warping of the audio signal was used to align the sentences with the In analysis-by-synthesis approaches [Tewari et al. 2017; Thies et al.
animation. Although the approach demonstrated realistic results, 2018], differentiable rendering a textured mesh is necessary to al-
the animation tends to exhibit tokenized expressions which look low the error signal to flow from pixel errors to parameters of
plausible but are not faithful reconstructions of the user’s facial the graphics engine in a system trained end-to-end. However, it
motion. This approach is also limited by the granularity at which involves a discrete rasterization step to assign triangles to every
facial expressions can be expressed, as well as their repeatability. pixel, which has a non-differentiable property causing the gradients
Unpaired learning-based methods have also been investigated. from pixel errors hard to be propagated to mesh geometry. Tewari
In [Lombardi et al. 2018], synthetic renderings of the avatar are et al. [2017] circumvent this issue by formulating the loss over the
used together with real tracking sensor images to build a domain- vertices instead of the image pixels. Kato et al. [2018] address the
conditioned variational autoencoder (VAE) while simultaneously problem by approximating gradients according to the geometric
learning a mapping from the latent space to face model parameters, relationship between a triangle and a pixel. In our application, this
using correspondences from the rendered domain exclusively. In problem manifests as failures in matching face silhouettes, and we
that work, correspondences are found by leveraging parsimony in address it through a formulation whereby gradients of background
the VAE network, and the model is never trained directly for the pixels that are not covered by any rasterized triangle can still be
target setting (i.e., input is headset images, output is avatar parame- backpropagated to affect mesh geometry.
ters). Although compelling results were demonstrated for speech
sequences, its performance deteriorates with expressive content. 3 ESTABLISHING CORRESPONDENCE
While our method also leverages unpaired learning methods, dif- In recent years, end-to-end regression from images has proven effec-
ferently, we explicitly transfer multiview IR images to avatar-like tive for high-quality and real-time facial motion estimation [Laine
rendered images with Generative Adversarial Networks (GANs), et al. 2017; Tewari et al. 2017]. These works leverage the repre-
which can generate high quality modality-transferred images while sentation capacity of deep neural networks to model the complex
better preserving facial expression. We also leverage the additional mapping between raw image pixels and the parameters of a para-
views provided by the training HMC, and infer more precise corre- metric face model. With input-output pairs, the problem can be
spondences through differentiable rendering. posed within a supervised learning framework, where a precise
mapping that generalizes well can be found (see §4). The main chal-
lenge, therefore, is to acquire a high quality training set: a sufficient
67:4 • Wei et al.
optimal camera mounting configurations do, in fact, contain suffi-

cient information for precise expression estimation, as long as there
are sufficient number of high-quality samples relating those images
to the face model’s expression space. In support of this idea, we built
two versions of the same headset; one with a consumer-friendly
(a) design with a minimally intrusive camera configuration (i.e., the
tracking HMC), and another with an augmented camera set with
more accommodating viewpoints to support correspondence finding
(i.e., the training HMC). Shown in Fig. 2, the training HMC is used
(b) (c) to collect data and build a mapping between the minimal headset
camera configuration and the user’s facial expressions. Specifically,
the minimal set consists of 3 IR VGA cameras for the mouth, left-
eye and right-eye, respectively. The training HMC has 6 additional
cameras: an additional view of each eye, and 4 additional views of
✓1
the mouth, strategically placed lower to capture lip-touching and
vertical mouth motion, and on either side to capture lip protrusion.
<latexit sha1_base64="pwq02tjUgZvfoKCbOucWggnsnss=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEsN4KXjxWMLbQhrLZbtqlm03cnQgl9E948aDi1d/jzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTx6MEmmGfdZIhPdCanhUijuo0DJO6nmNA4lb4fjm5nffuLaiETd4yTlQUyHSkSCUbRSp4cjjrTv9as1t+7OQVaJV5AaFGj1q1+9QcKymCtkkhrT9dwUg5xqFEzyaaWXGZ5SNqZD3rVU0ZibIJ/fOyVnVhmQKNG2FJK5+nsip7Exkzi0nTHFkVn2ZuJ/XjfDqBHkQqUZcsUWi6JMEkzI7HkyEJozlBNLKNPC3krYiGrK0EZUsSF4yy+vEv+ifl337i5rzUaRRhlO4BTOwYMraMIttMAHBhKe4RXenEfnxXl3PhatJaeYOYY/cD5/ADY6j5Y=</latexit>
✓2
All cameras are synchronized and capture at 90Hz. They were geo-
<latexit sha1_base64="8vDmJGbdI/cg6Vo6I6c5+HIpb2o=">AAAB7nicbVBNS8NAEN34WetX1aOXxSJ4KkkRrLeCF48VjC20oWy2k3bpZhN3J0IJ/RNePKh49fd489+4bXPQ1gcDj/dmmJkXplIYdN1vZ219Y3Nru7RT3t3bPzisHB0/mCTTHHyeyER3QmZACgU+CpTQSTWwOJTQDsc3M7/9BNqIRN3jJIUgZkMlIsEZWqnTwxEg69f7lapbc+egq8QrSJUUaPUrX71BwrMYFHLJjOl6bopBzjQKLmFa7mUGUsbHbAhdSxWLwQT5/N4pPbfKgEaJtqWQztXfEzmLjZnEoe2MGY7MsjcT//O6GUaNIBcqzRAUXyyKMkkxobPn6UBo4CgnljCuhb2V8hHTjKONqGxD8JZfXiV+vXZd8+4uq81GkUaJnJIzckE8ckWa5Ja0iE84keSZvJI359F5cd6dj0XrmlPMnJA/cD5/ADe9j5c=</latexit>
sha1_base64="X/BbPPQRM1pmBhxdK1enSbL+gJw=">AAAB2HicbZDNSgMxFIXv1L86Vq1rN8EiuCozbtSd4MZlBccW2qFkMnfa0ExmSO4IpfQFXLhRfDB3vo3pz0KtBwIf5yTk3pOUSloKgi+vtrW9s7tX3/cPGv7h0XGz8WSLygiMRKEK00u4RSU1RiRJYa80yPNEYTeZ3C3y7jMaKwv9SNMS45yPtMyk4OSszrDZCtrBUmwTwjW0YK1h83OQFqLKUZNQ3Np+GJQUz7ghKRTO/UFlseRiwkfYd6h5jjaeLcecs3PnpCwrjDua2NL9+WLGc2uneeJu5pzG9m+2MP/L+hVl1/FM6rIi1GL1UVYpRgVb7MxSaVCQmjrgwkg3KxNjbrgg14zvOgj/brwJ0WX7ph0+BFCHUziDCwjhCm7hHjoQgYAUXuDNG3uv3vuqqpq37uwEfsn7+AaqKYoN</latexit>
sha1_base64="zJqa6K42zjvcM99kYyFSbiEFNko=">AAAB43icbZDNSgMxFIXv+Ftr1erWTbAIrspMN+pOcOOygrWFdiiZ9E4bmsmMyR2hDH0JNy5UfCd3vo3pz0JbDwQ+zknIvSfKlLTk+9/exubW9s5uaa+8Xzk4PKoeVx5tmhuBLZGq1HQiblFJjS2SpLCTGeRJpLAdjW9nefsZjZWpfqBJhmHCh1rGUnByVqdHIyTeb/SrNb/uz8XWIVhCDZZq9qtfvUEq8gQ1CcWt7QZ+RmHBDUmhcFru5RYzLsZ8iF2Hmidow2I+75SdO2fA4tS4o4nN3d8vCp5YO0kidzPhNLKr2cz8L+vmFF+FhdRZTqjF4qM4V4xSNlueDaRBQWrigAsj3axMjLjhglxFZVdCsLryOrQa9et6cO9DCU7hDC4ggEu4gTtoQgsEKHiBN3j3nrxX72PR1oa3rO0E/sj7/AEKTI5I</latexit>
sha1_base64="x+VcxIuTiIkB60YIGLQQVeNUHUY=">AAAB7nicbVBNT8JAEN3iF+IX6tHLRmLiibRcxBuJF4+YWCGBhmyXKWzYbuvu1IQ0/AkvHtR49fd489+4QA8KvmSSl/dmMjMvTKUw6LrfTmljc2t7p7xb2ds/ODyqHp88mCTTHHyeyER3Q2ZACgU+CpTQTTWwOJTQCSc3c7/zBNqIRN3jNIUgZiMlIsEZWqnbxzEgGzQG1Zpbdxeg68QrSI0UaA+qX/1hwrMYFHLJjOl5bopBzjQKLmFW6WcGUsYnbAQ9SxWLwQT54t4ZvbDKkEaJtqWQLtTfEzmLjZnGoe2MGY7NqjcX//N6GUbNIBcqzRAUXy6KMkkxofPn6VBo4CinljCuhb2V8jHTjKONqGJD8FZfXid+o35d9+7cWqtZpFEmZ+ScXBKPXJEWuSVt4hNOJHkmr+TNeXRenHfnY9lacoqZU/IHzucPNn2Pkw==</latexit>
✓1 ⌧ ✓2
metrically calibrated using a custom 3D printed calibration pattern
<latexit sha1_base64="5jd0Z4vzTMXlZb9vlCRENOu1eFU=">AAAB/HicbVDLSsNAFJ34rPUVHzs3g0VwVZIiWHcFNy4rGFtoQphMJ+3QySTM3Ag1FH/FjQsVt36IO//GaZuFth64cOace5l7T5QJrsFxvq2V1bX1jc3KVnV7Z3dv3z44vNdprijzaCpS1Y2IZoJL5gEHwbqZYiSJBOtEo+up33lgSvNU3sE4Y0FCBpLHnBIwUmgf+zBkQEIX+0Lg+aMR2jWn7syAl4lbkhoq0Q7tL7+f0jxhEqggWvdcJ4OgIAo4FWxS9XPNMkJHZMB6hkqSMB0Us+0n+MwofRynypQEPFN/TxQk0XqcRKYzITDUi95U/M/r5RA3g4LLLAcm6fyjOBcYUjyNAve5YhTE2BBCFTe7YjokilAwgVVNCO7iycvEa9Sv6u7tRa3VLNOooBN0is6Riy5RC92gNvIQRY/oGb2iN+vJerHerY9564pVzhyhP7A+fwDUm5RX</latexit>
that ensures that parts of the pattern are within the depth of field
(d) (e)
of each camera, as shown in Fig. 2(d).
To build the dataset, we captured each subject twice using the
Fig. 2. Headset mounted cameras (HMC); (a) Training HMC, with stan-
same stimuli; once using the training HMC, and again using the
dard cameras (circled in red) and additional cameras (circled in blue). It is
used for collecting data to help establish better correspondence between
tracking HMC. The content included 73 expressions, 50 sentences,
HMC images and avatar parameters. (b) Tracking HMC, used for realtime a range of motion, a range of gaze directions and 10 minutes of free
face animation, with minimal camera configuration. (c) Example of captured conversation. This set was designed to cover the range of natural
images, with colored frames indicating standard views (red) and additional expressions. Collecting the same content in both devices ensures
views (blue). (d) Multi-plane calibration pattern used to geometrically cali- a roughly balanced distribution of facial expressions between the
brate all cameras in the training and tracking HMCs. (e) An illustration of two domains, which is important for unpaired domain transfer
the challenges of ergonomic camera placement (blue), where large motions algorithms to work well (see §3.3).
such as mouth opening project to small changes in the captured image, in
comparison to more accommodating camera placements (orange). 3.2 Overall Algorithm
We illustrate our overall algorithm to establish correspondences
in Fig. 3. It assumes the availability of a pre-trained personalized
parametric face model. For the experiments in this paper, we use the
number of precise correspondence between input HMC images and
deep appearance model [Lombardi et al. 2018] which generates ge-
avatar parameters spanning diverse facial expressions. This is par-
ometry and view-conditioned texture from an l-dimensional latent
ticularly challenging for VR applications, where existing methods
have significant limitations as described in §2.1. code z ∈ Rl and a 6-DOF rigid transform v ∈ R6 from the avatar’s
In this work, we establish high-quality correspondence using reference frame to the headset (represented by a reference camera)
domain-transferred multi-view analysis-by-synthesis. We describe using a deep deconvolutional neural network D:
our training- and tracking-HMC designs in §3.1. Our approach for M,T ← D(z, v). (1)
multiview consistent correspondence estimation is then described
Here, M ∈ Rn×3 is the facial shape comprising n-vertices, and
in §3.2, with a detailed treatment of domain transfer in §3.3, and
T ∈ Rw ×h is the generated texture. A rendered image R can be
background-aware differentiable rendering in §3.4.
generated from this shape and texture through rasterization:
3.1 Data Capture R ← R(M,T , A(v)), (2)
Consumer-grade VR headset designs need to be structurally and where A denotes the camera’s projection function.
mechanically robust, ergonomic, and aesthetically pleasing. HMC de- Given multiview images H = {Hi }i ∈ C acquired from a set of
signs that fit this criteria typically have partial and oblique views of headset cameras C, our goal is to estimate the user’s facial expres-
the face which are challenging to use with image-space reconstruc- sion as seen in these views. We solve this by estimating the latent
tion losses typically employed in registration algorithms. Fig. 2(e) parameters z and headset’s pose v that best aligns the rendered
illustrates this problem. For this reason, most published [Li et al. face model to the acquired images. However, instead of performing
2015; Olszewski et al. 2016] and commercial [BinaryVR 2019] VR face this task separately for each frame in a recording, similar to Tewari
tracking systems are designed to operate with more accommodating et al. [2017], we simultaneously estimate these attributes over the
designs, at the expense of size, weight, and cost considerations. The entire dataset comprising thousands of multiview frames. Specifi-
core idea of our work is that the challenging images acquired from cally, we estimate the parameters θ of a predictor Eθ that extracts
{} {}
,{}
{}
HMC Images Established Correspondence
,
Avatar
<latexit sha1_base64="WVUPy8j6/oNeGty+Pga9KkyxATU=">AAAB63icbVDLSgNBEOyNrxhfUY9eFoPgKeyIkGvQixchgbwgWcLspBOHzMwuO7NCWPIFHtWLePWTPPg3TtZFTGJBQ1HVBdUdRIJr43lfTmFjc2t7p7hb2ts/ODwqH590dJjEDNssFGHcC6hGwRW2DTcCe1GMVAYCu8H0duF3HzHWPFQtM4vQl3Si+JgzaqzUvB+WK17Vy+CuE5KTCuRoDMufg1HIEonKMEG17hMvMn5KY8OZwHlpkGiMKJvSCfYtVVSi9tOs6Ny9sMrIHYexHWXcTP2bSKnUeiYDuympedCr3kL81xM8QNtALRf4lRNtSJZdqtcifrrogYrN7R/I6tXrpHNVJV6VNK8r9Zv8I0U4g3O4BAI1qMMdNKANDBCe4AVeHek8O2/O+89qwckzp7AE5+MbeRyPUA==</latexit>
<latexit sha1_base64="Ud+DuvwkbNb7ryLevURBVcOZ57A=">AAAB7XicbVDLSgNBEOyNrxhfUY9eFoPgKeyIkGvQS44R84JkCbOTThwyj2VnVghLPsGjehGvfpEH/8bNuohJLGgoqrqguoNQcGM978spbGxube8Ud0t7+weHR+Xjk47RccSwzbTQUS+gBgVX2LbcCuyFEVIZCOwG09uF333EyHCtWnYWoi/pRPExZ9Sm0n1jyIflilf1MrjrhOSkAjmaw/LnYKRZLFFZJqgxfeKF1k9oZDkTOC8NYoMhZVM6wX5KFZVo/CSrOncvUmXkjnWUjrJupv5NJFQaM5NBuimpfTCr3kL81xM8wLSBWi7wK8fGkiy7VK9F/GTRAxWbp38gq1evk85VlXhVcnddqd/kHynCGZzDJRCoQR0a0IQ2MJjAE7zAq6OdZ+fNef9ZLTh55hSW4Hx8A+2/kCc=</latexit>
<latexit sha1_base64="3zSqpQZCxWJm/a+23dU8oYy+lwU=">AAAB7XicbVDLSsNAFL2pr1pfUZdugkVwVTIiuC26cVmxL2hDmUxv69CZSchMhBL6CS7Vjbj1i1z4N05jENt64MLhnHvg3BvGgmvj+19OaW19Y3OrvF3Z2d3bP3APj9o6ShOGLRaJKOmGVKPgCluGG4HdOEEqQ4GdcHIz9zuPmGgeqaaZxhhIOlZ8xBk1VrpvDvjArfo1P4e3SkhBqlCgMXA/+8OIpRKVYYJq3SN+bIKMJoYzgbNK
sha1_base64="3zSqpQZCxWJm/a+23dU8oYy+lwU=">AAAB7XicbVDLSsNAFL2pr1pfUZdugkVwVTIiuC26cVmxL2hDmUxv69CZSchMhBL6CS7Vjbj1i1z4N05jENt64MLhnHvg3BvGgmvj+19OaW19Y3OrvF3Z2d3bP3APj9o6ShOGLRaJKOmGVKPgCluGG4HdOEEqQ4GdcHIz9zuPmGgeqaaZxhhIOlZ8xBk1VrpvDvjArfo1P4e3SkhBqlCgMXA/+8OIpRKVYYJq3SN+bIKMJoYzgbNKP9UYUzahY+xZqqhEHWR51Zl3ZpWhN4oSO8p4ufo3kVGp9VSGdlNS86CXvbn4ryd4iLaBWizwK6fakDy7UK9JgmzeAxWb2T+Q5atXSfuiRvwaubus1q+Lj5ThBE7hHAhcQR1uoQEtYDCGJ3iBVydynp035/1nteQUmWNYgPPxDQA6kDM=</latexit>
Parameter
{} {}
<latexit sha1_base64="ZpbU3QlWpsZz4w/kwceYFvcZ5R0=">AAAB8HicbVDLSsNAFL1TX7W+oi7dBIvgqmREcFt047JCX9CGMpne1rEzk5CZCCX0H1yqG3Hr/7jwb0xjENt64MLhnHvg3BtEUhjreV+ktLa+sblV3q7s7O7tHziHR20TJjHHFg9lGHcDZlAKjS0rrMRuFCNTgcROMLmZ+51HjI0IddNOI/QVG2sxEpzZTGr3+TC0ZuBUvZqXw10ltCBVKNAYOJ/9YcgThdpyyYzpUS+yfspiK7jEWaWfGIwYn7Ax9jKqmULjp3nbmXuWKUN3FMbZaOvm6t9EypQxUxVkm4rZe7PszcV/PSkCzBroxQK/cmIszbML9ZrUT+c9UPNZ9ge6fPUqaV/UqFejd5fV+nXxkTKcwCmcA4UrqMMtNKAFHB7gCV7glcTkmbyR95/VEikyx7AA8vENiPmRrg==</latexit>
,
<latexit sha1_base64="eQnzXC7dt7IZRHVD+z4DZzT5bdE=">AAAB9HicbZDNSsNAFIVv6l+tf1WXboJFcFUyIrgt6sJlhaYttqHMTG/r0MkkZCaFEvoWLtWNuPVtXPg2JjGIbT0wcDjnXvjmslAKbRznyyqtrW9sbpW3Kzu7e/sH1cOjtg7iiKPLAxlEXUY1SqHQNcJI7IYRUp9J7LDJTdZ3phhpEaiWmYXo+XSsxEhwatLooe9T88hGye18UK05dSeXvWpIYWpQqDmofvaHAY99VIZLqnWPOKHxEhoZwSXOK/1YY0j5hI6xl1pFfdRekhPP7bM0GdqjIEqfMnae/t1IqK/1zGfpZEaol7ss/LeTgmFKoBYBfuNYG5LvLuC1iJdkHKh4dgey/OtV076oE6dO7i9rjeviImU4gVM4BwJX0IA7aIILHBQ8wQu8WlPr2Xqz3n9GS1axcwwLsj6+AY2Wk2c=</latexit>
<latexit sha1_base64="b5ED6JL7HuVYcj3XhV5eZIN+Zhc=">AAAB/XicbZBLS8NAFIVv6qvWV6pLN8EiuCoZEdwWRXBZoS9oQphMb9uhkweZiVJC8Je4VDfi1l/iwn9jEoPY1gMDh3PuhW+uGwoulWl+aZW19Y3Nrep2bWd3b/9Arx/2ZBBHDLssEEE0cKlEwX3sKq4EDsIIqecK7Luz67zv32MkeeB31DxE26MTn485oyqLHL1ueVRN3XFykzqWmqKijt4wm2YhY9WQ0jSgVNvRP61RwGIPfcUElXJIzFDZCY0UZwLTmhVLDCmb0QkOM+tTD6WdFOipcZolI2McRNnzlVGkfzcS6kk599xsMgeVy10e/tsJ7mJG4C8C/MaxVKTYXcDrEDvJOdBnaXYHsvzrVdM7bxKzSe4uGq2r8iJVOIYTOAMCl9CCW2hDFxg8wBO8wKv2qD1rb9r7z2hFK3eOYEHaxzd8kZaw</latexit>
<latexit sha1_base64="n1zL5bQ1SCS3ygCCHocEvBWt8Uk=">AAAB63icbVDLSsNAFL2pr1pfUZdugkVwISXjxm3RjcsW+oI2lMn0tg6dmYTMRCihX+BS3YhbP8mFf+M0BrGtBy4czrkHzr1hLLg2vv/llDY2t7Z3yruVvf2DwyP3+KSjozRh2GaRiJJeSDUKrrBtuBHYixOkMhTYDad3C7/7iInmkWqZWYyBpBPFx5xRY6Xm1dCt+jU/h7dOSEGqUKAxdD8Ho4ilEpVhgmrdJ35sgowmhjOB88og1RhTNqUT7FuqqEQdZHnRuXdhlZE3jhI7yni5+jeRUan1TIZ2U1LzoFe9hfivJ3iItoFaLvArp9qQPLtUr0WCbNEDFZvbP5DVq9dJ57pG/Bppkmr9tvhIGc7gHC6BwA3U4R4a0AYGCE/wAq+OdJ6dN+f9Z7XkFJlTWILz8Q1FxY8s</latexit>
<latexit sha1_base64="79nn5lbtuftuxApg/bbjScOt3LU=">AAAB63icbVDLSgNBEOyJrxhfUY96WAyCBwk7XrwuevGYQF6QLGF20olDZmaXnVkhLPkCj+pFvPpJHvwbN5sgJrGgoajqguoOIimMdd1vUtjY3NreKe6W9vYPDo/KxyctEyYxxyYPZRh3AmZQCo1NK6zEThQjU4HEdjC+n/ntJ4yNCHXDTiL0FRtpMRSc2UyqX/fLFbfq5nDWCV2QincOOWr98ldvEPJEobZcMmO61I2sn7LYCi5xWuolBiPGx2yE3YxqptD4aV506lxmysAZhnE22jq5+jeRMmXMRAXZpmL20ax6M/FfT4oAswZ6ucCvnBhL8+xSvQb101kP1Hya/YGuXr1OWjdV6lZpnVa8u/lDoAhncAFXQOEWPHiAGjSBA8IzvMIbUeSFvJOP+WqBLDKnsATy+QPDzY+I</latexit>
<latexit sha1_base64="4uHZqYMgOi7NQ078cgy38M/NFto=">AAAB/HicbVDLSsNAFL2pr1pfUZdugkVwVTJu3BbduKzQF5hQJtPbduhkEmYmhRLql7hUN+LWP3Hh35jEILb1wIXDOffAuTeIBdfGdb+sysbm1vZOdbe2t39weGQfn3R1lCiGHRaJSPUDqlFwiR3DjcB+rJCGgcBeML3N/d4MleaRbJt5jH5Ix5KPOKMmkwa27QkcGS91PMXHE+MtBnbdbbgFnHVCSlKHEq2B/ekNI5aEKA0TVOsH4sbGT6kynAlc1LxEY0zZlI7xIaOShqj9tGi+cC4yZeiMIpWNNE6h/k2kNNR6HgbZZkjNRK96ufivJ3iAWQO5XOBXTrQhRXapXpv4ad4DJcv/QFavXifdqwZxG+Se1Js35UeqcAbncAkErqEJd9CCDjCYwRO8wKv1aD1bb9b7z2rFKjOnsATr4xt5+ZYY</latexit>
sha1_base64="Q7vR7Sv+uhA2R7II3GoG09bYf3Y=">AAAB/HicbVDLSsNAFL2pr1pfUZduQovgqmTcuA26cVmhL2hCmUxv26GTSchMCiXUL3HhQt2IW//EhX9jkhaxrQcuHM65B869fiS40rb9bZS2tnd298r7lYPDo+MT8/SsrcIkZthioQjjrk8VCi6xpbkW2I1ipIEvsONP7nK/M8VY8VA29SxCL6AjyYecUZ1JfdN0BQ61m1puzEdj7c77Zs2u2wWsTUKWpOZUofQMAI2++eUOQpYEKDUTVKkesSPtpTTWnAmcV9xEYUTZhI6wl1FJA1ReWjSfW5eZMrCGYZyN1Fah/k2kNFBqFvjZZkD1WK17ufivJ7iPWQO5WuBXTpQmRXalXpN4ad4DJcv/QNav3iTt6zqx6+SB1JxbWKAMF1CFKyBwAw7cQwNawGAKT/AKb8aj8WK8Gx+L1ZKxzJzDCozPH9vclxw=</latexit>
sha1_base64="TEn66AehV7gaxyMeoQeg2gvRHVs=">AAAB/HicbVDLSsNAFL2pr1pfUcGNLkKL4Kpk3LgtunFZoS9oQplMb9vBySRkJoUS6pcIbtSNuHDjn7jwb0zSIrb1wIXDOffAudcLBVfatr+Nwtr6xuZWcbu0s7u3f2AeHrVUEEcMmywQQdTxqELBJTY11wI7YYTU9wS2vfubzG+PMVI8kA09CdH16VDyAWdUp1LPNB2BA+0klhPx4Ug7055Zsat2DmuVkDmp1MpQeDr5OKv3zC+nH7DYR6mZoEp1iR1qN6GR5kzgtOTECkPK7ukQuymV1EflJnnzqXWeKn1rEETpSG3l6t9EQn2lJr6XbvpUj9Syl4n/eoJ7mDaQiwV+5VhpkmcX6jWIm2Q9ULLsD2T56lXSuqwSu0ruSKV2DTMU4RTKcAEErqAGt1CHJjAYwyO8wKvxYDwbb8b7bLVgzDPHsADj8wcMhJf+</latexit>
<latexit sha1_base64="vR22rjO7hn9ogFsuS0wf6jvn0e8=">AAAB9XicbVDLSsNAFL2pr1pfUZdugkVwVTIiuC26cVmhL2hDmUxv69DJTMhMlBL6GS7Vjbj1a1z4N05jENt6YOBwzj3cMzeMBdfG97+c0tr6xuZWebuys7u3f+AeHrW1ShOGLaaESroh1Si4xJbhRmA3TpBGocBOOLmZ+50HTDRXsmmmMQYRHUs+4owaK/X6IR8rwyPU3sCt+jU/h7dKSEGqUKAxcD/7Q8XSCKVhgmrdI35sgowmhjOBs0o/1RhTNqFj7Fkqqd0SZHnlmXdmlaE3Uol90ni5+jeR0UjraRTayYiae73szcV/PcFDtA3kYoFfOdWG5NmFek0SZPMeKNnM3oEs/3qVtC9qxK+Ru8tq/bq4SBlO4BTOgcAV1OEWGtACBgqe4AVenUfn2Xlz3n9GS06ROYYFOB/fB7uTpg==</latexit>
Headset's
6DOF Pose Mesh View-dependent Texture
<latexit sha1_base64="8211kZEp2fNHVkv6szzoH2iFlT0=">AAAB7XicbVDLSgNBEOyJrxhfUY9eFoPgKeyIkGvUi8eIeUGyhNlJJw6ZmV12ZoWw5BM8qhfx6hd58G/crIuYxIKGoqoLqtsPpTDWdb9IYW19Y3OruF3a2d3bPygfHrVNEEccWzyQQdT1mUEpNLassBK7YYRM+RI7/uRm7nceMTIi0E07DdFTbKzFSHBmU+n+aiAG5YpbdTM4q4TmpAI5GoPyZ38Y8FihtlwyY3rUDa2XsMgKLnFW6scGQ8YnbIy9lGqm0HhJVnXmnKXK0BkFUTraOpn6N5EwZcxU+emmYvbBLHtz8V9PCh/TBnqxwK8cG0uz7EK9JvWSeQ/UfJb+gS5fvUraF1XqVundZaV+nX+kCCdwCudAoQZ1uIUGtIDDGJ7gBV5JQJ7JG3n/WS2QPHMMCyAf3+MAkCA=</latexit>
HMC Camera
{} {}
Image Style Diﬀerentiable  
Calibration
Translation Rendering
,
<latexit sha1_base64="06m4rjwPepVLhOVmgttQObBpX5A=">AAAB+3icbZBLS8NAFIVv6qvWR6Mu3RSL4KpkRHBbFMRlhb6gCWEyvW2HTiYhMxFKyC9xqW7ErT/Fhf/GJAaxrQcGDufcC99cLxRcacv6Miobm1vbO9Xd2t7+wWHdPDruqyCOGPZYIIJo6FGFgkvsaa4FDsMIqe8JHHjz27wfPGKkeCC7ehGi49Op5BPOqM4i16zbPtUzb5Lcpa4dzrhrNq2WVaixbkhpmlCq45qf9jhgsY9SM0GVGhEr1E5CI82ZwLRmxwpDyuZ0iqPMSuqjcpICPG2cZ8m4MQmi7EndKNK/Gwn1lVr4XjaZY6rVLg//7QT3MCOQywC/caw0KXaX8LrESXIOlCzN7kBWf71u+pctYrXIw1WzfVNepAqncAYXQOAa2nAPHegBgxie4AVejdR4Nt6M95/RilHunMCSjI9v4d2VyA==</latexit> <latexit sha1_base64="jTu8vQbebuBnYnUoViEkiAoBK5c=">AAAB+3icbZBLS8NAFIVvfNb6aNSlm2ARXJWMCG6LLnRZoS9oQphMb+vQySRkJkIJ+SUu1Y249ae48N+YxCC29cDA4Zx74ZvrR4Irbdtfxtr6xubWdm2nvru3f9AwD4/6Kkxihj0WijAe+lSh4BJ7mmuBwyhGGvgCB/7spugHjxgrHsqunkfoBnQq+YQzqvPIMxtOQPWDP0lvM8+JFPfMpt2yS1mrhlSmCZU6nvnpjEOWBCg1E1SpEbEj7aY01pwJzOpOojCibEanOMqtpAEqNy3BM+ssT8bWJIzzJ7VVpn83UhooNQ/8fLLAVMtdEf7bCe5jTiAXAX7jRGlS7i7gdYmbFhwoWZbfgSz/etX0L1rEbpH7y2b7urpIDU7gFM6BwBW04Q460AMGCTzBC7wamfFsvBnvP6NrRrVzDAsyPr4B9EKV1A==</latexit>
§3.3 §3.4 <latexit sha1_base64="n1zL5bQ1SCS3ygCCHocEvBWt8Uk=">AAAB63icbVDLSsNAFL2pr1pfUZdugkVwISXjxm3RjcsW+oI2lMn0tg6dmYTMRCihX+BS3YhbP8mFf+M0BrGtBy4czrkHzr1hLLg2vv/llDY2t7Z3yruVvf2DwyP3+KSjozRh2GaRiJJeSDUKrrBtuBHYixOkMhTYDad3C7/7iInmkWqZWYyBpBPFx5xRY6Xm1dCt+jU/h7dOSEGqUKAxdD8Ho4ilEpVhgmrdJ35sgowmhjOB88og1RhTNqUT7FuqqEQdZHnRuXdhlZE3jhI7yni5+jeRUan1TIZ2U1LzoFe9hfivJ3iItoFaLvArp9qQPLtUr0WCbNEDFZvbP5DVq9dJ57pG/Bppkmr9tvhIGc7gHC6BwA3U4R4a0AYGCE/wAq+OdJ6dN+f9Z7XkFJlTWILz8Q1FxY8s</latexit>

1 -Loss
Style Transferred ˆ <latexit sha1_base64="3hRZLRveBT/Q+zWLAvMlVxViHQs=">AAAB83icbVBNS8NAFHypX7V+VT16CRbBU8mK4LXoxWOVpq20oWy2r+3S3U3IboQS8is8qhfx6s/x4L8xjUFs68CDYeYNzHt+KLg2jvNlldbWNza3ytuVnd29/YPq4VFbB3HE0GWBCKKuTzUKrtA13AjshhFS6Qvs+NObud95xEjzQLXMLERP0rHiI86oyaSH/oSa5D4d8EG15tSdHPYqIQWpQYHmoPrZHwYslqgME1TrHnFC4yU0MpwJTCv9WGNI2ZSOsZdRRSVqL8kLp/ZZpgztURBlo4ydq38TCZVaz6SfbUpqJnrZm4v/eoL7mDVQiwV+5VgbkmcX6rWIl8x7oGJp9geyfPUqaV/UiVMnd5e1xnXxkTKcwCmcA4EraMAtNMEFBhKe4AVerdh6tt6s95/VklVkjmEB1sc30/WS/g==</latexit>

Rendered Avatar <latexit sha1_base64="FSUAt7xtJLAXfVmMd5eEzPE4klY=">AAAB7XicbVDLSgNBEOyJrxhfUY9eFoPgKeyIkGvQi8eoeUGyhNlJJw6ZmV12ZoWw5BM8qhfx6hd58G/crIuYxIKGoqoLqtsPpTDWdb9IYW19Y3OruF3a2d3bPygfHrVNEEccWzyQQdT1mUEpNLassBK7YYRM+RI7/uR67nceMTIi0E07DdFTbKzFSHBmU+n+biAG5YpbdTM4q4TmpAI5GoPyZ38Y8FihtlwyY3rUDa2XsMgKLnFW6scGQ8YnbIy9lGqm0HhJVnXmnKXK0BkFUTraOpn6N5EwZcxU+emmYvbBLHtz8V9PCh/TBnqxwK8cG0uz7EK9JvWSeQ/UfJb+gS5fvUraF1XqVuntZaV+lX+kCCdwCudAoQZ1uIEGtIDDGJ7gBV5JQJ7JG3n/WS2QPHMMCyAf3/0ZkDE=</latexit>
F g 3 Overa op m za on or es ab sh ng correspondence We a n a ne wo k Eθ o o n y es ma e ava a pa ame e s z and headse pos on v

om 9 v ew mages H om a a n ng HMC These es ma ed va ues a e a e used by D and R o ende mu v ew mages o he ava a R n o de o app y
an L econs uc on oss we b dge he doma n gap be ween HMC mages and ende ed mages by a so a n ng an mage s y e ans o me Fϕ a ong w h
Gψ o conve H o he ende ed doma n R w h p ese ved seman cs Co o ed modu es Eθ and Fϕ have a nab e pa ame e s wh e pa ame e s o D a e
ozen and R s pa ame e ee On he gh he ound co espondences be ween H and z can be used as pa s o 3 v ew mages o a ack ng HMC shown n
p nk boxes and ava a pa ame e s o a e eg ess on see §4
{z v } he a en code and headse s pose or rame ∈ T by here s a space o so u ons where one ne work can compensa e
o n y cons der ng da a rom a cameras a ha me ns an or he seman c error ncurred by he o her ead ng o ow re-
cons ruc on errors bu ncorrec es ma on o express on z and
z v ← Eθ H ∀ ∈T 3
headse pose v W hou add ona cons ra n s we observe ha
No e ha he same pred c or s used or a rames n he da ase h s phenomenon o en occurs n prac ce wh ch we re er o as
Ana ogous o non r g d s ruc ure rom mo on [X ao e a 2006] h s co abora ve se superv s on When he doma n gap compr ses pr
has he benefi ha regu ar y n ac a express on across me can mar y o appearance d fferences we con ec ure ha co abora ve
ur her cons ra n he op m za on process mak ng he es ma on se -superv s on ends o be more prom nen n arch ec ures ha
proces more res s an o erm na ng n poor oca m n ma do no re a n spa a s ruc ure Th s s he case n our sys em where
Due o he doma n-gap be ween he rendered mage R and he he a en code z s a vec or zed encod ng o he mage Th s prob-
camera mages H hey are no d rec y comparab e To address h s em has been observed n he s y e- rans er commun y where a
we a so earn he parame ers ϕ o a v ew dependen doma n rans er common so u on s o use u y convo u ona arch ec ures w h
ne work sk p connec ons [ so a e a 2017 Zhu e a 2017] As spa a s ruc
R̂ = Fϕ H ∀ ∈C 4 ure s propaga ed hrough he en re ne work s eas er o re a n
n s s mp es orm h s unc on s compr sed o ndependen ne - s ruc ure and hus ends o rema n unchanged when d fferences
works or each camera W h h s we can ormu a e he ana ys s- n s y e can be we exp a ned by appearance a one
by-syn hes s recons ruc on oss Mo va ed by he compe ng resu s ach eved by me hods such
! as [Zhu e a 2017] we decoup e Eq 5 n o wo s ages F rs we
Õ Õ earn he doma n rans er Fϕ ha conver s headse mages H n o
Lθ ϕ = R̂ − R + λδ z

5 1 ake rendered mages R w hou chang ng he apparen express on
∈T ∈C
e seman cs n he second s age we fix Fϕ and op m ze Eq 5
where R s he rendered ace mode rom Eq 2 ras er zed us ng w h respec o Eθ
he known pro ec on unc on A whose parame ers are ob a ned
rom he ca bra ed camera as men oned n § 3 1 Here δ s a regu
ar za on erm over he a en codes z and λ we gh s s con r bu on 33 Express on-Preserv ng Doma n Transfer
aga ns he L 1 -norm recons ruc on o he doma n- rans erred m- Our express on-preserv ng rans er s based on unpa red mage
age rans a on ne works [Zhu e a 2017] Th s arch ec ure earns a
A hough a firs g ance h s ormu a on appears o be reasonab e b d rec ona doma n mapp ng Fϕ and Gψ by en orc ng cyc c con
w h h gh capac y encoder Eθ and doma n- rans er Fϕ unc ons s s ency be ween he doma ns and an adversar a oss or each o he
ACM T an G aph Vo 38 No 4 A e 67 Pub a on da e u y 2019

67:6 • Wei et al.
Captured Synthetically
HMC Images Rendered Avatar
<latexit sha1_base64="06m4rjwPepVLhOVmgttQObBpX5A=">AAAB+3icbZBLS8NAFIVv6qvWR6Mu3RSL4KpkRHBbFMRlhb6gCWEyvW2HTiYhMxFKyC9xqW7ErT/Fhf/GJAaxrQcGDufcC99cLxRcacv6Miobm1vbO9Xd2t7+wWHdPDruqyCOGPZYIIJo6FGFgkvsaa4FDsMIqe8JHHjz27wfPGKkeCC7ehGi49Op5BPOqM4i16zbPtUzb5Lcpa4dzrhrNq2WVaixbkhpmlCq45qf9jhgsY9SM0GVGhEr1E5CI82ZwLRmxwpDyuZ0iqPMSuqjcpICPG2cZ8m4MQmi7EndKNK/Gwn1lVr4XjaZY6rVLg//7QT3MCOQywC/caw0KXaX8LrESXIOlCzN7kBWf71u+pctYrXIw1WzfVNepAqncAYXQOAa2nAPHegBgxie4AVejdR4Nt6M95/RilHunMCSjI9v4d2VyA==</latexit>
𝐻 𝑅
<latexit sha1_base64="jTu8vQbebuBnYnUoViEkiAoBK5c=">AAAB+3icbZBLS8NAFIVvfNb6aNSlm2ARXJWMCG6LLnRZoS9oQphMb+vQySRkJkIJ+SUu1Y249ae48N+YxCC29cDA4Zx74ZvrR4Irbdtfxtr6xubWdm2nvru3f9AwD4/6Kkxihj0WijAe+lSh4BJ7mmuBwyhGGvgCB/7spugHjxgrHsqunkfoBnQq+YQzqvPIMxtOQPWDP0lvM8+JFPfMpt2yS1mrhlSmCZU6nvnpjEOWBCg1E1SpEbEj7aY01pwJzOpOojCibEanOMqtpAEqNy3BM+ssT8bWJIzzJ7VVpn83UhooNQ/8fLLAVMtdEf7bCe5jTiAXAX7jRGlS7i7gdYmbFhwoWZbfgSz/etX0L1rEbpH7y2b7urpIDU7gFM6BwBW04Q460AMGCTzBC7wamfFsvBnvP6NrRrVzDAsyPr4B9EKV1A==</latexit>
<latexit sha1_base64="+3H4J34WwTe/oCnge3GCDh7BMcY=">AAAB9nicbZBLS8NAFIVvfNb6qrp0EyyCq5Jx47aoiy4r9AVNKJPpbR06mYTMRCwhf8OluhG3/hkX/hsnMYhtPTBwOOde+Ob6keBKO86Xtba+sbm1Xdmp7u7tHxzWjo57Kkxihl0WijAe+FSh4BK7mmuBgyhGGvgC+/7sJu/7DxgrHsqOnkfoBXQq+YQzqk3kugHV9/4kvc1GrVGt7jScQvaqIaWpQ6n2qPbpjkOWBCg1E1SpIXEi7aU01pwJzKpuojCibEanODRW0gCVlxbMmX1ukrE9CWPzpLaL9O9GSgOl5oFvJnNGtdzl4b+d4D4aArkI8BsnSpNidwGvQ7w050DJMnMHsvzrVdO7bBCnQe5IvXldXqQCp3AGF0DgCprQgjZ0gUEET/ACr9aj9Wy9We8/o2tWuXMCC7I+vgHeLZQf</latexit>
sha1_base64="1XNRT5rQ0HkUA+k7YbrDrfRfx6s=">AAAB9nicbZBLS8NAFIVv6qvWV1VcuQkWwVXJuHFb1EWXFfqCJpTJ9LYOnUxCZiKWEPAHuBbcqBtx659x4b8xSYvY1gMDh3Puhe+OGwiutGV9G4WV1bX1jeJmaWt7Z3evvH/QVn4UMmwxX/hh16UKBZfY0lwL7AYhUs8V2HHHV1nfucNQcV829SRAx6MjyYecUZ1Gtu1RfesO4+ukX++XK1bVymUuGzIzldrR8+MDADT65S974LPIQ6mZoEr1iBVoJ6ah5kxgUrIjhQFlYzrCXmol9VA5cc6cmKdpMjCHfpg+qc08/bsRU0+pieemkxmjWuyy8N9OcBdTAjkP8BtHSpN8dw6vSZw440DJkvQfyOLVy6Z9XiVWldyQSu0SpirCMZzAGRC4gBrUoQEtYBDAE7zCm3FvvBjvxsd0tGDMdg5hTsbnD481lhk=</latexit>
sha1_base64="W3HeaYfud1tI1ZQHc8XfDFGFMxM=">AAAB9nicbZBLS8NAFIVv6qu2Pqriyk2wCl2VjBu3RV10WaEvMKVMprd16GQSMhOxhPgLXAtu1I249c+48N+YpEVs64GBwzn3wnfH8QVX2rK+jdzK6tr6Rn6zUNza3tkt7e23lRcGDFvME17QdahCwSW2NNcCu36A1HUEdpzxZdp37jBQ3JNNPfGx59KR5EPOqE4i23apvnWG0VXcr/dLZatqZTKXDZmZcu3w+fGhclJs9Etf9sBjoYtSM0GVuiGWr3sRDTRnAuOCHSr0KRvTEd4kVlIXVS/KmGPzNEkG5tALkie1maV/NyLqKjVxnWQyZVSLXRr+2wnuYEIg5wF+41Bpku3O4TVJL0o5ULI4+QeyePWyaZ9ViVUl16Rcu4Cp8nAEx1ABAudQgzo0oAUMfHiCV3gz7o0X4934mI7mjNnOAczJ+PwBAeqWbg==</latexit> <latexit sha1_base64="g+5sNfWiO9VJ9fouZkDZ4NwDyiA=">AAAB9nicbZBLS8NAFIVvfNb6qrp0EyyCq5Jx47aoC5dV+oImlMn0tg6dTEJmIpaQv+FS3Yhb/4wL/42TGMS2Hhg4nHMvfHP9SHClHefLWlldW9/YrGxVt3d29/ZrB4ddFSYxww4LRRj3fapQcIkdzbXAfhQjDXyBPX96lfe9B4wVD2VbzyL0AjqRfMwZ1SZy3YDqe3+cXmfDu2Gt7jScQvayIaWpQ6nWsPbpjkKWBCg1E1SpAXEi7aU01pwJzKpuojCibEonODBW0gCVlxbMmX1qkpE9DmPzpLaL9O9GSgOlZoFvJnNGtdjl4b+d4D4aAjkP8BsnSpNidw6vTbw050DJMnMHsvjrZdM9bxCnQW5JvXlZXqQCx3ACZ0DgAppwAy3oAIMInuAFXq1H69l6s95/RlescucI5mR9fAPtc5Qp</latexit>
sha1_base64="rF2O81tyuBmX6aQI/wJx9mrv42w=">AAAB9nicbZBLS8NAFIVv6qvWV1VcuQkWwVXJuHFb1IXLKn1BE8pkeluHTiYhMxFLCPgDXAtu1I249c+48N+YpEVs64GBwzn3wnfHDQRX2rK+jcLS8srqWnG9tLG5tb1T3t1rKT8KGTaZL/yw41KFgktsaq4FdoIQqecKbLuji6xv32GouC8behyg49Gh5APOqE4j2/aovnUH8WXSu+mVK1bVymUuGjI1ldrB8+MDANR75S+777PIQ6mZoEp1iRVoJ6ah5kxgUrIjhQFlIzrEbmol9VA5cc6cmMdp0jcHfpg+qc08/bsRU0+pseemkxmjmu+y8N9OcBdTAjkL8BtHSpN8dwavQZw440DJkvQfyPzVi6Z1WiVWlVyTSu0cJirCIRzBCRA4gxpcQR2awCCAJ3iFN+PeeDHejY/JaMGY7uzDjIzPH557liM=</latexit>
sha1_base64="XoznfhfAXIwNWHaEBJVANtOnmzM=">AAAB9nicbZBLS8NAFIVvfNbWR1VcuQlWoauSceO2qAuXVfqCJpTJ9LYOnUxCZiKWEH+Ba8GNuhG3/hkX/huTtohtPTBwOOde+O64geBKW9a3sbS8srq2ntvIFza3tneKu3tN5UchwwbzhR+2XapQcIkNzbXAdhAi9VyBLXd4kfWtOwwV92VdjwJ0PDqQvM8Z1Wlk2x7Vt24/vky6N91iyapYY5mLhkxNqXrw/PhQPi7UusUvu+ezyEOpmaBKdYgVaCemoeZMYJK3I4UBZUM6wE5qJfVQOfGYOTFP0qRn9v0wfVKb4/TvRkw9pUaem05mjGq+y8J/O8FdTAnkLMBvHClNxrszeHXixBkHSpak/0Dmr140zdMKsSrkmpSq5zBRDg7hCMpA4AyqcAU1aACDAJ7gFd6Me+PFeDc+JqNLxnRnH2ZkfP4AETCWeA==</latexit>
Real-world Controlled
Imaging Landmark Rendering
𝑃(𝑣)
Detection
ˆ
𝑃(𝑧, 𝑣) 𝑃(𝑧,
ˆ 𝑣)
Unknown Approximated
<latexit sha1_base64="b63qAZK6h6GcVMxXAiLIvJCqz6g=">AAAB9HicbVDLSsNAFL1TX7W+qi7dBItQNyXjptuiG5cV+sI2lMn0th06mYTMpFBC/sKluhG3/o0L/8Y0BrGtBy4czrkHzr1uIIU2tv1FClvbO7t7xf3SweHR8Un59Kyj/Sjk2Oa+9MOeyzRKobBthJHYC0Jkniux687uln53jqEWvmqZRYCOxyZKjAVnJpUeB1Nm4mZSnV8PyxW7ZmewNgnNSQVyNIflz8HI55GHynDJtO5TOzBOzEIjuMSkNIg0BozP2AT7KVXMQ+3EWePEukqVkTX2w3SUsTL1byJmntYLz003PWamet1biv96UriYNlCrBX7lSBuaZVfqtagTL3ug4kn6B7p+9Sbp3NSoXaMPtNK4zT9ShAu4hCpQqEMD7qEJbeCg4Ale4JXMyTN5I+8/qwWSZ85hBeTjG/apkwI=</latexit>
sha1_base64="vsQcKQWFJ2mTkTIkz4bhwVIDps4=">AAAB9HicbVDLSsNAFL2pr1pfVZduQotQNyXjxm3RjcsKfWEbymR62w6dTEJmUighf+FSRRC3/o0L/8ZpWsS2HrhwOOceOPd6oeBKO863ldva3tndy+8XDg6Pjk+Kp2ctFcQRwyYLRBB1PKpQcIlNzbXAThgh9T2BbW9yN/fbU4wUD2RDz0J0fTqSfMgZ1UZ67I2pTuppZXrVL5adqpPB3iRkScq1EuTeAKDeL371BgGLfZSaCapUlzihdhMaac4EpoVerDCkbEJH2DVUUh+Vm2SNU/vSKAN7GERmpLYz9W8iob5SM98zmz7VY7XuzcV/PcE9NA3kaoFfOVaaZNmVeg3iJvMeKFlq/kDWr94kresqcarkgZRrt7BAHi6gBBUgcAM1uIc6NIGBhCd4gVdraj1b79bHYjVnLTPnsALr8wdYm5QG</latexit>
sha1_base64="Ogi2zDGnAoC2TgNssehusvrXU0I=">AAAB9HicbVDLSsNAFL2pr1pfVcGNLkKLUDcl043bohuXFfrCNpTJ9LYdOpmEzKRQQv7CpY+FuBX8GBf+jWlaxLYeuHA45x449zq+4Epb1reR2djc2t7J7ub29g8Oj/LHJ03lhQHDBvOEF7QdqlBwiQ3NtcC2HyB1HYEtZ3w781sTDBT3ZF1PfbRdOpR8wBnVifTQHVEd1eLS5KqXL1plK4W5TsiCFKsFyLyefV7Uevmvbt9joYtSM0GV6hDL13ZEA82ZwDjXDRX6lI3pEDsJldRFZUdp49i8TJS+OfCCZKQ2U/VvIqKuUlPXSTZdqkdq1ZuJ/3qCO5g0kMsFfuVQaZJml+rViR3NeqBkcfIHsnr1OmlWysQqk3tSrN7AHFk4hwKUgMA1VOEOatAABhIe4RlejInxZLwZ7/PVjLHInMISjI8fiTSU6A==</latexit>
𝑃(𝑧)
<latexit sha1_base64="uQVzbfMlNlrwUK3hNmjb31xkYW8=">AAAB8HicbVDLSsNAFL1TX7W+qi7dBItQQUrGjduiG5cV+oI2lMn0to6dTEJmUqih/+BS3Yhb/8eFf2MSg9jWAxcO59wD5143kEIb2/4ihbX1jc2t4nZpZ3dv/6B8eNTWfhRybHFf+mHXZRqlUNgywkjsBiEyz5XYcSc3qd+ZYqiFr5pmFqDjsbESI8GZSaR2o/p4MT0flCt2zc5grRKakwrkaAzKn/2hzyMPleGSad2jdmCcmIVGcInzUj/SGDA+YWPsJVQxD7UTZ23n1lmiDK2RHyajjJWpfxMx87SeeW6y6TFzr5e9VPzXk8LFpIFaLPArR9rQLLtQr0mdOO2Bis+TP9Dlq1dJ+7JG7Rq9o5X6df6RIpzAKVSBwhXU4RYa0AIOD/AEL/BKQvJM3sj7z2qB5JljWAD5+AZnbJDv</latexit>
sha1_base64="ijHrQbWAyaZtzctOV3kFCs3xMTo=">AAAB8HicbVBNSwJRFL1jX2ZfVss2gxIYhMxz01Zq09LAL9BB3jyv9vK9N8O8N4IN/oeW1aba9n9a9G8aR4nUDlw4nHMPnHu9QHBtHOfbymxsbm3vZHdze/sHh0f545Om9qOQYYP5wg/bHtUouMKG4UZgOwiRSk9gyxvdzPzWGEPNfVU3kwBdSYeKDzijJpGatdLj5fiily86ZSeFvU7IghSrBci8A0Ctl//q9n0WSVSGCap1hziBcWMaGs4ETnPdSGNA2YgOsZNQRSVqN07bTu3zROnbAz9MRhk7Vf8mYiq1nkgv2ZTU3OtVbyb+6wnuYdJALRf4lSNtSJpdqlcnbjzrgYpNkz+Q1avXSbNSJk6Z3JFi9RrmyMIZFKAEBK6gCrdQgwYweIAneIFXK7SerTfrY76asRaZU1iC9fkDyU+R8w==</latexit>
sha1_base64="xE6FGVkboBQFjFMtt7LjSkqyIUo=">AAAB8HicbVBNSwJRFL1jX2ZfVtCmFoMSGITMa9NWatPSwFFBB3nzvNrLN2+GeW8EG/wPLauNtG3Xj2nRv2kcJVI7cOFwzj1w7nUDwZW2rG8js7a+sbmV3c7t7O7tH+QPj+rKj0KGNvOFHzZdqlBwibbmWmAzCJF6rsCGO7id+o0hhor7sqZHAToe7Uve44zqRKpXS0+Xw4tOvmiVrRTmKiFzUqwUIDM5+TyrdvJf7a7PIg+lZoIq1SJWoJ2YhpozgeNcO1IYUDagfWwlVFIPlROnbcfmeaJ0zZ4fJiO1map/EzH1lBp5brLpUf2glr2p+K8nuItJA7lY4FeOlCZpdqFejTjxtAdKNk7+QJavXiX1qzKxyuSeFCs3MEMWTqEAJSBwDRW4gyrYwOARnuEV3ozQeDEmxvtsNWPMM8ewAOPjB/noktU=</latexit>
Shared <latexit sha1_base64="LWXrrwdJ+QDaWx71U0SmEF5OL/8=">AAAB9nicbVDLSsNAFL2pr1pfVZdugkWoICXTjduiG5cV+oImlMn0th06mYTMpFhDfsOluhG3/owL/8Y0BrGtBy4czrkHzr1uILjSlvVlFDY2t7Z3irulvf2Dw6Py8UlH+VHIsM184Yc9lyoUXGJbcy2wF4RIPVdg153eLvzuDEPFfdnS8wAdj44lH3FGdSrZ9oTquJlUH69ml4NyxapZGcx1QnJSgRzNQfnTHvos8lBqJqhSfWIF2olpqDkTmJTsSGFA2ZSOsZ9SST1UTpx1TsyLVBmaIz9MR2ozU/8mYuopNffcdNOjeqJWvYX4rye4i2kDuVzgV46UJll2qV6LOPGiB0qWpH8gq1evk069RqwauSeVxk3+kSKcwTlUgcA1NOAOmtAGBgE8wQu8Gg/Gs/FmvP+sFow8cwpLMD6+AUYdk7w=</latexit>

sha1_base64="jFK5r5chdLIVmiZxzoj8cdpKlZw=">AAAB9nicbVDLSsNAFL2pr1pfVZduQotQQUrGjduiG5cV+oImlMn0th06mYTMpFhDfsOldiNu/RkX/o1pWsS2HrhwOOceOPe6geBKW9a3kdva3tndy+8XDg6Pjk+Kp2ct5UchwybzhR92XKpQcIlNzbXAThAi9VyBbXd8P/fbEwwV92VDTwN0PDqUfMAZ1alk2yOq43pSeb6eXPWKZatqZTA3CVmScq0EuRkA1HvFL7vvs8hDqZmgSnWJFWgnpqHmTGBSsCOFAWVjOsRuSiX1UDlx1jkxL1Olbw78MB2pzUz9m4ipp9TUc9NNj+qRWvfm4r+e4C6mDeRqgV85Uppk2ZV6DeLE8x4oWZL+gaxfvUlaN1ViVckjKdfuYIE8XEAJKkDgFmrwAHVoAoMAXuANZsaT8Wq8Gx+L1ZyxzJzDCozPH6gAlMA=</latexit>
sha1_base64="5TPDhkoL5T7I9uTF8fglRCrPa58=">AAAB9nicbVDLSsNAFL2pr1pfVcGNLkqLUEFKxo3bohuXFfqCJpTJ9LYdnExCZlKsIb/h0roRt/oxLvwb07SIbT1w4XDOPXDudXzBlTbNbyOztr6xuZXdzu3s7u0f5A+PmsoLA4YN5gkvaDtUoeASG5prgW0/QOo6AlvOw+3Ub40wUNyTdT320XbpQPI+Z1QnkmUNqY5qcfnpcnTRzZfMipmisErInJSqRchMTj7Pat38l9XzWOii1ExQpTrE9LUd0UBzJjDOWaFCn7IHOsBOQiV1UdlR2jkunCdKr9D3gmSkLqTq30REXaXGrpNsulQP1bI3Ff/1BHcwaSAXC/zKodIkzS7UqxM7mvZAyeLkD2T56lXSvKoQs0LuSal6AzNk4RSKUAYC11CFO6hBAxj48AwTeDUejRfjzXifrWaMeeYYFmB8/ADYmZWi</latexit>
Fig. 5. Landmarks on the texture map of the avatar (left) and HMC
Stimuli
ˆ images (right). Colors of landmarks indicate the point-to-point correspon-
dence across domains. We detect landmarks on all 9 views (only 3 views are
<latexit sha1_base64="sQyJpZ9CGBYzZe9g8XrqTDgxCBk=">AAAB9HicbVDLSsNAFL1TX7W+qi7dBItQNyXjxm3RjcsKfWEbymR62w6dTEJmUqghf+FS3Yhb/8aFf2Mag9jWAxcO59wD5143kEIb2/4ihY3Nre2d4m5pb//g8Kh8fNLWfhRybHFf+mHXZRqlUNgywkjsBiEyz5XYcae3C78zw1ALXzXNPEDHY2MlRoIzk0oP/QkzcSOpPl4OyhW7Zmew1gnNSQVyNAblz/7Q55GHynDJtO5ROzBOzEIjuMSk1I80BoxP2Rh7KVXMQ+3EWePEukiVoTXyw3SUsTL1byJmntZzz003PWYmetVbiP96UriYNlDLBX7lSBuaZZfqNakTL3qg4kn6B7p69TppX9WoXaP3tFK/yT9ShDM4hypQuIY63EEDWsBBwRO8wCuZkWfyRt5/Vgskz5zCEsjHN/zJkwY=</latexit>
sha1_base64="vkLC73hYIoKYib5MVs1NchItUZo=">AAAB9HicbVDLSsNAFL2pr1pfVZduQotQNyXjxm3RjcsKfWEbymR62w5OJiEzKdSQv3CpIohb/8aFf+M0LWJbD1w4nHMPnHu9UHClHefbym1sbm3v5HcLe/sHh0fF45OWCuKIYZMFIog6HlUouMSm5lpgJ4yQ+p7AtvdwM/PbE4wUD2RDT0N0fTqSfMgZ1Ua6742pTupp5fGiXyw7VSeDvU7IgpRrJci9AUC9X/zqDQIW+yg1E1SpLnFC7SY00pwJTAu9WGFI2QMdYddQSX1UbpI1Tu1zowzsYRCZkdrO1L+JhPpKTX3PbPpUj9WqNxP/9QT30DSQywV+5VhpkmWX6jWIm8x6oGSp+QNZvXqdtC6rxKmSO1KuXcMceTiDElSAwBXU4Bbq0AQGEp7gBV6tifVsvVsf89WctcicwhKszx9eu5QK</latexit>
sha1_base64="GTbVURbaJlKQeJYDClvQPtZcTfk=">AAAB9HicbVDLSsNAFL2pr1pfVcGNLkKLUDcl48Zt0Y3LCn1hG8pketsOnUxCZlKoIX/h0sdC3Ap+jAv/xjQtYlsPXDiccw+cex1fcKUt69vIrK1vbG5lt3M7u3v7B/nDo4bywoBhnXnCC1oOVSi4xLrmWmDLD5C6jsCmM7qZ+s0xBop7sqYnPtouHUje54zqRLrvDKmOqnHp4aKbL1plK4W5SsicFCsFyLyefJ5Vu/mvTs9joYtSM0GVahPL13ZEA82ZwDjXCRX6lI3oANsJldRFZUdp49g8T5Se2feCZKQ2U/VvIqKuUhPXSTZdqodq2ZuK/3qCO5g0kIsFfuVQaZJmF+rViB1Ne6BkcfIHsnz1KmlclolVJnekWLmGGbJwCgUoAYErqMAtVKEODCQ8wjO8GGPjyXgz3merGWOeOYYFGB8/j1SU7A==</latexit>
shown) from detectors trained from manual annotations. The uv-coordinates

Fig. 4. Matching the distribution of spatial structures in images across of these landmarks are then jointly solved with z t and v t across multiple
domains is important for training expression-preserving image style transfer frames. Note that we do not minimize projected distance of landmarks in
networks Fϕ and Gψ . We use landmark detection to estimate a distribution Eq. (5) to find image-to-avatar correspondence.
over headset pose, and use the same expression stimuli during data capture
(for both avatar building and HMC capture) to ensure the distribution of
expressions is comparable.
does not produce precise enough estimates of expression (because
landmarks are not able to describe complete and subtle facial expres-
two. To achieve preservation of expression, we need to eliminate the sions), it can often give reasonable estimates for v owing to its low
tendency of generators to modify spatial structure on the images. dimensionality and limited range of variation. One of the challenges
With a fully convolutional architecture where random initialization in fitting a 3D mesh to 2D detections is defining correspondence
already leads to retained image structures, this tendency mainly between mesh vertices and detected landmarks. Typically, the land-
comes from the pressure to prevent opposing discriminators from mark set for which annotations are available does not match exactly
spotting fake images from their spatial structure. In other words, to any vertex in a particular mesh topology. Manually assigning a
if the distribution of spatial structure, which in our case is jointly single vertex to each landmark can lead to suboptimal fitting results
determined by headset positions v and facial expressions z, is bal- for coarse mesh topologies. To address this problem, while fitting
anced, the generators then have no pressure to begin modifying individual meshes, we simultaneously solve for each landmark’s
them. In the following, we begin by describing how we balance the mesh correspondence (used across all frames) in the texture’s uv-
distribution when preparing real datasets before training, followed space {u j ∈ R2 }mj=1 , where m is the number of available landmarks.
by the learning problem itself. To project each landmark j on rendered images of every view, we
calculate a row vector of the barycentric-coordinates bj ∈ R1×3 of
3.3.1 Matching Distribution of Image Spatial Structure. While we the current u j in its enclosing triangle, with vertices indexed by
don’t have control over the underlying expression z t and headset aj ∈ N3 , and then linearly interpolate projections of the enclosing
position v t in captured headset data {Hit }t ∈ T , we can generate a triangle’s 3D vertices, M aj ∈ R3×3 , where M is the mesh from Eq. (1).
set of rendered images {Ris }s ∈S with the desired statistics if we Specifically, we solve the following optimization problem:
have an estimate P̂(z, v) of the joint distribution P(z, v). However, m
since estimating individual z t and v t for headset data is our original
2
w itj pi j Hit − bj P(M at j , Ai v t ) , (6)
Õ ÕÕ
min

problem, we need to find a proxy to approximate P̂ ≈ P. u j ,v t ,z t
t ∈T i ∈ C j=1
Fig. 4 summarizes our strategy. Here, we assume independent dis-
tribution between z and v, or P̂(z, v) = P̂(z)P̂(v), and estimate P̂(z) where pi j ∈ R2 is the 2D detection of landmark j in HMC camera i,
and P̂(v) individually. For P̂(z), we rely on the data capture process, P is a camera projection generating 2D points in R3×2 , and w itj is the
where the subject is captured twice with the same stimuli as men- landmark’s detection confidence in [0, 1]. Note that for a landmark
tioned in § 3.1. Even though this does not lead to a frame-to-frame j not observable by view i, w itj is zero. In practice, we initialize the
mapping between the captures, we can assume the distribution of u j ’s to a predefined set of vertices in the template mesh to prevent
facial expression is comparable. Therefore, we can use the set of ex- divergence. We also avoid using landmarks in regions where the
pression codes from the modeling capture as approximate samples avatar doesn’t have mesh vertices, such as the pupils and the mouth
from P(z). interior. Fig. 5 shows an example of the u j ’s at convergence.
For the distribution over headset pose P̂(v), we resort to fitting By solving Eq. (6), we obtain a set of HMC pose {v t }t ∈T from
the face model’s 3D geometry to detected landmarks on headset each frame H t . We can now render the required dataset {Ris }s ∈S
images by collecting 2D landmark annotations and training land- by randomly sampling a HMC pose |S| times from {v t }t ∈T , and
mark detectors [Wei et al. 2016]. Although landmark fitting alone an expression code also |S| times from the set of encoded values
Domain H Domain R cameras on the same side. Together with the standard terms of
CycleGAN [Zhu et al. 2017], our loss function takes the form:
view <latexit sha1_base64="06m4rjwPepVLhOVmgttQObBpX5A=">AAAB+3icbZBLS8NAFIVv6qvWR6Mu3RSL4KpkRHBbFMRlhb6gCWEyvW2HTiYhMxFKyC9xqW7ErT/Fhf/GJAaxrQcGDufcC99cLxRcacv6Miobm1vbO9Xd2t7+wWHdPDruqyCOGPZYIIJo6FGFgkvsaa4FDsMIqe8JHHjz27wfPGKkeCC7ehGi49Op5BPOqM4i16zbPtUzb5Lcpa4dzrhrNq2WVaixbkhpmlCq45qf9jhgsY9SM0GVGhEr1E5CI82ZwLRmxwpDyuZ0iqPMSuqjcpICPG2cZ8m4MQmi7EndKNK/Gwn1lVr4XjaZY6rVLg//7QT3MCOQywC/caw0KXaX8LrESXIOlCzN7kBWf71u+pctYrXIw1WzfVNepAqncAYXQOAa2nAPHegBgxie4AVejdR4Nt6M95/RilHunMCSjI9v4d2VyA==</latexit>
0 L = LC + λG LG + λ P LP + λV LV , (7)
where LC = LCH +LCR is the cycle-consistency loss for each domain

0
<latexit sha1_base64="zbZlnnVn27gNZ4BMF6mM6tmiiOI=">AAAB7XicbVDLSgNBEOyNrxhfUY9eFoPgKeyIkGvQS44R84JkCbOTThwyM7vszAphySd4VC/i1S/y4N84WRcxiQUNRVUXVHcQCa6N5305hY3Nre2d4m5pb//g8Kh8fNLRYRIzbLNQhHEvoBoFV9g23AjsRTFSGQjsBtPbhd99xFjzULXMLEJf0oniY86osdJ9Y+gNyxWv6mVw1wnJSQVyNIflz8EoZIlEZZigWveJFxk/pbHhTOC8NEg0RpRN6QT7lioqUftpVnXuXlhl5I7D2I4ybqb+TaRUaj2Tgd2U1DzoVW8h/usJHqBtoJYL/MqJNiTLLtVrET9d9EDF5vYPZPXqddK5qhKvSu6uK/Wb/CNFOINzuAQCNahDA5rQBgYTeIIXeHVC59l5c95/VgtOnjmFJTgf35awj+4=</latexit>
and for each view,
𝐻 𝑅
Cross-view Õ Õ
LCH = Gψ ◦ Fϕ Hit − Hit ,

0 1 0 1 (8)

<latexit sha1_base64="odQLSrJxjsGdQUU8elJ/sYbHVDE=">AAAB9nicbZBLS8NAFIVv6qvWV9Wlm2ARXJUZN26LblxW6AuaUCbT2zp0MgmZSbGE/A2X6kbc+mdc+G9MYhDbemDgcM698M31Qim0IeTLqmxsbm3vVHdre/sHh0f145OeDuKIY5cHMogGHtMohcKuEUbiIIyQ+Z7Evje7zfv+HCMtAtUxixBdn02VmAjOTBY5js/MgzdJ2umIjOoN0iSF7HVDS9OAUu1R/dMZBzz2URkumdZDSkLjJiwygktMa06sMWR8xqY4zKxiPmo3KZhT+yJLxvYkiLKnjF2kfzcS5mu98L1sMmfUq10e/ttJ4WFGoJYBfuNYG1rsLuF1qJvkHKh4mt2Brv563fSumpQ06T1ttG7Ki1ThDM7hEihcQwvuoA1d4BDCE7zAq/VoPVtv1vvPaMUqd05hSdbHN8v9lBM=</latexit>
sha1_base64="q8+OXEv8fyRR2QAD33YXPThNIrE=">AAAB9nicbZBLS8NAFIVv6qvWV1VcuQkWwVXJuHEbdOOyQl/QhDKZ3tahk0nITMQSAv4A14IbdSNu/TMu/DcmaRHbemDgcM698N3xQsGVtqxvo7Syura+Ud6sbG3v7O5V9w/aKogjhi0WiCDqelSh4BJbmmuB3TBC6nsCO974Ku87dxgpHsimnoTo+nQk+ZAzqrPIcXyqb71h0kj7Vr9as+pWIXPZkJmp2UfPjw8A0OhXv5xBwGIfpWaCKtUjVqjdhEaaM4FpxYkVhpSN6Qh7mZXUR+UmBXNqnmbJwBwGUfakNov070ZCfaUmvpdN5oxqscvDfzvBPcwI5DzAbxwrTYrdObwmcZOcAyVLs38gi1cvm/Z5nVh1ckNq9iVMVYZjOIEzIHABNlxDA1rAIIQneIU34954Md6Nj+loyZjtHMKcjM8ffQWWDQ==</latexit>
sha1_base64="ULWeSB7f4o2DhXiySneL152dE4o=">AAAB9nicbZBLS8NAFIVvfNbWR1VcuQlWoauSceO26MZlhb6gCWUyva1DJ5OQmYglxF/gWnCjbsStf8aF/8YkLWJbDwwczrkXvjtuILjSlvVtrKyurW9sFraKpe2d3b3y/kFb+VHIsMV84YddlyoUXGJLcy2wG4RIPVdgxx1fZX3nDkPFfdnUkwAdj44kH3JGdRrZtkf1rTuMG0nf6pcrVs3KZS4bMjOV+tHz40P1tNTol7/sgc8iD6VmgirVI1agnZiGmjOBSdGOFAaUjekIe6mV1EPlxDlzYp6lycAc+mH6pDbz9O9GTD2lJp6bTmaMarHLwn87wV1MCeQ8wG8cKU3y3Tm8JnHijAMlS9J/IItXL5v2eY1YNXJDKvVLmKoAx3ACVSBwAXW4hga0gEEAT/AKb8a98WK8Gx/T0RVjtnMIczI+fwDvq5Zi</latexit>
<latexit sha1_base64="tJQE3MOBoA/35IgdJ8Pol0GC29s=">AAAB9nicbZBLS8NAFIVv6qvWV9Wlm2ARXJWMG7dFNy4r9AVNKJPpbR06mYTMpFhC/oZLdSNu/TMu/DdOYhDbemDgcM698M31I8GVdpwvq7KxubW9U92t7e0fHB7Vj096Kkxihl0WijAe+FSh4BK7mmuBgyhGGvgC+/7sNu/7c4wVD2VHLyL0AjqVfMIZ1SZy3YDqB3+StrMRGdUbTtMpZK8bUpoGlGqP6p/uOGRJgFIzQZUaEifSXkpjzZnArOYmCiPKZnSKQ2MlDVB5acGc2RcmGduTMDZPartI/26kNFBqEfhmMmdUq10e/tsJ7qMhkMsAv3GiNCl2l/A6xEtzDpQsM3cgq79eN72rJnGa5J40WjflRapwBudwCQSuoQV30IYuMIjgCV7g1Xq0nq036/1ntGKVO6ewJOvjG82ElBQ=</latexit>
sha1_base64="1N4vqjm3peaSDNyTncjNQgxCvQY=">AAAB9nicbZBLS8NAFIVv6qvWV1VcuQkWwVXJuHEbdOOyQl/QhDKZ3tahk0nITMQSAv4A14IbdSNu/TMu/DcmaRHbemDgcM698N3xQsGVtqxvo7Syura+Ud6sbG3v7O5V9w/aKogjhi0WiCDqelSh4BJbmmuB3TBC6nsCO974Ku87dxgpHsimnoTo+nQk+ZAzqrPIcXyqb71h0kj7pF+tWXWrkLlsyMzU7KPnxwcAaPSrX84gYLGPUjNBleoRK9RuQiPNmcC04sQKQ8rGdIS9zErqo3KTgjk1T7NkYA6DKHtSm0X6dyOhvlIT38smc0a12OXhv53gHmYEch7gN46VJsXuHF6TuEnOgZKl2T+QxauXTfu8Tqw6uSE1+xKmKsMxnMAZELgAG66hAS1gEMITvMKbcW+8GO/Gx3S0ZMx2DmFOxucPfoyWDg==</latexit>
sha1_base64="Ey3f6A0cmxkAdOcD2YRMK97jczY=">AAAB9nicbZBLS8NAFIVvfNbWR1VcuQlWoauSceO26MZlhb6gCWUyva1DJ5OQmYglxF/gWnCjbsStf8aF/8YkLWJbDwwczrkXvjtuILjSlvVtrKyurW9sFraKpe2d3b3y/kFb+VHIsMV84YddlyoUXGJLcy2wG4RIPVdgxx1fZX3nDkPFfdnUkwAdj44kH3JGdRrZtkf1rTuMG0mf9MsVq2blMpcNmZlK/ej58aF6Wmr0y1/2wGeRh1IzQZXqESvQTkxDzZnApGhHCgPKxnSEvdRK6qFy4pw5Mc/SZGAO/TB9Upt5+ncjpp5SE89NJzNGtdhl4b+d4C6mBHIe4DeOlCb57hxekzhxxoGSJek/kMWrl037vEasGrkhlfolTFWAYziBKhC4gDpcQwNawCCAJ3iFN+PeeDHejY/p6Iox2zmEORmfP/EylmM=</latexit>
Cycle Consistency <latexit sha1_base64="odQLSrJxjsGdQUU8elJ/sYbHVDE=">AAAB9nicbZBLS8NAFIVv6qvWV9Wlm2ARXJUZN26LblxW6AuaUCbT2zp0MgmZSbGE/A2X6kbc+mdc+G9MYhDbemDgcM698M31Qim0IeTLqmxsbm3vVHdre/sHh0f145OeDuKIY5cHMogGHtMohcKuEUbiIIyQ+Z7Evje7zfv+HCMtAtUxixBdn02VmAjOTBY5js/MgzdJ2umIjOoN0iSF7HVDS9OAUu1R/dMZBzz2URkumdZDSkLjJiwygktMa06sMWR8xqY4zKxiPmo3KZhT+yJLxvYkiLKnjF2kfzcS5mu98L1sMmfUq10e/ttJ4WFGoJYBfuNYG1rsLuF1qJvkHKh4mt2Brv563fSumpQ06T1ttG7Ki1ThDM7hEihcQwvuoA1d4BDCE7zAq/VoPVtv1vvPaMUqd05hSdbHN8v9lBM=</latexit>

sha1_base64="q8+OXEv8fyRR2QAD33YXPThNIrE=">AAAB9nicbZBLS8NAFIVv6qvWV1VcuQkWwVXJuHEbdOOyQl/QhDKZ3tahk0nITMQSAv4A14IbdSNu/TMu/DcmaRHbemDgcM698N3xQsGVtqxvo7Syura+Ud6sbG3v7O5V9w/aKogjhi0WiCDqelSh4BJbmmuB3TBC6nsCO974Ku87dxgpHsimnoTo+nQk+ZAzqrPIcXyqb71h0kj7Vr9as+pWIXPZkJmp2UfPjw8A0OhXv5xBwGIfpWaCKtUjVqjdhEaaM4FpxYkVhpSN6Qh7mZXUR+UmBXNqnmbJwBwGUfakNov070ZCfaUmvpdN5oxqscvDfzvBPcwI5DzAbxwrTYrdObwmcZOcAyVLs38gi1cvm/Z5nVh1ckNq9iVMVYZjOIEzIHABNlxDA1rAIIQneIU34954Md6Nj+loyZjtHMKcjM8ffQWWDQ==</latexit>
sha1_base64="ULWeSB7f4o2DhXiySneL152dE4o=">AAAB9nicbZBLS8NAFIVvfNbWR1VcuQlWoauSceO26MZlhb6gCWUyva1DJ5OQmYglxF/gWnCjbsStf8aF/8YkLWJbDwwczrkXvjtuILjSlvVtrKyurW9sFraKpe2d3b3y/kFb+VHIsMV84YddlyoUXGJLcy2wG4RIPVdgxx1fZX3nDkPFfdnUkwAdj44kH3JGdRrZtkf1rTuMG0nf6pcrVs3KZS4bMjOV+tHz40P1tNTol7/sgc8iD6VmgirVI1agnZiGmjOBSdGOFAaUjekIe6mV1EPlxDlzYp6lycAc+mH6pDbz9O9GTD2lJp6bTmaMarHLwn87wV1MCeQ8wG8cKU3y3Tm8JnHijAMlS9J/IItXL5v2eY1YNXJDKvVLmKoAx3ACVSBwAXW4hga0gEEAT/AKb8a98WK8Gx/T0RVjtnMIczI+fwDvq5Zi</latexit>
<latexit sha1_base64="tJQE3MOBoA/35IgdJ8Pol0GC29s=">AAAB9nicbZBLS8NAFIVv6qvWV9Wlm2ARXJWMG7dFNy4r9AVNKJPpbR06mYTMpFhC/oZLdSNu/TMu/DdOYhDbemDgcM698M31I8GVdpwvq7KxubW9U92t7e0fHB7Vj096Kkxihl0WijAe+FSh4BK7mmuBgyhGGvgC+/7sNu/7c4wVD2VHLyL0AjqVfMIZ1SZy3YDqB3+StrMRGdUbTtMpZK8bUpoGlGqP6p/uOGRJgFIzQZUaEifSXkpjzZnArOYmCiPKZnSKQ2MlDVB5acGc2RcmGduTMDZPartI/26kNFBqEfhmMmdUq10e/tsJ7qMhkMsAv3GiNCl2l/A6xEtzDpQsM3cgq79eN72rJnGa5J40WjflRapwBudwCQSuoQV30IYuMIjgCV7g1Xq0nq036/1ntGKVO6ewJOvjG82ElBQ=</latexit>
sha1_base64="1N4vqjm3peaSDNyTncjNQgxCvQY=">AAAB9nicbZBLS8NAFIVv6qvWV1VcuQkWwVXJuHEbdOOyQl/QhDKZ3tahk0nITMQSAv4A14IbdSNu/TMu/DcmaRHbemDgcM698N3xQsGVtqxvo7Syura+Ud6sbG3v7O5V9w/aKogjhi0WiCDqelSh4BJbmmuB3TBC6nsCO974Ku87dxgpHsimnoTo+nQk+ZAzqrPIcXyqb71h0kj7pF+tWXWrkLlsyMzU7KPnxwcAaPSrX84gYLGPUjNBleoRK9RuQiPNmcC04sQKQ8rGdIS9zErqo3KTgjk1T7NkYA6DKHtSm0X6dyOhvlIT38smc0a12OXhv53gHmYEch7gN46VJsXuHF6TuEnOgZKl2T+QxauXTfu8Tqw6uSE1+xKmKsMxnMAZELgAG66hAS1gEMITvMKbcW+8GO/Gx3S0ZMx2DmFOxucPfoyWDg==</latexit>
sha1_base64="Ey3f6A0cmxkAdOcD2YRMK97jczY=">AAAB9nicbZBLS8NAFIVvfNbWR1VcuQlWoauSceO26MZlhb6gCWUyva1DJ5OQmYglxF/gWnCjbsStf8aF/8YkLWJbDwwczrkXvjtuILjSlvVtrKyurW9sFraKpe2d3b3y/kFb+VHIsMV84YddlyoUXGJLcy2wG4RIPVdgxx1fZX3nDkPFfdnUkwAdj44kH3JGdRrZtkf1rTuMG0mf9MsVq2blMpcNmZlK/ej58aF6Wmr0y1/2wGeRh1IzQZXqESvQTkxDzZnApGhHCgPKxnSEvdRK6qFy4pw5Mc/SZGAO/TB9Upt5+ncjpp5SE89NJzNGtdhl4b+d4C6mBHIe4DeOlCb57hxekzhxxoGSJek/kMWrl037vEasGrkhlfolTFWAYziBKhC4gDpcQwNawCCAJ3iFN+PeeDHejY/p6Iox2zmEORmfP/EylmM=</latexit>
<latexit sha1_base64="g+5sNfWiO9VJ9fouZkDZ4NwDyiA=">AAAB9nicbZBLS8NAFIVvfNb6qrp0EyyCq5Jx47aoC5dV+oImlMn0tg6dTEJmIpaQv+FS3Yhb/4wL/42TGMS2Hhg4nHMvfHP9SHClHefLWlldW9/YrGxVt3d29/ZrB4ddFSYxww4LRRj3fapQcIkdzbXAfhQjDXyBPX96lfe9B4wVD2VbzyL0AjqRfMwZ1SZy3YDqe3+cXmfDu2Gt7jScQvayIaWpQ6nWsPbpjkKWBCg1E1SpAXEi7aU01pwJzKpuojCibEonODBW0gCVlxbMmX1qkpE9DmPzpLaL9O9GSgOlZoFvJnNGtdjl4b+d4D4aAjkP8BsnSpNidw6vTbw050DJMnMHsvjrZdM9bxCnQW5JvXlZXqQCx3ACZ0DgAppwAy3oAIMInuAFXq1H69l6s95/RlescucI5mR9fAPtc5Qp</latexit>
sha1_base64="rF2O81tyuBmX6aQI/wJx9mrv42w=">AAAB9nicbZBLS8NAFIVv6qvWV1VcuQkWwVXJuHFb1IXLKn1BE8pkeluHTiYhMxFLCPgDXAtu1I249c+48N+YpEVs64GBwzn3wnfHDQRX2rK+jcLS8srqWnG9tLG5tb1T3t1rKT8KGTaZL/yw41KFgktsaq4FdoIQqecKbLuji6xv32GouC8behyg49Gh5APOqE4j2/aovnUH8WXSu+mVK1bVymUuGjI1ldrB8+MDANR75S+777PIQ6mZoEp1iRVoJ6ah5kxgUrIjhQFlIzrEbmol9VA5cc6cmMdp0jcHfpg+qc08/bsRU0+pseemkxmjmu+y8N9OcBdTAjkL8BtHSpN8dwavQZw440DJkvQfyPzVi6Z1WiVWlVyTSu0cJirCIRzBCRA4gxpcQR2awCCAJ3iFN+PeeDHejY/JaMGY7uzDjIzPH557liM=</latexit>
sha1_base64="XoznfhfAXIwNWHaEBJVANtOnmzM=">AAAB9nicbZBLS8NAFIVvfNbWR1VcuQlWoauSceO2qAuXVfqCJpTJ9LYOnUxCZiKWEH+Ba8GNuhG3/hkX/huTtohtPTBwOOde+O64geBKW9a3sbS8srq2ntvIFza3tneKu3tN5UchwwbzhR+2XapQcIkNzbXAdhAi9VyBLXd4kfWtOwwV92VdjwJ0PDqQvM8Z1Wlk2x7Vt24/vky6N91iyapYY5mLhkxNqXrw/PhQPi7UusUvu+ezyEOpmaBKdYgVaCemoeZMYJK3I4UBZUM6wE5qJfVQOfGYOTFP0qRn9v0wfVKb4/TvRkw9pUaem05mjGq+y8J/O8FdTAnkLMBvHClNxrszeHXixBkHSpak/0Dmr140zdMKsSrkmpSq5zBRDg7hCMpA4AyqcAU1aACDAJ7gFd6Me+PFeDc+JqNLxnRnH2ZkfP4AETCWeA==</latexit>
1
<latexit sha1_base64="+3H4J34WwTe/oCnge3GCDh7BMcY=">AAAB9nicbZBLS8NAFIVvfNb6qrp0EyyCq5Jx47aoiy4r9AVNKJPpbR06mYTMRCwhf8OluhG3/hkX/hsnMYhtPTBwOOde+Ob6keBKO86Xtba+sbm1Xdmp7u7tHxzWjo57Kkxihl0WijAe+FSh4BK7mmuBgyhGGvgC+/7sJu/7DxgrHsqOnkfoBXQq+YQzqk3kugHV9/4kvc1GrVGt7jScQvaqIaWpQ6n2qPbpjkOWBCg1E1SpIXEi7aU01pwJzKpuojCibEanODRW0gCVlxbMmX1ukrE9CWPzpLaL9O9GSgOl5oFvJnNGtdzl4b+d4D4aArkI8BsnSpNidwGvQ7w050DJMnMHsvzrVdO7bBCnQe5IvXldXqQCp3AGF0DgCprQgjZ0gUEET/ACr9aj9Wy9We8/o2tWuXMCC7I+vgHeLZQf</latexit>
sha1_base64="1XNRT5rQ0HkUA+k7YbrDrfRfx6s=">AAAB9nicbZBLS8NAFIVv6qvWV1VcuQkWwVXJuHFb1EWXFfqCJpTJ9LYOnUxCZiKWEPAHuBbcqBtx659x4b8xSYvY1gMDh3Puhe+OGwiutGV9G4WV1bX1jeJmaWt7Z3evvH/QVn4UMmwxX/hh16UKBZfY0lwL7AYhUs8V2HHHV1nfucNQcV829SRAx6MjyYecUZ1Gtu1RfesO4+ukX++XK1bVymUuGzIzldrR8+MDADT65S974LPIQ6mZoEr1iBVoJ6ah5kxgUrIjhQFlYzrCXmol9VA5cc6cmKdpMjCHfpg+qc08/bsRU0+pieemkxmjWuyy8N9OcBdTAjkP8BtHSpN8dw6vSZw440DJkvQfyOLVy6Z9XiVWldyQSu0SpirCMZzAGRC4gBrUoQEtYBDAE7zCm3FvvBjvxsd0tGDMdg5hTsbnD481lhk=</latexit>
sha1_base64="W3HeaYfud1tI1ZQHc8XfDFGFMxM=">AAAB9nicbZBLS8NAFIVv6qu2Pqriyk2wCl2VjBu3RV10WaEvMKVMprd16GQSMhOxhPgLXAtu1I249c+48N+YpEVs64GBwzn3wnfH8QVX2rK+jdzK6tr6Rn6zUNza3tkt7e23lRcGDFvME17QdahCwSW2NNcCu36A1HUEdpzxZdp37jBQ3JNNPfGx59KR5EPOqE4i23apvnWG0VXcr/dLZatqZTKXDZmZcu3w+fGhclJs9Etf9sBjoYtSM0GVuiGWr3sRDTRnAuOCHSr0KRvTEd4kVlIXVS/KmGPzNEkG5tALkie1maV/NyLqKjVxnWQyZVSLXRr+2wnuYEIg5wF+41Bpku3O4TVJL0o5ULI4+QeyePWyaZ9ViVUl16Rcu4Cp8nAEx1ABAudQgzo0oAUMfHiCV3gz7o0X4934mI7mjNnOAczJ+PwBAeqWbg==</latexit>
i ∈ {0,1} t
1 Loss
LG = LGH + LGR is the GAN-loss (for both the generator and
<latexit sha1_base64="YMU1x6jqiiFsIJSOfP3Rmpp+PcY=">AAAB7XicbVDLSgNBEOyNrxhfUY9eFoPgKeyIkGvQS44R84JkCbOTThwyj2VnVghLPsGjehGvfpEH/8bNuohJLGgoqrqguoNQcGM978spbGxube8Ud0t7+weHR+Xjk47RccSwzbTQUS+gBgVX2LbcCuyFEVIZCOwG09uF333EyHCtWnYWoi/pRPExZ9Sm0n1jSIblilf1MrjrhOSkAjmaw/LnYKRZLFFZJqgxfeKF1k9oZDkTOC8NYoMhZVM6wX5KFZVo/CSrOncvUmXkjnWUjrJupv5NJFQaM5NBuimpfTCr3kL81xM8wLSBWi7wK8fGkiy7VK9F/GTRAxWbp38gq1evk85VlXhVcnddqd/kHynCGZzDJRCoQR0a0IQ2MJjAE7zAq6OdZ+fNef9ZLTh55hSW4Hx8A5g3j+8=</latexit>
discriminator) for each domain and for each view,

view <latexit sha1_base64="06m4rjwPepVLhOVmgttQObBpX5A=">AAAB+3icbZBLS8NAFIVv6qvWR6Mu3RSL4KpkRHBbFMRlhb6gCWEyvW2HTiYhMxFKyC9xqW7ErT/Fhf/GJAaxrQcGDufcC99cLxRcacv6Miobm1vbO9Xd2t7+wWHdPDruqyCOGPZYIIJo6FGFgkvsaa4FDsMIqe8JHHjz27wfPGKkeCC7ehGi49Op5BPOqM4i16zbPtUzb5Lcpa4dzrhrNq2WVaixbkhpmlCq45qf9jhgsY9SM0GVGhEr1E5CI82ZwLRmxwpDyuZ0iqPMSuqjcpICPG2cZ8m4MQmi7EndKNK/Gwn1lVr4XjaZY6rVLg//7QT3MCOQywC/caw0KXaX8LrESXIOlCzN7kBWf71u+pctYrXIw1WzfVNepAqncAYXQOAa2nAPHegBgxie4AVejdR4Nt6M95/RilHunMCSjI9v4d2VyA==</latexit>
1 log DH Hit + log 1 − DR ◦ Fϕ Hit

Õ Õh i
LGH = ,

i ∈ {0,1} t
(9)
Fig. 6. Cross-view cycle consistency in multiview image style trans- LP is the loss for the view predictor,
lation. In additional to original cycle consistency within each domain, we
further constrain style transformers given paired multiview data in each
!
Pi (H t ) − H t + Pi (R s ) − R s ,
Õ Õ Õ
LP =

domain, to encourage preserving spatial structure of the face images. Only i 1−i 1 i 1−i 1
one loss term (out of all 4 directions) is shown here. i ∈ {0,1} t s
(10)
and finally the cross-view cycle consistency LV = LV H + LV R ,
of the face modeling capture [Lombardi et al. 2018], independently.
Together with {Hit }, we now have training data for unpaired style
Õ Õ
LV H = Pi ◦ Fϕ Hit − Fϕ H 1−i
t

(11)
transfer. We discard the estimated z t solved by Eq. (6) since they

1
i ∈ {0,1} t
tend to be inaccurate when relying solely on landmarks.
where LCR , LGR , and LV R are defined symmetrically, and DH and
3.3.2 Multiview Image Style Translation. Given images from the
DR are discriminators in both domains, respectively. Note that while
two domains {Hit } and {Ris }, one can utilize a method such as [Zhu
we do not have paired Hit and Ris , we do have paired Hit and H 1−i
t ,
et al. 2017] to learn the view-specific mappings Fϕ,i and Gψ ,i that s
as does Ri . An illustration of these components is shown in Fig. 6.
translate images back and forth between the domains. Since we are
Different than [Bansal et al. 2018], in our formulation, we share
careful to encourage a balanced distribution P(z, v) between the
P0 and P1 across domains, since the relative structural difference
domains, this straight-forward approach already produces reason-
between the views should be the same in both domains.
able results. The main failure cases occur due to limited rendering
Our problem takes the form of a minimax optimization problem:
fidelity of the parametric face model. This is most noticeable in
the eye images, where eyelashes and glints are almost completely
absent due to poor generalization of the view-conditioned rendering min max L P0 , P1 , Fϕ , Gψ , DH , DR , (12)
P0,P1,Fϕ ,Gψ DH ,DR
method of [Lombardi et al. 2018] to very close viewpoints. Specif-
ically, the style transferred image often exhibits a modified gaze which we repeat for every pair of views. Compared to standard Cy-
direction compared to the source. In addition, this modification is cleGAN, our problem involves additional Pi modules. If we simply
often inconsistent across different camera views. This last effect alternately train parameters in {P0 , P1 , Fϕ , Gψ } and parameters in
is also observed for the rest of the face, though not to the same {DH , DR } like [Bansal et al. 2018], we find that collusion between
extent as that for the eyes. When solving for (z, v) using constraints P and Fϕ (or Gψ ) can minimize the loss function without preserv-
from all camera views in Eq. (5), these inconsistent and independent ing expression across the domains; effectively learning different
errors have an averaging effect, which manifests as dampened facial behaviors on real and face data to compensate errors made by each
expressions. other. As a result, the semantics that we want to keep unchanged
To overcome this problem, we propose a method that exploits the get lost during the style transformation. To address this problem, we
spatial relationship between cameras during image style translation. apply “uncooperative training”, recently proposed by Harley et al.
Inspired by [Bansal et al. 2018], where additional cyclic-consistency [2019], which prevents this “cheating” by breaking the optimization
constraints are obtained through temporal prediction, we enforce into more steps. At each step, the loss function is readjusted so that
cyclic-consistency through cross-view prediction. Specifically, for only terms that operate on real data remain, and only modules that
a pairs of views, denoted 0 and 1 for simplicity, we train “spatial- take real data as input are updated. An outline of the algorithm
predictors” P0 and P1 to transform images in view 0 to view 1 and is presented in Algorithm 1. In this way, modules have no chance
vice versa. These pairs are chosen in such a way that they observe to learn to compensate for errors made by previous modules. As a
similar parts of the face to ensure that their contents are mutually result, expressions are better preserved through domain transfer
predictable, for example the stereo eye-camera pair, or lower-face and the cross-view predictions ensure multiview consistency.
67:8 • Wei et al.
ALGORITHM 1: Uncooperative training for multiview image style for a foreground pixel (i.e., W (p) = 1) the gradient of C(p) still can
translation dC t (p)
be calculated from dq , by parameterizing the coordinates in the
Input: Unpaired real HMC images {H it } and real rendered images
texture (from which the pixel color is sampled) by the barycentric
{R is }, in two common views i ∈ {0, 1}.
coordinates of that pixel in its currently enclosing triangle. Although
Output: Converged Fϕ and Gψ
this way of formulating the rendering function and its derivative
Initialize parameters in all modules;
can already produce good results, in practice, in the presence of
repeat
Sample (t, s) to get H it , R is for i ∈ {0, 1} (total 4 images) ; multiview constraints, it exhibits failure cases that result from zero
Update ϕ for F using gradients minimizing LC H + LG H + LV H ; gradients from W (p). Specifically, if p is rendered as background
Update ψ for G using gradients minimizing LC R + LG R + LV R ; (i.e., W (p) = 0) but the target for that pixel is a foreground value,
Update P using gradients minimizing LP ; there are no gradients propagated to the parameters q. Similarly, a
Update DH and DR using gradients maximizing LG foreground pixel at the boundary of the rasterization has no pressure
until converged; to expand. In practice, this can lead to terminating in poor local
minima with large reconstruction errors. For example, in a puffed-
cheek expression, where the target image’s foreground image tends
foreground pixel to occupy larger area of the image, the estimated expression often
background pixel (contributing gradients) fails to match the contour of the cheek well (see Fig. 7).
background pixel (not contributing gradients) Intuitively, the force expanding the foreground area should come
(a) (b) (c) from a soft blending around the boundary between foreground and
projected mesh
background. Therefore, instead of a binary assignment of pixels
around the boundary to either a color sampled from the texture
map or the background color, we employ soft blending similar to
anti-aliasing. It is important to parameterize the blending weight
as a function of the face model’s projected geometry so that re-
2
construction errors along the rasterization’s boundary can be back-
<latexit sha1_base64="BQjPMHHgNSTg8uqwUSlYaJJneTI=">AAAB7XicbVDLSgNBEOyNrxhfUY9eFoPgKewEIdegF48R84JkCbOTThwyM7vszAphySd4VC/i1S/y4N84WRcxiQUNRVUXVHcQCa6N5305hY3Nre2d4m5pb//g8Kh8fNLRYRIzbLNQhHEvoBoFV9g23AjsRTFSGQjsBtObhd99xFjzULXMLEJf0oniY86osdJ9NKwNyxWv6mVw1wnJSQVyNIflz8EoZIlEZZigWveJFxk/pbHhTOC8NEg0RpRN6QT7lioqUftpVnXuXlhl5I7D2I4ybqb+TaRUaj2Tgd2U1DzoVW8h/usJHqBtoJYL/MqJNiTLLtVrET9d9EDF5vYPZPXqddKpVYlXJXdXlcZ1/pEinME5XAKBOjTgFprQBgYTeIIXeHVC59l5c95/VgtOnjmFJTgf39cmkBg=</latexit>
propagated to the parameters q. To this end, we use a decaying

<latexit sha1_base64="LCl/HghNIA1V9shElcDdOIcZt20=">AAAB7XicbVDLSgNBEOyJrxhfUY9eFoPgKeyIkGswF48R84JkCbOTThwyM7vszAphySd4VC/i1S/y4N+4WRcxiQUNRVUXVLcfSmGs636Rwsbm1vZOcbe0t39weFQ+PumYII44tnkgg6jnM4NSaGxbYSX2wgiZ8iV2/Wlj4XcfMTIi0C07C9FTbKLFWHBmU+m+MbTDcsWtuhmcdUJzUoEczWH5czAKeKxQWy6ZMX3qhtZLWGQFlzgvDWKDIeNTNsF+SjVTaLwkqzp3LlJl5IyDKB1tnUz9m0iYMmam/HRTMftgVr2F+K8nhY9pA71c4FeOjaVZdqlei3rJogdqPk//QFevXiedqyp1q/TuulK/yT9ShDM4h0ugUIM63EIT2sBhAk/wAq8kIM/kjbz/rBZInjmFJZCPb/bfkC0=</latexit>
blend-function away from the boundary:

<latexit sha1_base64="DoXZ5+U0zEy/mGol4IZRh3ujFuk=">AAAB7XicbVDLSsNAFL2pr1pfUZdugkVwVTIiuC26cVmxL2hDmUxv69DJTMhMhBL6CS7Vjbj1i1z4N05jENt64MLhnHvg3BvGgmvj+19OaW19Y3OrvF3Z2d3bP3APj9papQnDFlNCJd2QahRcYstwI7AbJ0ijUGAnnNzM/c4jJpor2TTTGIOIjiUfcUaNle6Hg3jgVv2an8NbJaQgVSjQGLif/aFiaYTSMEG17hE/NkFGE8OZwFmln2qMKZvQMfYslTRCHWR51Zl3ZpWhN1KJHWm8XP2byGik9TQK7WZEzYNe9ubiv57gIdoGcrHAr5xqQ/LsQr0mCbJ5D5RsZv9Alq9eJe2LGvFr5O6yWr8uPlKGEziFcyBwBXW4hQa0gMEYnuAFXh3lPDtvzvvPaskpMsewAOfjGyN7kEo=</latexit>
1 <latexit sha1_base64="pUqRb/a//Czs8dhQeiB/o4V+VK0=">AAAB63icbVDLSsNAFL1TX7W+oi7dBIvgqmREcFt047KFvqANZTK9rUNnJiEzEUroF7hUN+LWT3Lh35jGILb1wIXDOffAuTeIpDDW875IaWNza3unvFvZ2z84PHKOTzomTGKObR7KMO4FzKAUGttWWIm9KEamAondYHq38LuPGBsR6padRegrNtFiLDizmdSMhk7Vq3k53HVCC1KFAo2h8zkYhTxRqC2XzJg+9SLrpyy2gkucVwaJwYjxKZtgP6OaKTR+mheduxeZMnLHYZyNtm6u/k2kTBkzU0G2qZh9MKveQvzXkyLArIFeLvArJ8bSPLtUr0X9dNEDNZ9nf6CrV6+TzlWNejXavK7Wb4uPlOEMzuESKNxAHe6hAW3ggPAEL/BKFHkmb+T9Z7VEiswpLIF8fAOukY9z</latexit>
target face
contour
<latexit sha1_base64="Ev45o3IWA1rqi6C4d+oxCSB/hAA=">AAAB7XicbVDLSsNAFL2pr1pfUZdugkVwVTIiuC26cVmxL2hDmUxv69B5hMxEKKGf4FLdiFu/yIV/YxqD2NYDFw7n3APn3jAS3Fjf/3JKa+sbm1vl7crO7t7+gXt41DY6iRm2mBY67obUoOAKW5Zbgd0oRipDgZ1wcjP3O48YG65V004jDCQdKz7ijNpMuo8GZOBW/Zqfw1slpCBVKNAYuJ/9oWaJRGWZoMb0iB/ZIKWx5UzgrNJPDEaUTegYexlVVKIJ0rzqzDvLlKE30nE2ynq5+jeRUmnMVIbZpqT2wSx7c/FfT/AQswZqscCvnBhL8uxCvSYJ0nkPVGyW/YEsX71K2hc14tfI3WW1fl18pAwncArnQOAK6nALDWgBgzE8wQu8Otp5dt6c95/VklNkjmEBzsc31Z+QFw==</latexit>
dp
( 2)
(d) (e) W (p) = exp − 2 , (14)
σ
Fig. 7. Background-aware differentiable rendering. (a) Input HMC im- where dp is the perpendicular 2D distance from a background pixel
age with a puffed cheek (b) Style transferred result (target image) (c) Cur-
p to its closest edge of any projected triangle for pixels outside the
rent rendered avatar (d) Overlaying (b) and (c). Many pixels in the red box
rasterization coverage, and σ controls the rate of decay. The value
are currently rendered as background pixels, but should be in foreground.
(e) A closer look at pixels around the face contour. For a background pixel of Ct used in Eq. (13) for W (p) is set to the color in the texture of the
p within a bounding box of any projected triangle (dashed rectangle), its triangle at the closest edge. For pixels within the coverage, dp = 0.
color is blended from color C t at the closest point on the closest edge p1 p2 In practice, we set σ = 1 and evaluate W only for pixels within
and background color Cb , with a weighting related to distance dp . The red enclosing rectangles of each projected triangle for efficiency. With
arrows indicate that the gradient generated from dark gray pixels can be this background-aware rendering, even though only a small portion
back-propagated to geometry of the face. of background pixels contribute gradients to expand or contract
the boundary at each iteration, it is enough to prevent optimization
from terminating in poor local minima.
3.4 Background-aware Differentiable Rendering
A core component of our system is a differentiable renderer R for 3.5 Implementation Details
the parametric face model. It is used to generate synthetic samples For the style transformation described in §3.3, we use (256 × 256)-
for the domain transfer described above, but also for evaluating the sized images for both domains, and adapt the architecture design
reconstruction accuracy given the estimated expression and pose from [Zhu et al. 2017]. For Fϕ , Gψ , and Pi , we use a ResNet with
parameters q = (z, v) in Eq. (5). Our rendering function blends a 4× downsampling followed by 3 ResNet modules and another 4×
rasterization of the face model’s shape and background, so that the upsampling. For discriminators DH and DR , we apply spectral nor-
pixel color C(p) at image position p is defined as: malization [Miyato et al. 2018] for better quality of generated images
C(p) = W (p) Ct (p) + (1 − W (p)) Cb (13) and more stable training. For Eθ in the overall training described
in 3.2, we build separate convolutional networks to convert individ-
where Ct (p) is the rasterized color from texture at position p, and ual Hit to |C| vectors, which are concatenated and then converted
Cb is a constant background color. If we simply define W as a binary into both z t and v t using separate MLPs. For the prior δ (z t ) in Eq.(5),
mask of the rasterization’s pixel coverage, where W (p) = 1 if p is we use an L 2 -loss δ (z t ) = z t 2 because the latent space associated
2
dW (p)
within a triangle and otherwise W (p) = 0, then dq would be with D is learned with a KL divergence against a standard normal
zero for all p because of the discreteness of rasterization. In this case, distribution [Lombardi et al. 2018].
4 REALTIME FACE ANIMATION 5 RESULTS AND EXPERIMENTS

After minimizing the loss in Eq. (5), we can apply a converged Eθ to We first show qualitative results of both the found correspondences
all H t to obtain per-frame correspondences {(H t , z t )}t ∈T . We can using the training HMC (§ 3), and realtime prediction using the track-
now drop all the auxiliary views in H t and only retain the views ing HMC (§4). Then we characterize the importance of various com-
available in a tracking HMC H̃ t = {Hit }i ∈ C′ where |C ′ | = 3. This ponents in our system design, including multiview style transfer, dis-
forms the training data {(H̃ t , z t )}t ∈ T for training the regressor tribution matching, cross-view consistency, and background-aware
that will be used during realtime animation. differentiable rendering, in an ablation study. Finally, we compare
Rather than simply minimizing L 2 -loss in the latent space of z t , our animation results with [Olszewski et al. 2016] and [Lombardi
we find it is important to measure loss in a way that encourages the et al. 2018]. For more results, such as speech and dynamics which
network to spend capacity on the most visually sensitive parts, such are better shown as consecutive frames with audio, please refer
as subtle lip shape and gaze direction. We additionally minimize the to our supplementary video. Our method achieves natural speech
error in geometry and texture map particularly in eye and mouth dynamics with only frame-by-frame inference without temporal
regions, because the avatar does not have detailed geometry and re- constraints.
lies on view-dependent texture to be photorealistic in these regions.
5.1 Qualitative Results
Specifically, we build regressor Ẽθ˜ to convert H̃ t to target z t :
5.1.1 Established Correspondence. Fig. 8 shows three examples of
corresponding HMC images and rendered avatars for three different
z t − z̃ t 2 + λ 1 M t − M̃ t 2 + λ 2 κ(T t ) − κ(T̃ t ) 2 , (15)
Õ
min 0 0 subjects. In the last row of each example, we show the alignment
θ˜ t
between HMC images and the rendered avatar for every view of
the training HMC, where good alignment demonstrates the high-
where
fidelity of the obtained correspondence. We present more results
in Fig. 9, showing that our method robustly produces high-fidelity
z̃ t = Ẽθ˜ (H̃ t ) (16)
results for a large range of facial expressions, including extreme
M̃ t , T̃0t ← D(z̃ t , v 0 ) (17) expressions with occluded face parts, such as biting lips, puffed
M t ,T0t ← D(z t , v 0 ), (18) cheeks, wrinkles on the forehead and nose, and tightly closed eyes.
Note that the inside of the eyes and mouth are not modeled with
detailed geometry [Lombardi et al. 2018], so the mapping for nu-
where κ is a crop on texture maps focusing on eye and mouth area,
anced expressions, like gaze directions and visibility of teeth and
and v 0 is a fixed frontal view of the avatar.
shape of tongue, rely completely on minimizing the pixel loss of the
The architectural design of Ẽθ˜ should allow good fitting to the tar-
rendered avatar against style transferred images.
get z t , be robust against real-world variations such as surrounding
illumination and headset wearing positions, and at the same time, 5.1.2 Realtime Animation. Fig. 10 shows example outputs from a
achieve realtime inference speed. These requirements are different trained regressor Ẽ that runs in realtime. We captured each subject
from Eθ , whose function is solely to minimize Eq. (5), or to overfit. with a training HMC in 4 different environments, with different
Therefore, we use smaller input images (192 × 192) and a smaller lighting and different HMC placement relative to the head (the sub-
number of convolutional filters and layers for Ẽθ˜ , compared to Eθ . jects had to take off and put on the HMC between captures). We
The architectural design is similar: we build 3 separated branches use our method described in §3 to generate “ground-truth” corre-
of convolutional networks to convert H̃it to 3 1D vectors, because spondences for all data, but only train Ẽ on 3 of the captures and
the input images are observing different parts of the face and hence use the remaining capture as the testing set. For most of the facial
don’t share spatial structure. Finally, these 1D vectors are concate- expressions, the regressor using only the 3 tracking views perform
nated and converted to z̃ t through an MLP. During training, we very well and matches the established correspondence using the 9
augment input images with a random small angle homography to training views, indicating that the 3 tracking views contain sufficient
simulate camera rotation to account for manufacturing variance in information despite obliqueness. For a few cases highlighted in the
camera mounts, as well as directional image intensity histogram per- red boxes, the tracking views have poor coverage of the mouth in-
turbation to account for lighting variation. Our architecture design terior (illustrated in Fig. 2(e)) and gaze direction because the eyelids
balances speed with almost no obvious quality drop (from 9-view are almost closed (the additional eye cameras in the training HMC
input to 3-view input) on both training data and validation data, at are lower), resulting in slightly different results. While the use of
an inference speed of 73 fps on a NVIDIA Titan 1080Ti. region-focused texture and geometry terms in Eq. (15) already helps
Given that Ẽ and D can both be evaluated in realtime [Lombardi concentrate the capacity of the network on perceptually important
et al. 2018], we can build a two-way social VR system, where both parts, the error for extreme gaze directions is still observable.
users can see high-fidelity animation of each other’s personalized
avatar while wearing an HMC. On each side, the computing node 5.2 Ablation Study
runs Ẽ of the user on one GPU and sends encoded z t over the We first validate the importance of having the additional camera
network. At the same time, the node receives z t from the other side, views of the training HMC, style transfer, and distribution matching
runs the decoder D of the other person, and renders stereo images for finding correspondences. We then compare independent per-
(for left and right eye) of the other user’s avatar on a second GPU. view style transfer with our cross-view cycle consistency. We also
67:10 • Wei et al.
Fig. 8. Qualitative results of established correspondence using the training HMC. For each subject, the first row shows input images from the training
HMC and detected landmarks. The second row shows the avatar rendered with found corresponding facial expressions and headset position (coupled with
camera calibration at each view), with projected landmarks whose uv-coordinates are optimized by Eq. (6). The leftmost image is rendered from a fixed frontal
view. The last row overlays the previous two, where good alignment shows high fidelity correspondence.
show evidence that background-aware rendering leads to better detected landmarks and projected landmarks of the avatar on all
convergence. Finally, we discuss the effects of using different loss 9 views, plus matching image gradients (edges) (b) only using 3
functions in the optimization Eq. (5) and Eq. (15). tracking views: training independent per-view style transfer and
solving Eq. 5 with only the tracking views, and (c) not matching
5.2.1 Effect of in-domain pixel matching, additional viewpoints, and the distribution of HMC pose: simply using a mean pose to ren-
distribution matching. To show the importance of using multiview der {Ris } for style transfer, instead of solving Eq. (6) to collect a set
style transfer in Eq. (5), we compare our results with (a) only us- of HMC poses to sample from.
ing domain-invariant features: minimizing 2D distance between
Fig. 9. More examples of established correspondence. Each subfigure shows 8 facial expressions for each identity. The 9 grayscale images, with 3 tracking
views (large images) and 6 additional training views (small images), are input HMC images, and in color on the right is the rendered 3D avatar. Our method
can find high quality mappings even for subtle expressions on the upper face, where the camera angle is oblique and close to the subject. Our method can also
capture subtle differences in tongues, teeth, and eyes where the avatar does not have detailed geometry.
67:12 • Wei et al.
Input Prediction Target Input Prediction Target Input Prediction Target Input Prediction Target
Fig. 10. Predictions on held out data for a tracking HMC. In each subfigure, the target image is “ground-truth” from our established correspondences,
and the prediction image is the output from Ẽ. While most of the time we can map from only 3-view inputs to targets successfully, there are some failure
cases highlighted in the red boxes, where the nuances are hard to observe without the additional cameras on the training HMC.
In Fig. 11, we show 8 expressions to compare. Results that only direction, do impact the final result, as illustrated in Fig. 12. To show
use landmark and edge constraints can roughly match the shape of the effect of the cross-view training described in §3.3, we group 9
mouth, but completely fail to match parts such as teeth, tongue, and camera views into 4 pairs and run Algorithm 1 for each pair (and
gaze direction, as landmarks cannot describe these details on a pho- still train the remaining 1 view individually). The resulting style
torealistic avatar. Matching image gradients does not improve the transferred images have fewer artifacts on the face in general. We
results because the domain gap between HMC images and rendered highlight the cases where the per-view CycleGAN baseline fails to
avatars is too large for naive edge matching to be useful. Using only capture visibility of the tongue and a consistent gaze direction. Using
3 tracking views with view-independent style transfer and Eq. (5) the cross-view consistency loss in our method, in contrast, gives
already generates much better results than 9-view results in (a), consistent images across views and therefore the final optimized
showing the importance of bridging the domain gap. But because of avatar better matches the inputs.
the obliqueness of viewpoints, it sometimes fails to generate precise
results when compared to using the full 9 views, such as estimating 5.2.3 Effect of background-aware differentiable rendering. We com-
the amount of mouth openness (1st , 5th expressions), mouth interior pare our background-aware differentiable rendering with a baseline
(6th ), and lip shape (7th ). Finally, the results from (c) are in general where background pixels are simply assigned a constant color and
worse than the full method, showing the importance of matching therefore have no path to back-propagate gradients from errors in
distribution of spatial structure. The quality of style transferred re- pixel color to mesh geometry. In particular, we show the image loss
sults degrades, leading to uncanny faces (1st , 4th , 8th ) and semantics curve during convergence in Fig. 13 to a target puffed cheek. Even
are sometimes modified such as gaze (3rd ). though there are constraints from 9 views, which already greatly
alleviates the background problem, the baseline still plateaus at a
higher loss, and the cheek does not fully expand to match the inputs.
5.2.2 Effect of view-consistent image style transfer. When training In contrast, our method allows gradients from the image loss to
style transformers Fϕ and Gψ for all 9 views, we found that training affect the geometry through Eq. 13 and therefore shows lower loss
them independently for each view already gives good results despite and better face alignment.
artifacts on the fake rendered image such as checkerboard patterns.
By jointly considering multiview and a robust L 1 loss in Eq.(5), 5.2.4 Is image loss alone enough to make Eq. (5) converge? While
these uncorrelated artifacts are often averaged out. However, other style transferred images provide supervision with great details, it
artifacts such as wrong color of mouth interior or changes in gaze is a common observation that the convergence basin when using a
Landmark   9 views   Landmark   9 views  

Input 3 views Full Input 3 views Full
& edge only (not matching pose) & edge only (not matching pose)
1 5
2 6
3 7
4 8
Fig. 11. Ablation on system design. We compare our method with simpler settings such as (a) without style transformation, only using landmarks and edge
matching, (b) without additional cameras, only using 3 tracking views, and (c) without matching distribution of headset positions, only using mean headset
pose to render data for style transfer. These settings failed to obtain good correspondences in different characteristics.
differentiable renderer can be small and therefore requires careful 5.2.5 Why do we need geometry and texture terms in Eq. (15)? To
initialization. However, we found that the convergence basin of fit established correspondences in § 4, we did not simply apply loss
our method is large enough that no sample-specific initialization on the predicted z t but also geometry M and texture T . Ideally,
is required. In Fig. 14, we compare two training schemes with the minimizing differences in z t should also lead to reduced differences
same initialization for θ : (1) use landmark constraints (L 2 -loss of in the appearance of the avatar. However, Fig. 15 shows that if we
2D distances on images) for the first 2000 iterations to bring the don’t add geometry and texture terms, while the geometry error
geometry into roughly the correct position, and then remove it is still well minimized, the texture suffers from much higher error,
to leave only the image loss to fine-tune the result, and (2) only which leads to mismatched expressions such as changes in gaze
use an image loss, as presented in our method. The results show direction or mouth interior (i.e., regions that do not have detailed
that these two schemes achieve the same level of image loss at geometry). It is also interesting to see that the decrease of distance in
convergence, showing that the convergence basin of (2) is good z t latent space becomes much slower if we add these terms, showing
enough to avoid getting stuck in local minima. Fig. 14 also shows that z t and appearance (at least in texture space) are not strongly
minimizing a landmark loss can lead to gradients that contradict coupled through the decoder,
the image loss (red solid curve goes up after iteration 2000). This is
5.3 Comparison with prior art
because of the imprecision of landmark detections in HMC images
and uv-coordinate estimation on the avatar’s texture map. While We compare our animation results with our reimplementations
the landmark loss can be used to find good headset positions, for of [Olszewski et al. 2016] and [Lombardi et al. 2018]. To compare
facial expression, which requires high precision, it can only provide with [Olszewski et al. 2016], we select a short temporal window of
a relatively rough signal that ultimately becomes unnecessary given HMC images labeled as a certain peak expression, and map them
the good convergence basin from the background-aware renderer. to the encoded z t of the corresponding expression captured with
the modeling sensors during avatar building. We also performed dy-
namic time warping to determine correspondence labels for speech
67:14 • Wei et al.
E�ect of gradients from background pixels Input HMC

0.120
0.115
Input
0.110
Normalized Image loss

Optimized
0.105
Result Style transferred (target)
0.100
Per-view
0.095
Image loss (Background-aware)

0.090 Image loss (Baseline)
0 100 200 300 400 500 600 700

Iteration
Fig. 13. Effect of background-aware rendering. We compare the image

loss curves of a baseline (background pixels contribute no gradient) with
Multiview
our method, with a target “puffed cheeks” on all 9 views (only 1 is shown).
While the baseline gets stuck at a higher loss, our method fully expands the
cheek.
Evaluation of Early Use of Landmark Loss

Landmark loss (Also optimize landmark loss before iteration 2000)
4000 Landmark loss (Only optimize image loss) 32000
Image loss (Also optimize landmark loss before iteration 2000)
Input
Image loss (Only optimize image loss) 31000

3500
30000
Landmark Loss
Optimized
Image Loss
3000 29000
Result
28000
2500
Per-view
27000
2000 26000
25000
0 1000 2000 3000 4000 5000 6000 7000 8000
Iteration
Fig. 14. Evaluating landmark loss as initialization. We compare two

Multiview
training schemes for Eq. (5): (1) use both landmark loss and image loss during
the first 2000 iterations, and only use image loss after that, in red curves,
and (2) only use image loss, in blue curves. The result shows the convergence
basin is large enough that landmark initialization is not necessary. We also
observe the contradiction between the two losses, as the red landmark loss
curve goes back to a similar level as blue after iteration 2000. The learning
rate is decreased by 10× at iteration 3500 and 7000 and hence the loss drop
Fig. 12. Comparison between per-view style transfer and cross-view
in both methods.
style transfer. In each subfigure, we show the results of style transferred
multiview images and optimized result from Eq. (5) using these target images
respectively. The cross-view style transferred images have better quality
and better preserve semantics. Here we show the visibility of tongue and facial movements) or in transitions from peak expressions to neu-
gaze direction are better mapped to input with cross-view style transfer.
tral. Moreover, even though speech data is densely labeled through
dynamic time warping based on audio, the expression, especially in
the upper face, can be quite different across captures.
data. Fig. 16 (left) clearly shows the limitations. It not only suffers [Lombardi et al. 2018] only demonstrated their animation results
from inconsistency due to subjects not repeating the same expres- on speech data. When applied to extreme expressions, it often gen-
sion twice, but also cannot generalize well to unlabeled expressions, erates far worse results as show in Fig. 16 (right). However, their
such as those in a range of motion clip (a short period of free-form method only uses 3 views and does not measure headset position.
E�ect of geometry and texture term

Capture Using Anchor Frames. ACM Transactions on Graphics (TOG) 30, 4, Article
Latent space loss ( 1 = 2 = 100) 75 (July 2011), 10 pages.
Geometry loss ( 1 = 2 = 100) BinaryVR. 2019. Real-time Facial Tracking. https://www.binaryvr.com/vr.
20 Texture loss ( 1 = 2 = 100) 80 Volker Blanz and Thomas Vetter. 1999. A Morphable Model for the Synthesis of 3D Faces.
Latent space loss ( 1 = 2 = 0) In Proceedings of the 26th Annual Conference on Computer Graphics and Interactive
Geometry loss ( 1 = 2 = 0) Techniques. 187–194.
Loss of geometry/texture
Texture loss ( 1 = 2 = 0) Chen Cao, Qiming Hou, and Kun Zhou. 2014. Displaced Dynamic Expression Regression
15 60 for Real-time Facial Tracking and Animation. ACM Transactions on Graphics (TOG)
Latent space loss
33, 4 (July 2014), 43:1–43:10.

Timothy F. Cootes, Gareth J. Edwards, and Christopher J. Taylor. 1998. Active Appear-
ance Models. In IEEE European Conference on Computer Vision (ECCV).
10 40 Dimensional Imaging. 2016. DI4D PRO System. http://www.di4d.com/systems/
di4d-pro-system/.
Epic Games. 2017. Epic Games. https://www.epicgames.com.
Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, Kun Zhang, and
5 20 Dacheng Tao. 2018. Geometry-Consistent Generative Adversarial Networks for
One-Sided Unsupervised Domain Mapping. arXiv preprint arXiv:1706.00826 (2018).
Graham Fyffe, Andrew Jones, Oleg Alexander, Ryosuke Ichikari, and Paul Debevec.
2014. Driving High-Resolution Facial Scans with Video Performance Capture. ACM
0 Transactions on Graphics (TOG) 34, 1, Article 8 (Dec. 2014), 8:1–8:14 pages.
0
0 5000 10000 15000 20000 25000 30000 35000 Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. 2016. Image Style Transfer
Iteration Using Convolutional Neural Networks. In IEEE Conference on Computer Vision and
Pattern Recognition (CVPR).
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil
Fig. 15. Effect of geometry and texture terms in Eq. (15). We compare
Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In
errors in the latent space of z, geometry, and texture when training with Advances in Neural Information Processing Systems (NIPS).
and without additional geometry and texture terms. The curves show the Adam W. Harley, Shih-En Wei, Jason Saragih, and Katerina Fragkiadaki. 2019. Image
significant tradeoff between distributing network capacity to match the Disentanglement and Uncooperative Re-Entanglement for High-Fidelity Image-to-
Image Translation. arXiv preprint arXiv:1901.03628 (2019).
latent space loss or texture loss, showing the necessity of non-zero λ 1 and
Hellblade. 2018. Hellblade. https://www.hellblade.com/.
λ 2 . The loss values are normalized so only relative values are meaningful. Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-to-image
Translation with Conditional Adversarial Networks. In IEEE Conference on Computer
Vision and Pattern Recognition (CVPR).
For a better comparison, we augment their algorithm with our 9- Hiroharu Kato, Yoshitaka Ushiku, and Tatsuya Harada. 2018. Neural 3D Mesh Renderer.
In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
view training HMC, as well as additionally matching the distribution Tejas D. Kulkarni, William F. Whitney, Pushmeet Kohli, and Josh Tenenbaum. 2015.
of headset poses. We find that both of these design choices improve Deep Convolutional Inverse Graphics Network. In Advances in Neural Information
their results, showing their benefit even on other approaches. Addi- Processing Systems (NIPS). 2539–2547.
Samuli Laine, Tero Karras, Timo Aila, Antti Herva, Shunsuke Saito, Ronald Yu, Hao Li,
tionally, our method still captures subtleties much more precisely, and Jaakko Lehtinen. 2017. Production-level Facial Performance Capture Using Deep
such as the tongue sticking out, lip shapes, visibility of the teeth, and Convolutional Neural Networks. In Proceedings of the ACM SIGGRAPH / Eurographics
Symposium on Computer Animation. Article 10, 10 pages.
the amount of mouth openness. We argue that since their method Hao Li, Laura Trutoiu, Kyle Olszewski, Lingyu Wei, Tristan Trutna, Pei-Lun Hsieh,
involves converting images from both domains to 1D vectors in Aaron Nicholls, and Chongyang Ma. 2015. Facial Performance Sensing Head-
a VAE, it is hard to ensure pixel-level alignments as well as our mounted Display. ACM Transactions on Graphics (TOG) 34, 4, Article 47 (July
2015), 47:1–47:9 pages.
method, which compares the images directly. Stephen Lombardi, Jason Saragih, Tomas Simon, and Yaser Sheikh. 2018. Deep Appear-
ance Models for Face Rendering. ACM Transactions on Graphics (TOG) 37, 4, Article
6 DISCUSSION 68 (July 2018), 13 pages.
Magic Leap. 2018. Magic Leap. https://www.magicleap.com/.
In this work, we present a system that enables high-fidelity bidirec- Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. 2018. Spectral
tional communication in VR using photorealistic avatars. By lever- Normalization for Generative Adversarial Networks. In International Conference on
Learning Representations (ICLR).
aging a training headset with augmented cameras, we formulate a Franziska Mueller, Florian Bernard, Oleksandr Sotnychenko, Dushyant Mehta, Srinath
self-supervised learning problem that can generate sensor-to-avatar Sridhar, Dan Casas, and Christian Theobalt. 2018. GANerated Hands for Real-time
correspondences. These correspondences are then used to train a 3D Hand Tracking from Monocular RGB. In IEEE Conference on Computer Vision
and Pattern Recognition (CVPR).
deep network that directly produces highly precise face parame- Vinod Nair, Josh Susskind, and Geoffrey E. Hinton. 2008. Analysis-by-Synthesis by
ters in realtime from images captured by a tracking headset with a Learning to Invert Generative Black Boxes. In Proceedings of the 18th International
Conference on Artificial Neural Networks (ICANN), Part I. 971–981.
minimal sensor design. Kyle Olszewski, Joseph J. Lim, Shunsuke Saito, and Hao Li. 2016. High-fidelity Facial
Although the application of this work has focused on VR headset and Speech Animation for VR HMDs. ACM Transactions on Graphics (TOG) 35, 6,
designs, the same difficulties identified in this work apply also to AR- Article 221 (Nov. 2016), 14 pages.
Jason M. Saragih, Simon Lucey, and Jeffrey F. Cohn. 2009. Face Alignment through
headsets, where camera viewpoint difficulties are exacerbated due Subspace Constrained Mean-shifts. In IEEE International Conference on Computer
to stricter requirements on minimal headset design. One direction Vision (ICCV).
of future work is to investigate the applicability of our method to Mike Seymour, Chris Evans, and Kim Libreri. 2017. Meet Mike: Epic Avatars. In ACM
SIGGRAPH 2017 VR Village. Article 12, 2 pages.
these more extreme cases. Ayush Tewari, Michael Zollöfer, Hyeongwoo Kim, Pablo Garrido, Florian Bernard,
Patrick Perez, and Christian Theobalt. 2017. MoFA: Model-based Deep Convolutional
REFERENCES Face Autoencoder for Unsupervised Monocular Reconstruction. In IEEE International
Conference on Computer Vision (ICCV).
Aayush Bansal, Shugao Ma, Deva Ramanan, and Yaser Sheikh. 2018. Recycle-GAN: Justus Thies, Michael Zollhöfer, Marc Stamminger, Christian Theobalt, and Matthias
Unsupervised Video Retargeting. In IEEE European Conference on Computer Vision Nießner. 2016. Face2Face: Real-time Face Capture and Reenactment of RGB Videos.
(ECCV). In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Thabo Beeler, Fabian Hahn, Derek Bradley, Bernd Bickel, Paul Beardsley, Craig Gotsman,
Robert W. Sumner, and Markus Gross. 2011. High-quality Passive Facial Performance
67:16 • Wei et al.
[Olszewski et al. 16] Ours [Lombardi et al. 18] w/ training HMC w/ headset pose Ours
Peak expression
Peak expression
Peak expression
Peak expression
Range of motion
Range of motion
Peak expression
Speech
Transition from peak
Peak expression
Fig. 16. Comparison with prior arts. On the left, we compare our results with [Olszewski et al. 2016], which suffers from the subject’s inconsistency in
peak expressions and poor generalization to data they cannot acquire semantic labels. On the right, we compare with [Lombardi et al. 2018] and also with
their method augmented with our design, including more input views (“w/ training HMC”), and matching distribution of headset position (“w/ headset pose”).
Their results improve when augmented with our designs, but our methods still significantly better capture nuances on the face.
Justus Thies, Michael Zollhöfer, Marc Stamminger, Christian Theobalt, and Matthias Ilker Yildirim, Winrich Freiwald, Tejas Kulkarni, and Joshua B. Tenenbaum. 2015.
Niessner. 2018. FaceVR: Real-Time Gaze-Aware Facial Reenactment in Virtual Reality. Efficient Analysis-by-synthesis in Vision: A Computational Framework, Behavioral
ACM Transactions on Graphics (TOG) 37, 2, Article 25 (June 2018), 25:1–25:15 pages. Tests, and Comparison with Neural Representations. In Proceedings of 37th Annual
Unreal Engine 4. 2018. Unreal Engine 4. https://www.unrealengine.com/. Conference of the Cognitive Science Society.
Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. 2016. Convolutional Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. 2017. Unpaired Image-
Pose Machines. In IEEE Conference on Computer Vision and Pattern Recognition to-Image Translation using Cycle-Consistent Adversarial Networks. In IEEE Inter-
(CVPR). national Conference on Computer Vision (ICCV).
Jing Xiao, Jinxiang Chai, and Takeo Kanade. 2006. A Closed-Form Solution to Non-Rigid
Shape and Motion Recovery. International Journal of Computer Vision (IJCV) 67, 2
(April 2006), 233–246.
Xuehan Xiong and Fernando De la Torre. 2013. Supervised Descent Method and Its
Applications to Face Alignment. In IEEE Conference on Computer Vision and Pattern
Recognition (CVPR).

VR Facial Animation Via Multiview Image Translation PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

VR Facial Animation Via Multiview Image Translation PDF

Uploaded by

Copyright:

Available Formats

VR Facial Animation via Multiview Image Translation

Synchronized IR Images Fake Rendered Images

optimal camera mounting configurations do, in fact, contain suffi-

F g 3 Overa op m za on or es ab sh ng correspondence We a n a ne wo k Eθ o o n y es ma e ava a pa ame e s z and headse pos on v

ACM T an G aph Vo 38 No 4 A e 67 Pub a on da e u y 2019

shown) from detectors trained from manual annotations. The uv-coordinates

where LC = LCH +LCR is the cycle-consistency loss for each domain

and for each view,

discriminator) for each domain and for each view,

1 log DH Hit + log 1 − DR ◦ Fϕ Hit

propagated to the parameters q. To this end, we use a decaying

blend-function away from the boundary:

4 REALTIME FACE ANIMATION 5 RESULTS AND EXPERIMENTS

Landmark 9 views Landmark 9 views

E�ect of gradients from background pixels Input HMC

Normalized Image loss

Image loss (Background-aware)

0 100 200 300 400 500 600 700

Fig. 13. Effect of background-aware rendering. We compare the image

Evaluation of Early Use of Landmark Loss

Image loss (Only optimize image loss) 31000

Fig. 14. Evaluating landmark loss as initialization. We compare two

E�ect of geometry and texture term

33, 4 (July 2014), 43:1–43:10.

You might also like

Landmark   9 views   Landmark   9 views