Art2Real - Unfolding The Reality of Artworks

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Art2Real: Unfolding the Reality of Artworks

via Semantically-Aware Image-to-Image Translation

Matteo Tomei Marcella Cornia Lorenzo Baraldi Rita Cucchiara


University of Modena and Reggio Emilia
{name.surname}@unimore.it

Abstract Artistic Domain Realistic Domain

The applicability of computer vision to real paintings


and artworks has been rarely investigated, even though a
vast heritage would greatly benefit from techniques which Artistic paintings
can understand and process data from the artistic domain. Real photos
Generated images
This is partially due to the small amount of annotated artis-
Original Painting Art2Real Real Photos
tic data, which is not even comparable to that of natu-
ral images captured by cameras. In this paper, we pro-
pose a semantic-aware architecture which can translate art-
works to photo-realistic visualizations, thus reducing the
gap between visual features of artistic and realistic data.
Our architecture can generate natural images by retrieving
and learning details from real photos through a similarity
matching strategy which leverages a weakly-supervised se-
mantic understanding of the scene. Experimental results Figure 1: We present Art2Real, an architecture which can
show that the proposed technique leads to increased real- reduce the gap between the distributions of visual features
ism and to a reduction in domain shift, which improves the from artistic and realistic images, by translating paintings
performance of pre-trained architectures for classification, to photo-realistic images.
detection, and segmentation. Code is publicly available at:
https://github.com/aimagelab/art2real.
volutional features of the two domains, which leads to a
decrease in performance in the target tasks, such as classifi-
cation, detection or segmentation.
1. Introduction
This paper proposes a solution to the aforementioned
Our society has inherited a huge legacy of cultural arti- problem that avoids the need for re-training neural archi-
facts from past generations: buildings, monuments, books, tectures on large-scale datasets containing artistic images.
and exceptional works of art. While this heritage would In particular, we propose an architecture which can reduce
benefit from algorithms which can automatically under- the shift between the feature distributions from the two
stand its content, computer vision techniques have been domains, by translating artworks to photo-realistic images
rarely adapted to work in this domain. which preserve the original content. A sample of this set-
One of the reasons is that applying state of the art tech- ting is depicted in Fig. 1.
niques to artworks is rather difficult, and often brings poor As paired training data is not available for this task,
performance. This can be motivated by the fact that the vi- we revert to an unpaired image-to-image translation set-
sual appearance of artworks is different from that of photo- ting [56], in which images can be translated between dif-
realistic images, due to the presence of brush strokes, the ferent domains while preserving some underlying charac-
creativity of the artist and the specific artistic style at hand. teristics. In our art-to-real scenario, the first domain is that
As current vision pipelines exploit large datasets consist- of paintings while the second one is that of natural images.
ing of natural images, learned models are largely biased to- The shared characteristic is that they are two different vi-
wards them. The result is a gap between high-level con- sualizations of the same class of objects, for example, they

5849
both represent landscapes. 47, 48, 28] and text to image synthesis [36, 37, 55, 50].
In the translation architecture that we propose, new Recently, a line of work on image-to-image translation
photo-realistic images are obtained by retrieving and learn- has emerged, in both paired [16, 40] and unpaired set-
ing from existing details of natural images and exploiting a tings [56, 20, 29, 43]. Our task belongs to the second cate-
weakly-supervised semantic understanding of the artwork. gory, as the translation of artistic paintings to photo-realistic
To this aim, a number of memory banks of realistic patches images cannot be solved by exploiting supervised methods.
is built from the set of photos, each containing patches from Zhu et al. [56] proposed the Cycle-GAN framework,
a single semantic class in a memory-efficient representa- which learns a translation between domains by exploiting a
tion. By comparing generated and real images at the patch cycle-consistent constraint that guarantees the consistency
level, in a multi-scale manner, we can then drive the train- of generated images with respect to original ones. On a
ing of a generator network which learns to generate photo- similar line, Kim et al. [20] introduced a method for pre-
realistic details, while preserving the semantics of the orig- serving the key attributes between the input and the trans-
inal painting. As performing a semantic understanding of lated image, while preserving a cycle-consistency criterion.
the original painting would create a chicken-egg problem, On the contrary, Liu et al. [29] used a combination of gen-
in which unreliable data is used to drive the training and erative adversarial networks, based on CoGAN [30], and
the generation, we propose a strategy to update the seman- variational auto-encoders. While all these methods have
tic masks during the training, leveraging the partial conver- achieved successful results on a wide range of translation
gence of a cycle-consistent framework. tasks, none of them has been specifically designed, nor ap-
We apply our model to a wide range of artworks which plied, to recover photo-realism from artworks.
include paintings from different artists and styles, land- A different line of work is multi-domain image-to-image
scapes and portraits. Through experimental evaluation, we translation [5, 2, 52]: here, the same model can be used for
show that our architecture can improve the realism of trans- translating images according to multiple attributes (i.e., hair
lated images when compared to state of the art unpaired color, gender or age). Other methods, instead, focus on di-
translation techniques. This is evaluated both qualitatively verse image-to-image translation, in which an image can be
and quantitatively, by setting up a user study. Furthermore, translated in multiple ways by encoding different style prop-
we demonstrate that the proposed architecture can reduce erties of the target distribution [57, 15, 24]. However, since
the domain shift when applying pre-trained state of the art these methods typically depend on domain-specific proper-
models on the generated images. ties, they are not suitable for our setting as realism is more
Contributions. To sum up, our contributions are as follows: important than diversity.
• We address the domain gap between real images and Neural style transfer. Another way of performing image-
artworks, which prevents the understanding of data to-image translation is that of neural style transfer meth-
from the artistic domain. To this aim, we propose a net- ods [7, 8, 18, 14, 39], in which a novel image is synthe-
work which can translate paintings to photo-realistic sized by combining the content of one image with the style
generated images. of another, typically a painting. In this context, the sem-
• The proposed architecture is based on the construction inal work by Gatys et al. [7, 8] proposed to jointly mini-
of efficient memory banks, from which realistic details mize a content loss to preserve the original content, and a
can be recovered at the patch level. Retrieved patches style reconstruction loss to transfer the style of a target artis-
are employed to drive the training of a cycle-consistent tic image. The style component is encoded by exploiting
framework and to increase the realism of generated im- the Gram matrix of activations coming from a pre-trained
ages. This is done in a semantically aware manner, CNN. Subsequent methods have been proposed to address
exploiting segmentation masks computed on artworks and improve different aspects of style transfer, including the
and generated images during the training. reduction of the computational overhead [18, 25, 44], the
• We show, through experimental results in different set- improvement of the generation quality [9, 4, 49, 17, 39] and
tings, improved realism with respect to state of the art diversity [26, 45]. Other works have concentrated on the
approaches for image translation, and an increase in combination of different styles [3], and the generalization
the performance of pre-trained models on generated to previously unseen styles [27, 10, 41]. All these methods,
data. while being effective on transferring artistic styles, show
poor performance in the opposite direction.
2. Related work
3. Proposed approach
Image-to-image translation. Generative adversarial net-
works have been applied to several conditional image gen- Our goal is to obtain a photo-realistic representation of a
eration problems, ranging from image inpainting [35, 53, painting. The proposed approach explicitly guarantees the
54, 51] and super-resolution [23] to video prediction [33, realism of the generation and a semantic binding between

5850
True? Segmentation Maps Memory Banks
Fake?

sky

generated
sky
real
Segmentation


Network real grass

generated
Real Photos


True?
Fake?
grass

Figure 2: Overview of our Art2Real approach. A painting is translated to a photo-realistic visualization by forcing a matching
with patches from real photos. This is done in a semantically-aware manner, by building class-specific memory banks of
real patches B c , and pairing generated and real patches through affinity matrices Ac , according to their semantic classes.
Segmentation maps are computed either from the original painting or the generated image as the training proceeds.

the original artwork and the generated picture. To increase to the number of different semantic classes found in the
the realism, we build a network which can copy from the de- dataset, plus the background class, where patches belong-
tails of real images at the patch level. Further, to reinforce ing to the same class are placed together (Fig. 3). Also, se-
the semantic consistency before and after the translation, mantic information from generated images is needed: since
we make use of a semantic similarity constraint: each patch images generated at the beginning of the training are less
of the generated image is paired up with similar patches of informative, we first extract segmentation masks from the
the same semantic class extracted from a memory bank of original paintings. As soon as the model starts to gener-
realistic images. The training of the network aims at max- ate meaningful images, we employ the segmentation masks
imizing this similarity score, in order to reproduce realistic obtained on generated images.
details and preserve the original scene. An overview of our
model is presented in Fig. 2. 3.2. Semantically-aware generation
The unpaired image-to-image translation model that we
3.1. Patch memory banks propose maps images belonging to a domain X (that of art-
Given a semantic segmentation model, we define a pre- works) to images belonging to a different domain Y (that of
processing step with the aim of building the memory banks natural images), preserving the overall content. Suppose we
of patches which will drive the generation. Each memory have a generated realistic image G(x) at each training step,
bank B c is tied to a specific semantic class c, in that it can produced by a mapping function G which starts from an in-
contain only patches which belong to its semantic class. To put painting x. We adopt the previously obtained memory
define the set of classes, and semantically understand the banks of realistic patches and the segmentation masks of the
content of an image, we adopt the weakly-supervised seg- paintings in order to both enhance the realism of the gener-
mentation model from Hu et al. [13]: in this approach, a net- ated details and keep the semantic content of the painting.
work is trained to predict semantic masks from a large set Pairing similar patches in a meaningful way. At each
of categories, by leveraging the partial supervision given by training step, G(x) is split in patches as well, maintain-
detections. We also define an additional background mem- ing the same stride and patch size used for the memory
ory bank, to store all patches which do not belong to any banks. Reminding that we have the masks for all the paint-
semantic class. ings, we denote a mask of the painting x with class label c
Following a sliding-window policy, we extract fixed-size as Mxc . We retrieve all masks Mx of the painting x from
RGB patches from the set of real images and put them in which G(x) originates, and assign each generated patch to
a specific memory B c , according to the class label c of the class label c of the mask Mxc in which it falls. If a
the mask in which they are located. Since a patch might patch belongs to different masks, it is also assigned to mul-
contain pixels which belong to a second class label or the tiple classes. Then, generated patches assigned to a spe-
background class, we store in B c only patches containing cific class c are paired with similar realistic patches in the
at least 20% pixels from class c. memory bank B c , i.e. the bank containing realistic patches
Therefore, we obtain a number of memory banks equal with class label c. Given realistic patches belonging to B c ,

5851
of each generated patch kic . In this way, Ac will be a sparse
matrix with at most as many columns as k times the num-
ber of generated patches of class c. The Softmax in Eq. 3
ensures that the approximated version of the affinity matrix
is very close to the exact one if the k-NN searches through
the indices are reliable. We adopt inverted indexes with ex-
act post-verification, implemented in the Faiss library [19].
Patches are stored with their RGB values when memory
banks have less than one million vectors; otherwise, we use
a PCA pre-processing step to reduce their dimensionality,
Figure 3: Memory banks building. A segmentation
and scalar quantization to limit the memory requirements
model [13] computes segmentation masks for each realis-
of the index.
tic image in the dataset, then RGB patches belonging to the
same semantic class are placed in the same memory bank. Maximizing the similarity. A contextual loss [34] for each
semantic class in Mx aims to maximize the similarity be-
B c = {bcj } and the set of generated patches with class label tween couples of patches with high affinity value:
c, K c = {kic }, we center both sets with respect to the mean !!
of patches in B c , and we compute pairwise cosine distances c c c 1 X
c
LCX (K , B ) = − log c max Aij (4)
as follows: NK i
j
!
c
(kic − µcb ) · (bcj − µcb ) where NK c
is the cardinality of the set of generated patches
dij = 1 − c (1)
kki − µcb k2 bcj − µcb 2 with class label c. Our objective is the sum of the previously
computed single-class contextual losses over the different
where µcb = N1c j bcj , being Nc the number of patches
P
classes found in Mx :
in memory bank B c . We compute a number of distance !!
matrices equal to the number of semantic classes found in X 1 X
the original painting x. Pairwise distances are subsequently LCX (K, B) = − log c max Acij (5)
c
NK i
j
normalized as follows:
dcij where c assumes all the class label values of masks in Mx .
d˜cij = , where ǫ = 1e − 5 (2) Note that masks in Mx are not constant during training: at
minl dcil + ǫ
the beginning, they are computed on paintings, then they
and pairwise affinity matrices are computed by applying a are regularly extracted from G(x).
row-wise softmax normalization: Multi-scale variant. To enhance the realism of generated
˜c /h)
(
exp(1 − d ≈ 1 if d˜cij ≪ d˜cil ∀ l 6= j images, we adopt a multi-scale variant of the approach,
ij
Acij = P = which considers different sizes and strides in the patch ex-
˜c ≈ 0 otherwise
l exp(1 − dil /h) traction process. The set of memory banks is therefore
(3)
replicated for each scale, and G(x) is split at multiple scales
where h > 0 is a bandwidth parameter. Thanks to the
accordingly. Our loss function is given by the sum of the
softmax normalization, each generated patch kic will have
values from Eq. 5 computed at each scale, as follows:
a high-affinity degree with the nearest real patch and with
other not negligible near patches. Moreover, affinities are X
LCXM S (K, B) = LsCX (K, B) (6)
computed only between generated and artistic patches be- s
longing to the same semantic class.
Approximate affinity matrix. Computing the entire affin- where each scale s implies a specific patch size and stride.
ity matrix would require an intractable computational over-
3.3. Unpaired image-to-image translation baseline
head, especially for classes with a memory bank containing
millions of patches. In fact matrix Ac has as many rows as Our objective assumes the availability of a generated im-
the number of patches of class c extracted from G(x) and age G(x) which is, in our task, the representation of a paint-
as many columns as the number of patches contained in the ing in the photo-realistic domain. In our work, we adopt
memory bank B c . a cycle-consistent adversarial framework [56] between the
To speed up the computation, we build a suboptimal domain of paintings from a specific artist X and the do-
Nearest Neighbors index I c for each memory bank. When main of realistic images Y . The data distributions are
the affinity matrix for a class c has to be computed, we con- x ∼ pdata (x) and y ∼ pdata (y), while G : X → Y and
duct a k-NN search through I c to get the k nearest samples F : Y → X are the mapping functions between the two

5852
domains. The two discriminators are denoted as DY and Gogh: 400, Ukiyo-e: 825, landscape paintings: 2044, por-
DX . The full cycle-consistent adversarial loss [56] is the traits: 1714, real landscape photographs: 2048, real people
following: photographs: 2048.

Lcca (G, F, DX , DY ) = LGAN (G, DY , X, Y ) Architecture and training details. To build generators and
discriminators, we adapt generative networks from Johnson
+ LGAN (F, DX , Y, X) (7)
et al. [18], with two stride-2 convolutions to downsample
+ Lcyc (G, F ) the input, several residual blocks and two stride-1/2 convo-
lutional layers for upsampling. Discriminative networks are
where the two adversarial losses are:
PatchGANs [16, 23, 25] which classify each square patch
LGAN (G, DY , X, Y ) = Ey∼pdata (y) [logDY (y)] of an image as real or fake.
+ Ex∼pdata (x) [log(1 − DY (G(x)))] Memory banks of real patches are built using all the
(8) available real images, i.e. 2048 images both for landscapes
and for people faces, and are kept constant during training.
LGAN (F, DX , Y, X) = Ex∼pdata (x) [logDX (x)] Masks of the paintings, after epoch 40, are regularly up-
+ Ey∼pdata (y) [log(1 − DX (F (y)))] dated every 20 epochs with those from the generated im-
ages. Patches are extracted at three different scales: 4 × 4,
(9)
8×8 and 16×16, using three different stride values: 4, 5 and
and the cycle consistency loss, which requires the original 6 respectively. The same patch sizes and strides are adopted
images x and y to be the same as the reconstructed ones, when splitting the generated image, in order to compute
F (G(x)) and G(F (y)) respectively, is: affinities and the contextual loss. We use a multi-scale con-
textual loss weight λ, in Eq. 11, equal to 0.1.
Lcyc (G, F ) = Ex∼pdata (x) [kF (G(x)) − xk] We train the model for 300 epochs through the Adam op-
(10)
+ Ey∼pdata (y) [kG(F (y)) − yk]. timizer [21] and using mini-batches with a single sample.
A learning rate of 0.0002 is kept constant for the first 100
3.4. Full objective epochs, making it linearly decay to zero over the next 200
Our full semantically-aware translation loss is given by epochs. An early stopping technique is used to reduce train-
the sum of the baseline objective, i.e. Eq. 7, and our patch- ing times. In particular, at each epoch the Fréchet Inception
level similarity loss, i.e. Eq. 6: Distance (FID) [12] is computed between our generated im-
ages and the set of real photos: if it does not decrease for
L(G, F, DX , DY , K, B) = Lcca (G, F, DX , DY ) 30 consecutive epochs, the training is stopped. We initialize
(11) the weights of the model from a Gaussian distribution with
+ λLCXM S (K, B)
0 mean and standard deviation 0.02.
where λ controls our multi-scale contextual loss weight
with respect to the baseline objective. Competitors. To compare our results with those from state
of the art techniques, we train Cycle-GAN [56], UNIT [29]
4. Experimental results and DRIT [24] approaches on the previously described
datasets. The adopted code comes from the authors’ im-
Datasets. In order to evaluate our approach, different sets plementations and can be found in their GitHub reposito-
of images, both from artistic and realistic domains, are used. ries. The number of epochs and other training parameters
Our tests involve both sets of paintings from specific artists are those suggested by the authors, except for DRIT [24]:
and sets of artworks representing a given subject from dif- to enhance the quality of the results generated by this com-
ferent authors. We use paintings from Monet, Cezanne, Van petitor, after contacting the authors we employed spectral
Gogh, Ukiyo-e style and landscapes from different artists normalization and manually chose the best epoch through
along with real photos of landscapes, keeping an underly- visual inspection and by computing the FID [12] measure.
ing relationship between artistic and realistic domains. We Moreover, being DRIT [24] a diverse image-to-image trans-
also show results using portraits and real people photos. All lation framework, its performance depends on the choice of
artworks are taken from Wikiart.org, while landscape pho- an attribute from the attribute space of the realistic domain.
tos are downloaded from Flickr through the combination For fairness of comparison, we generate a single realistic
of tags landscape and landscapephotography. To image using a randomly sampled attribute. We also show
obtain people photos, images are extracted from the CelebA quantitative results of applying the style transfer approach
dataset [31]. All the images are scaled to 256 × 256 pixels, from Gatys et al. [7], with content images taken from the
and only RGB pictures are used. The size of each train- realistic datasets and style images randomly sampled from
ing set is, respectively, Monet: 1072, Cezanne: 583, Van the paintings, for each set.

5853
Method Monet Cezanne Van Gogh Ukiyo-e Landscapes Portraits Mean
Original paintings 69.14 169.43 159.82 177.52 59.07 72.95 117.99
Style-transferred reals 74.43 114.39 137.06 147.94 70.25 62.35 101.07
DRIT [24] 68.32 109.36 108.92 117.07 59.84 44.33 84.64
UNIT [29] 56.18 97.91 98.12 89.15 47.87 43.47 72.12
Cycle-GAN [56] 49.70 85.11 85.10 98.13 44.79 30.60 65.57
Art2Real 44.71 68.00 78.60 80.48 35.03 34.03 56.81

Table 1: Evaluation in terms of Fréchet Inception Distance [12].

Art2Real Cycle-GAN [56] UNIT [29] DRIT [24] Style-transferred reals


Landscapes
Portraits

Figure 4: Distribution of ResNet-152 features extracted from landscape and portrait images. Each row shows the results of
our method and competitors on a specific setting.

4.1. Visual quality evaluation Cycle-GAN [56] UNIT [29] DRIT [24]

We evaluate the visual quality of our generated images Realism 36.5% 27.9% 14.2%
using both automatic evaluation metrics and user studies. Coherence 48.4% 25.5% 7.3%
Fréchet Inception Distance. To numerically assess the
quality of our generated images, we employ the Fréchet Table 2: User study results. We report the percentage of
Inception Distance [12]. It measures the difference of times an image from a competitor was preferred against
two Gaussians, and it is also known as Wasserstein-2 dis- ours. Our method is always preferred more than 50% of
tance [46]. The FID d between a Gaussian G1 with mean the times.
and covariance (m1 , C1 ) and a Gaussian G2 with mean and
covariance (m2 , C2 ) is given by: Human judgment. In order to evaluate the visual quality
of our generated images, we conducted a user study on the
2
d2 (G1 , G2 ) = km1 − m2 k2 + T r(C1 + C2 − 2(C1 C2 )1/2 ) Figure Eight crowd-sourcing platform. In particular, we as-
(12) sessed both the realism of our results and their coherence
For our evaluation purposes, the two Gaussians are fitted on with the original painting. To this aim, we conducted two
Inception-v3 [42] activations of real and generated images, different evaluation processes, which are detailed as fol-
respectively. The lower the Fréchet Inception Distance be- lows:
tween these Gaussians, the more generated and real data • In the Realism evaluation, we asked the user to select
distributions overlap, i.e. the realism of generated images the most realistic image between the two shown, both
increases when the FID decreases. Table 1 shows FID val- obtained from the same painting, one from our method
ues for our model and a number of competitors. As it can and the other from a competitor;
be observed, the proposed approach produces a lower FID • In the Coherence evaluation, we presented the user
on all settings, except for portraits, in which we rank sec- the original painting and two generated images which
ond after Cycle-GAN. Results thus confirm the capabilities originate from it, asking to select the most faithful to
of our method in producing images which looks realistic to the artwork. Again, generated images come from our
pre-trained CNNs. method and a competitor.

5854
Original Painting Art2Real Cycle-GAN [56] UNIT [29] DRIT [24]

Figure 5: Qualitative results on portraits. Our method can preserve facial expressions and reduce the amount of artifacts with
respect to Cycle-GAN [56], UNIT [29], and DRIT [24].

Method Classification Segmentation Detection 4.2. Reducing the domain shift


Real Photos 3.99 0.63 2.03 We evaluate the capabilities of our model to reduce the
Original paintings 4.81 0.67 2.58 domain shift between artistic and real data, by analyzing the
Style-transferred reals 5.39 0.70 2.89 performance of pre-trained convolutional models and visu-
DRIT [24] 5.14 0.67 2.56 alizing the distributions of CNN features.
UNIT [29] 4.88 0.69 2.54 Entropy analysis. Pre-trained architectures show increased
Cycle-GAN [56] 4.81 0.67 2.50 performances on images synthesized by our approach, in
Art2Real 4.50 0.66 2.42 comparison with original paintings and images generated
by other approaches. We visualize this by computing the en-
Table 3: Mean entropy values for classification, segmenta- tropy of the output of state of the art architectures: the lower
tion, and detection of images generated through our method the entropy, the lower the uncertainty of the model about its
and through competitor methods. result. We evaluate the entropy on classification, seman-
tic segmentation, and detection tasks, adopting a ResNet-
152 [11] trained on ImageNet [6], Hu et al. [13]’s model and
Each test involved our method and one competitor at a time
Faster R-CNN [38] trained on the Visual Genome [22, 1],
leading to six different tests, considering three competitors:
respectively. Table 3 shows the average image entropy for
Cycle-GAN [56], UNIT [29], and DRIT [24]. A set of 650
classification, the average pixel entropy for segmentation
images were randomly sampled for each test, and each im-
and the average bounding-box entropy for detection, com-
age pair was evaluated from three different users. Each user,
puted on all the artistic, realistic and generated images avail-
to start the test, was asked to successfully evaluate eight ex-
able. Our approach is able to generate images which lower
ample pairs where one of the two images was definitely bet-
the entropy, on average, for each considered task with re-
ter than the other. A total of 685 evaluators were involved
spect to paintings and images generated by the competitors.
in our tests. Results are presented in Table 2, showing that
our generated images are always chosen more than 50% of Feature distributions visualization. To further validate the
the times. domain shift reduction between real images and generated

5855
Original Painting Art2Real Cycle-GAN [56] UNIT [29] DRIT [24]

Figure 6: Qualitative results on landscape paintings. Results generated by our approach show increased realism and reduced
blur when compared with those from Cycle-GAN [56], UNIT [29], and DRIT [24].

ones, we visualize the distributions of features extracted the Supplementary material. We observe increased realism
from a CNN. In particular, for each image, we extract a vi- in our generated images, due to more detailed elements and
sual feature vector coming from the average pooling layer fewer blurred areas, especially in the landscape results. Por-
of a ResNet-152 [11], and we project it into a 2-dimensional trait samples reveal that brush strokes disappear completely,
space by using the t-SNE algorithm [32]. Fig. 4 shows the leading to a photo-realistic visualization. Our results con-
feature distributions on two different sets of paintings (i.e., tain fewer artifacts and are more faithful to the paintings,
landscapes and portraits) comparing our results with those more often preserving the original facial expression.
of competitors. Each plot represents the distribution of vi-
sual features extracted from paintings belonging to a spe-
cific set, from the corresponding images generated by our 5. Conclusion
model or by one of the competitors, and from the real pho- We have presented Art2Real, an approach to translate
tographs depicting landscapes or, in the case of portraits, paintings to photo-realistic visualizations. Our research is
human faces. As it can be seen, the distributions of our gen- motivated by the need of reducing the domain gap between
erated images are in general closer to the distributions of artistic and real data, which prevents the application of re-
real images than to those of paintings, thus confirming the cent techniques to art. The proposed approach generates
effectiveness of our model in the domain shift reduction. realistic images by copying from sets of real images, in a
semantically aware manner and through efficient memory
4.3. Qualitative results
banks. This is paired with an image-to-image translation ar-
Besides showing numerical improvements with respect chitecture, which ultimately leads to the final result. Quanti-
to state of the art approaches, we present some qualitative tative and qualitative evaluations, conducted on artworks of
results coming from our method, compared to those from different artists and styles, have shown the effectiveness of
Cycle-GAN [56], UNIT [29], and DRIT [24]. We show ex- our method in comparison with image-to-image translation
amples of landscape and portrait translations in Fig. 5 and algorithms. Finally, we also showed how generated images
6. Many other samples from all settings can be found in can enhance the performance of pre-trained architectures.

5856
References [14] Xun Huang and Serge J Belongie. Arbitrary Style Trans-
fer in Real-Time with Adaptive Instance Normalization. In
[1] Peter Anderson, Xiaodong He, Chris Buehler, Damien Proceedings of the International Conference on Computer
Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Vision, 2017.
Bottom-Up and Top-Down Attention for Image Captioning
[15] Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz.
and Visual Question Answering. In Proceedings of the IEEE
Multimodal Unsupervised Image-to-image Translation. In
Conference on Computer Vision and Pattern Recognition,
Proceedings of the European Conference on Computer Vi-
2018.
sion, 2018.
[2] Asha Anoosheh, Eirikur Agustsson, Radu Timofte, and Luc [16] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A
Van Gool. ComboGAN: Unrestrained scalability for im- Efros. Image-to-image translation with conditional adver-
age domain translation. In Proceedings of the IEEE Con- sarial networks. In Proceedings of the IEEE Conference on
ference on Computer Vision and Pattern Recognition Work- Computer Vision and Pattern Recognition, 2017.
shops, 2018.
[17] Yongcheng Jing, Yang Liu, Yezhou Yang, Zunlei Feng,
[3] Dongdong Chen, Lu Yuan, Jing Liao, Nenghai Yu, and Gang Yizhou Yu, Dacheng Tao, and Mingli Song. Stroke con-
Hua. Stylebank: An explicit representation for neural image trollable fast style transfer with adaptive receptive fields. In
style transfer. In Proceedings of the IEEE Conference on Proceedings of the European Conference on Computer Vi-
Computer Vision and Pattern Recognition, 2017. sion, 2018.
[4] Tian Qi Chen and Mark Schmidt. Fast patch-based style [18] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual
transfer of arbitrary style. arXiv preprint arXiv:1612.04337, losses for real-time style transfer and super-resolution. In
2016. Proceedings of the European Conference on Computer Vi-
[5] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, sion, 2016.
Sunghun Kim, and Jaegul Choo. StarGAN: Unified gener- [19] Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-
ative adversarial networks for multi-domain image-to-image scale similarity search with GPUs. arXiv preprint
translation. In Proceedings of the IEEE Conference on Com- arXiv:1702.08734, 2017.
puter Vision and Pattern Recognition, 2017.
[20] Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee,
[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- and Jiwon Kim. Learning to Discover Cross-Domain Rela-
Fei. ImageNet: A Large-Scale Hierarchical Image Database. tions with Generative Adversarial Networks. In Proceedings
In Proceedings of the IEEE Conference on Computer Vision of the International Conference on Machine Learning, 2017.
and Pattern Recognition, 2009.
[21] Diederik P Kingma and Jimmy Ba. Adam: A method for
[7] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. stochastic optimization. arXiv preprint arXiv:1412.6980,
A neural algorithm of artistic style. arXiv preprint 2014.
arXiv:1508.06576, 2015. [22] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson,
[8] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Im- Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan-
age style transfer using convolutional neural networks. In tidis, Li-Jia Li, David A Shamma, Michael Bernstein, and
Proceedings of the IEEE Conference on Computer Vision Li Fei-Fei. Visual Genome: Connecting Language and Vi-
and Pattern Recognition, 2016. sion Using Crowdsourced Dense Image Annotations. Inter-
[9] Leon A Gatys, Alexander S Ecker, Matthias Bethge, Aaron national Journal of Computer Vision, 123(1):32–73, 2017.
Hertzmann, and Eli Shechtman. Controlling perceptual fac- [23] Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero,
tors in neural style transfer. In Proceedings of the IEEE Con- Andrew Cunningham, Alejandro Acosta, Andrew Aitken,
ference on Computer Vision and Pattern Recognition, 2017. Alykhan Tejani, Johannes Totz, Zehan Wang, and Wenzhe
[10] Golnaz Ghiasi, Honglak Lee, Manjunath Kudlur, Vincent Shi. Photo-Realistic Single Image Super-Resolution Using
Dumoulin, and Jonathon Shlens. Exploring the structure of a Generative Adversarial Network. In Proceedings of the
a real-time, arbitrary neural artistic stylization network. In IEEE Conference on Computer Vision and Pattern Recogni-
Proceedings of the British Machine Vision Conference, 2017. tion, 2017.
[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. [24] Hsin-Ying Lee, Hung-Yu Tseng, Jia-Bin Huang, Ma-
Deep residual learning for image recognition. In Proceed- neesh Kumar Singh, and Ming-Hsuan Yang. Diverse Image-
ings of the IEEE Conference on Computer Vision and Pattern to-Image Translation via Disentangled Representations. In
Recognition, 2016. Proceedings of the European Conference on Computer Vi-
[12] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, sion, 2018.
Bernhard Nessler, Günter Klambauer, and Sepp Hochreiter. [25] Chuan Li and Michael Wand. Precomputed real-time texture
GANs trained by a two time-scale update rule converge to a synthesis with markovian generative adversarial networks. In
Nash equilibrium. Advances in Neural Information Process- Proceedings of the European Conference on Computer Vi-
ing Systems, 2017. sion, 2016.
[13] Ronghang Hu, Piotr Dollár, Kaiming He, Trevor Darrell, and [26] Yijun Li, Chen Fang, Jimei Yang, Zhaowen Wang, Xin Lu,
Ross Girshick. Learning to Segment Every Thing. In Pro- and Ming-Hsuan Yang. Diversified texture synthesis with
ceedings of the IEEE Conference on Computer Vision and feed-forward networks. In Proceedings of the IEEE Confer-
Pattern Recognition, 2018. ence on Computer Vision and Pattern Recognition, 2017.

5857
[27] Yijun Li, Chen Fang, Jimei Yang, Zhaowen Wang, Xin Lu, tecture for computer vision. In Proceedings of the IEEE Con-
and Ming-Hsuan Yang. Universal style transfer via feature ference on Computer Vision and Pattern Recognition, 2016.
transforms. In Advances in Neural Information Processing [43] Matteo Tomei, Lorenzo Baraldi, Marcella Cornia, and Rita
Systems, 2017. Cucchiara. What was Monet seeing while painting? Trans-
[28] Xiaodan Liang, Lisa Lee, Wei Dai, and Eric P Xing. Dual lating artworks to photo-realistic images. In Proceedings of
motion GAN for future-flow embedded video prediction. In the European Conference on Computer Vision Workshops,
Proceedings of the International Conference on Computer 2018.
Vision, 2017. [44] Dmitry Ulyanov, Vadim Lebedev, Andrea Vedaldi, and Vic-
[29] Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised tor S Lempitsky. Texture Networks: Feed-forward Synthesis
image-to-image translation networks. In Advances in Neural of Textures and Stylized Images. In Proceedings of the In-
Information Processing Systems, 2017. ternational Conference on Machine Learning, 2016.
[30] Ming-Yu Liu and Oncel Tuzel. Coupled generative adversar- [45] Dmitry Ulyanov, Andrea Vedaldi, and Victor S Lempitsky.
ial networks. In Advances in Neural Information Processing Improved Texture Networks: Maximizing Quality and Di-
Systems, 2016. versity in Feed-forward Stylization and Texture Synthesis.
[31] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. In Proceedings of the IEEE Conference on Computer Vision
Deep Learning Face Attributes in the Wild. In Proceedings and Pattern Recognition, 2017.
of the International Conference on Computer Vision, 2015. [46] Leonid Nisonovich Vaserstein. Markov processes over denu-
[32] Laurens van der Maaten and Geoffrey Hinton. Visualizing merable products of spaces, describing large systems of au-
data using t-SNE. Journal of Machine Learning Research, tomata. Problemy Peredachi Informatsii, 5(3):64–72, 1969.
9(Nov):2579–2605, 2008. [47] Ruben Villegas, Jimei Yang, Seunghoon Hong, Xunyu Lin,
[33] Michael Mathieu, Camille Couprie, and Yann LeCun. Deep and Honglak Lee. Decomposing motion and content for nat-
multi-scale video prediction beyond mean square error. In ural video sequence prediction. In Proceedings of the Inter-
Proceedings of the International Conference on Learning national Conference on Learning Representations, 2017.
Representations, 2016. [48] Jacob Walker, Kenneth Marino, Abhinav Gupta, and Martial
[34] Roey Mechrez, Itamar Talmi, Firas Shama, and Lihi Zelnik- Hebert. The pose knows: Video forecasting by generating
Manor. Learning to Maintain Natural Image Statistics. arXiv pose futures. In Proceedings of the International Conference
preprint arXiv:1803.04626, 2018. on Computer Vision, 2017.
[35] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor [49] Xin Wang, Geoffrey Oxholm, Da Zhang, and Yuan-Fang
Darrell, and Alexei A Efros. Context encoders: Feature Wang. Multimodal transfer: A hierarchical deep convolu-
learning by inpainting. In Proceedings of the European Con- tional neural network for fast artistic style transfer. In Pro-
ference on Computer Vision, 2016. ceedings of the IEEE Conference on Computer Vision and
[36] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Lo- Pattern Recognition, 2017.
geswaran, Bernt Schiele, and Honglak Lee. Generative ad- [50] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe
versarial text to image synthesis. In Proceedings of the In- Gan, Xiaolei Huang, and Xiaodong He. AttnGAN: Fine-
ternational Conference on Machine Learning, 2016. grained text to image generation with attentional generative
[37] Scott E Reed, Zeynep Akata, Santosh Mohan, Samuel Tenka, adversarial networks. In Proceedings of the IEEE Confer-
Bernt Schiele, and Honglak Lee. Learning what and where ence on Computer Vision and Pattern Recognition, 2018.
to draw. In Advances in Neural Information Processing Sys- [51] Zhaoyi Yan, Xiaoming Li, Mu Li, Wangmeng Zuo, and
tems, 2016. Shiguang Shan. Shift-net: Image inpainting via deep feature
[38] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. rearrangement. In Proceedings of the European Conference
Faster R-CNN: towards real-time object detection with re- on Computer Vision, 2018.
gion proposal networks. IEEE Transactions on Pattern Anal- [52] Xuewen Yang, Dongliang Xie, and Xin Wang. Crossing-
ysis and Machine Intelligence, 39(6):1137–1149, 2017. Domain Generative Adversarial Networks for Unsupervised
[39] Artsiom Sanakoyeu, Dmytro Kotovenko, Sabine Lang, and Multi-Domain Image-to-Image Translation. In ACM Inter-
Björn Ommer. A Style-Aware Content Loss for Real-time national Conference on Multimedia, 2018.
HD Style Transfer. In Proceedings of the European Confer- [53] Raymond A Yeh, Chen Chen, Teck-Yian Lim, Alexander G
ence on Computer Vision, 2018. Schwing, Mark Hasegawa-Johnson, and Minh N Do. Seman-
[40] Patsorn Sangkloy, Jingwan Lu, Chen Fang, Fisher Yu, and tic Image Inpainting with Deep Generative Models. In Pro-
James Hays. Scribbler: Controlling deep image synthesis ceedings of the IEEE Conference on Computer Vision and
with sketch and color. In Proceedings of the IEEE Confer- Pattern Recognition, 2017.
ence on Computer Vision and Pattern Recognition, 2017. [54] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and
[41] Falong Shen, Shuicheng Yan, and Gang Zeng. Neural Style Thomas S Huang. Generative image inpainting with contex-
Transfer via Meta Networks. In Proceedings of the IEEE tual attention. In Proceedings of the IEEE Conference on
Conference on Computer Vision and Pattern Recognition, Computer Vision and Pattern Recognition, 2018.
2018. [55] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaolei
[42] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Huang, Xiaogang Wang, and Dimitris Metaxas. StackGAN:
Shlens, and Zbigniew Wojna. Rethinking the inception archi- Text to photo-realistic image synthesis with stacked gener-

5858
ative adversarial networks. In Proceedings of the Interna-
tional Conference on Computer Vision, 2017.
[56] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A
Efros. Unpaired image-to-image translation using cycle-
consistent adversarial networks. In Proceedings of the In-
ternational Conference on Computer Vision, 2017.
[57] Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Dar-
rell, Alexei A Efros, Oliver Wang, and Eli Shechtman. To-
ward multimodal image-to-image translation. In Advances
in Neural Information Processing Systems, 2017.

5859

You might also like