Professional Documents
Culture Documents
Sketch To Photo
Sketch To Photo
Sketch To Photo
Xiaoyu Xiang1 *, Ding Liu2 , Xiao Yang2 , Yiheng Zhu2 , Xiaohui Shen2 , Jan P. Allebach1
1
Purdue University, 2 ByteDance Inc.
{xiang43,allebach}@purdue.edu,
{liuding,yangxiao.0,yiheng.zhu,shenxiaohui}@bytedance.com
1
to learn from unpaired sketches and photos collected sepa- 2. Related Work
rately. Even so, the existing sketch datasets cannot cover
all types of photos in the open domain [54]: the largest Sketch-Based Image Synthesis The goal of sketch-based
sketch dataset Quick, Draw! [22] has 345 categories, while image synthesis is to output a target image from a given
the full ImageNet [14] has as many as 21,841 class labels. sketch. Early works [8, 18, 9] regard freehand sketches
Therefore, most categories even lack corresponding free- as queries or constraints to retrieve each composition and
hand sketches to train a sketch-to-image translation model. stitch them into a picture. In recent years, an increasing
number of works adopt GAN-based models [21] to learn
To resolve this challenging task, we propose an Adver- pixel-wise translation between sketches and photos directly.
sarial Open Domain Adaptation (AODA) framework that [80, 39, 7] train their networks with pairs of photos and
for the first time learns to synthesize the absent freehand corresponding edge maps due to the lack of real sketch
sketches and makes the unsupervised open-domain adap- data. However, the freehand sketches are usually distorted
tation possible, as illustrated in Figure 1. We propose to in shape compared with the target photo. Even when depict-
jointly learn a sketch-to-photo translation network and a ing the same object, the sketches from different users vary in
photo-to-sketch translation network for mapping the open- appearance due to differences in their drawing skills and the
domain photos into the sketches with the GAN priors. With levels of abstractness. To make the model applicable to free-
the bridge of the photo-to-sketch generation, we can gener- hand sketches, SketchyGAN [10] trained with both sketches
alize the learned correspondence between in-domain free- and augmented edge maps. ContextualGAN [48] turns the
hand sketches and photos to open-domain categories. Still, image generation problem into an image completion prob-
there is an unignorable domain gap between synthesized lem: the network learns the joint distribution of sketch and
sketches and real ones, which prevents the generator from image pairs and acquires the result by iteratively travers-
generalizing the learned correspondence to real sketches ing the manifold. iSketchNFill [20] uses simple outlines to
and synthesizing realistic photos for open-domain classes. represent freehand sketches and generates photos from par-
tial strokes with two-stage generators. Gao et al. [19] ap-
To further mitigate its influence on the generator and plies two generators to synthesize the foreground and back-
leverage the output quality of open-domain translation, we ground respectively and proposes a novel GAN structure
introduce a simple yet effective random-mixed sampling to encode the edge maps and corresponding photos into a
strategy that considers a certain proportion of fake sketches shared latent space. The above works are supervised based
as real ones blindly for all categories. With the proposed on paired data. Liu et al. [47] proposes a two-stage model
framework and training strategy, our model is able to syn- for the unsupervised sketch-to-photo generation with refer-
thesize a photo-realistic output even for sketches of un- ence images in a single class. Compared with these works,
seen classes. We compare the proposed AODA to exist- our problem setting is more challenging: we aim to learn
ing unpaired sketch-to-image generation approaches. Both the multi-class generation without supervision using paired
qualitative and quantitative results show that our proposed data from an incomplete and heavily unbalanced dataset.
method achieves significantly superior performance on both Conditional Image Generation Image generation can
seen and unseen data. be controlled by class-condition [20, 19], reference im-
ages [48, 46, 47], or specific semantic features [30, 57, 81],
The main contributions of this paper are three-fold: etc. The pioneering work cGAN [50] combines the input
(1) We propose the adversarial open-domain adaptation noise with the condition for generator and discriminator.
(AODA) framework as the first attempt to solve the open- To help the generator synthesize images based on the input
domain multi-class sketch-to-photo synthesis problem by label, AC-GAN [52] makes the discriminator also predict
learning to generate the missing freehand sketches. (2) the class labels. SGAN [51] unifies the idea of discrimi-
We introduce an open-domain training strategy by consid- nator and classifier by including the fake images as a new
ering certain fake sketches as real ones to reduce the gen- class. In this paper, we adopt a photo classifier that is jointly
erator’s bias of synthesized sketches and leverage the gen- trained with the generator and discriminator to supervise the
eralization of adversarial domain adaptation, thus achiev- sketch-to-photo generator’s training.
ing more faithful generation for open-domain classes. (3)
Our network provides, as a byproduct, a high-quality free- 3. Adversarial Open Domain Adaptation
hand sketch extractor for arbitrary photos. Extensive ex-
periments and user studies on diverse datasets demonstrate First, we discuss the challenge of the open-domain gen-
that our model can faithfully synthesize realistic photos eration problem and the limitation of previous methods in
for different categories of open-domain freehand sketches. Section 3.1. Then we introduce our proposed solution, in-
The source code and pre-trained models are available at cluding our AODA framework and the proposed training
https://github.com/Mukosame/AODA. strategy in Section 3.2.
2
Edge −→ Output Real Sketch −→ Output Fake Sketch −→ Output Real Sketch −→ Output
Figure 2: Results of photo synthesis from edge inputs and Figure 3: Results of photo synthesis from fake sketch inputs
real sketch inputs generated by a model trained with xDoG and real sketch inputs on Scribble [20] and QMUL-Sketch
edges and photos from the SketchyCOCO dataset [19]. The datasets [76, 65, 47]. The outputs are generated by a model
left two columns show the xDoG inputs and their outputs, trained with synthesized sketches, and the setting remains
and the right two columns are the real freehand sketch in- the same as in [47], where the fake sketches are generated
puts and the corresponding unsatisfactory outputs, which using a sketch extractor trained on the in-domain data. The
shows that the model simply trained with edges cannot rec- left two columns show the fake sketch inputs and their out-
tify the distorted shapes of freehand sketches. puts, and the right two columns are the real freehand sketch
inputs and the corresponding unsatisfactory outputs. Com-
paring the outputs, we can see this training strategy makes
3.1. Challenge the model fail to generalize on real sketches.
Unlike previous sketch-to-photo synthesis works [10,
20] that can directly learn the mapping between the in-
the photo-to-sketch extractor is trained with the in-domain
put sketch and its corresponding photo, during training, the
classes of the training set. From the left two columns in Fig-
sketches of open-domain classes are missing. To enable the
ure 3, we can see that the model is able to generate photo-
network to learn to synthesize photos from sketches of both
realistic outputs from synthesized sketches. However, it
in-domain and open-domain classes, there are two ways to
fails on real freehand sketches, as shown in the right two
solve this problem: (1) training with extracted edge maps
columns: even though it can generate the correct color and
and (2) enriching the open-domain classes with synthesized
texture conditioned by the input label, it cannot understand
sketches from a pre-trained photo-to-sketch extractor. We
the basic structure of real sketches (e.g. distinguish the ob-
show the results of these two methods and discuss their lim-
ject from the background). The reason is, even though they
itations.
are visually similar, the real and fake sketches are still dis-
Edge Maps. Figure 2 shows the results of a model trained
tinguishable for the model. This strategy cannot guaran-
on edges extracted by XDoG [71]. While the model can
tee that the model can generalize from the synthesized data
generate vivid highlights and shadows and fine details from
to the real freehand sketches, especially for the multi-class
the edge inputs, the images generated from the actual free-
generation. Thus, simply using the synthesized sketch to
hand sketches are not that photo-realistic, but more like
replace the missing freehand sketches cannot ensure photo-
a colored drawing. This is because edges and freehand
realistic generation.
sketches are very different: freehand sketches are human
abstractions of an object, usually with more deformations. 3.2. Our Method
The connections between the target photos and the input
sketch are looser than with edges. Due to this domain gap, To solve this problem, we propose to learn the photo-to-
sketch-to-photo generators trained on the edge inputs usu- sketch and sketch-to-photo translation jointly and to narrow
ally cannot learn shape rectification, and thus fail to gener- the domain gap between the synthesized and real sketches.
alize to freehand sketches. Framework. As shown in Figure 4, our framework mainly
Synthesized sketches. Another intuitive solution for open- consists of the following parts: two generators: a photo-
domain generation is to enrich the training set of un- to-sketch generator Gs , and a multi-class sketch-to-photo
seen classes M with sketches synthesized by a pre-trained generator Gp that takes sketch s and class label ηs as input;
photo-to-sketch generator [47]. Figure 3 shows the result two discriminators Ds and Dp that encourage the genera-
from a model trained with pre-extracted sketches on Scrib- tors to synthesize outputs indistinguishable from the sketch
ble [20] and QMUL-Sketch dataset [76, 65, 47], where domain S and photo domain P , respectively; and a classi-
3
Pixelwise
consistency Class 𝜂𝑝
… … …
real/fake : 𝐷𝑠 𝐺𝑝 𝐷𝑝 : real/fake
Sketch discriminator Photo discriminator
…
Real sketch 𝑠 Class 𝜂𝑠 Generated photo
G𝑝 (𝑠, 𝜂𝑠 )
𝑅 : class label
…
6
Photo classifier
Real photo 𝑝
Figure 4: AODA framework overview. It has two generators Gs : photo → sketch and Gp : sketch → photo conditioned
on the input label, and two discriminators Ds and Dp for the sketch and photo domains, respectively. In addition, we use a
photo classifier R to encourage Gp to generate indistinguishable photos from the real ones of the same class.
fier R that predicts class labels for both real photos p and Please see our supplementary materials for more details.
fake photos Gp (s, ηs ) to ensure that the output is truly con- All parts of our framework are trained jointly from scratch.
ditioned on the input label ηs . Our AODA framework is However, if we directly train the multi-class generator
trained with the unpaired sketch and photo data. with the loss defined in Equation 4, the training objective
During the training process, Gp extracts the sketch for open-domain classes M becomes the following form
Gs (p) from the given photo p. Then, the synthesized sketch due to the missing sketches s:
Gs (p) and the real sketch s are sent to Gp along with their
labels ηp and ηs , and turned into the reconstructed photo \label {eq:lgan_regress} \mathcal {L}_{GAN}^{\mathcal {M}} = \lambda _s\mathcal {L}_{G_{s}}(G_{s}, D_s, p) + \lambda _{pix}\mathcal {L}_{pix}(G_{s}, G_{p}, p, \eta _p),
Gp (Gs (p), ηp ) and the synthesized photo Gp (s, ηs ), respec- (2)
tively. Note that we only send the sketch with its true la- where ηp ∈ M. As a result, the sketch-to-photo genera-
bel to ensure that Gp learns the correct shape rectification tor Gp is solely supervised by the pixel-wise consistency.
from sketch to image domain for each class. The recon- Since the commonly used L1 and L2 loss lead to the me-
structed photo is supposed to look similar to the original dian and mean of pixels, respectively, this bias will make
photo, which is imposed by a pixel-wise consistency loss. Gp generate blurry photos for the open-domain classes.
We do not add such a consistency constraint onto the sketch To solve this problem, we propose the random-mixed
domain since we wish the synthesized sketches to be di- sampling strategy to minimize the domain gap between real
verse. The generated photo is finally sent to the discrimi- and fake sketch inputs for the generator and to improve its
nator Dp to ensure that it is photo-realistic, and to the clas- output quality with the open-domain classes.
sifier R to ensure that it has the same perceptual features Random-mixed strategy. This strategy aims to “fool”
as the target class. In summary, the generator loss includes the generator into treating fake sketches as real ones. Al-
four parts: the adversarial loss of photo-to-sketch genera- gorithm 1 describes the detailed steps for the random-
tion LGs , the adversarial loss of sketch-to-photo transla- mixed sampling and modified optimization: P ool denotes
tion LGp , the pixel-wise consistency of photo reconstruc- the buffer that stores the minibatch of sketch-label pairs.
tion Lpix , and the classification loss for the synthesized Querying the pool returns either the current minibatch or a
photo Lη : previously stored one (and inserts the current minibatch in
the pool) with a certain likelihood. U denotes uniform sam-
\label {eq:l_gan} \mathcal {L}_{GAN} = \lambda _s\mathcal {L}_{G_{s}}(G_{s}, D_s, p) + \lambda _p\mathcal {L}_{G_{p}}(G_{p},D_p,s,\eta _s) \\ + \lambda _{pix}\mathcal {L}_{pix}(G_{s},G_{p},p,\eta _p) + \lambda _{\eta }\mathcal {L}_{\eta }(R,G_{p},s,\eta _s).
pling in the given range, and t denotes the threshold that
(1) is set according to the ratio of open-domain classes and in-
4
domain classes to match the photo data distribution. Mixing real sketches with fake ones can be regarded
as an online data augmentation technique for training Gp .
Algorithm 1: Minibatch Random-Mixed Sampling Compared with augmentation using edges, Gs can learn the
and Updating distortions from real freehand sketches by approaching the
real data distribution [21, 34, 78], and enable Gp to learn
Input: In training set D, each minibatch contains
shape rectification on the fly. Benefiting from the joint train-
photo set p, freehand sketch set s, the class label of
ing mechanism, as the training progresses, the sketches gen-
photo ηp , and the class label of sketch ηs ;
erated by Gs change epoch by epoch. The loose consis-
for number of training iterations do
tency constraint on sketch generation further increases the
sf ake ← Gs (p);
diverseness of the sketch data in the open-domain. Com-
sc ← s;
pared with using pre-extracted sketches, the open-domain
ηc ← ηs ;
buffer maintains a broad spectrum of sketches: from the
if t < u ∼ U (0, 1) then
very coarse ones generated in early epochs to the more
sc , ηc ← pool.query(sf ake , ηp );
human-like sketches in later epochs as Gs converges.
end
prec ← Gp (sf ake , ηp );
pf ake ← Gp (s, ηs ) 4. Experiments
Calculate LGAN with (p, sc , prec , ηc ) and
update Gs and Gp ; 4.1. Experiment Setup
Calculate LDs (s, sf ake ) and LDs (p, pf ake ),
update Ds and Dp ; Dataset. We train and evaluate the performance of sketch-
Calculate LR (p, pf ake , ηp , ηs ) and update the to-photo synthesis methods on two datasets: Scribble [20]
classifier. (10 classes), and SketchyCOCO [19] (14 classes of objects).
end
Scribble contains ten object classes with photos and sim-
ple outline sketches. Six out of ten classes have similar
One key operation of this strategy is to construct pseudo round outlines, which imposes more stringent requirements
sketches for Gp by randomly mixing the synthesized on the network: whether it can generate the correct structure
sketches with real ones in a batch-wise manner. In this step, and texture conditioned on the input class label. In the open-
the pseudo sketches are treated as the real ones by the gener- domain setting, we only have the sketches of four classes for
ator. Thus, the open-domain classes’ LM GAN becomes:
training: pineapple, cookie, orange, and watermelon, which
means that 60% of the classes are open-domain.
\mathcal {L}_{GAN}^{\mathcal {M}} = \lambda _s\mathcal {L}_{G_{s}}(G_{s}, D_s, p) + \lambda _p\mathcal {L}_{G_{p}}(G_{p},D_p,s_{fake},\eta _p) \\ + \lambda _{pix}\mathcal {L}_{pix}(G_{s},G_{p},p,\eta _p) + \lambda _{\eta }\mathcal {L}_{\eta }(R,G_{p},s_{fake},\eta _p), SketchyCOCO includes 14 object classes, where the
sketches are collected from the Sketchy dataset [61], TU-
(3) Berlin dataset [17], and Quick! Draw dataset [22]. The
14,081 photos for each object class are segmented from the
where ηp ∈ M. Another key of the strategy is on optimiza- natural images of COCO Stuff [5] under unconstrained con-
tion: the sampling strategy is only for Gp . The classifier ditions, thereby making it more difficult for existing meth-
and discriminators are still updated with real/fake data to ods to map the freehand sketches to the photo domain. The
guarantee their discriminative powers. two open-domain classes are: sheep and giraffe.
The random mixing operation is blind to in-domain and Metrics. We quantitatively evaluate the generation results
open-domain classes. As a result, the training sketches in- with three different metrics: 1) Fréchet Inception Distance
clude both real and pseudo sketches from all categories. (FID) [24] that measures the feature similarity between gen-
By including pseudo sketches from both the in-domain and erated and real image sets. A Low FID score means the
open-domain classes, it would further enforce the sketch-to- generated images are less different from the real ones and
image generator to ignore the domain gap in the inputs and thus have high fidelity; 2) Classification Accuracy (Acc) [2]
synthesize realistic photos from both real and fake sketches. of generated images with a pre-trained classifier in the same
Note that since Gs ’s parameters are consistently updated manner as [20, 19]. Higher accuracy indicates better image
during training, the pseudo sketches also change for each realism; 3) User Preference Study (Human): we show the
batch. Moreover, the pseudo sketch-label pairs are acquired participants a given sketch and the class label, and ask them
from a history of generated sketches and their labels rather to pick one photo with the best quality and realism from
than the latest produced ones by Gs . We maintain a buffer generated results. We randomly sample 31 groups of im-
that stores the 50 previously added minibatch of sketch- ages. For each evaluation, we shuffle the options and show
label pairs [64, 80]. them to 25 users. We collect 775 answers in total.
5
Input (a) (b) (c) (d) Ours
6
Dataset Method CycleGAN [80] conditional CycleGAN EdgeGAN [19] Ours
Metric full in-domain open-domain full in-domain open-domain full in-domain open-domain full in-domain open-domain
Scribble FID ↓ 279.5 252.7 355.9 213.6 210.9 253.6 259.7 256.3 298.5 209.5 204.6 252.8
Acc (%) ↑ 16.0 30.0 6.7 68.0 70.0 66.7 100.0 100.0 100.0 100.0 100.0 100.0
Human (%) ↑ 5.60 1.00 8.67 19.20 17.00 20.67 25.20 17.00 30.67 48.80 65.00 38.00
SketchyCOCO FID ↓ 201.7 218.7 237.2 124.3 138.7 171.6 169.7 177.8 221 114.8 128.4 139.2
Acc (%) ↑ 8.4 10.8 1.9 57.0 58.7 52.4 75.8 68.8 98.3 78.3 70.5 100.0
Human (%) ↑ 0.36 0.00 0.67 5.09 5.60 4.67 22.55 32.00 14.67 72.00 59.20 82.67
Table 1: Quantitative evaluation and user study on Scribble and SketchyCOCO datasets. We show the full testset results,
in-domain results, and open-domain results, respectively. Best results are shown in bold.
−− − Original + ++
7
nM = 0 1 2 3 4 5 6
FID ↓ 167.8 182.6 202.0 207.2 204.2 183.2 209.5
Acc (%) ↑ 88.0 80.0 88.0 90.0 76.0 86.0 100.0
8
w/o strategy w/ random-mixed
𝑝!"#$
𝑝%$&
Figure 11: t-SNE visualization of photo embeddings from without any strategy, and with the random-mixed sampling strategy
models. Different colors refer to different categories. Our strategies can make the generator learn more separable embeddings
for different categories, regardless of in-domain or open-domain data.
nM = 0 1 2 3 4 5 6
9
Adversarial Open Domain Adaptation for Sketch-to-Photo Synthesis
Supplementary Material
Xiaoyu Xiang11 , Ding Liu2 , Xiao Yang2 , Yiheng Zhu2 , Xiaohui Shen2 , Jan P. Allebach1
1
Purdue University, 2 ByteDance Inc.
{xiang43,allebach}@purdue.edu,
{liuding,yangxiao.0,yiheng.zhu,shenxiaohui}@bytedance.com
A. Experimental Details
Input size:
In this section, we first illustrate the architectures of (𝐵, 3, 𝐻, 𝑊)
our framework, including generators, discriminators, and a
classifier in Section A.1. Then, we present the objective
7x7 conv, 3, 64, stride=1
functions for training them in Section A.2. The training set- norm, Relu
Output size:
tings of each dataset and additional implementation details (𝐵, 64, 𝐻, 𝑊) 3x3 conv, 64, 128, stride=2
are described in Section A.3 and Section A.4. norm, Relu
Output size:
(𝐵, 128, 𝐻/2, 𝑊/2) 3x3 conv, 128, 256, stride=2
A.1. Architecture
Output size: norm, Relu
Note that our proposed solution is not limited to cer- (𝐵, 256, 𝐻/4, 𝑊/4)
AdaIN Residual Block ×9
tain network architecture. In this work, we select the Cy- conv, AdaIN, Relu, conv, AdaIN Class label
cleGAN [80] as a baseline to illustrate the effectiveness 3x3 conv, 256, 256, stride=1
Output size:
of our proposed solution. Thus we only modify the Gp (𝐵, 256, 𝐻/4, 𝑊/4) 3x3 conv, 256, 512, stride=1
into a multi-class generator and keep the rest structures un- Output size: PixelShuffle(2), norm, Relu
changed, as introduced below. (𝐵, 128, 𝐻/2, 𝑊/2)
3x3 conv, 128, 256, stride=1
Photo-to-Sketch Generator Gs We adopt the architecture Output size: PixelShuffle(2), norm, Relu
of the photo-to-sketch generator from Johnson et al. [29]. (𝐵, 64, 𝐻, 𝑊)
7x7 conv, 64, 3, stride=1
It includes one convolution layer to map the RGB image tanh
to feature space, two downsampling layers, nine residual
blocks, two upsampling layers, and one convolution layer
Output size:
that maps features back to the RGB image. Instance nor- (𝐵, 3, 𝐻, 𝑊)
malization [67] is used in this network. This network is also
adopted as the sketch extractor for the compared method in
the main paper Section 3.1.
Multi-class Sketch-to-Photo Generator Gp The overall Figure 13: The architecture of our multi-class sketch-to-
structure of this network is similar to Gs : a feature-mapping photo generator.
convolution, two downsampling layers, a few residual
blocks, two upsampling layers, and the RGB-mapping con-
volution. We make the following modifications on the resid-
ual blocks and upsampling layers for the multi-class photo work. It includes five convolutional layers and turns a
generation, as illustrated in Figure 13. To make the network 256 × 256 input image into an output tensor of size 30 × 30,
capable of accepting class label information, we change the where each value represents the prediction result for a
normalization layers of the residual blocks into adaptive in- 70×70 patch of the input image. The final prediction output
stance normalization (AdaIN) [25]. The sketch input serves of the whole image is the average value of every patch.
as the content input for AdaIN, and the class label is the
style input ensuring that the network learns the correct tex- Photo Classifier We adopt the architecture of HRNet [70]
tures and colors for each category. In addition, we use con- for photo classification and change its output size of the
volution and PixelShuffle layers [63], instead of commonly last fully-connected (FC) layer according to the number of
used transposed convolution, to upsample the features. The classes in our training data. This network takes a 256 × 256
sub-pixel convolution can alleviate the checkerboard arti- image as input and outputs an n-dim vector as the predic-
facts in generated photos while reducing the number of pa- tion result. We choose the HRNet because of its superior
rameters as well as computations [1]. performance in maintaining high-resolution representations
Discriminators We use the PatchGAN [36, 34] classifier through the whole process while fusing the multi-resolution
as the architecture for the two discriminators in our frame- information at different stages of the network.
10
A.2. Objective Function A.3. Datasets
The loss for training the generator is composed of four We train our model on three datasets: Scribble [20]
parts: the adversarial loss of photo-to-sketch generation (10 classes), QMUL-Sketch [76, 65, 47] (3 classes), and
LGs , the adversarial loss of sketch-to-photo translation SketchyCOCO [19] (14 classes of objects). During the
LGp , the pixel-wise consistency of photo reconstruction training stage, the sketches of certain classes are completely
Lpix , and the classification loss for synthesized photo Lη : removed to meet the open-domain settings.
Scribble This dataset contains ten classes of objects,
including white-background photos and simple outline
\label {eq:l_gan} \mathcal {L}_{GAN} = \lambda _s\mathcal {L}_{G_s}(G_s, D_s, p) + \lambda _p\mathcal {L}_{G_p}(G_p,D_p,s,\eta _s) \\ + \lambda _{pix}\mathcal {L}_{pix}(G_s,G_p,p,\eta _p) + \lambda _{\eta }\mathcal {L}_{\eta }(R,G_p,s,\eta _s), sketches. Six out of ten object classes have similar round
(4) outlines, which imposes more stringent requirements on the
network: whether it can generate the correct structure and
where
texture conditioned on the input class label. In our open-
\mathcal {L}_{G_s}(G_s, D_s,p) = -\mathbb {E}_{p\sim P_{data}(p)}[\log _{D_s}(G_s(p))], (5) domain setting, we only have the sketches of four classes
for training: pineapple (151 images), cookie (147 images),
orange (146 images), and watermelon (146 images). We
\mathcal {L}_{G_p}(G_p, D_p,s,\eta _s) = -\mathbb {E}_{s\sim P_{data}(s)}[\log _{D_p}(G_p(s, \eta _s))], set the input image size to 256 × 256 and train all the com-
(6) pared networks for 200 epochs. We apply the Adam [33]
optimizer with batch size= 1, and the learning rate is set to
2e − 4 for the first 100 epochs, and it decreases linearly to
\mathcal {L}_{pix}(G_s, G_p, p, \eta _p) = \mathbb {E}_{p\sim P_{data}(p)}[||G_p(G_s(p),\eta _p)-p||_1], zero in the second 100 epochs.
(7) QMUL-Sketch We construct it by combing three datasets:
handbags [65] with 400 photos and sketches, ShoeV2 [76]
with 2000 photos and 6648 sketches, and ChairV2 [47]
\mathcal {L}_{\eta }(R,G_p,s,\eta _s) = \\ \mathbb {E}[\log P(R(G_p(s,\eta _s))=\eta _s |G_p(s,\eta _s))]. with 400 photos and 1297 sketches. For the open-domain
(8) training setting, we completely remove the sketches of the
ChairV2. We train the networks for 400 epochs.
Note that only the classification loss of the generated
SketchyCOCO This dataset includes 14 object classes,
photo Gp (s, ηs ) is used to optimize the generators.
where the sketches are collected from the Sketchy
Then we update the discriminators Ds and Dp with the
dataset [61], TU-Berlin dataset [17], and Quick!Draw
following loss functions, respectively:
dataset [22]. The 14,081 photos for each object class are
\mathcal {L}_{D_s}(G_s, D_s,p,s) = -\mathbb {E}_{s\sim P_{data}(s)}[\log D_s(s)] \\ +\mathbb {E}_{p\sim P_{data}(p)}[\log D_s(G_s(p))], segmented from the natural images of COCO Stuff [5] un-
der unconstrained conditions, thereby making it more dif-
(9)
ficult for existing methods to map the freehand sketches
to the photo domain. In our open-domain setting, we re-
move the sketches of two classes during training: sheep
\mathcal {L}_{D_p}(G_p, D_p, s, p,\eta _s) = -\mathbb {E}_{p\sim P_{data}(p)}[\log D_p(p)] \\ +\mathbb {E}_{s\sim P_{data}(s)}[\log D_p(G_p(s, \eta _s))]. and giraf f e. We use EdgeGAN weights released by the
(10) author. All the other networks are trained for 100 epochs.
Then we calculate the classification loss of both real and A.4. Implementation Details
synthesized photos and optimize the classifier:
Our model is implemented in PyTorch [58, 59]. We train
\mathcal {L}_R(R,G_p,s,p,\eta _s,\eta _p) = \mathbb {E}[\log P(R(p)=\eta _p|p)] \\ + \mathbb {E}[\log P(R(G_p (s, \eta _s))=\eta _s|G_p(s,\eta _s))]. our networks with the standard Adam [33] using 1 NVIDIA
(11) V100 GPU. The batch size and initial learning rate are set
to 1 and 2e − 4 for all datasets. The epoch numbers are
Real images and their labels enable the classifier to learn 200, 400, and 100 for the Scribble [20], QMUL-Sketch [76,
the decision boundary for each class, and the synthesized 65, 47], and SketchyCOCO [19], respectively. The learn-
images can force the classifier to treat the fake images as ing rates drop by multiplying 0.5 in the second half of
the real ones and provide discriminant outputs regardless of epochs. For the compared method EdgeGAN [19], we
their domain gap. For this reason, the classifier needs to be use the official implementation in https://github.com/sysu-
trained jointly with the other parts of our framework. imsl/EdgeGAN for data preprocessing and training. It is
We adopt the binary cross-entropy loss for discrimina- trained for 100 epochs on Scribble and QMUL datasets us-
tors and focal loss [42] for classification. The pixel-wise ing one NVIDIA GTX 2080 GPU. The batch size is set to 1
loss for photo reconstruction is measured by L1-distance. due to memory limitation.
11
Metric Method full in-domain open-domain
FID ↓ CycleGAN 97.9 87.7 151.7
conditional CycleGAN 91.6 88.2 107.5
EdgeGAN 243.0 281.3 268.3
Ours 92.4 76.9 142.6
Acc (%) ↑ CycleGAN 72.6 64.7 78.2
conditioanl CycleGAN 78.4 58.0 92.6
EdgeGAN 62.7 100.0 36.8
Ours 89.7 91.5 88.5
Human (%) ↑ CycleGAN 4.00 4.57 2.67
conditional CycleGAN 21.20 18.86 26.67
EdgeGAN 0.00 0.00 0.00
Ours 74.80 76.57 70.67
B. Experimental Results
We first show more sketch-to-photo results on the
QMUL-Sketch dataset in Section B.1 and briefly discuss
these results. At last, we show more sketch-to-photo syn-
thesis results of our method in Section B.2.
12
Figure 15: More 256 × 256 results on the SketchyCOCO dataset.
[3] Ayan Kumar Bhunia, Pinaki Nath Chowdhury, Yongxin semantics preserving adversarial learning for sketch-based
Yang, Timothy M. Hospedales, Tao Xiang, and Yi-Zhe Song. 3d shape retrieval. In Proceedings of the European Confer-
Vectorization and rasterization: Self-supervised learning for ence on Computer Vision (ECCV), pages 605–620, 2018. 1
sketch and handwriting. In Proceedings of the IEEE/CVF [7] Shu-Yu Chen, Wanchao Su, Lin Gao, Shihong Xia, and
Conference on Computer Vision and Pattern Recognition Hongbo Fu. Deepfacedrawing: deep generation of face im-
(CVPR), pages 5672–5681, June 2021. 1 ages from sketches. ACM Transactions on Graphics (TOG),
[4] Ayan Kumar Bhunia, Yongxin Yang, Timothy M 39(4):72–1, 2020. 1, 2
Hospedales, Tao Xiang, and Yi-Zhe Song. Sketch less [8] Tao Chen, Ming-Ming Cheng, Ping Tan, Ariel Shamir, and
for more: On-the-fly fine-grained sketch-based image Shi-Min Hu. Sketch2photo: Internet image montage. ACM
retrieval. In Proceedings of the IEEE/CVF Conference Transactions on Graphics (TOG), 28(5):1–10, 2009. 2
on Computer Vision and Pattern Recognition, pages [9] Tao Chen, Ping Tan, Li-Qian Ma, Ming-Ming Cheng,
9779–9788, 2020. 1 Ariel Shamir, and Shi-Min Hu. Poseshop: Human image
[5] Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco- database construction and personalized content synthesis.
stuff: Thing and stuff classes in context. In Proceedings IEEE Transactions on Visualization and Computer Graph-
of the IEEE Conference on Computer Vision and Pattern ics, 19(5):824–837, 2012. 2
Recognition, pages 1209–1218, 2018. 1, 5, 11 [10] Wengling Chen and James Hays. Sketchygan: Towards di-
[6] Jiaxin Chen and Yi Fang. Deep cross-modality adaptation via verse and realistic sketch to image synthesis. In Proceed-
13
Figure 16: More 256 × 256 results on the QMUL-Sketch dataset.
14
Figure 17: More 256 × 256 results on the Scribble dataset.
15
Figure 18: In-domain 256 × 256 results on the Scribble dataset.
16
ings of the IEEE Conference on Computer Vision and Pattern face and caricature modeling. ACM Transactions on Graph-
Recognition, pages 9416–9425, 2018. 1, 2, 3 ics (TOG), 36(4):1–12, 2017. 1
[11] John Collomosse, Tu Bui, and Hailin Jin. Livesketch: Query [24] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
perturbations for guided sketch-based visual search. In Pro- Bernhard Nessler, and Sepp Hochreiter. Gans trained by a
ceedings of the IEEE Conference on Computer Vision and two time-scale update rule converge to a local nash equilib-
Pattern Recognition, pages 2879–2887, 2019. 1 rium. In Advances in Neural Information Processing Sys-
[12] Guoxian Dai, Jin Xie, Fan Zhu, and Yi Fang. Deep cor- tems, pages 6626–6637, 2017. 5
related metric learning for sketch-based 3d shape retrieval. [25] Xun Huang and Serge Belongie. Arbitrary style transfer in
In Thirty-First AAAI Conference on Artificial Intelligence, real-time with adaptive instance normalization. In Proceed-
2017. 1 ings of the IEEE International Conference on Computer Vi-
[13] Tali Dekel, Chuang Gan, Dilip Krishnan, Ce Liu, and sion, pages 1501–1510, 2017. 10
William T Freeman. Sparse, smart contours to represent [26] Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz.
and edit images. In Proceedings of the IEEE Conference Multimodal unsupervised image-to-image translation. In
on Computer Vision and Pattern Recognition, pages 3511– Proceedings of the European Conference on Computer Vi-
3520, 2018. 1 sion (ECCV), pages 172–189, 2018. 1
[14] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, [27] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A
and Li Fei-Fei. Imagenet: A large-scale hierarchical image Efros. Image-to-image translation with conditional adver-
database. In 2009 IEEE conference on computer vision and sarial networks. In Proceedings of the IEEE Conference
pattern recognition, pages 248–255. IEEE, 2009. 2 on Computer Vision and Pattern Recognition, pages 1125–
1134, 2017. 1
[15] Sounak Dey, Pau Riba, Anjan Dutta, Josep Llados, and Yi-
Zhe Song. Doodle to search: Practical zero-shot sketch- [28] Youngjoo Jo and Jongyoul Park. Sc-fegan: Face editing gen-
based image retrieval. In Proceedings of the IEEE Con- erative adversarial network with user’s sketch and color. In
ference on Computer Vision and Pattern Recognition, pages Proceedings of the IEEE International Conference on Com-
2179–2188, 2019. 1 puter Vision, pages 1745–1753, 2019. 1
[29] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual
[16] Anjan Dutta and Zeynep Akata. Semantically tied paired
losses for real-time style transfer and super-resolution. In
cycle consistency for zero-shot sketch-based image retrieval.
Proceedings of the European Conference on Computer Vi-
In Proceedings of the IEEE Conference on Computer Vision
sion (ECCV), pages 694–711, 2016. 10
and Pattern Recognition, pages 5089–5098, 2019. 1
[30] Justin Johnson, Agrim Gupta, and Li Fei-Fei. Image gener-
[17] Mathias Eitz, James Hays, and Marc Alexa. How do humans
ation from scene graphs. In Proceedings of the IEEE Con-
sketch objects? ACM Transactions on Graphics (TOG),
ference on Computer Vision and Pattern Recognition, pages
31(4):1–10, 2012. 1, 5, 11
1219–1228, 2018. 2
[18] Mathias Eitz, Ronald Richter, Kristian Hildebrand, Tamy [31] Moritz Kampelmuhler and Axel Pinz. Synthesizing human-
Boubekeur, and Marc Alexa. Photosketcher: interactive like sketches from natural images using a conditional convo-
sketch-based image synthesis. IEEE Computer Graphics and lutional decoder. In The IEEE Winter Conference on Appli-
Applications, 31(6):56–66, 2011. 2 cations of Computer Vision, pages 3203–3211, 2020. 7
[19] Chengying Gao, Qi Liu, Qi Xu, Limin Wang, Jianzhuang [32] Junho Kim, Minjae Kim, Hyeonwoo Kang, and Kwanghee
Liu, and Changqing Zou. Sketchycoco: Image gener- Lee. U-gat-it: unsupervised generative attentional networks
ation from freehand scene sketches. In Proceedings of with adaptive layer-instance normalization for image-to-
the IEEE/CVF Conference on Computer Vision and Pattern image translation. arXiv preprint arXiv:1907.10830, 2019.
Recognition, pages 5174–5183, 2020. 1, 2, 3, 5, 6, 7, 8, 11, 1
12 [33] Diederik P Kingma and Jimmy Ba. Adam: A method for
[20] Arnab Ghosh, Richard Zhang, Puneet K Dokania, Oliver stochastic optimization. arXiv preprint arXiv:1412.6980,
Wang, Alexei A Efros, Philip HS Torr, and Eli Shechtman. 2014. 11
Interactive sketch & fill: Multiclass sketch-to-image transla- [34] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero,
tion. In Proceedings of the IEEE International Conference Andrew Cunningham, Alejandro Acosta, Andrew Aitken,
on Computer Vision, pages 1171–1180, 2019. 1, 2, 3, 5, 6, 8, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-
9, 11, 12 realistic single image super-resolution using a generative ad-
[21] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing versarial network. In Proceedings of the IEEE Conference
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and on Computer Vision and Pattern Recognition, pages 4681–
Yoshua Bengio. Generative adversarial nets. In Advances in 4690, 2017. 5, 10
Neural Information Processing Systems, pages 2672–2680, [35] Hsin-Ying Lee, Hung-Yu Tseng, Jia-Bin Huang, Maneesh
2014. 1, 2, 5 Singh, and Ming-Hsuan Yang. Diverse image-to-image
[22] David Ha and Douglas Eck. A neural representation of translation via disentangled representations. In Proceed-
sketch drawings. arXiv preprint arXiv:1704.03477, 2017. ings of the European conference on computer vision (ECCV),
1, 2, 5, 11 pages 35–51, 2018. 1
[23] Xiaoguang Han, Chang Gao, and Yizhou Yu. Deeps- [36] Chuan Li and Michael Wand. Precomputed real-time texture
ketch2face: a deep learning based sketching system for 3d synthesis with markovian generative adversarial networks. In
17
Proceedings of the European Conference on Computer Vi- [50] Mehdi Mirza and Simon Osindero. Conditional generative
sion (ECCV), pages 702–716, 2016. 10 adversarial nets. arXiv preprint arXiv:1411.1784, 2014. 2
[37] Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M [51] Augustus Odena. Semi-supervised learning with genera-
Hospedales. Deeper, broader and artier domain generaliza- tive adversarial networks. arXiv preprint arXiv:1606.01583,
tion. In Proceedings of the IEEE International Conference 2016. 2
on Computer Vision, pages 5542–5550, 2017. 1 [52] Augustus Odena, Christopher Olah, and Jonathon Shlens.
[38] Mengtian Li, Zhe Lin, Radomir Mech, Ersin Yumer, and Conditional image synthesis with auxiliary classifier gans.
Deva Ramanan. Photo-sketching: Inferring contour draw- In International Conference on Machine Learning, pages
ings from images. In 2019 IEEE Winter Conference on Ap- 2642–2651, 2017. 2
plications of Computer Vision (WACV), pages 1403–1412. [53] Kyle Olszewski, Duygu Ceylan, Jun Xing, Jose Echevarria,
IEEE, 2019. 1 Zhili Chen, Weikai Chen, and Hao Li. Intuitive, interactive
[39] Yuhang Li, Xuejin Chen, Feng Wu, and Zheng-Jun Zha. beard and hair synthesis with generative models. In Proceed-
Linestofacephoto: Face photo generation from lines with ings of the IEEE/CVF Conference on Computer Vision and
conditional self-attention generative adversarial networks. In Pattern Recognition, pages 7446–7456, 2020. 1
Proceedings of the 27th ACM International Conference on [54] Pau Panareda Busto and Juergen Gall. Open set domain
Multimedia, pages 2323–2331, 2019. 1, 2 adaptation. In Proceedings of the IEEE International Con-
[40] Yuhang Li, Xuejin Chen, Binxin Yang, Zihan Chen, Zhihua ference on Computer Vision, pages 754–763, 2017. 2
Cheng, and Zheng-Jun Zha. Deepfacepencil: Creating face [55] Kaiyue Pang, Da Li, Jifei Song, Yi-Zhe Song, Tao Xi-
images from freehand sketches. In Proceedings of the 28th ang, and Timothy M Hospedales. Deep factorised inverse-
ACM International Conference on Multimedia, pages 991– sketching. In Proceedings of the European Conference on
999, 2020. 1 Computer Vision (ECCV), pages 36–52, 2018. 7
[41] Yi Li, Timothy M. Hospedales, Yi-Zhe Song, and Shaogang [56] Kaiyue Pang, Ke Li, Yongxin Yang, Honggang Zhang, Tim-
Gong. Fine-grained sketch-based image retrieval by match- othy M Hospedales, Tao Xiang, and Yi-Zhe Song. General-
ing deformable part models. In In British Machine Vision ising fine-grained sketch-based image retrieval. In Proceed-
Conference (BMVC), 2014. 1 ings of the IEEE Conference on Computer Vision and Pattern
[42] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Recognition, pages 677–686, 2019. 1
Piotr Dollár. Focal loss for dense object detection. In Pro- [57] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan
ceedings of the IEEE International Conference on Computer Zhu. Semantic image synthesis with spatially-adaptive nor-
Vision, pages 2980–2988, 2017. 11 malization. In Proceedings of the IEEE Conference on Com-
[43] Fang Liu, Changqing Zou, Xiaoming Deng, Ran Zuo, Yu- puter Vision and Pattern Recognition, pages 2337–2346,
Kun Lai, Cuixia Ma, Yong-Jin Liu, and Hongan Wang. 2019. 2
Scenesketcher: Fine-grained image retrieval with scene [58] Adam Paszke, Sam Gross, Soumith Chintala, Gregory
sketches. 2020. 1 Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Al-
[44] Li Liu, Fumin Shen, Yuming Shen, Xianglong Liu, and Ling ban Desmaison, Luca Antiga, and Adam Lerer. Automatic
Shao. Deep sketch hashing: Fast free-hand sketch-based differentiation in pytorch. 2017. 11
image retrieval. In Proceedings of the IEEE Conference [59] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer,
on Computer Vision and Pattern Recognition, pages 2862– James Bradbury, Gregory Chanan, Trevor Killeen, Zeming
2871, 2017. 1 Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An
[45] Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised imperative style, high-performance deep learning library. In
image-to-image translation networks. In Advances in Neural Advances in Neural Information Processing Systems, pages
Information Processing Systems, pages 700–708, 2017. 1 8026–8037, 2019. 11
[46] Ming-Yu Liu, Xun Huang, Arun Mallya, Tero Karras, Timo [60] Tiziano Portenier, Qiyang Hu, Attila Szabo, Siavash Ar-
Aila, Jaakko Lehtinen, and Jan Kautz. Few-shot unsuper- jomand Bigdeli, Paolo Favaro, and Matthias Zwicker.
vised image-to-image translation. In Proceedings of the Faceshop: Deep sketch-based face image editing. arXiv
IEEE International Conference on Computer Vision, pages preprint arXiv:1804.08972, 2018. 1
10551–10560, 2019. 2 [61] Patsorn Sangkloy, Nathan Burnell, Cusuh Ham, and James
[47] Runtao Liu, Qian Yu, and Stella Yu. Unsupervised sketch- Hays. The sketchy database: learning to retrieve badly drawn
to-photo synthesis. arXiv preprint arXiv:1909.08313, 2019. bunnies. ACM Transactions on Graphics (TOG), 35(4):1–12,
1, 2, 3, 11, 12 2016. 1, 5, 11
[48] Yongyi Lu, Shangzhe Wu, Yu-Wing Tai, and Chi-Keung [62] Yuefan Shen, Changgeng Zhang, Hongbo Fu, Kun Zhou, and
Tang. Image generation from sketch constraint using con- Youyi Zheng. Deepsketchhair: Deep sketch-based 3d hair
textual gan. In Proceedings of the European Conference on modeling. arXiv preprint arXiv:1908.07198, 2019. 1
Computer Vision (ECCV), pages 205–220, 2018. 1, 2 [63] Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz,
[49] Zhaoliang Lun, Matheus Gadelha, Evangelos Kalogerakis, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan
Subhransu Maji, and Rui Wang. 3d shape reconstruction Wang. Real-time single image and video super-resolution
from sketches via multi-view convolutional networks. In using an efficient sub-pixel convolutional neural network. In
2017 International Conference on 3D Vision (3DV), pages Proceedings of the IEEE Conference on Computer Vision
67–77. IEEE, 2017. 1 and Pattern Recognition, pages 1874–1883, 2016. 10
18
[64] Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Joshua domain-migration hashing for sketch-to-image retrieval. In
Susskind, Wenda Wang, and Russell Webb. Learning Proceedings of the European Conference on Computer Vi-
from simulated and unsupervised images through adversarial sion (ECCV), pages 297–314, 2018. 1
training. In Proceedings of the IEEE Conference on Com- [78] Ruofan Zhou and Sabine Susstrunk. Kernel modeling super-
puter Vision and Pattern Recognition, pages 2107–2116, resolution on real low-resolution images. In Proceedings
2017. 5 of the IEEE International Conference on Computer Vision,
[65] Jifei Song, Qian Yu, Yi-Zhe Song, Tao Xiang, and Timo- pages 2433–2443, 2019. 5
thy M Hospedales. Deep spatial-semantic attention for fine- [79] Fan Zhu, Jin Xie, and Yi Fang. Learning cross-domain neural
grained sketch-based image retrieval. In Proceedings of the networks for sketch-based 3d shape retrieval. In Thirtieth
IEEE International Conference on Computer Vision, pages AAAI Conference on Artificial Intelligence, 2016. 1
5551–5560, 2017. 1, 3, 11 [80] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A
[66] Song Tao and Jia Wang. Alleviation of gradient exploding in Efros. Unpaired image-to-image translation using cycle-
gans: Fake can be real. In Proceedings of the IEEE/CVF consistent adversarial networks. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, International Conference on Computer Vision, pages 2223–
pages 1191–1200, 2020. 6 2232, 2017. 1, 2, 5, 6, 7, 10, 12
[67] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. In- [81] Peihao Zhu, Rameen Abdal, Yipeng Qin, and Peter Wonka.
stance normalization: The missing ingredient for fast styliza- Sean: Image synthesis with semantic region-adaptive nor-
tion. arXiv preprint arXiv:1607.08022, 2016. 10 malization. In Proceedings of the IEEE/CVF Conference
[68] Laurens van der Maaten and Geoffrey Hinton. Visualizing on Computer Vision and Pattern Recognition, pages 5104–
data using t-SNE. Journal of Machine Learning Research, 5113, 2020. 2
9(86):2579–2605, 2008. 8 [82] Changqing Zou, Qian Yu, Ruofei Du, Haoran Mo, Yi-Zhe
[69] Fang Wang, Le Kang, and Yi Li. Sketch-based 3d shape Song, Tao Xiang, Chengying Gao, Baoquan Chen, and Hao
retrieval using convolutional neural networks. In Proceed- Zhang. Sketchyscene: Richly-annotated scene sketches. In
ings of the IEEE Conference on Computer Vision and Pattern Proceedings of the European Conference on Computer Vi-
Recognition, pages 1875–1883, 2015. 1 sion (ECCV), pages 421–436, 2018. 1
[70] Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang,
Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui
Tan, Xinggang Wang, et al. Deep high-resolution represen-
tation learning for visual recognition. IEEE Transactions on
Pattern Analysis and Machine intelligence, 2020. 10
[71] Holger Winnemöller, Jan Eric Kyprianidis, and Sven C
Olsen. Xdog: an extended difference-of-gaussians com-
pendium including advanced image stylization. Computers
& Graphics, 36(6):740–753, 2012. 3
[72] Jin Xie, Guoxian Dai, Fan Zhu, and Yi Fang. Learning
barycentric representations of 3d shapes for sketch-based
3d shape retrieval. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 5068–
5076, 2017. 1
[73] Shuai Yang, Zhangyang Wang, Jiaying Liu, and Zongming
Guo. Deep plastic surgery: Robust and controllable im-
age editing with human-drawn sketches. arXiv preprint
arXiv:2001.02890, 2020. 1
[74] Sasi Kiran Yelamarthi, Shiva Krishna Reddy, Ashish Mishra,
and Anurag Mittal. A zero-shot framework for sketch based
image retrieval. In Proceedings of the European Conference
on Computer Vision (ECCV), pages 316–333, 2018. 1
[75] Ran Yi, Yong-Jin Liu, Yu-Kun Lai, and Paul L Rosin. Ap-
drawinggan: Generating artistic portrait drawings from face
photos with hierarchical gans. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition,
pages 10743–10752, 2019. 7
[76] Qian Yu, Feng Liu, Yi-Zhe SonG, Tao Xiang, Timothy
Hospedales, and Chen Change Loy. Sketch me that shoe.
In Computer Vision and Pattern Recognition, 2016. 1, 3, 11
[77] Jingyi Zhang, Fumin Shen, Li Liu, Fan Zhu, Mengyang Yu,
Ling Shao, Heng Tao Shen, and Luc Van Gool. Generative
19