Professional Documents
Culture Documents
DSTN
DSTN
Kibeom Hong 1 Seogkyu Jeon1 Huan Yang3 Jianlong Fu3 Hyeran Byun1,2*
1
Department of Computer Science, Yonsei University
2
Graduate school of AI, Yonsei University
3
Microsoft Research
{cha2068,jone9312,hrbyun}@yonsei.ac.kr {huayan,jianf}@microsoft.com
Photo-realistic
Artistic
Content Reference (a) Ours (b) AdaIN [10] (c) WCT [15] (d) Avatar-Net [24] (e) SANet [21] (f) AAMS [30] (g) PhotoWCT [16] (h) WCT 2 [31]
Figure 1. Stylization results. Given input pairs (a fixed content and four references from photo and art domains), we compare our method
(a) with both artistic transfer methods (b-f) and photo-realistic methods (g-h). While previous works fail to generate plausible results with
references coming from the opposite domains, our model successfully conducts style transfer in both domains. (Best viewed in color.)
s ■■■ ■■■
s ■■■ ■■■
s ■■■ ■■■
c c c
model [25]. After that, thanks to a series of previous stud- we introduce the domain-aware skip connection to balance
ies, we can meet several stylized pictures of various painting between content preservation and texture stylization for the
styles of famous painters (e.g., Van Gogh). domain-aware universal style transfer. Unlike the conven-
Existing universal style transfer methods show the abil- tional skip connection which conveys intact structural de-
ity to deal with arbitrary reference images on either artis- tails, the proposed skip connection block adjusts the trans-
tic or photo-realistic domain. On one hand, WCT [15] and mission clarity of the high-frequency component from the
AdaIN [10] transform the features of content images to stylized feature maps according to the domain properties.
match second-order statistics of reference features. Fol- To obtain the domain property (i.e., domainness) from a
lowed by these approaches, several artistic transfer stud- given reference image, we design the domainness indica-
ies [24, 33, 3, 28, 9] endeavor to transfer the global style tor. Our novel indicator analyzes the characteristics of the
pattern from reference images. Meanwhile, photo-realistic domain by utilizing both the texture and structural feature
style transfer studies [19, 16, 31, 1] focus on preserving maps extracted from different levels of our encoder. In or-
original structures while transferring target styles. der to predict a continuous domain factor with the range
However, we observe the limitation that the meaning of of [0, 1], we propose to augment the intermediate space be-
“arbitrary” is restricted in a specific domain (i.e., either tween art and photo with mixed samples.
artistic or photo-realistic), and it comes from fundamental With the proposed domain-aware architecture, DSTNs
structural modifications for predefined target domains, as deliver semantic and structural information enabling artis-
shown in Figure 2. In specific, artistic style transfer models tic and photo-realistic style transfer, respectively. Conse-
have difficulty in maintaining clear details in the decoder quently, DSTNs generate impeccable stylized results for ar-
because there is no clue directly coming from the content bitrary style, regardless of the target domain (Figure 1 (a)).
image. As a result, structural distortions of the content im- In experiments, we train our decoder and the domain-
age occur when the artistic style transfer methods confront ness indicator with Microsoft COCO [17] and WikiArt [23]
the photo-realistic reference images (Figure 1 (b)-(f)). On datasets. Qualitatively, we show that DSTNs are capable of
the other hand, photo-realistic style transfer models heavily generating plausible stylized results on both domains. Be-
constrain the transformation of input references with skip sides, through quantitative proxy metrics and user study, we
connections. Thus, they lack the ability to express delicate demonstrate that our method outperforms previous methods
patterns (e.g., stroke pattern) of artistic references (Figure on both photo-realistic and artistic style transfer.
1 (g)-(h)). In summary, existing arbitrary style transfer mod- Our contributions are three-fold: 1) We propose the novel
els generate undesired outputs when they receive samples of end-to-end unified architecture, domain-aware style trans-
the other domain as references. fer networks (DSTN), for multi-domain style transfer. 2)
To overcome this limitation, we focus on capturing the With the proposed domainness indicator and domain-aware
domain characteristics from a given reference image and skip connections, we capture the domain characteristics and
adjust the degree of the stylization and the structural preser- adaptively balance between the content preservation and
vation adaptively. For this purpose, we propose Domain- the texture transformation. 3) DSTNs achieve the admirable
aware Style Transfer Networks (DSTN) which are a uni- performance given references of both photo-realistic and
fied architecture composed of an auto-encoder with domain- artistic domains, and outperform previous methods in terms
aware skip connections and the domainness indicator. First, of preservation and stylization.
Conv 1_1
Conv 1_2
Conv 2_1
Conv 2_2
Conv 3_1
Conv 3_2
Conv 3_3
Conv 3_4
Conv 4_1
-
Avg Pool
Avg Pool
Avg Pool
High
Content 𝑰𝑰𝑪𝑪
Feature Freq
Avg Pool
Up Sampling
2x
Photo-realistic style
𝒑𝒑
𝑰𝑰𝑺𝑺
Artistic style
𝑰𝑰𝒂𝒂𝑺𝑺
𝑓𝑓𝑐𝑐 𝑓𝑓𝑠𝑠 Domain-aware Domain-aware (b) Extraction of high frequency
Domainness Skip Connection 𝜶𝜶𝟐𝟐 Skip Connection 𝜶𝜶𝟑𝟑
WCT Indicator
Conv 2_2
Up Sampling 2x
Up Sampling 2x
Up Sampling 2x
Conv 1_1
Conv 1_2
Conv 2_1
Conv 3_1
Conv 3_2
Conv 3_3
Conv 3_4
Conv 4_1
� � 𝒇𝒇𝒘𝒘𝒘𝒘𝒘𝒘 + 𝜶𝜶
𝟏𝟏 − 𝜶𝜶 � � 𝒇𝒇𝒔𝒔𝒔𝒔
+ + +
WCT
WCT
WCT
Conv 4_1
Conv 1_2
Conv 2_2
Conv 3_4
Input
𝑰𝑰
… … … Pretrained VGG-19
(Encoder) Φ
As discussed in earlier sections, the skip-connection is a
fundamental difference that separates photo-realistic meth- Φ1 (𝐼𝐼) Φ2 (𝐼𝐼) Φ3 (𝐼𝐼)
(64, 4W, 4H) (128, 2W, 2H) (256, W, H)
ods from artistic ones. It is advantageous in terms of struc- 𝒇𝒇𝟏𝟏 𝒇𝒇𝟐𝟐 𝒇𝒇𝟑𝟑
tural preservation, but it constrains delicate artistic expres- 𝒢𝒢(𝑓𝑓1 ) 𝒢𝒢(𝑓𝑓2 ) 𝒢𝒢(𝑓𝑓3 )
sions. Based on this observation, we propose a domain-
aware skip connection to conduct domain-aware universal
𝜓𝜓1 𝜓𝜓2 𝜓𝜓3
style transfer on both photo-realistic and artistic domains.
The domain-aware skip connection transforms the content (100, 1, 1) (100, 1, 1) (100, 1, 1)
replicate replicate replicate
feature fc with the given reference feature fs . Thereafter, 1×1 Conv 1×1 Conv 1×1 Conv
we extract the high-frequency components of a stylized (100, W, H) AvgPool (100, W, H) AvgPool (100, W, H)
feature as shown in Figure 3 (b). We exploit this high- 𝒇𝒇′𝟏𝟏 𝒇𝒇′𝟐𝟐 𝒇𝒇′𝟑𝟑
Content & Reference (a) Ours (b) Avatar-Net [24] (c) AAMS [30] (d) Collaborative (e) PhotoWCT [16] (f) WCT 2 [31]
-Distillation [27]
Figure 5. Qualitative comparisons with state-of-the-art models. The blue box indicates photo-realistic reference images and the red box
indicates artistic ones. We depict the stylized results from existing artistic style transfer models (b-d) and photo-realistic ones (e-f). Previous
models produce unsatisfactory results when they receive images from different domains, not the target of each model. The results of (a)
demonstrate that our DSTNs produce both photo-realistic and artistic well regardless of the domain of given images.
in Figure 6. We observe that DSTNs express not only the 4.2. Quantitative results
texture of the reference image but also the domain charac-
teristics well. (More qualitative comparisons are included in Statistics. To measure both photorealism and the degree
our supplementary materials.) of style representation, we employ two proxy metrics for
0.45
0.00
Ideal
Collaborative Methods Preservationp Stylizationa Domainnesso
0.50 -Distillation [27] Ours (+ℒ𝒂𝒂𝒂𝒂𝒂𝒂 )
Avatar-Net [24]
AdaIN [10]
Avatar-Net [24] 0.76% 19.66% 16.54%
SANet [21]
AAMS [30] Ours AAMS [30] 5.30% 18.59% 20.51%
0.55 WCT [15]
LST [14] LST* Colla-Distil [27] 1.52% 26.07% 19.62%
Style loss ↓
(a) Content