2304.10278v1

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Not Only Generative Art: Stable Diffusion

for Content-Style Disentanglement in Art Analysis


Yankun Wu Yuta Nakashima Noa Garcia
Osaka University Osaka University Osaka University
yankun@is.ids.osaka-u.ac.jp n-yuta@ids.osaka-u.ac.jp noagarcia@ids.osaka-u.ac.jp

ABSTRACT Diffusion generated


attract
The duality of content and style is inherent to the nature of art. For
humans, these two elements are clearly different: content refers repel
arXiv:2304.10278v1 [cs.CV] 20 Apr 2023

to the objects and concepts in the piece of art, and style to the
way it is expressed. This duality poses an important challenge for Training

computer vision. The visual appearance of objects and concepts


is modulated by the style that may reflect the author’s emotions, Content space
social trends, artistic movement, etc., and their deep comprehension
undoubtfully requires to handle both. A promising step towards
CLIP space
a general paradigm for art analysis is to disentangle content and Training
style, whereas relying on human annotations to cull a single aspect Diffusion generated
Style space
of artworks has limitations in learning semantic concepts and the
visual appearance of paintings. We thus present GOYA, a method
repel
that distills the artistic knowledge captured in a recent generative attract

model to disentangle content and style. Experiments show that


synthetically generated images sufficiently serve as a proxy of Figure 1: An overview of our method GOYA. By using Sta-
the real distribution of artworks, allowing GOYA to separately ble Diffusion generated images, we disentangle content and
represent the two elements of art while keeping more information style spaces from CLIP space, where content space repre-
than existing methods. sents semantic concepts and style space captures visual ap-
pearance.
CCS CONCEPTS
• Computing methodologies → Image representations; • Ap-
plied computing → Fine arts.
or shape, addressing the question how the artwork looks. Through a
KEYWORDS unique combination of content and style, a piece of art looks as it
art analysis, representation disentanglement, text-to-image genera- is, making the disentanglement of these two elements an essential
tion trait in the study of digital humanities.
ACM Reference Format: Whereas for humans, content and style are easily distinguishable
Yankun Wu, Yuta Nakashima, and Noa Garcia. 2023. Not Only Generative (we can often tell apart the topics depicted in a painting from
Art: Stable Diffusion for Content-Style Disentanglement in Art Analysis. their visual appearance without much trouble), the boundary is
In International Conference on Multimedia Retrieval (ICMR ’23), June 12– not so clear from a computer vision perspective. Traditionally, for
15, 2023, Thessaloniki, Greece. ACM, New York, NY, USA, 14 pages. https: analyzing content in artworks, computer vision relies on object
//doi.org/10.1145/3591106.3592262 recognition techniques [2, 15, 30]. However, even when artworks
contain similar objects, the subject matter may still be different.
1 INTRODUCTION Likewise, the automatic analysis of style is not without controversy.
Content and style are two fundamental elements in the analysis As there is not a formal definition of what visual appearance is,
of art. On the one hand, content describes the concepts depicted there is a degree of vagueness and subjectivity in the computation
in the image, such as the objects, people, or locations. It addresses of style. Some methods [5, 24] rely on well-established attributes,
the question what the artwork is about. On the other hand, style such as author or artistic movement, as a proxy to classify style.
describes the visual appearance of the image: its color, composition, While this definition may be enough for some applications, such as
artist identification [79], it is not appropriate in others, e.g. style
Permission to make digital or hard copies of part or all of this work for personal or transfer [29] or image search [85]. In style transfer, for example,
classroom use is granted without fee provided that copies are not made or distributed style is defined as the low-level features of an image (e.g. colors,
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored. brushstrokes, shapes). In a broader sense, however, style is not
For all other uses, contact the owner/author(s). formed by a single image but by a set of artworks that share a
ICMR ’23, June 12–15, 2023, Thessaloniki, Greece common visual appearance [47].
© 2023 Copyright held by the owner/author(s).
ACM ISBN 979-8-4007-0178-8/23/06. On top of these challenges, most of the methods for art analy-
https://doi.org/10.1145/3591106.3592262 sis are trained with full supervision, requiring each image in the
ICMR ’23, June 12–15, 2023, Thessaloniki, Greece Wu et al.

dataset to be annotated with its corresponding content or style cat- • We design a model to disentangle content and style informa-
egory. This categorization poses some additional problems. Firstly, tion from pre-trained CLIP’s latent space.
although there are digitized collections of artworks that can be • We propose to train the disentanglement model with syn-
leveraged for supervised learning (e.g. WikiArt1 , The Met2 , Web thetically generated images, instead of real paintings, via
Gallery of Art3 ), labels for new artworks are not straightforward to Stable Diffusion and prompt design.
obtain, and often require experts to annotate them. Secondly, the • We show that the information in Stable Diffusion generated
annotated labels, which are often single-words, reflect only some images can be effectively distilled for art analysis, performing
general traits of a set of artworks while ignoring the subtle proper- well on tasks such as art retrieval and art classification.
ties of each image. For instance, given a painting with a genre label Our findings open the way for adopting generative models in digital
still life and an artistic movement label Expressionism, what scene humanities, not only for generation but also for analysis.
does it depict and how does its visual appearance look like? We
can infer some of the coarse attributes it may carry, e.g. inanimate 2 RELATED WORK
subjects from still life and strong subjective emotions from Expres- Art analysis. The use of computer vision techniques for art anal-
sionism. However, some fine attributes such as depicted concepts, ysis has been an active research topic for decades, in particular
color composition, and brushstrokes still remain obscure. When in tasks such as attribute classification [20, 24, 53, 55, 75], object
training based on labels, it is difficult to learn the subtle content and recognition [2, 15, 30, 68], or image retrieval [2, 15, 26, 85]. Fully-
style discrepancy in images. To overcome the limitations imposed supervised tasks (e.g., genre or artist classification [25, 54, 75]) have
by labels, some work [1, 4, 26] has relied on natural language de- obtained outstanding results by leveraging neural networks trained
scriptions instead of categorical classes. Although natural language on annotated datasets [54, 55, 62, 72]. However, image annotations
can contribute to resolving the ambiguity and rigidness of labels, have some limitations. An important limitation is the categoriza-
human experts still need to write descriptions for each artwork. tion of styles. Multiple datasets [22, 37, 39, 54, 55, 62, 72, 80] pro-
Moving away from human supervision, we investigate the gener- vide style labels, which have been leveraged by abundant work
ative power of a popular text-to-image model, Stable Diffusion [58], [8, 13, 21, 48, 60, 63, 79] for style classification. This direction of
and propose to leverage the distilled knowledge as a prior to learn work assumes style to be a static attribute, instead of dynamic and
disentangled content and style embeddings of paintings. Given a evolving [47]. A different interpretation is provided in style transfer
text input, also known as prompt, Stable Diffusion can generate a [6, 7, 16, 29]. A model extracts the low-level representation of a
set of diverse synthetic images while maintaining content and style stylized image (e.g. a painting) and applies it to a content image
consistency. The subtle characteristics of content and style in the (e.g. a plain photograph), defining style by a single artwork, i.e. the
synthetically generated images can be controlled by well-defined color, shape, and brushstroke of the stylized image. To overcome
prompts. Thus, we propose to use the generated images with the the rigidness of labels in supervised learning and the narrowness
help of the input prompts to train a model that disentangles content of a single image in style transfer, we propose learning disentan-
and style in paintings without direct human annotations. Concur- gled embeddings of content and style by similarity comparisons by
rent to our work, it has been shown that using Stable Diffusion leveraging the flexibility of a text-to-image generative model.
generated images can be useful for image classification tasks [64].
The intuition behind our method, named GOYA (disentanGlement Representation disentanglement. Learning disentangled represen-
of cOntent and stYle with generAtions), is that although there is no tation plays an essential role in various computer vision tasks such
explicit boundary between different contents or styles, we can dis- as style transfer [44, 82], image manipulation [69, 83], and image-
tinguish significant dissimilarities by comparison. Our simple yet to-image translation [23, 31, 36, 86]. The goal is to discover discrete
effective model (Figure 1) first extracts joint content-style embed- factors of variation in data, thus improving the interpretability
dings using pre-trained CLIP image encoder [56], and then applies of representations and enabling a wide range of downstream ap-
two independent content and style transformation networks to plications. To disentangle attributes like azimuth, age, or gender,
learn disentangled content and style embeddings. As mentioned previous work have built on adversarial learning [10, 17, 40, 77] or
before, the weights of the transformation networks are trained on variational autoencoders (VAE) [33, 43, 67], aiming to encourage
the generated synthetic images with contrastive learning, reducing discrete properties in a single latent space. For content and style
the need of using human image-level annotations. disentanglement, several work apply generative model [23, 38, 44],
We conduct three tasks and an ablation study on a popular bench- diffusion model [46, 81], or an autoencoder architecture with con-
mark of paintings, the WikiArt dataset [74]. We show that GOYA, trastive learning [14, 44, 59]. In the art domain, ALADIN [59] con-
by distilling the knowledge from Stable Diffusion generated images, catenates the adaptive instance normalization (AdaIN) [35] feature
disentangles content and style better than models trained on real into the style encoder to learn style embedding for visual searching.
paintings. Moreover, experiments show that the resulting disen- Kotovenko et al. [44] propose fixpoint triplet loss and disentangle-
tangled spaces are useful for downstream tasks such as similarity ment loss for performing better style transfer. However, these work
retrieval and art classification. In summary, our contributions are: lack semantic analysis of content embeddings in paintings. Recently,
Vision Transformer (ViT) [19]-based models show the ability to
obtain structure and appearance embeddings [3, 46, 78]. DiffuseIT
1 https://www.wikiart.org/ [46] and Splice [78] learn content and style embeddings by utilizing
2 https://www.metmuseum.org/ the keys and the global [CLS] token of pre-trained DINO [3]. In
3 https://www.wga.hu/ our work, taking advantage of the generative model, our approach
Not Only Generative Art: Stable Diffusion
for Content-Style Disentanglement in Art Analysis ICMR ’23, June 12–15, 2023, Thessaloniki, Greece

Prompt
CLIP + −
portrait of germaine raynal,
Text fiC
Art Nouveau
Encoder

Di usion
Model + −
attract repel
giC
Content space hC
CLIP
Image
Encoder gi
hS
CLIP feature space attract

+
repel
giS −
Baroque
Style space Realism

Frozen model Art Nouveau

Figure 2: Details of our proposed method GOYA for content and style disentanglement. Given a synthetic prompt containing
content (first part of the prompt, in green) and style (second part of the prompt, in red) descriptions, we generate synthetic
diffusion images. We compute CLIP embeddings with the frozen CLIP image encoder, and generate content and style disen-
❄︎
ff

tangled embeddings with two dedicated encoders C and S, respectively. In the training stage, projectors ℎ𝐶 and ℎ𝑆 and style
classifier R are used to train GOYA with contrastive learning. For content, contrastive learning pairs are chosen based on the
text embedding of content description in the prompt extracted by frozen CLIP text encoder. For style, contrastive learning
pairs are chosen based on the style description in the prompt.

builds a simple framework to decompose the latent space into con- 3 PRELIMINARIES
tent and style spaces with contrastive learning, exploring to employ
3.1 Stable Diffusion
generated images on representation learning.
Diffusion models [34, 58] are generative methods trained on two
stages: a forward process with a Markov chain to transform input
data to noise, and a reversed process to reconstruct data from the
noise, obtaining high-quality performance on image generation.
To reduce the training cost and accelerate the inference process,
Text-to-image generation. Text-to-image models intend to per- Stable Diffusion [58] trains the diffusion process in the latent space
form synthetic image generation conditioned on a given text input. instead of the pixel space. Given a text prompt as input condition,
Catalyzed by datasets with massive text-image pairs that have the text encoder transforms the prompt to a text embedding. Then,
emerged in recent years, many powerful text-to-image generation by feeding the embedding into the UNet through a cross-attention
models sprung up [18, 49, 50, 57, 58, 61, 76]. For instance, CogView mechanism, the reversed diffusion process generates an image em-
[18] is trained on 30 million text-image pairs while DALL-E 2 [57] bedding in the latent space. Finally, the image embedding is fed to
is trained on 650 million text-image pairs. One of their main chal- the decoder to generate a synthetic image.
lenges is achieving semantic coherence between guiding texts and In this work, we define symbols as follows: given a text prompt
generated images. This has been addressed by using pre-trained 𝑥 = {𝑥 𝐶 , 𝑥 𝑆 } as input, we can obtain the generated image 𝑦. The text
CLIP embeddings [56] to construct aligned text and image features 𝑥 𝐶 represents content description and 𝑥 𝑆 denotes style description,
in the latent space [45, 49, 88]. Another challenge is to obtain high- where {·} indicates a comma-separated string concatenation.
resolution synthetic images. GAN-based models [50, 73, 76, 88] have
achieved good performance in improving the quality of generated
images, however, they suffer from instability in training. Exploit- 3.2 CLIP
ing the superiority of training stability, work based on Diffusion CLIP [56] is a text-image matching model that aligns text and image
models [57, 58, 61] have recently become a popular tool for gener- embeddings in the same latent space. It shows high consistency
ating near-human quality images. Despite the rapid development of the visual concepts in the image and the semantic concepts in
of models for image generation, how to leverage the feature of the corresponding text. The text encoder E𝑇 and image encoder
synthetic images remains an underexplored area of research. In this E𝐼 of CLIP are trained with 440 million text-image pairs, showing
paper, we study the potential of generated images for enhancing outstanding performance on various text and image downstream
representation learning. tasks, such as zero-shot prediction [12, 87] and image manipulation
ICMR ’23, June 12–15, 2023, Thessaloniki, Greece Wu et al.

[41, 45, 49]. Given the text 𝑥 and an image 𝑦, the CLIP embeddings definition of content similarity. We propose a soft-positive selection
𝑓 from text, and 𝑔 from image, both in R𝑑 , can be computed as: strategy that defines pairs of images with similar content accord-
ing to their semantic similarity. That is, two images with similar
𝑓 = E𝑇 (𝑥), (1)
semantic concepts are defined as a positive pair whereas images
𝑔 = E𝐼 (𝑦). (2) without semantic similarity are negative pairs.
To exploit the multi-modal CLIP space, we employ the pre-trained To quantify semantic similarity between a pair of images, we
CLIP image encoder E𝐼 to obtain CLIP image embeddings as the exploit the CLIP latent space and conduct text similarity between
prerequisite for the subsequent disentanglement model. Moreover, the associated texts. Given the content description 𝑥𝑖𝐶 of the image
during the training stage, the CLIP text embedding of a prompt is 𝑦𝑖 , we assume that the CLIP text embedding 𝑓𝑖𝐶 = E𝑇 (𝑥𝑖𝐶 ) can be a
applied to acquire the semantic concepts of the generated image. proxy for the content of 𝑦𝑖 . Therefore, given a pair of two diffusion
images (𝑦𝑖 , 𝑦 𝑗 ) and a text similarity threshold 𝜖𝑇 , they are a positive
4 GOYA pair if 𝐷𝑇𝑖𝑗 ≤ 𝜖𝑇 , where 𝐷𝑇𝑖𝑗 is the text similarity obtained by the
The goal of our task is to learn disentangled content and style em- cosine distance between the CLIP text embedding 𝑓𝑖𝐶 and 𝑓 𝑗𝐶 . The
beddings of artworks in two different spaces. Unlike previous work, content contrastive loss is defined as:
we leverage the knowledge of generated images rather than real
paintings, unlocking generative models for representation analysis. 𝐿𝑖𝐶𝑗 = 1 [𝐷𝑇 ≤𝜖𝑇 ] (1 − 𝐷𝑖𝐶𝑗 ) + 1 [𝐷𝑇 >𝜖𝑇 ] max(0, 𝐷𝑖𝐶𝑗 − 𝜖𝑐 ), (5)
𝑖𝑗 𝑖𝑗
Our idea is to borrow the Stable Diffusion’s capability to generate where 1 [·] is the indicator function that gives 1 when the condition
a wide variety of images not only of diverse contents but also in
is true and 0 otherwise. 𝐷𝑖𝐶𝑗 is the cosine distance between ℎ𝐶 (𝑔𝐶𝑖 )
various styles. Contrastive losses for content and style allow GOYA
to learn the proximity of different artworks in the respective spaces and ℎ𝐶 (𝑔𝐶𝑗 ), which are the content embeddings of images after
with the consistency of generated images and text prompts. projection. 𝜖𝑐 is the margin that constrains the minimum distance
Figure 2 shows an overview of GOYA. Given a mini-batch of 𝑁 of negative pairs.
𝑁 , where 𝑥 = {𝑥 𝐶 , 𝑥 𝑆 } with comma-connected con-
prompts {𝑥𝑖 }𝑖=0 𝑖 𝑖 𝑖
tent and style descriptions, we obtain diffusion generated images 𝑦𝑖 4.4 Style contrastive loss
using Stable Diffusion. We then compute CLIP image embeddings The style contrastive loss is defined based on the style description
𝑔𝑖 by Eq. (2) and use a content encoder and a style encoder to obtain 𝑥 𝑆 given in the input prompt. If a pair of images shares the same
disentangled embeddings in two different spaces. As previous work style class, then they are considered to be a positive pair, which
has shown [27] content and style have different properties, while means that their style embeddings should be close in the style space.
content embeddings refer to higher layers in the deep neural net- Otherwise, they are a negative pair, and they should be pushed away
work and style embeddings respond to lower layers. We design an from each other. Given (𝑦𝑖 , 𝑦 𝑗 ), the style contrastive loss can be
asymmetric network architecture for extracting content and style, computed as:
which is common in the art analysis domain [27, 44, 54, 59].
𝐿𝑖𝑆𝑗 = 1 [𝑥 𝑆 =𝑥 𝑆 ] (1 − 𝐷𝑖𝑆𝑗 )) + 1 [𝑥 𝑆 ≠𝑥 𝑆 ] max(0, 𝐷𝑖𝑆𝑗 − 𝜖 𝑆 ), (6)
𝑖 𝑗 𝑖 𝑗
4.1 Content encoder where 𝐷𝑖𝑆𝑗 is the cosine distance between the style embeddings
The content encoder C maps CLIP image embedding 𝑔𝑖 to content
ℎ𝑆 (𝑔𝑖𝑆 ) and ℎ𝑆 (𝑔𝑆𝑗 ) after projection, and 𝜖 𝑆 is the margin.
embedding 𝑔𝐶 𝑖 as:
𝑔𝐶𝑖 = C(𝑔𝑖 ), (3) 4.5 Style classification loss
C is a two-layer perceptron (MLP) with ReLU non-linearity. Follow- To learn the general attributes of each style, we introduce a style
ing previous work [9], to make content 𝑔𝐶𝑖 highly linear, at training classifier R to predict the style description (given as 𝑥𝑖𝑆 ) based on
time we add a non-linear projector ℎ𝐶 on top of the content encoder, the embedding 𝑔𝑖𝑆 of image 𝑦𝑖 . Prediction 𝑤𝑖𝑆 by the classifier is
which is a three-layer MLP with ReLU non-linearity. given by:
𝑤𝑖𝑆 = R (𝑔𝑖𝑆 ), (7)
4.2 Style encoder
where R is a linear layer network. For training, we use softmax
Style encoder S also maps CLIP image embedding 𝑔𝑖 but to style
cross-entropy loss, which is denoted by 𝐿𝑖𝑆𝐶 . Note that the training
embedding 𝑔𝑖𝑆 as:
of this classifier does not rely on human annotations, but on the
𝑔𝑖𝑆 = S(𝑔𝑖 ). (4) synthetic prompts and generated images by Stable Diffusion.
S is a three-layer MLP with ReLU non-linearity. In particular, fol-
lowing [32], we apply a skip connection before the last ReLU non- 4.6 Total loss
linearity in S. Similar to the content encoder, non-linear projector In the training process, we compute the sum of three losses. The
ℎ𝑆 with the same structure as ℎ𝐶 is added after S to facilitate con- overall loss function in a mini-batch is formulated as:
trastive learning. ∑︁ ∑︁ ∑︁
𝐿 = 𝜆𝐶 𝐿𝑖𝐶𝑗 + 𝜆𝑆 𝐿𝑖𝑆𝑗 + 𝜆𝑆𝐶 𝐿𝑖𝑆𝐶 , (8)
4.3 Content contrastive loss 𝑖𝑗 𝑖𝑗 𝑖

Unlike prior work [44], which defines content similarity only when where 𝜆𝐶 , 𝜆𝑆 and 𝜆𝐶𝑆are parameters to control the contributions
style-transferred images are from the same source, we use a broader of losses. We set 𝜆𝐶 = 𝜆𝑆 = 𝜆𝐶𝑆 = 1. The summantions over 𝑖 and
Not Only Generative Art: Stable Diffusion
for Content-Style Disentanglement in Art Analysis ICMR ’23, June 12–15, 2023, Thessaloniki, Greece

Table 1: Distance Correlation (DC) between content and


style embeddings on the WikiArt test set. Labels indicate
the results when using a one-hot vector embedding of the
ground truth labels. ResNet50 and CLIP are fine-tuned on
WikiArt while DINO loads the pre-trained weights.
street in moret sur loing, woman with a child on her our father who art in Training Training Emb. size Emb. size
Analytical Cubism lap, Analytical Cubism heaven, Analytical Cubism Model DC ↓
params. data content style
Labels - - 27 27 0.269
ResNet50 [32] 47M WikiArt 2, 048 2, 048 0.635
CLIP [56] 302M WikiArt 512 512 0.460
DINO [3] - - 616, 225 768 0.518
GOYA (Ours) 15M Diffusion 2, 048 2, 048 0.367
street in moret sur loing, woman with a child on our father who art in
Ukiyo e her lap, Ukiyo e heaven, Ukiyo e

from the WikiArt test set. We examine other models trained on


generated images, which are equally affected by this issue.
Image generation details. To generate images that look similar to
human-made paintings, we rely on crafting prompts 𝑥 = {𝑥 𝐶 , 𝑥 𝑆 } as
explained in Section 3.1. For simplicity, we choose titles of paintings
street in moret sur loing, woman with a child on our father who art in
Realism her lap, Realism heaven, Realism as 𝑥 𝐶 and style movements as 𝑥 𝑆 , although any other definitions of
content and style descriptions can be used. In total, there are 43, 610
Figure 3: Examples of prompts and the corresponding gener- content descriptions 𝑥 𝐶 , and 27 style descriptions 𝑥 𝑆 . For each 𝑥 𝐶 ,
ated diffusion images. The first part of the prompt (in blue) we randomly choose five 𝑥 𝑆 to generate five prompts 𝑥. Then, each
denotes the content description 𝑥 𝐶 , and the second part (in prompt generates five images with random seeds. Altogether, we
orange) is the style description 𝑥 𝑆 . Each column depicts the obtain 218, 050 prompts and 1, 090, 250 synthetic images. We split
same content 𝑥 𝐶 while each row depicts one style 𝑥 𝑆 . the generated images into 981, 225 training and 109, 025 validation
images. We use Stable Diffusion v1.44 and generate images of size
512 × 512 through 50 PLMS [51] sampling steps.
Figure 3 shows some examples of diffusion generated images by
𝑗 are computed for all pairs of images in the mini-batch, and the the designed prompts. We can observe that the depicted scene is
summation over 𝑖 is for all images in the mini-batch. consistent with the content description in the prompts. Images in
the same column have the same 𝑥 𝐶 but different 𝑥 𝑆 , and have a high
5 EVALUATION agreement in content while carrying significant differences in style.
We evaluate GOYA on three tasks: disentanglement (Section 5.1), Likewise, images in the same row have the same 𝑥 𝑆 but different
classification (Section 5.2), and similarity retrieval (Section 5.3). We 𝑥 𝐶 , and paint different scenes or objects while maintaining a similar
also conduct an ablation study in Section 5.4. style. However, some of the content descriptions are religious, such
as 𝑥 𝐶 in the third column, “our father who art in heaven.” In such
Evaluation data. To evaluate content and style in the classifica- cases, it may be difficult to have an agreement on the semantic
tion task, we apply genre and style movement labels in art datasets consistency of the generated images and the prompts.
that can be served as substitutes for presenting content and style,
even if they do not entirely satisfy our definition in this paper. GOYA details. For the CLIP image and text encoders, we load
In detail, the genre labels indicate the type of scene depicted in the pre-trained weights of CLIP-ViT-B/32 models.5 The margin
the paintings, such as “portrait” or “cityscape”, while the style for computing contrastive losses 𝜖𝐶 = 𝜖 𝑆 = 0.5. In the indicator
movement labels correspond to artistic movements such as “Im- function for the content contrastive loss, the threshold 𝜖𝑇 is set
pressionism” and “Expressionism”. We use the WikiArt dataset [74] to 0.25. We use Adam optimizer [42] with base learning rate =
for evaluation, a popular artwork dataset with both genre and style 0.0005 and decay rate = 0.9. We train GOYA on 4 A6000 GPUs with
movement annotations. In total, we get 81, 445 paintings, 57, 025 Distributed Data Parallel in PyTorch.6 In each device, the batch
in the training set, 12, 210 in the validation set, and 12, 210 in the size is set as 512. Before feeding into CLIP, images are resized to
test set, with three types of labels: 23 artists, 10 genres, and 27 style 224 × 224 pixels.
movements. All evaluation results are computed on the test set.
5.1 Disentanglement evaluation
Training data. Baselines reported on WikiArt are trained with To measure content and style disentanglement quantitatively, we
WikiArt training set. GOYA is trained with generated images by compute the Distance Correlation (DC) [52] between content and
Stable Diffusion, which are described in the next paragraph. Addi- 4 https://github.com/CompVis/stable-diffusion
tionally, the training dataset of Stable Diffusion LAION-5B [66] has 5 https://github.com/openai/CLIP

over five billion image-text pairs, which contain some paintings 6 https://pytorch.org/
ICMR ’23, June 12–15, 2023, Thessaloniki, Greece Wu et al.

Table 2: Genre and style movement accuracy on WikiArt [74] dataset for different models.

Training Emb. size Emb. size Accuracy Accuracy


Model Label Num. train
data content style genre style
Pre-trained
Gram Matrix [27, 28] - - - 4, 096 4, 096 61.81 40.79
ResNet50 [32] - - - 2, 048 2, 048 67.85 43.15
CLIP [56] - - - 512 512 71.56 51.23
DINO [3] - - - 616, 225 768 51.13 38.81
Trained on WikiArt
ResNet50 [32] (Genre) WikiArt Genre 57, 025 2, 048 2, 048 79.13 43.17
ResNet50 [32] (Style) WikiArt Style 57, 025 2, 048 2, 048 67.22 64.44
CLIP [56] (Genre) WikiArt Genre 57, 025 512 512 80.43 34.98
CLIP [56] (Style) WikiArt Style 57, 025 512 512 56.28 63.02
SimCLR [9] WikiArt - 57, 025 2, 048 2, 048 65.82 45.15
SimSiam [11] WikiArt - 57, 025 2, 048 2, 048 51.65 31.24
Trained on Diffusion generated
ResNet50 [32] (Movement) Diffusion Movement 981, 225 2, 048 2, 048 61.78 45.79
CLIP [56] (Movement) Diffusion Movement 981, 225 512 512 52.65 43.58
SimCLR [9] Diffusion - 981, 225 2, 048 2, 048 33.82 20.88
GOYA (Ours) Diffusion - 981, 225 2, 048 2, 048 69.70 50.90

style embeddings, which is specially designed for content and style at the deepest layer from the self-similarity of keys in the attention
disentanglement evaluation. Let 𝐺 𝐶 and 𝐺 𝑆 denote matrices con- module and the [CLS] token, respectively.
taining all content and style embeddings in the WikiArt test set, i.e.,
Results. Results are reported in Table 1. GOYA shows the best
𝐺 𝐶 = (𝑔𝐶 𝐶 𝑆 𝑆 𝑆
1 · · · 𝑔𝑁 ) and 𝐺 = (𝑔1 · · · 𝑔𝑁 ). For an arbitrary pair disentanglement with the lowest DC, 0.367, and a large margin
(𝑖, 𝑗) of embeddings, the distances 𝑝𝑖 𝑗 and 𝑞𝑖𝑆𝑗 can be computed by:
𝐶
with the second-best disentangled embeddings from fine-tuned
𝑝𝑖𝐶𝑗 = ∥𝑔𝐶 𝐶 𝑆 𝑆 𝑆 CLIP. With only nearly 1/3 training parameters of ResNet50 and
𝑖 − 𝑔 𝑗 ∥, 𝑝𝑖 𝑗 = ∥𝑔𝑖 − 𝑔 𝑗 ∥, (9)
1/20 of CLIP, GOYA outperforms embeddings directly trained on
where ∥ · ∥ gives the Euclidean distance. Let 𝑝¯𝑖𝐶· , 𝑝¯𝐶·𝑗 , and 𝑝¯𝐶
denote WikiArt’s real paintings while consuming fewer resources. Also,
the means over 𝑗, 𝑖, and both 𝑖 and 𝑗, respectively. With these means, GOYA achieves better disentanglement capability than DINO, with
the distances can be doubly centered by much more compact embeddings, e.g. 1/300 content size embedding.
However, there is still a notorious gap between GOYA and the lower
𝑞𝐶 𝐶 𝐶 𝐶 𝐶
𝑖 𝑗 = 𝑝𝑖 𝑗 − 𝑝¯𝑖 · − 𝑝¯ ·𝑗 + 𝑝¯ , (10)
bound based on labels, showing that there is room for improvement.
and likewise for 𝑞𝑖𝑆𝑗 . DC between 𝐺 𝐶 and 𝐺 𝑆 is given by:
5.2 Classification evaluation
𝐶 𝑆 dCov(𝐺 𝐶 , 𝐺 𝑆 ) For evaluating the disentangled embeddings for art classification,
DC(𝐺 , 𝐺 ) = √︁ , (11)
dCov(𝐺 𝐶 , 𝐺 𝐶 )dCov(𝐺 𝑆 , 𝐺 𝑆 ) following the protocol in [11], we train two independent classifiers
where with a single linear layer on top of the content and style embeddings.
√︂∑︁ ∑︁
1
𝐶
dCov(𝐺 , 𝐺 ) = 𝑆
𝑞𝐶 𝑞𝑆 . (12) Baselines. We compare GOYA against three types of baselines:
𝑁 𝑖 𝑗 𝑖𝑗 𝑖𝑗
pre-trained models, models trained on WikiArt dataset, and models
dCov(𝐺 𝐶 , 𝐺 𝐶 ) and dCov(𝐺 𝑆 , 𝐺 𝑆 ) are defined likewise. DC can be trained on diffusion generated images. As pre-trained models, we
computed for arbitrary matrices with 𝑁 columns. DC is in [0, 1], use Gram matrix [27, 28], ResNet50 [32], CLIP [56] and DINO [3].
and lower value means 𝐺 𝐶 and 𝐺 𝑆 are less correlated. We aim at For models trained on WikiArt, other than fine-tuning ResNet50
DC being close to 0. and CLIP, we also apply two popular contrastive learning methods:
SimCLR [9] and SimSiam [11]. For models trained on generated
Baselines. To compute the lower bound DC on the WikiArt test
images, ResNet50 and CLIP are fine-tuned with style movements
dataset, we assign the one-hot vector of the ground-truth genre
in the prompts. SimCLR is trained without any annotations.
and style movement labels as the content and style embeddings,
representing the uppermost disentanglement when the labels are Results. Table 2 shows the classification results. Compared with
100% correct. Besides the lower bound, we evaluate DC on ResNet50 the pre-trained baselines in the first four rows, GOYA is ahead of
[32], CLIP [56] and DINO [3]. For ResNet50, embeddings are ex- Gram matrix, ResNet50, and DINO. Yet, it lags behind pre-trained
tracted before the last fully-connected layer. For CLIP, we use the CLIP by less than 1% in both genre and style movement accuracy.
embedding from the CLIP image encoder E𝐼 . For pre-trained DINO, Compared with models trained on WikiArt, although not com-
following Splice [78], content and style embeddings are extracted parable to fine-tuned ResNet50 and CLIP on classification, GOYA
Not Only Generative Art: Stable Diffusion
for Content-Style Disentanglement in Art Analysis ICMR ’23, June 12–15, 2023, Thessaloniki, Greece

Query Content

1st 2nd 3rd 4th 5th


Style

1st 2nd 3rd 4th 5th

Query Content

1st 2nd 3rd 4th 5th


Style

1st 2nd 3rd 4th 5th

Query Content

1st 2nd 3rd 4th 5th


Style

1st 3rd 4th 5th 6th

Query Content

1st 2nd 4th 5th 6th


Style

1st 2nd 4th 6th 7th

Figure 4: Retrieval results on the WikiArt test set based on cosine similarity. The similarity decreases from left to right. Copy-
righted images are skipped.

achieves better capability of disentanglement as shown in Table 1. However, CLIP decreases in both genre and style accuracy after
Also, GOYA enables better performance on classification against fine-tuning on generated images. SimCLR has a dramatic decre-
contrastive learning models SimCLR and SimSiam. ment when trained on generated images compared to on WikiArt.
When training on diffusion generated images, GOYA achieves As SimCLR focuses more on learning the intricacies of the image
the best classification performance compared to other models with itself rather than the relation of images, it learns the distribution of
different embedding sizes. After fine-tuned on style movement in generated images, leading to poor performance on WikiArt. While
the prompts, ResNet50 increases 3% on the style accuracy, show- training on the same dataset, GOYA sustains better capability on
ing the potential for analysis via synthetically generated images. classification tasks while achieving high disentanglement.
ICMR ’23, June 12–15, 2023, Thessaloniki, Greece Wu et al.

5.3 Similarity retrieval 0.48 Triplet +


Next, we evaluate the visual retrieval performance of GOYA. Given Triplet + Classifier
0.46 NTXent +
a painting as a query, the five closest images are retrieved based NTXent + Classifier
on the cosine similarity of the embeddings in the content and style 0.44 Contrastive +
Contrastive + Classifier
space, representing the most similar paintings in each space.
0.42
Results. Visual results are shown in Figure 4. Most of the paint-
0.4
ings retrieved in the content space depict scenes similar to the query
image. For instance, in the third query image, there is a woman with 0.38
a headscarf bending over to scrub a pot, while all similar paintings GOYA
in the content space show a woman leaning to do manual labor 0.36

such as washing, knitting, and chopping, independently of their


0.3 0.32 0.34 0.36 0.38 0.4
visual style. It can be seen that in most similar content paintings,
various styles are depicted through different color compositions
and tones. On the contrary, similar paintings in the style space are
Figure 5: Loss comparison. The 𝑥-axis shows the product of
prone to carry similar styles but different content. The similar style
genre and style accuracies (the higher the better) while the 𝑦-
images of the query image have similar color compositions or brush-
axis presents the disentanglement, DC (the lower the better).
strokes, but depict different scenes compared to the query image.
The purple line shows the trendline as 𝑦 = 0.0776 + 0.9295𝑥.
For example, the fourth query image, which is one of the paintings
In general, better accuracy is obtained at expense of a worse
in the “Rouen Cathedral” series by Monet, exhibits different visual
disentanglement. Only GOYA (Contrastive + Classifier loss)
appearances on the same object under the light variance. It can be
improves accuracy without damaging DC.
observed that the retrieved images in the style space also apply a
different light condition to create a sense of space and display vivid
color contrast. Not only that, but they also display similar color Genre acc
compositions and strokes, but paint different scenes. Style acc
DC 0.82
75

5.4 Ablation study 0.8


70
We conduct an ablation study on WikiArt test set to examine the
effectiveness of the losses and the network structure in GOYA. 65 0.78

Losses. We compare the losses in GOYA against two other popu- 60


0.76
lar contrastive losses, Triplet loss [65] and NTXent loss [71], both
55
of which have shown their superiority in many contrastive learn- 0.74
ing methods. We also investigate applying style classification loss
500 1000 1500 2000
on top of the above-mentioned contrastive losses. The selection
criteria of positive and negative pairs are the same for all the losses.
The results in terms of accuracy (as the product of genre and
Figure 6: Disentanglement and classification evaluation
style movement accuracies) and disentanglement (as DC) are shown
with different embedding sizes when only one single layer
in Figure 5. The NTXent loss achieves the highest accuracy but
is set in the content and style encoder.
with the cost of undercutting disentanglement ability. In contrast,
triplet loss has almost the best disentanglement performance but is
at a disadvantage in terms of classification performance. Compared carry a stronger ability to distill knowledge from the pre-trained
to those two losses, only contrastive loss in GOYA is able to sustain model. Inspired by this finding, we set the embedding size to 2, 048.
a balance between disentanglement and classification performance.
Moreover, after occupying the classification loss, GOYA has a boost
in classification without sacrificing disentanglement, achieving the
6 CONCLUSION
best performance compared to the other loss settings. This work proposed GOYA, a method for disentangling content
and style embeddings of paintings by training on synthetic im-
Embedding size. We explore the effect of the embedding size on ages generated with Stable Diffusion. Exploiting the multi-modal
a single layer content and style encoders, with embedding sizes CLIP latent space, we first extracted off-the-shelf embeddings to
ranging from 256 to 2, 048. Figure 6 shows that the accuracy of then learn similarities and dissimilarities in content and style with
both genre and style improves up to 6% as the embedding size in- two encoders trained with contrastive learning. Evaluation on the
creases, but conversely, the DC becomes worse, from 0.750 to 0.814, WikiArt dataset included disentanglement, classification, and simi-
indicating a trade-off between classification and disentanglement. larity retrieval. Despite relying only on synthetic images, results
Moreover, the classification performance of genre and style move- showed that GOYA achieves good disentanglement between con-
ment outperforms the pre-trained CLIP (shown in Table 2) when the tent and style embeddings. We this work sheds light on the adoption
embedding size exceeds 512, suggesting that larger embedding sizes of generative models in the analysis of the digital humanities.
Not Only Generative Art: Stable Diffusion
for Content-Style Disentanglement in Art Analysis ICMR ’23, June 12–15, 2023, Thessaloniki, Greece

ACKNOWLEDGMENTS [27] LA Gatys, AS Ecker, and M Bethge. 2015. A Neural Algorithm of Artistic Style.
Nature Communications (2015).
This work is partly supported by JST FOREST Grant No. JPMJFR216O [28] Leon Gatys, Alexander S Ecker, and Matthias Bethge. 2015. Texture synthesis
and JSPS KAKENHI Grant No. 23H00497 and No. 20K19822. using convolutional neural networks. NeurIPS 28 (2015).
[29] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. 2016. Image style transfer
using convolutional neural networks. In CVPR. 2414–2423.
[30] Nicolas Gonthier, Yann Gousseau, Said Ladjal, and Olivier Bonfait. 2018. Weakly
REFERENCES Supervised Object Detection in Artworks. In ECCV Workshops. 692–709.
[1] Zechen Bai, Yuta Nakashima, and Noa Garcia. 2021. Explain me the painting: [31] Abel Gonzalez-Garcia, Joost Van De Weijer, and Yoshua Bengio. 2018. Image-to-
Multi-topic knowledgeable art description generation. In ICCV. 5422–5432. image translation for cross-domain disentanglement. NeurIPS 31 (2018).
[2] Gustavo Carneiro, Nuno Pinho da Silva, Alessio Del Bue, and João Paulo Costeira. [32] Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning
2012. Artistic image classification: An analysis on the printart database. In ECCV. for Image Recognition. CVPR (2016), 770–778.
Springer, 143–157. [33] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot,
[3] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. 2017. 𝛽 -VAE:
Bojanowski, and Armand Joulin. 2021. Emerging properties in self-supervised Learning basic visual concepts with a constrained variational framework. In
vision transformers. In ICCV. 9650–9660. ICLR.
[4] Eva Cetinic. 2021. Towards generating and evaluating iconographic image cap- [34] Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic
tions of artworks. Journal of Imaging 7, 8 (2021), 123. models. NeurIPS 33 (2020), 6840–6851.
[5] Eva Cetinic, Tomislav Lipic, and Sonja Grgic. 2018. Fine-tuning convolutional [35] Xun Huang and Serge Belongie. 2017. Arbitrary Style Transfer in Real-time with
neural networks for fine art classification. Expert Systems with Applications 114 Adaptive Instance Normalization. In ICCV.
(2018), 107–118. [36] Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. 2018. Multimodal
[6] Haibo Chen, Lei Zhao, Zhizhong Wang, Huiming Zhang, Zhiwen Zuo, Ailin Li, unsupervised image-to-image translation. In ECCV. 172–189.
Wei Xing, and Dongming Lu. 2021. Dualast: Dual style-learning networks for [37] Sergey Karayev, Matthew Trentacoste, Helen Han, Aseem Agarwala, Trevor
artistic style transfer. In CVPR. 872–881. Darrell, Aaron Hertzmann, and Holger Winnemoeller. 2014. Recognizing image
[7] Haibo Chen, Lei Zhao, Huiming Zhang, Zhizhong Wang, Zhiwen Zuo, Ailin Li, style. In BMVC.
Wei Xing, and Dongming Lu. 2021. Diverse image style transfer via invertible [38] Hadi Kazemi, Seyed Mehdi Iranmanesh, and Nasser Nasrabadi. 2019. Style and
cross-space mapping. In ICCV. IEEE Computer Society, 14860–14869. content disentanglement in generative adversarial networks. In WACV. IEEE,
[8] Liyi Chen and Jufeng Yang. 2019. Recognizing the style of visual arts via adaptive 848–856.
cross-layer correlation. In ACM MM. 2459–2467. [39] Selina J. Khan and Nanne van Noord. 2021. Stylistic Multi-Task Analysis of
[9] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. Ukiyo-e Woodblock Prints. In BMVC. 1–5.
A simple framework for contrastive learning of visual representations. In ICML. [40] Valentin Khrulkov, Leyla Mirvakhabova, Ivan Oseledets, and Artem Babenko.
PMLR, 1597–1607. 2022. Disentangled representations from non-disentangled models. ICLR (2022).
[10] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter [41] Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. 2022. DiffusionCLIP: Text-
Abbeel. 2016. InfoGAN: Interpretable representation learning by information Guided Diffusion Models for Robust Image Manipulation. In CVPR. 2426–2435.
maximizing generative adversarial nets. NeurIPS 29 (2016). [42] Diederick P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic opti-
[11] Xinlei Chen and Kaiming He. 2021. Exploring simple siamese representation mization. In ICLR.
learning. In CVPR. 15750–15758. [43] Diederik P Kingma and Max Welling. 2014. Auto-encoding variational bayes. In
[12] Ruizhe Cheng, Bichen Wu, Peizhao Zhang, Peter Vajda, and Joseph E Gonzalez. ICLR.
2021. Data-efficient language-supervised zero-shot learning with self-distillation. [44] Dmytro Kotovenko, Artsiom Sanakoyeu, Sabine Lang, and Bjorn Ommer. 2019.
In CVPR. 3119–3124. Content and style disentanglement for artistic style transfer. In ICCV. 4422–4431.
[13] Wei-Ta Chu and Yi-Ling Wu. 2018. Image style classification based on learnt [45] Gihyun Kwon and Jong Chul Ye. 2022. CLIPstyler: Image style transfer with a
deep correlation features. Transactions on Multimedia 20, 9 (2018), 2491–2502. single text condition. In CVPR. 18062–18071.
[14] John Collomosse, Tu Bui, Michael J Wilber, Chen Fang, and Hailin Jin. 2017. [46] Gihyun Kwon and Jong Chul Ye. 2023. Diffusion-based image translation using
Sketching with style: Visual search with sketches and aesthetic context. In ICCV. disentangled style and content representation. ICLR (2023).
2660–2668. [47] Sabine Lang and Bjorn Ommer. 2018. Reflecting on how artworks are processed
[15] Elliot J Crowley and Andrew Zisserman. 2014. The state of the art: Object retrieval and analyzed by computer vision. In ECCV Workshops. 0–0.
in paintings using discriminative regions. In BMVC. [48] Adrian Lecoutre, Benjamin Negrevergne, and Florian Yger. 2017. Recognizing
[16] Yingying Deng, Fan Tang, Weiming Dong, Wen Sun, Feiyue Huang, and Chang- art style automatically in painting with deep learning. In Asian Conference on
sheng Xu. 2020. Arbitrary style transfer via multi-adaptation network. In ACM Machine Learning. PMLR, 327–342.
MM. 2719–2727. [49] Zhiheng Li, Martin Renqiang Min, Kai Li, and Chenliang Xu. 2022. StyleT2I:
[17] Emily L Denton et al. 2017. Unsupervised learning of disentangled representations Toward Compositional and High-Fidelity Text-to-Image Synthesis. In CVPR.
from video. NeurIPS 30 (2017). 18197–18207.
[18] Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, [50] Wentong Liao, Kai Hu, Michael Ying Yang, and Bodo Rosenhahn. 2022. Text to
Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, et al. 2021. CogView: Mastering image generation with semantic-spatial aware GAN. In CVPR. 18187–18196.
text-to-image generation via transformers. NeurIPS 34 (2021), 19822–19835. [51] Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. 2022. Pseudo numerical methods
[19] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi- for diffusion models on manifolds. ICLR (2022).
aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg [52] Xiao Liu, Spyridon Thermos, Gabriele Valvano, Agisilaos Chartsias, Alison O’Neil,
Heigold, Sylvain Gelly, et al. 2021. An image is worth 16x16 words: Transformers and Sotirios A Tsaftaris. 2021. Measuring the Biases and Effectiveness of Content-
for image recognition at scale. ICLR (2021). Style Disentanglement. BMVC (2021).
[20] Cheikh Brahim El Vaigh, Noa Garcia, Benjamin Renoust, Chenhui Chu, Yuta [53] Daiqian Ma, Feng Gao, Yan Bai, Yihang Lou, Shiqi Wang, Tiejun Huang, and
Nakashima, and Hajime Nagahara. 2021. GCNBoost: Artwork classification by Ling-Yu Duan. 2017. From part to whole: who is behind the painting?. In ACM
label propagation through a knowledge graph. In ICMR. 92–100. MM. 1174–1182.
[21] Ahmed Elgammal, Bingchen Liu, Diana Kim, Mohamed Elhoseiny, and Marian [54] Hui Mao, Ming Cheung, and James She. 2017. DeepArt: Learning joint represen-
Mazzone. 2018. The shape of art history in the eyes of the machine. In AAAI, tations of visual arts. In ACM MM. 1183–1191.
Vol. 32. [55] Thomas Mensink and Jan Van Gemert. 2014. The rijksmuseum challenge:
[22] Corneliu Florea, Răzvan Condorovici, Constantin Vertan, Raluca Butnaru, Laura Museum-centered visual recognition. In ICMR. 451–454.
Florea, and Ruxandra Vrânceanu. 2016. Pandora: Description of a painting data- [56] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh,
base for art movement recognition with baselines and perspectives. In European Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark,
Signal Processing Conference. 918–922. et al. 2021. Learning transferable visual models from natural language supervision.
[23] Aviv Gabbay and Yedid Hoshen. 2020. Improving style-content disentanglement In ICML. PMLR, 8748–8763.
in image-to-image translation. arXiv preprint arXiv:2007.04964 (2020). [57] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen.
[24] Noa Garcia, Benjamin Renoust, and Yuta Nakashima. 2019. Context-aware 2022. Hierarchical text-conditional image generation with CLIP latents. arXiv
embeddings for automatic art analysis. In ICMR. 25–33. preprint arXiv:2204.06125 (2022).
[25] Noa Garcia, Benjamin Renoust, and Yuta Nakashima. 2020. ContextNet: rep- [58] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn
resentation and exploration for painting classification and retrieval in context. Ommer. 2022. High-resolution image synthesis with latent diffusion models. In
International Journal of Multimedia Information Retrieval 9, 1 (2020), 17–30. CVPR. 10684–10695.
[26] Noa Garcia and George Vogiatzis. 2018. How to read paintings: semantic art [59] Dan Ruta, Saeid Motiian, Baldo Faieta, Zhe Lin, Hailin Jin, Alex Filipkowski,
understanding with multi-modal retrieval. In ECCV Workshops. 0–0. Andrew Gilbert, and John Collomosse. 2021. ALADIN: all layer adaptive instance
ICMR ’23, June 12–15, 2023, Thessaloniki, Greece Wu et al.

normalization for fine-grained style similarity. In ICCV. 11926–11935. by CLIP. In CVPR. 8552–8562.
[60] Matthia Sabatelli, Mike Kestemont, Walter Daelemans, and Pierre Geurts. 2018. [88] Yufan Zhou, Ruiyi Zhang, Changyou Chen, Chunyuan Li, Chris Tensmeyer, Tong
Deep transfer learning for art classification problems. In ECCV Workshops. 0–0. Yu, Jiuxiang Gu, Jinhui Xu, and Tong Sun. 2022. Towards Language-Free Training
[61] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L for Text-to-Image Generation. In CVPR. 17907–17917.
Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim
Salimans, et al. 2022. Photorealistic text-to-image diffusion models with deep
language understanding. NeurIPS 35 (2022), 36479–36494.
[62] Babak Saleh and Ahmed Elgammal. 2016. Large-scale Classification of Fine-Art
Paintings: Learning The Right Metric on The Right Feature. International Journal
for Digital Art History 2 (2016).
[63] Catherine Sandoval, Elena Pirogova, and Margaret Lech. 2019. Two-stage deep
learning approach to the classification of fine-art paintings. IEEE Access 7 (2019),
41770–41781.
[64] Mert Bulent Sariyildiz, Karteek Alahari, Diane Larlus, and Yannis Kalantidis. 2023.
Fake it till you make it: Learning transferable representations from synthetic
ImageNet clones. In CVPR.
[65] Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. FaceNet: A
unified embedding for face recognition and clustering. In CVPR. 815–823.
[66] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross
Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell
Wortsman, et al. 2022. LAION-5B: An open large-scale dataset for training next
generation image-text models. NeurIPS (2022).
[67] Huajie Shao, Shuochao Yao, Dachun Sun, Aston Zhang, Shengzhong Liu, Dongxin
Liu, Jun Wang, and Tarek Abdelzaher. 2020. ControlVAE: Controllable variational
autoencoder. In ICML. PMLR, 8655–8664.
[68] Xi Shen, Alexei A Efros, and Mathieu Aubry. 2019. Discovering visual patterns
in art collections with spatially-consistent feature learning. In CVPR. 9278–9287.
[69] Yichun Shi, Xiao Yang, Yangyue Wan, and Xiaohui Shen. 2022. SemanticStyleGAN:
Learning Compositional Generative Priors for Controllable Image Synthesis and
Editing. In CVPR. 11254–11264.
[70] Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks
for large-scale image recognition. In ICLR.
[71] Kihyuk Sohn. 2016. Improved deep metric learning with multi-class n-pair loss
objective. Advances in neural information processing systems 29 (2016).
[72] Gjorgji Strezoski and Marcel Worring. 2018. OmniArt: a large-scale artistic
benchmark. TOMM 14, 4 (2018), 1–21.
[73] Hongchen Tan, Xiuping Liu, Meng Liu, Baocai Yin, and Xin Li. 2020. KT-GAN:
knowledge-transfer generative adversarial network for text-to-image synthesis.
Transactions on Image Processing 30 (2020), 1275–1290.
[74] Wei Ren Tan, Chee Seng Chan, Hernan Aguirre, and Kiyoshi Tanaka. 2019.
Improved ArtGAN for Conditional Synthesis of Natural Image and Artwork.
Transactions on Image Processing 28, 1 (2019), 394–409. https://doi.org/10.1109/
TIP.2018.2866698
[75] Wei Ren Tan, Chee Seng Chan, Hernán E Aguirre, and Kiyoshi Tanaka. 2016.
Ceci n’est pas une pipe: A deep convolutional network for fine-art paintings
classification. In ICIP. IEEE, 3703–3707.
[76] Ming Tao, Hao Tang, Fei Wu, Xiao-Yuan Jing, Bing-Kun Bao, and Changsheng
Xu. 2022. DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis.
In CVPR. 16515–16525.
[77] Luan Tran, Xi Yin, and Xiaoming Liu. 2017. Disentangled representation learning
GAN for pose-invariant face recognition. In CVPR. 1415–1424.
[78] Narek Tumanyan, Omer Bar-Tal, Shai Bagon, and Tali Dekel. 2022. Splicing ViT
Features for Semantic Appearance Transfer. In CVPR. 10748–10757.
[79] Nanne Van Noord, Ella Hendriks, and Eric Postma. 2015. Toward discovery of
the artist’s style: Learning to recognize artists by their artworks. IEEE Signal
Processing Magazine 32, 4 (2015), 46–54.
[80] Michael J Wilber, Chen Fang, Hailin Jin, Aaron Hertzmann, John Collomosse, and
Serge Belongie. 2017. BAM! The behance artistic media dataset for recognition
beyond photography. In ICCV. 1202–1211.
[81] Qiucheng Wu, Yujian Liu, Handong Zhao, Ajinkya Kale, Trung Bui, Tong Yu,
Zhe Lin, Yang Zhang, and Shiyu Chang. 2022. Uncovering the Disentanglement
Capability in Text-to-Image Diffusion Models. arXiv preprint arXiv:2212.08698
(2022).
[82] Xin Xie, Yi Li, Huaibo Huang, Haiyan Fu, Wanwan Wang, and Yanqing Guo. 2022.
Artistic Style Discovery With Independent Components. In CVPR. 19870–19879.
[83] Zipeng Xu, Tianwei Lin, Hao Tang, Fu Li, Dongliang He, Nicu Sebe, Radu Timofte,
Luc Van Gool, and Errui Ding. 2022. Predict, Prevent, and Evaluate: Disentangled
Text-Driven Image Manipulation Empowered by Pre-Trained Vision-Language
Model. In CVPR. 18229–18238.
[84] Yang You, Igor Gitman, and Boris Ginsburg. 2020. Large batch training of convo-
lutional networks. ICLR (2020).
[85] Nikolaos-Antonios Ypsilantis, Noa Garcia, Guangxing Han, Sarah Ibrahimi,
Nanne Van Noord, and Giorgos Tolias. 2021. The Met dataset: Instance-level
recognition for artworks. In NeurIPS Datasets and Benchmarks Track.
[86] Xiaoming Yu, Yuanqi Chen, Shan Liu, Thomas Li, and Ge Li. 2019. Multi-mapping
image-to-image translation via learning disentanglement. NeurIPS 32 (2019).
[87] Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xupeng Miao, Bin Cui, Yu
Qiao, Peng Gao, and Hongsheng Li. 2022. PointCLIP: Point cloud understanding
Not Only Generative Art: Stable Diffusion
for Content-Style Disentanglement in Art Analysis ICMR ’23, June 12–15, 2023, Thessaloniki, Greece

A GOYA DETAILS Confusion matrix. Figure 7 shows the confusion matrix of genre
The details of GOYA architecture are shown in Table 3. classification evaluation on GOYA content space. The number in
each cell represents the proportion of images that are classified as
the predicted label to the total images with the true label. The darker
Table 3: GOYA detailed architecture.
the color, the more images are classified as the predicted label. We
can observe that images from several genres are misclassified as
Components Layer details genre painting, as genre paintings usually depict a wide aboard of
Linear layer (512, 2048)
activities in daily life, thus have overlapping semantics with images
Content encoder C from other genres, such as illustration and nude painting. In addition,
ReLU non-linearity due to the high similarity of the depicted scenes, there are 28% of
Linear layer (2048, 2048) the images from cityscape misclassified as landscape.
Figure 8 shows the confusion matrix of style movement classifi-
Linear layer (512, 512)
cation in GOYA style space. However, the boundary of some move-
ReLU non-linearity ments is not very clear, as some movements are sub-movements
Style encoder S
Linear layer (512, 512) that represent different phases in one major movement, e.g. Syn-
thetic Cubism in Cubism and Post Impressionism in Impressionism.
ReLU non-linearity Generative models may produce images likely to the major move-
Linear layer (512, 2048) ment even if when the prompt is about sub-movements, leading
GOYA to learn from inaccurate information. Thus, images from
Linear layer (2048, 2048)
sub-movements are prone to be predicted as the according major
Projector ℎ𝐶 /ℎ𝑆
ReLU non-linearity movement. For example, 82% of the images in Synthetic Cubism and
Linear layer (2048, 64) 90% of the images in Analytical Cubism are classified as Cubism.
Similarly, about 1/3 of the images in Contemporary Realism and
Style classifier R Linear layer (2048, 27) New Realism are predicted incorrectly as Realism.

D SIMILARITY RETRIEVAL COMPARISON


B BASELINE DETAILS Here we show more results (Figure 9, 10) on similarity retrieval and
Fine-tuning ResNet50 and CLIP. ResNet50 and CLIP are fine-tuned compare them against CLIP [56]. In Figure 9 and 10, for each query
by adding a linear classifier for genre or style movement after the image, the first two rows display the retrieved images from GOYA
layer from which embeddings are extracted, and then training the content and style spaces, and the last row shows images retrieved in
entire model on top of the pre-trained checkpoint. The ground- the CLIP latent space. Results show that images in the CLIP latent
truth is the genre or style movement label in the WikiArt, or style space are similar in content and style, while in GOYA content space
movement in the prompt of diffusion-generated images. display consistency in depicting scenes but with different styles,
and in GOYA style space the visual appearance is similar but the
Embeddings of baselines. Gram matrix embeddings are computed content is different.
from the layer conv5_1 of a pre-trained VGG19 [70]. For ResNet50
[32], CLIP [56] and DINO [3], the protocols for which layer to
extract embeddings and for fine-tuning are consistent as in the
disentanglement task.

C CLASSIFICATION EVALUATION DETAILS


Labels in classification evaluation. We use 10 genres and 27 style
movements in the WikiArt [74] dataset for classification evaluation.
Genre labels include: abstract painting, cityscape, genre paint-
ing, illustration, landscape, nude painting, portrait, sketch and study,
religious painting and still life.
Style movement labels include: Abstract Expressionism, Action
painting, Analytical Cubism, Art Nouveau, Baroque, Color Field Paint-
ing, Contemporary Realism, Cubism, Early Renaissance, Expression-
ism, Fauvism, High Renaissance, Impressionism, Mannerism Late
Renaissance, Minimalism, Naive Art Primitivism, New Realism, North-
ern Renaissance, Pointillism, Pop Art, Post Impressionism, Realism,
Rococo, Romanticism, Symbolism, Synthetic Cubism and Ukiyo-e.

Classifier training details. The optimizer is LARS [84] with initial


learning rate 0.02, a cosine delay schedule, and momentum = 0.9.
The batch size is 4, 096. We train each classifier for 90 epochs.
ICMR ’23, June 12–15, 2023, Thessaloniki, Greece Wu et al.

abstract painting 0.72 0.00 0.09 0.00 0.07 0.01 0.05 0.01 0.02 0.02

0.8
cityscape 0.01 0.52 0.15 0.00 0.28 0.00 0.01 0.01 0.01 0.00

genre painting 0.01 0.02 0.68 0.00 0.10 0.01 0.09 0.06 0.03 0.00

illustration 0.04 0.02 0.53 0.04 0.06 0.00 0.07 0.19 0.05 0.00 0.6
Real genre

landscape 0.01 0.03 0.06 0.00 0.90 0.00 0.00 0.00 0.01 0.00

nude painting 0.01 0.00 0.28 0.00 0.04 0.46 0.10 0.04 0.05 0.01 0.4

portrait 0.00 0.00 0.13 0.00 0.00 0.00 0.83 0.02 0.01 0.00

religious painting 0.02 0.01 0.16 0.00 0.04 0.00 0.07 0.66 0.03 0.00 0.2

sketch and study 0.03 0.02 0.24 0.00 0.18 0.03 0.19 0.03 0.28 0.00

still life 0.05 0.01 0.19 0.00 0.05 0.00 0.05 0.01 0.02 0.62
0
ab cit ge i lan nu po r s sti
str ys
ca nre llustr ds de rtra eligio ketc ll li
ac pe pa ati ca pa it us ha fe
tp int o n p e i nti pa n
ain ing n int d stu
tin g ing dy
g

Predicted genre

Figure 7: Confusion matrix of genre classification evaluation on GOYA content space.

Abstract Expressionism 0.61 0.00 0.00 0.01 0.00 0.08 0.00 0.04 0.00 0.13 0.00 0.00 0.03 0.00 0.02 0.01 0.00 0.00 0.00 0.02 0.01 0.01 0.00 0.00 0.03 0.00 0.00
Action painting 0.93 0.00 0.00 0.00 0.00 0.07 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Analytical Cubism 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.90 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Art Nouveau 0.01 0.00 0.00 0.40 0.00 0.00 0.00 0.01 0.00 0.09 0.00 0.00 0.13 0.00 0.00 0.02 0.00 0.01 0.00 0.01 0.06 0.13 0.01 0.07 0.05 0.00 0.01
Baroque 0.00 0.00 0.00 0.01 0.62 0.00 0.00 0.00 0.01 0.01 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.05 0.00 0.00 0.00 0.10 0.03 0.14 0.01 0.00 0.00 0.8
Color Field Painting 0.17 0.00 0.00 0.01 0.00 0.61 0.00 0.01 0.00 0.02 0.00 0.00 0.00 0.00 0.13 0.00 0.00 0.00 0.00 0.03 0.00 0.00 0.00 0.00 0.01 0.00 0.00
Contemporary Realism 0.01 0.00 0.00 0.08 0.02 0.04 0.00 0.00 0.00 0.14 0.00 0.00 0.10 0.01 0.03 0.02 0.00 0.01 0.00 0.04 0.01 0.29 0.00 0.04 0.14 0.00 0.00
Cubism 0.01 0.00 0.00 0.04 0.00 0.00 0.00 0.49 0.00 0.31 0.00 0.00 0.03 0.00 0.01 0.01 0.00 0.00 0.00 0.03 0.05 0.01 0.00 0.00 0.01 0.00 0.00
Early Renaissance 0.00 0.00 0.00 0.02 0.02 0.00 0.00 0.00 0.45 0.00 0.00 0.04 0.01 0.02 0.00 0.00 0.00 0.28 0.00 0.00 0.00 0.06 0.00 0.01 0.08 0.00 0.00
Expressionism 0.03 0.00 0.00 0.04 0.00 0.00 0.00 0.02 0.00 0.56 0.01 0.00 0.09 0.00 0.00 0.02 0.00 0.01 0.00 0.01 0.10 0.06 0.00 0.01 0.04 0.00 0.00
0.6
Real style movement

Fauvism 0.07 0.00 0.00 0.04 0.00 0.03 0.00 0.05 0.00 0.31 0.04 0.00 0.09 0.00 0.00 0.05 0.00 0.00 0.00 0.00 0.32 0.00 0.00 0.00 0.01 0.00 0.00
High Renaissance 0.00 0.00 0.00 0.00 0.18 0.00 0.00 0.00 0.17 0.00 0.00 0.09 0.03 0.04 0.00 0.00 0.00 0.29 0.00 0.00 0.00 0.06 0.00 0.08 0.03 0.00 0.00
Impressionism 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.03 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.04 0.12 0.00 0.01 0.01 0.00 0.00
Mannerism Late Renaissance 0.00 0.00 0.00 0.01 0.45 0.00 0.00 0.00 0.02 0.03 0.00 0.03 0.02 0.06 0.00 0.00 0.00 0.14 0.00 0.00 0.01 0.11 0.01 0.10 0.02 0.00 0.00
Minimalism 0.13 0.00 0.00 0.00 0.01 0.19 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.62 0.00 0.00 0.01 0.00 0.02 0.00 0.02 0.00 0.00 0.00 0.00 0.00
Naive Art Primitivism 0.01 0.00 0.00 0.07 0.01 0.00 0.00 0.05 0.00 0.37 0.00 0.00 0.05 0.00 0.00 0.21 0.00 0.01 0.00 0.02 0.05 0.06 0.00 0.03 0.03 0.00 0.00 0.4
New Realism 0.00 0.00 0.00 0.04 0.02 0.00 0.00 0.00 0.00 0.16 0.00 0.00 0.33 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.39 0.00 0.04 0.00 0.00 0.00
Northern Renaissance 0.00 0.00 0.00 0.05 0.14 0.00 0.00 0.00 0.05 0.02 0.00 0.00 0.02 0.00 0.00 0.01 0.00 0.53 0.00 0.00 0.00 0.08 0.00 0.06 0.03 0.00 0.00
Pointillism 0.01 0.00 0.00 0.15 0.00 0.00 0.00 0.05 0.00 0.03 0.00 0.00 0.32 0.00 0.00 0.00 0.00 0.01 0.13 0.00 0.20 0.03 0.00 0.00 0.07 0.00 0.00
Pop Art 0.07 0.00 0.00 0.07 0.00 0.04 0.00 0.06 0.00 0.09 0.00 0.00 0.02 0.00 0.02 0.05 0.00 0.00 0.00 0.47 0.01 0.06 0.00 0.00 0.02 0.00 0.00
Post Impressionism 0.00 0.00 0.00 0.04 0.00 0.00 0.00 0.01 0.00 0.17 0.00 0.00 0.39 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.26 0.08 0.00 0.01 0.02 0.00 0.00
0.2
Realism 0.00 0.00 0.00 0.01 0.02 0.00 0.00 0.00 0.00 0.05 0.00 0.00 0.17 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.03 0.59 0.00 0.09 0.01 0.00 0.00
Rococo 0.00 0.00 0.00 0.01 0.29 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.09 0.34 0.24 0.00 0.00 0.00
Romanticism 0.00 0.00 0.00 0.02 0.05 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.09 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.01 0.23 0.03 0.51 0.02 0.00 0.00
Symbolism 0.01 0.00 0.00 0.08 0.00 0.00 0.00 0.02 0.01 0.11 0.00 0.00 0.13 0.00 0.00 0.00 0.00 0.02 0.00 0.01 0.05 0.13 0.00 0.06 0.35 0.00 0.00
Synthetic Cubism 0.00 0.00 0.00 0.03 0.00 0.00 0.00 0.82 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.03 0.00 0.00 0.03 0.00 0.00 0.00 0.00 0.00 0.00
Ukiyo-e 0.01 0.00 0.00 0.07 0.00 0.00 0.00 0.00 0.00 0.03 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.02 0.00 0.01 0.01 0.00 0.00 0.00 0.01 0.00 0.83 0
Ab

Ac

An

Ar tica g

Ba uv bis

C ue

C Fie

C mp ain

Ea m

Ex Ren

Fa sio an

H sm m

Im en

N alis ate

N Art

Po rn

Po llism ais

Po rt

R m

R o

Sy ntic

Sy lis

U etic
ol

on

ub

ig

ai

ew

or

ea re

oc

om

ki
an ion nce

in ism
tN lC

pr
tio

pr

uv

nt m
st

al

ro eau m

rly

in

st

yo Cu
or

ve m
im

th

oc
te d P

is rar ng

lis ss
ne ism
ra

es iss

ti

bo sm

h
es ss

A
q

a
n

i
o

Im
R

-e
ct

ea
pa res

p
Ex

l
in sio

i
i
ai

ni ce

Pr

en
t

m sm
p

s
in

L
u

im

bi
y

io
a

sm
R

ni
t iv
ti

sa
ea

sm
R

i
en
ni

nc
lis
sm

ai
m

e
ss
an
ce

Predicted style movement

Figure 8: Confusion matrix of style movement classification evaluation on GOYA style space.
Not Only Generative Art: Stable Diffusion
for Content-Style Disentanglement in Art Analysis ICMR ’23, June 12–15, 2023, Thessaloniki, Greece

Query GOYA - Content

1st 2nd 3rd 4th 6th

GOYA - Style

1st 2nd 3rd 4th 5th

CLIP

1st 2nd 3rd 4th 5th

Query GOYA - Content

1st 2nd 3rd 4th 5th

GOYA - Style

1st 3rd 4th 5th 6th

CLIP

1st 2nd 3rd 4th 5th

Query GOYA - Content

1st 2nd 3rd 7th 8th

GOYA - Style

1st 2nd 3rd 4th 5th

CLIP

1st 2nd 3rd 5th 6th

Figure 9: Retrieval results in GOYA content and style spaces and CLIP latent space based on cosine similarity. In each row, the
similarity decreases from left to right. Copyrighted images are skipped.
ICMR ’23, June 12–15, 2023, Thessaloniki, Greece Wu et al.

Query GOYA - Content

3rd 4th 6th 8th 10th

GOYA - Style

1st 3rd 4th 5th 6th

CLIP

1st 2nd 3rd 5th 7th

Query GOYA - Content

1st 2nd 3rd 4th 5th

GOYA - Style

2nd 6th 7th 8th 9th

CLIP

1st 2nd 3rd 4th 5th

Query GOYA - Content

1st 2nd 3rd 4th 6th

GOYA - Style

1st 2nd 3rd 4th 5th

CLIP

1st 2nd 3rd 4th 5th

Figure 10: Retrieval results in GOYA content and style spaces and CLIP latent space based on cosine similarity. In each row,
the similarity decreases from left to right. Copyrighted images are skipped.

You might also like