Professional Documents
Culture Documents
2304.10278v1
2304.10278v1
2304.10278v1
to the objects and concepts in the piece of art, and style to the
way it is expressed. This duality poses an important challenge for Training
dataset to be annotated with its corresponding content or style cat- • We design a model to disentangle content and style informa-
egory. This categorization poses some additional problems. Firstly, tion from pre-trained CLIP’s latent space.
although there are digitized collections of artworks that can be • We propose to train the disentanglement model with syn-
leveraged for supervised learning (e.g. WikiArt1 , The Met2 , Web thetically generated images, instead of real paintings, via
Gallery of Art3 ), labels for new artworks are not straightforward to Stable Diffusion and prompt design.
obtain, and often require experts to annotate them. Secondly, the • We show that the information in Stable Diffusion generated
annotated labels, which are often single-words, reflect only some images can be effectively distilled for art analysis, performing
general traits of a set of artworks while ignoring the subtle proper- well on tasks such as art retrieval and art classification.
ties of each image. For instance, given a painting with a genre label Our findings open the way for adopting generative models in digital
still life and an artistic movement label Expressionism, what scene humanities, not only for generation but also for analysis.
does it depict and how does its visual appearance look like? We
can infer some of the coarse attributes it may carry, e.g. inanimate 2 RELATED WORK
subjects from still life and strong subjective emotions from Expres- Art analysis. The use of computer vision techniques for art anal-
sionism. However, some fine attributes such as depicted concepts, ysis has been an active research topic for decades, in particular
color composition, and brushstrokes still remain obscure. When in tasks such as attribute classification [20, 24, 53, 55, 75], object
training based on labels, it is difficult to learn the subtle content and recognition [2, 15, 30, 68], or image retrieval [2, 15, 26, 85]. Fully-
style discrepancy in images. To overcome the limitations imposed supervised tasks (e.g., genre or artist classification [25, 54, 75]) have
by labels, some work [1, 4, 26] has relied on natural language de- obtained outstanding results by leveraging neural networks trained
scriptions instead of categorical classes. Although natural language on annotated datasets [54, 55, 62, 72]. However, image annotations
can contribute to resolving the ambiguity and rigidness of labels, have some limitations. An important limitation is the categoriza-
human experts still need to write descriptions for each artwork. tion of styles. Multiple datasets [22, 37, 39, 54, 55, 62, 72, 80] pro-
Moving away from human supervision, we investigate the gener- vide style labels, which have been leveraged by abundant work
ative power of a popular text-to-image model, Stable Diffusion [58], [8, 13, 21, 48, 60, 63, 79] for style classification. This direction of
and propose to leverage the distilled knowledge as a prior to learn work assumes style to be a static attribute, instead of dynamic and
disentangled content and style embeddings of paintings. Given a evolving [47]. A different interpretation is provided in style transfer
text input, also known as prompt, Stable Diffusion can generate a [6, 7, 16, 29]. A model extracts the low-level representation of a
set of diverse synthetic images while maintaining content and style stylized image (e.g. a painting) and applies it to a content image
consistency. The subtle characteristics of content and style in the (e.g. a plain photograph), defining style by a single artwork, i.e. the
synthetically generated images can be controlled by well-defined color, shape, and brushstroke of the stylized image. To overcome
prompts. Thus, we propose to use the generated images with the the rigidness of labels in supervised learning and the narrowness
help of the input prompts to train a model that disentangles content of a single image in style transfer, we propose learning disentan-
and style in paintings without direct human annotations. Concur- gled embeddings of content and style by similarity comparisons by
rent to our work, it has been shown that using Stable Diffusion leveraging the flexibility of a text-to-image generative model.
generated images can be useful for image classification tasks [64].
The intuition behind our method, named GOYA (disentanGlement Representation disentanglement. Learning disentangled represen-
of cOntent and stYle with generAtions), is that although there is no tation plays an essential role in various computer vision tasks such
explicit boundary between different contents or styles, we can dis- as style transfer [44, 82], image manipulation [69, 83], and image-
tinguish significant dissimilarities by comparison. Our simple yet to-image translation [23, 31, 36, 86]. The goal is to discover discrete
effective model (Figure 1) first extracts joint content-style embed- factors of variation in data, thus improving the interpretability
dings using pre-trained CLIP image encoder [56], and then applies of representations and enabling a wide range of downstream ap-
two independent content and style transformation networks to plications. To disentangle attributes like azimuth, age, or gender,
learn disentangled content and style embeddings. As mentioned previous work have built on adversarial learning [10, 17, 40, 77] or
before, the weights of the transformation networks are trained on variational autoencoders (VAE) [33, 43, 67], aiming to encourage
the generated synthetic images with contrastive learning, reducing discrete properties in a single latent space. For content and style
the need of using human image-level annotations. disentanglement, several work apply generative model [23, 38, 44],
We conduct three tasks and an ablation study on a popular bench- diffusion model [46, 81], or an autoencoder architecture with con-
mark of paintings, the WikiArt dataset [74]. We show that GOYA, trastive learning [14, 44, 59]. In the art domain, ALADIN [59] con-
by distilling the knowledge from Stable Diffusion generated images, catenates the adaptive instance normalization (AdaIN) [35] feature
disentangles content and style better than models trained on real into the style encoder to learn style embedding for visual searching.
paintings. Moreover, experiments show that the resulting disen- Kotovenko et al. [44] propose fixpoint triplet loss and disentangle-
tangled spaces are useful for downstream tasks such as similarity ment loss for performing better style transfer. However, these work
retrieval and art classification. In summary, our contributions are: lack semantic analysis of content embeddings in paintings. Recently,
Vision Transformer (ViT) [19]-based models show the ability to
obtain structure and appearance embeddings [3, 46, 78]. DiffuseIT
1 https://www.wikiart.org/ [46] and Splice [78] learn content and style embeddings by utilizing
2 https://www.metmuseum.org/ the keys and the global [CLS] token of pre-trained DINO [3]. In
3 https://www.wga.hu/ our work, taking advantage of the generative model, our approach
Not Only Generative Art: Stable Diffusion
for Content-Style Disentanglement in Art Analysis ICMR ’23, June 12–15, 2023, Thessaloniki, Greece
Prompt
CLIP + −
portrait of germaine raynal,
Text fiC
Art Nouveau
Encoder
Di usion
Model + −
attract repel
giC
Content space hC
CLIP
Image
Encoder gi
hS
CLIP feature space attract
+
repel
giS −
Baroque
Style space Realism
Figure 2: Details of our proposed method GOYA for content and style disentanglement. Given a synthetic prompt containing
content (first part of the prompt, in green) and style (second part of the prompt, in red) descriptions, we generate synthetic
diffusion images. We compute CLIP embeddings with the frozen CLIP image encoder, and generate content and style disen-
❄︎
ff
tangled embeddings with two dedicated encoders C and S, respectively. In the training stage, projectors ℎ𝐶 and ℎ𝑆 and style
classifier R are used to train GOYA with contrastive learning. For content, contrastive learning pairs are chosen based on the
text embedding of content description in the prompt extracted by frozen CLIP text encoder. For style, contrastive learning
pairs are chosen based on the style description in the prompt.
builds a simple framework to decompose the latent space into con- 3 PRELIMINARIES
tent and style spaces with contrastive learning, exploring to employ
3.1 Stable Diffusion
generated images on representation learning.
Diffusion models [34, 58] are generative methods trained on two
stages: a forward process with a Markov chain to transform input
data to noise, and a reversed process to reconstruct data from the
noise, obtaining high-quality performance on image generation.
To reduce the training cost and accelerate the inference process,
Text-to-image generation. Text-to-image models intend to per- Stable Diffusion [58] trains the diffusion process in the latent space
form synthetic image generation conditioned on a given text input. instead of the pixel space. Given a text prompt as input condition,
Catalyzed by datasets with massive text-image pairs that have the text encoder transforms the prompt to a text embedding. Then,
emerged in recent years, many powerful text-to-image generation by feeding the embedding into the UNet through a cross-attention
models sprung up [18, 49, 50, 57, 58, 61, 76]. For instance, CogView mechanism, the reversed diffusion process generates an image em-
[18] is trained on 30 million text-image pairs while DALL-E 2 [57] bedding in the latent space. Finally, the image embedding is fed to
is trained on 650 million text-image pairs. One of their main chal- the decoder to generate a synthetic image.
lenges is achieving semantic coherence between guiding texts and In this work, we define symbols as follows: given a text prompt
generated images. This has been addressed by using pre-trained 𝑥 = {𝑥 𝐶 , 𝑥 𝑆 } as input, we can obtain the generated image 𝑦. The text
CLIP embeddings [56] to construct aligned text and image features 𝑥 𝐶 represents content description and 𝑥 𝑆 denotes style description,
in the latent space [45, 49, 88]. Another challenge is to obtain high- where {·} indicates a comma-separated string concatenation.
resolution synthetic images. GAN-based models [50, 73, 76, 88] have
achieved good performance in improving the quality of generated
images, however, they suffer from instability in training. Exploit- 3.2 CLIP
ing the superiority of training stability, work based on Diffusion CLIP [56] is a text-image matching model that aligns text and image
models [57, 58, 61] have recently become a popular tool for gener- embeddings in the same latent space. It shows high consistency
ating near-human quality images. Despite the rapid development of the visual concepts in the image and the semantic concepts in
of models for image generation, how to leverage the feature of the corresponding text. The text encoder E𝑇 and image encoder
synthetic images remains an underexplored area of research. In this E𝐼 of CLIP are trained with 440 million text-image pairs, showing
paper, we study the potential of generated images for enhancing outstanding performance on various text and image downstream
representation learning. tasks, such as zero-shot prediction [12, 87] and image manipulation
ICMR ’23, June 12–15, 2023, Thessaloniki, Greece Wu et al.
[41, 45, 49]. Given the text 𝑥 and an image 𝑦, the CLIP embeddings definition of content similarity. We propose a soft-positive selection
𝑓 from text, and 𝑔 from image, both in R𝑑 , can be computed as: strategy that defines pairs of images with similar content accord-
ing to their semantic similarity. That is, two images with similar
𝑓 = E𝑇 (𝑥), (1)
semantic concepts are defined as a positive pair whereas images
𝑔 = E𝐼 (𝑦). (2) without semantic similarity are negative pairs.
To exploit the multi-modal CLIP space, we employ the pre-trained To quantify semantic similarity between a pair of images, we
CLIP image encoder E𝐼 to obtain CLIP image embeddings as the exploit the CLIP latent space and conduct text similarity between
prerequisite for the subsequent disentanglement model. Moreover, the associated texts. Given the content description 𝑥𝑖𝐶 of the image
during the training stage, the CLIP text embedding of a prompt is 𝑦𝑖 , we assume that the CLIP text embedding 𝑓𝑖𝐶 = E𝑇 (𝑥𝑖𝐶 ) can be a
applied to acquire the semantic concepts of the generated image. proxy for the content of 𝑦𝑖 . Therefore, given a pair of two diffusion
images (𝑦𝑖 , 𝑦 𝑗 ) and a text similarity threshold 𝜖𝑇 , they are a positive
4 GOYA pair if 𝐷𝑇𝑖𝑗 ≤ 𝜖𝑇 , where 𝐷𝑇𝑖𝑗 is the text similarity obtained by the
The goal of our task is to learn disentangled content and style em- cosine distance between the CLIP text embedding 𝑓𝑖𝐶 and 𝑓 𝑗𝐶 . The
beddings of artworks in two different spaces. Unlike previous work, content contrastive loss is defined as:
we leverage the knowledge of generated images rather than real
paintings, unlocking generative models for representation analysis. 𝐿𝑖𝐶𝑗 = 1 [𝐷𝑇 ≤𝜖𝑇 ] (1 − 𝐷𝑖𝐶𝑗 ) + 1 [𝐷𝑇 >𝜖𝑇 ] max(0, 𝐷𝑖𝐶𝑗 − 𝜖𝑐 ), (5)
𝑖𝑗 𝑖𝑗
Our idea is to borrow the Stable Diffusion’s capability to generate where 1 [·] is the indicator function that gives 1 when the condition
a wide variety of images not only of diverse contents but also in
is true and 0 otherwise. 𝐷𝑖𝐶𝑗 is the cosine distance between ℎ𝐶 (𝑔𝐶𝑖 )
various styles. Contrastive losses for content and style allow GOYA
to learn the proximity of different artworks in the respective spaces and ℎ𝐶 (𝑔𝐶𝑗 ), which are the content embeddings of images after
with the consistency of generated images and text prompts. projection. 𝜖𝑐 is the margin that constrains the minimum distance
Figure 2 shows an overview of GOYA. Given a mini-batch of 𝑁 of negative pairs.
𝑁 , where 𝑥 = {𝑥 𝐶 , 𝑥 𝑆 } with comma-connected con-
prompts {𝑥𝑖 }𝑖=0 𝑖 𝑖 𝑖
tent and style descriptions, we obtain diffusion generated images 𝑦𝑖 4.4 Style contrastive loss
using Stable Diffusion. We then compute CLIP image embeddings The style contrastive loss is defined based on the style description
𝑔𝑖 by Eq. (2) and use a content encoder and a style encoder to obtain 𝑥 𝑆 given in the input prompt. If a pair of images shares the same
disentangled embeddings in two different spaces. As previous work style class, then they are considered to be a positive pair, which
has shown [27] content and style have different properties, while means that their style embeddings should be close in the style space.
content embeddings refer to higher layers in the deep neural net- Otherwise, they are a negative pair, and they should be pushed away
work and style embeddings respond to lower layers. We design an from each other. Given (𝑦𝑖 , 𝑦 𝑗 ), the style contrastive loss can be
asymmetric network architecture for extracting content and style, computed as:
which is common in the art analysis domain [27, 44, 54, 59].
𝐿𝑖𝑆𝑗 = 1 [𝑥 𝑆 =𝑥 𝑆 ] (1 − 𝐷𝑖𝑆𝑗 )) + 1 [𝑥 𝑆 ≠𝑥 𝑆 ] max(0, 𝐷𝑖𝑆𝑗 − 𝜖 𝑆 ), (6)
𝑖 𝑗 𝑖 𝑗
4.1 Content encoder where 𝐷𝑖𝑆𝑗 is the cosine distance between the style embeddings
The content encoder C maps CLIP image embedding 𝑔𝑖 to content
ℎ𝑆 (𝑔𝑖𝑆 ) and ℎ𝑆 (𝑔𝑆𝑗 ) after projection, and 𝜖 𝑆 is the margin.
embedding 𝑔𝐶 𝑖 as:
𝑔𝐶𝑖 = C(𝑔𝑖 ), (3) 4.5 Style classification loss
C is a two-layer perceptron (MLP) with ReLU non-linearity. Follow- To learn the general attributes of each style, we introduce a style
ing previous work [9], to make content 𝑔𝐶𝑖 highly linear, at training classifier R to predict the style description (given as 𝑥𝑖𝑆 ) based on
time we add a non-linear projector ℎ𝐶 on top of the content encoder, the embedding 𝑔𝑖𝑆 of image 𝑦𝑖 . Prediction 𝑤𝑖𝑆 by the classifier is
which is a three-layer MLP with ReLU non-linearity. given by:
𝑤𝑖𝑆 = R (𝑔𝑖𝑆 ), (7)
4.2 Style encoder
where R is a linear layer network. For training, we use softmax
Style encoder S also maps CLIP image embedding 𝑔𝑖 but to style
cross-entropy loss, which is denoted by 𝐿𝑖𝑆𝐶 . Note that the training
embedding 𝑔𝑖𝑆 as:
of this classifier does not rely on human annotations, but on the
𝑔𝑖𝑆 = S(𝑔𝑖 ). (4) synthetic prompts and generated images by Stable Diffusion.
S is a three-layer MLP with ReLU non-linearity. In particular, fol-
lowing [32], we apply a skip connection before the last ReLU non- 4.6 Total loss
linearity in S. Similar to the content encoder, non-linear projector In the training process, we compute the sum of three losses. The
ℎ𝑆 with the same structure as ℎ𝐶 is added after S to facilitate con- overall loss function in a mini-batch is formulated as:
trastive learning. ∑︁ ∑︁ ∑︁
𝐿 = 𝜆𝐶 𝐿𝑖𝐶𝑗 + 𝜆𝑆 𝐿𝑖𝑆𝑗 + 𝜆𝑆𝐶 𝐿𝑖𝑆𝐶 , (8)
4.3 Content contrastive loss 𝑖𝑗 𝑖𝑗 𝑖
Unlike prior work [44], which defines content similarity only when where 𝜆𝐶 , 𝜆𝑆 and 𝜆𝐶𝑆are parameters to control the contributions
style-transferred images are from the same source, we use a broader of losses. We set 𝜆𝐶 = 𝜆𝑆 = 𝜆𝐶𝑆 = 1. The summantions over 𝑖 and
Not Only Generative Art: Stable Diffusion
for Content-Style Disentanglement in Art Analysis ICMR ’23, June 12–15, 2023, Thessaloniki, Greece
over five billion image-text pairs, which contain some paintings 6 https://pytorch.org/
ICMR ’23, June 12–15, 2023, Thessaloniki, Greece Wu et al.
Table 2: Genre and style movement accuracy on WikiArt [74] dataset for different models.
style embeddings, which is specially designed for content and style at the deepest layer from the self-similarity of keys in the attention
disentanglement evaluation. Let 𝐺 𝐶 and 𝐺 𝑆 denote matrices con- module and the [CLS] token, respectively.
taining all content and style embeddings in the WikiArt test set, i.e.,
Results. Results are reported in Table 1. GOYA shows the best
𝐺 𝐶 = (𝑔𝐶 𝐶 𝑆 𝑆 𝑆
1 · · · 𝑔𝑁 ) and 𝐺 = (𝑔1 · · · 𝑔𝑁 ). For an arbitrary pair disentanglement with the lowest DC, 0.367, and a large margin
(𝑖, 𝑗) of embeddings, the distances 𝑝𝑖 𝑗 and 𝑞𝑖𝑆𝑗 can be computed by:
𝐶
with the second-best disentangled embeddings from fine-tuned
𝑝𝑖𝐶𝑗 = ∥𝑔𝐶 𝐶 𝑆 𝑆 𝑆 CLIP. With only nearly 1/3 training parameters of ResNet50 and
𝑖 − 𝑔 𝑗 ∥, 𝑝𝑖 𝑗 = ∥𝑔𝑖 − 𝑔 𝑗 ∥, (9)
1/20 of CLIP, GOYA outperforms embeddings directly trained on
where ∥ · ∥ gives the Euclidean distance. Let 𝑝¯𝑖𝐶· , 𝑝¯𝐶·𝑗 , and 𝑝¯𝐶
denote WikiArt’s real paintings while consuming fewer resources. Also,
the means over 𝑗, 𝑖, and both 𝑖 and 𝑗, respectively. With these means, GOYA achieves better disentanglement capability than DINO, with
the distances can be doubly centered by much more compact embeddings, e.g. 1/300 content size embedding.
However, there is still a notorious gap between GOYA and the lower
𝑞𝐶 𝐶 𝐶 𝐶 𝐶
𝑖 𝑗 = 𝑝𝑖 𝑗 − 𝑝¯𝑖 · − 𝑝¯ ·𝑗 + 𝑝¯ , (10)
bound based on labels, showing that there is room for improvement.
and likewise for 𝑞𝑖𝑆𝑗 . DC between 𝐺 𝐶 and 𝐺 𝑆 is given by:
5.2 Classification evaluation
𝐶 𝑆 dCov(𝐺 𝐶 , 𝐺 𝑆 ) For evaluating the disentangled embeddings for art classification,
DC(𝐺 , 𝐺 ) = √︁ , (11)
dCov(𝐺 𝐶 , 𝐺 𝐶 )dCov(𝐺 𝑆 , 𝐺 𝑆 ) following the protocol in [11], we train two independent classifiers
where with a single linear layer on top of the content and style embeddings.
√︂∑︁ ∑︁
1
𝐶
dCov(𝐺 , 𝐺 ) = 𝑆
𝑞𝐶 𝑞𝑆 . (12) Baselines. We compare GOYA against three types of baselines:
𝑁 𝑖 𝑗 𝑖𝑗 𝑖𝑗
pre-trained models, models trained on WikiArt dataset, and models
dCov(𝐺 𝐶 , 𝐺 𝐶 ) and dCov(𝐺 𝑆 , 𝐺 𝑆 ) are defined likewise. DC can be trained on diffusion generated images. As pre-trained models, we
computed for arbitrary matrices with 𝑁 columns. DC is in [0, 1], use Gram matrix [27, 28], ResNet50 [32], CLIP [56] and DINO [3].
and lower value means 𝐺 𝐶 and 𝐺 𝑆 are less correlated. We aim at For models trained on WikiArt, other than fine-tuning ResNet50
DC being close to 0. and CLIP, we also apply two popular contrastive learning methods:
SimCLR [9] and SimSiam [11]. For models trained on generated
Baselines. To compute the lower bound DC on the WikiArt test
images, ResNet50 and CLIP are fine-tuned with style movements
dataset, we assign the one-hot vector of the ground-truth genre
in the prompts. SimCLR is trained without any annotations.
and style movement labels as the content and style embeddings,
representing the uppermost disentanglement when the labels are Results. Table 2 shows the classification results. Compared with
100% correct. Besides the lower bound, we evaluate DC on ResNet50 the pre-trained baselines in the first four rows, GOYA is ahead of
[32], CLIP [56] and DINO [3]. For ResNet50, embeddings are ex- Gram matrix, ResNet50, and DINO. Yet, it lags behind pre-trained
tracted before the last fully-connected layer. For CLIP, we use the CLIP by less than 1% in both genre and style movement accuracy.
embedding from the CLIP image encoder E𝐼 . For pre-trained DINO, Compared with models trained on WikiArt, although not com-
following Splice [78], content and style embeddings are extracted parable to fine-tuned ResNet50 and CLIP on classification, GOYA
Not Only Generative Art: Stable Diffusion
for Content-Style Disentanglement in Art Analysis ICMR ’23, June 12–15, 2023, Thessaloniki, Greece
Query Content
Query Content
Query Content
Query Content
Figure 4: Retrieval results on the WikiArt test set based on cosine similarity. The similarity decreases from left to right. Copy-
righted images are skipped.
achieves better capability of disentanglement as shown in Table 1. However, CLIP decreases in both genre and style accuracy after
Also, GOYA enables better performance on classification against fine-tuning on generated images. SimCLR has a dramatic decre-
contrastive learning models SimCLR and SimSiam. ment when trained on generated images compared to on WikiArt.
When training on diffusion generated images, GOYA achieves As SimCLR focuses more on learning the intricacies of the image
the best classification performance compared to other models with itself rather than the relation of images, it learns the distribution of
different embedding sizes. After fine-tuned on style movement in generated images, leading to poor performance on WikiArt. While
the prompts, ResNet50 increases 3% on the style accuracy, show- training on the same dataset, GOYA sustains better capability on
ing the potential for analysis via synthetically generated images. classification tasks while achieving high disentanglement.
ICMR ’23, June 12–15, 2023, Thessaloniki, Greece Wu et al.
ACKNOWLEDGMENTS [27] LA Gatys, AS Ecker, and M Bethge. 2015. A Neural Algorithm of Artistic Style.
Nature Communications (2015).
This work is partly supported by JST FOREST Grant No. JPMJFR216O [28] Leon Gatys, Alexander S Ecker, and Matthias Bethge. 2015. Texture synthesis
and JSPS KAKENHI Grant No. 23H00497 and No. 20K19822. using convolutional neural networks. NeurIPS 28 (2015).
[29] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. 2016. Image style transfer
using convolutional neural networks. In CVPR. 2414–2423.
[30] Nicolas Gonthier, Yann Gousseau, Said Ladjal, and Olivier Bonfait. 2018. Weakly
REFERENCES Supervised Object Detection in Artworks. In ECCV Workshops. 692–709.
[1] Zechen Bai, Yuta Nakashima, and Noa Garcia. 2021. Explain me the painting: [31] Abel Gonzalez-Garcia, Joost Van De Weijer, and Yoshua Bengio. 2018. Image-to-
Multi-topic knowledgeable art description generation. In ICCV. 5422–5432. image translation for cross-domain disentanglement. NeurIPS 31 (2018).
[2] Gustavo Carneiro, Nuno Pinho da Silva, Alessio Del Bue, and João Paulo Costeira. [32] Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning
2012. Artistic image classification: An analysis on the printart database. In ECCV. for Image Recognition. CVPR (2016), 770–778.
Springer, 143–157. [33] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot,
[3] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. 2017. 𝛽 -VAE:
Bojanowski, and Armand Joulin. 2021. Emerging properties in self-supervised Learning basic visual concepts with a constrained variational framework. In
vision transformers. In ICCV. 9650–9660. ICLR.
[4] Eva Cetinic. 2021. Towards generating and evaluating iconographic image cap- [34] Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic
tions of artworks. Journal of Imaging 7, 8 (2021), 123. models. NeurIPS 33 (2020), 6840–6851.
[5] Eva Cetinic, Tomislav Lipic, and Sonja Grgic. 2018. Fine-tuning convolutional [35] Xun Huang and Serge Belongie. 2017. Arbitrary Style Transfer in Real-time with
neural networks for fine art classification. Expert Systems with Applications 114 Adaptive Instance Normalization. In ICCV.
(2018), 107–118. [36] Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. 2018. Multimodal
[6] Haibo Chen, Lei Zhao, Zhizhong Wang, Huiming Zhang, Zhiwen Zuo, Ailin Li, unsupervised image-to-image translation. In ECCV. 172–189.
Wei Xing, and Dongming Lu. 2021. Dualast: Dual style-learning networks for [37] Sergey Karayev, Matthew Trentacoste, Helen Han, Aseem Agarwala, Trevor
artistic style transfer. In CVPR. 872–881. Darrell, Aaron Hertzmann, and Holger Winnemoeller. 2014. Recognizing image
[7] Haibo Chen, Lei Zhao, Huiming Zhang, Zhizhong Wang, Zhiwen Zuo, Ailin Li, style. In BMVC.
Wei Xing, and Dongming Lu. 2021. Diverse image style transfer via invertible [38] Hadi Kazemi, Seyed Mehdi Iranmanesh, and Nasser Nasrabadi. 2019. Style and
cross-space mapping. In ICCV. IEEE Computer Society, 14860–14869. content disentanglement in generative adversarial networks. In WACV. IEEE,
[8] Liyi Chen and Jufeng Yang. 2019. Recognizing the style of visual arts via adaptive 848–856.
cross-layer correlation. In ACM MM. 2459–2467. [39] Selina J. Khan and Nanne van Noord. 2021. Stylistic Multi-Task Analysis of
[9] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. Ukiyo-e Woodblock Prints. In BMVC. 1–5.
A simple framework for contrastive learning of visual representations. In ICML. [40] Valentin Khrulkov, Leyla Mirvakhabova, Ivan Oseledets, and Artem Babenko.
PMLR, 1597–1607. 2022. Disentangled representations from non-disentangled models. ICLR (2022).
[10] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter [41] Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. 2022. DiffusionCLIP: Text-
Abbeel. 2016. InfoGAN: Interpretable representation learning by information Guided Diffusion Models for Robust Image Manipulation. In CVPR. 2426–2435.
maximizing generative adversarial nets. NeurIPS 29 (2016). [42] Diederick P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic opti-
[11] Xinlei Chen and Kaiming He. 2021. Exploring simple siamese representation mization. In ICLR.
learning. In CVPR. 15750–15758. [43] Diederik P Kingma and Max Welling. 2014. Auto-encoding variational bayes. In
[12] Ruizhe Cheng, Bichen Wu, Peizhao Zhang, Peter Vajda, and Joseph E Gonzalez. ICLR.
2021. Data-efficient language-supervised zero-shot learning with self-distillation. [44] Dmytro Kotovenko, Artsiom Sanakoyeu, Sabine Lang, and Bjorn Ommer. 2019.
In CVPR. 3119–3124. Content and style disentanglement for artistic style transfer. In ICCV. 4422–4431.
[13] Wei-Ta Chu and Yi-Ling Wu. 2018. Image style classification based on learnt [45] Gihyun Kwon and Jong Chul Ye. 2022. CLIPstyler: Image style transfer with a
deep correlation features. Transactions on Multimedia 20, 9 (2018), 2491–2502. single text condition. In CVPR. 18062–18071.
[14] John Collomosse, Tu Bui, Michael J Wilber, Chen Fang, and Hailin Jin. 2017. [46] Gihyun Kwon and Jong Chul Ye. 2023. Diffusion-based image translation using
Sketching with style: Visual search with sketches and aesthetic context. In ICCV. disentangled style and content representation. ICLR (2023).
2660–2668. [47] Sabine Lang and Bjorn Ommer. 2018. Reflecting on how artworks are processed
[15] Elliot J Crowley and Andrew Zisserman. 2014. The state of the art: Object retrieval and analyzed by computer vision. In ECCV Workshops. 0–0.
in paintings using discriminative regions. In BMVC. [48] Adrian Lecoutre, Benjamin Negrevergne, and Florian Yger. 2017. Recognizing
[16] Yingying Deng, Fan Tang, Weiming Dong, Wen Sun, Feiyue Huang, and Chang- art style automatically in painting with deep learning. In Asian Conference on
sheng Xu. 2020. Arbitrary style transfer via multi-adaptation network. In ACM Machine Learning. PMLR, 327–342.
MM. 2719–2727. [49] Zhiheng Li, Martin Renqiang Min, Kai Li, and Chenliang Xu. 2022. StyleT2I:
[17] Emily L Denton et al. 2017. Unsupervised learning of disentangled representations Toward Compositional and High-Fidelity Text-to-Image Synthesis. In CVPR.
from video. NeurIPS 30 (2017). 18197–18207.
[18] Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, [50] Wentong Liao, Kai Hu, Michael Ying Yang, and Bodo Rosenhahn. 2022. Text to
Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, et al. 2021. CogView: Mastering image generation with semantic-spatial aware GAN. In CVPR. 18187–18196.
text-to-image generation via transformers. NeurIPS 34 (2021), 19822–19835. [51] Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. 2022. Pseudo numerical methods
[19] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi- for diffusion models on manifolds. ICLR (2022).
aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg [52] Xiao Liu, Spyridon Thermos, Gabriele Valvano, Agisilaos Chartsias, Alison O’Neil,
Heigold, Sylvain Gelly, et al. 2021. An image is worth 16x16 words: Transformers and Sotirios A Tsaftaris. 2021. Measuring the Biases and Effectiveness of Content-
for image recognition at scale. ICLR (2021). Style Disentanglement. BMVC (2021).
[20] Cheikh Brahim El Vaigh, Noa Garcia, Benjamin Renoust, Chenhui Chu, Yuta [53] Daiqian Ma, Feng Gao, Yan Bai, Yihang Lou, Shiqi Wang, Tiejun Huang, and
Nakashima, and Hajime Nagahara. 2021. GCNBoost: Artwork classification by Ling-Yu Duan. 2017. From part to whole: who is behind the painting?. In ACM
label propagation through a knowledge graph. In ICMR. 92–100. MM. 1174–1182.
[21] Ahmed Elgammal, Bingchen Liu, Diana Kim, Mohamed Elhoseiny, and Marian [54] Hui Mao, Ming Cheung, and James She. 2017. DeepArt: Learning joint represen-
Mazzone. 2018. The shape of art history in the eyes of the machine. In AAAI, tations of visual arts. In ACM MM. 1183–1191.
Vol. 32. [55] Thomas Mensink and Jan Van Gemert. 2014. The rijksmuseum challenge:
[22] Corneliu Florea, Răzvan Condorovici, Constantin Vertan, Raluca Butnaru, Laura Museum-centered visual recognition. In ICMR. 451–454.
Florea, and Ruxandra Vrânceanu. 2016. Pandora: Description of a painting data- [56] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh,
base for art movement recognition with baselines and perspectives. In European Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark,
Signal Processing Conference. 918–922. et al. 2021. Learning transferable visual models from natural language supervision.
[23] Aviv Gabbay and Yedid Hoshen. 2020. Improving style-content disentanglement In ICML. PMLR, 8748–8763.
in image-to-image translation. arXiv preprint arXiv:2007.04964 (2020). [57] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen.
[24] Noa Garcia, Benjamin Renoust, and Yuta Nakashima. 2019. Context-aware 2022. Hierarchical text-conditional image generation with CLIP latents. arXiv
embeddings for automatic art analysis. In ICMR. 25–33. preprint arXiv:2204.06125 (2022).
[25] Noa Garcia, Benjamin Renoust, and Yuta Nakashima. 2020. ContextNet: rep- [58] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn
resentation and exploration for painting classification and retrieval in context. Ommer. 2022. High-resolution image synthesis with latent diffusion models. In
International Journal of Multimedia Information Retrieval 9, 1 (2020), 17–30. CVPR. 10684–10695.
[26] Noa Garcia and George Vogiatzis. 2018. How to read paintings: semantic art [59] Dan Ruta, Saeid Motiian, Baldo Faieta, Zhe Lin, Hailin Jin, Alex Filipkowski,
understanding with multi-modal retrieval. In ECCV Workshops. 0–0. Andrew Gilbert, and John Collomosse. 2021. ALADIN: all layer adaptive instance
ICMR ’23, June 12–15, 2023, Thessaloniki, Greece Wu et al.
normalization for fine-grained style similarity. In ICCV. 11926–11935. by CLIP. In CVPR. 8552–8562.
[60] Matthia Sabatelli, Mike Kestemont, Walter Daelemans, and Pierre Geurts. 2018. [88] Yufan Zhou, Ruiyi Zhang, Changyou Chen, Chunyuan Li, Chris Tensmeyer, Tong
Deep transfer learning for art classification problems. In ECCV Workshops. 0–0. Yu, Jiuxiang Gu, Jinhui Xu, and Tong Sun. 2022. Towards Language-Free Training
[61] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L for Text-to-Image Generation. In CVPR. 17907–17917.
Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim
Salimans, et al. 2022. Photorealistic text-to-image diffusion models with deep
language understanding. NeurIPS 35 (2022), 36479–36494.
[62] Babak Saleh and Ahmed Elgammal. 2016. Large-scale Classification of Fine-Art
Paintings: Learning The Right Metric on The Right Feature. International Journal
for Digital Art History 2 (2016).
[63] Catherine Sandoval, Elena Pirogova, and Margaret Lech. 2019. Two-stage deep
learning approach to the classification of fine-art paintings. IEEE Access 7 (2019),
41770–41781.
[64] Mert Bulent Sariyildiz, Karteek Alahari, Diane Larlus, and Yannis Kalantidis. 2023.
Fake it till you make it: Learning transferable representations from synthetic
ImageNet clones. In CVPR.
[65] Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. FaceNet: A
unified embedding for face recognition and clustering. In CVPR. 815–823.
[66] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross
Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell
Wortsman, et al. 2022. LAION-5B: An open large-scale dataset for training next
generation image-text models. NeurIPS (2022).
[67] Huajie Shao, Shuochao Yao, Dachun Sun, Aston Zhang, Shengzhong Liu, Dongxin
Liu, Jun Wang, and Tarek Abdelzaher. 2020. ControlVAE: Controllable variational
autoencoder. In ICML. PMLR, 8655–8664.
[68] Xi Shen, Alexei A Efros, and Mathieu Aubry. 2019. Discovering visual patterns
in art collections with spatially-consistent feature learning. In CVPR. 9278–9287.
[69] Yichun Shi, Xiao Yang, Yangyue Wan, and Xiaohui Shen. 2022. SemanticStyleGAN:
Learning Compositional Generative Priors for Controllable Image Synthesis and
Editing. In CVPR. 11254–11264.
[70] Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks
for large-scale image recognition. In ICLR.
[71] Kihyuk Sohn. 2016. Improved deep metric learning with multi-class n-pair loss
objective. Advances in neural information processing systems 29 (2016).
[72] Gjorgji Strezoski and Marcel Worring. 2018. OmniArt: a large-scale artistic
benchmark. TOMM 14, 4 (2018), 1–21.
[73] Hongchen Tan, Xiuping Liu, Meng Liu, Baocai Yin, and Xin Li. 2020. KT-GAN:
knowledge-transfer generative adversarial network for text-to-image synthesis.
Transactions on Image Processing 30 (2020), 1275–1290.
[74] Wei Ren Tan, Chee Seng Chan, Hernan Aguirre, and Kiyoshi Tanaka. 2019.
Improved ArtGAN for Conditional Synthesis of Natural Image and Artwork.
Transactions on Image Processing 28, 1 (2019), 394–409. https://doi.org/10.1109/
TIP.2018.2866698
[75] Wei Ren Tan, Chee Seng Chan, Hernán E Aguirre, and Kiyoshi Tanaka. 2016.
Ceci n’est pas une pipe: A deep convolutional network for fine-art paintings
classification. In ICIP. IEEE, 3703–3707.
[76] Ming Tao, Hao Tang, Fei Wu, Xiao-Yuan Jing, Bing-Kun Bao, and Changsheng
Xu. 2022. DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis.
In CVPR. 16515–16525.
[77] Luan Tran, Xi Yin, and Xiaoming Liu. 2017. Disentangled representation learning
GAN for pose-invariant face recognition. In CVPR. 1415–1424.
[78] Narek Tumanyan, Omer Bar-Tal, Shai Bagon, and Tali Dekel. 2022. Splicing ViT
Features for Semantic Appearance Transfer. In CVPR. 10748–10757.
[79] Nanne Van Noord, Ella Hendriks, and Eric Postma. 2015. Toward discovery of
the artist’s style: Learning to recognize artists by their artworks. IEEE Signal
Processing Magazine 32, 4 (2015), 46–54.
[80] Michael J Wilber, Chen Fang, Hailin Jin, Aaron Hertzmann, John Collomosse, and
Serge Belongie. 2017. BAM! The behance artistic media dataset for recognition
beyond photography. In ICCV. 1202–1211.
[81] Qiucheng Wu, Yujian Liu, Handong Zhao, Ajinkya Kale, Trung Bui, Tong Yu,
Zhe Lin, Yang Zhang, and Shiyu Chang. 2022. Uncovering the Disentanglement
Capability in Text-to-Image Diffusion Models. arXiv preprint arXiv:2212.08698
(2022).
[82] Xin Xie, Yi Li, Huaibo Huang, Haiyan Fu, Wanwan Wang, and Yanqing Guo. 2022.
Artistic Style Discovery With Independent Components. In CVPR. 19870–19879.
[83] Zipeng Xu, Tianwei Lin, Hao Tang, Fu Li, Dongliang He, Nicu Sebe, Radu Timofte,
Luc Van Gool, and Errui Ding. 2022. Predict, Prevent, and Evaluate: Disentangled
Text-Driven Image Manipulation Empowered by Pre-Trained Vision-Language
Model. In CVPR. 18229–18238.
[84] Yang You, Igor Gitman, and Boris Ginsburg. 2020. Large batch training of convo-
lutional networks. ICLR (2020).
[85] Nikolaos-Antonios Ypsilantis, Noa Garcia, Guangxing Han, Sarah Ibrahimi,
Nanne Van Noord, and Giorgos Tolias. 2021. The Met dataset: Instance-level
recognition for artworks. In NeurIPS Datasets and Benchmarks Track.
[86] Xiaoming Yu, Yuanqi Chen, Shan Liu, Thomas Li, and Ge Li. 2019. Multi-mapping
image-to-image translation via learning disentanglement. NeurIPS 32 (2019).
[87] Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xupeng Miao, Bin Cui, Yu
Qiao, Peng Gao, and Hongsheng Li. 2022. PointCLIP: Point cloud understanding
Not Only Generative Art: Stable Diffusion
for Content-Style Disentanglement in Art Analysis ICMR ’23, June 12–15, 2023, Thessaloniki, Greece
A GOYA DETAILS Confusion matrix. Figure 7 shows the confusion matrix of genre
The details of GOYA architecture are shown in Table 3. classification evaluation on GOYA content space. The number in
each cell represents the proportion of images that are classified as
the predicted label to the total images with the true label. The darker
Table 3: GOYA detailed architecture.
the color, the more images are classified as the predicted label. We
can observe that images from several genres are misclassified as
Components Layer details genre painting, as genre paintings usually depict a wide aboard of
Linear layer (512, 2048)
activities in daily life, thus have overlapping semantics with images
Content encoder C from other genres, such as illustration and nude painting. In addition,
ReLU non-linearity due to the high similarity of the depicted scenes, there are 28% of
Linear layer (2048, 2048) the images from cityscape misclassified as landscape.
Figure 8 shows the confusion matrix of style movement classifi-
Linear layer (512, 512)
cation in GOYA style space. However, the boundary of some move-
ReLU non-linearity ments is not very clear, as some movements are sub-movements
Style encoder S
Linear layer (512, 512) that represent different phases in one major movement, e.g. Syn-
thetic Cubism in Cubism and Post Impressionism in Impressionism.
ReLU non-linearity Generative models may produce images likely to the major move-
Linear layer (512, 2048) ment even if when the prompt is about sub-movements, leading
GOYA to learn from inaccurate information. Thus, images from
Linear layer (2048, 2048)
sub-movements are prone to be predicted as the according major
Projector ℎ𝐶 /ℎ𝑆
ReLU non-linearity movement. For example, 82% of the images in Synthetic Cubism and
Linear layer (2048, 64) 90% of the images in Analytical Cubism are classified as Cubism.
Similarly, about 1/3 of the images in Contemporary Realism and
Style classifier R Linear layer (2048, 27) New Realism are predicted incorrectly as Realism.
abstract painting 0.72 0.00 0.09 0.00 0.07 0.01 0.05 0.01 0.02 0.02
0.8
cityscape 0.01 0.52 0.15 0.00 0.28 0.00 0.01 0.01 0.01 0.00
genre painting 0.01 0.02 0.68 0.00 0.10 0.01 0.09 0.06 0.03 0.00
illustration 0.04 0.02 0.53 0.04 0.06 0.00 0.07 0.19 0.05 0.00 0.6
Real genre
landscape 0.01 0.03 0.06 0.00 0.90 0.00 0.00 0.00 0.01 0.00
nude painting 0.01 0.00 0.28 0.00 0.04 0.46 0.10 0.04 0.05 0.01 0.4
portrait 0.00 0.00 0.13 0.00 0.00 0.00 0.83 0.02 0.01 0.00
religious painting 0.02 0.01 0.16 0.00 0.04 0.00 0.07 0.66 0.03 0.00 0.2
sketch and study 0.03 0.02 0.24 0.00 0.18 0.03 0.19 0.03 0.28 0.00
still life 0.05 0.01 0.19 0.00 0.05 0.00 0.05 0.01 0.02 0.62
0
ab cit ge i lan nu po r s sti
str ys
ca nre llustr ds de rtra eligio ketc ll li
ac pe pa ati ca pa it us ha fe
tp int o n p e i nti pa n
ain ing n int d stu
tin g ing dy
g
Predicted genre
Abstract Expressionism 0.61 0.00 0.00 0.01 0.00 0.08 0.00 0.04 0.00 0.13 0.00 0.00 0.03 0.00 0.02 0.01 0.00 0.00 0.00 0.02 0.01 0.01 0.00 0.00 0.03 0.00 0.00
Action painting 0.93 0.00 0.00 0.00 0.00 0.07 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Analytical Cubism 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.90 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Art Nouveau 0.01 0.00 0.00 0.40 0.00 0.00 0.00 0.01 0.00 0.09 0.00 0.00 0.13 0.00 0.00 0.02 0.00 0.01 0.00 0.01 0.06 0.13 0.01 0.07 0.05 0.00 0.01
Baroque 0.00 0.00 0.00 0.01 0.62 0.00 0.00 0.00 0.01 0.01 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.05 0.00 0.00 0.00 0.10 0.03 0.14 0.01 0.00 0.00 0.8
Color Field Painting 0.17 0.00 0.00 0.01 0.00 0.61 0.00 0.01 0.00 0.02 0.00 0.00 0.00 0.00 0.13 0.00 0.00 0.00 0.00 0.03 0.00 0.00 0.00 0.00 0.01 0.00 0.00
Contemporary Realism 0.01 0.00 0.00 0.08 0.02 0.04 0.00 0.00 0.00 0.14 0.00 0.00 0.10 0.01 0.03 0.02 0.00 0.01 0.00 0.04 0.01 0.29 0.00 0.04 0.14 0.00 0.00
Cubism 0.01 0.00 0.00 0.04 0.00 0.00 0.00 0.49 0.00 0.31 0.00 0.00 0.03 0.00 0.01 0.01 0.00 0.00 0.00 0.03 0.05 0.01 0.00 0.00 0.01 0.00 0.00
Early Renaissance 0.00 0.00 0.00 0.02 0.02 0.00 0.00 0.00 0.45 0.00 0.00 0.04 0.01 0.02 0.00 0.00 0.00 0.28 0.00 0.00 0.00 0.06 0.00 0.01 0.08 0.00 0.00
Expressionism 0.03 0.00 0.00 0.04 0.00 0.00 0.00 0.02 0.00 0.56 0.01 0.00 0.09 0.00 0.00 0.02 0.00 0.01 0.00 0.01 0.10 0.06 0.00 0.01 0.04 0.00 0.00
0.6
Real style movement
Fauvism 0.07 0.00 0.00 0.04 0.00 0.03 0.00 0.05 0.00 0.31 0.04 0.00 0.09 0.00 0.00 0.05 0.00 0.00 0.00 0.00 0.32 0.00 0.00 0.00 0.01 0.00 0.00
High Renaissance 0.00 0.00 0.00 0.00 0.18 0.00 0.00 0.00 0.17 0.00 0.00 0.09 0.03 0.04 0.00 0.00 0.00 0.29 0.00 0.00 0.00 0.06 0.00 0.08 0.03 0.00 0.00
Impressionism 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.03 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.04 0.12 0.00 0.01 0.01 0.00 0.00
Mannerism Late Renaissance 0.00 0.00 0.00 0.01 0.45 0.00 0.00 0.00 0.02 0.03 0.00 0.03 0.02 0.06 0.00 0.00 0.00 0.14 0.00 0.00 0.01 0.11 0.01 0.10 0.02 0.00 0.00
Minimalism 0.13 0.00 0.00 0.00 0.01 0.19 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.62 0.00 0.00 0.01 0.00 0.02 0.00 0.02 0.00 0.00 0.00 0.00 0.00
Naive Art Primitivism 0.01 0.00 0.00 0.07 0.01 0.00 0.00 0.05 0.00 0.37 0.00 0.00 0.05 0.00 0.00 0.21 0.00 0.01 0.00 0.02 0.05 0.06 0.00 0.03 0.03 0.00 0.00 0.4
New Realism 0.00 0.00 0.00 0.04 0.02 0.00 0.00 0.00 0.00 0.16 0.00 0.00 0.33 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.39 0.00 0.04 0.00 0.00 0.00
Northern Renaissance 0.00 0.00 0.00 0.05 0.14 0.00 0.00 0.00 0.05 0.02 0.00 0.00 0.02 0.00 0.00 0.01 0.00 0.53 0.00 0.00 0.00 0.08 0.00 0.06 0.03 0.00 0.00
Pointillism 0.01 0.00 0.00 0.15 0.00 0.00 0.00 0.05 0.00 0.03 0.00 0.00 0.32 0.00 0.00 0.00 0.00 0.01 0.13 0.00 0.20 0.03 0.00 0.00 0.07 0.00 0.00
Pop Art 0.07 0.00 0.00 0.07 0.00 0.04 0.00 0.06 0.00 0.09 0.00 0.00 0.02 0.00 0.02 0.05 0.00 0.00 0.00 0.47 0.01 0.06 0.00 0.00 0.02 0.00 0.00
Post Impressionism 0.00 0.00 0.00 0.04 0.00 0.00 0.00 0.01 0.00 0.17 0.00 0.00 0.39 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.26 0.08 0.00 0.01 0.02 0.00 0.00
0.2
Realism 0.00 0.00 0.00 0.01 0.02 0.00 0.00 0.00 0.00 0.05 0.00 0.00 0.17 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.03 0.59 0.00 0.09 0.01 0.00 0.00
Rococo 0.00 0.00 0.00 0.01 0.29 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.09 0.34 0.24 0.00 0.00 0.00
Romanticism 0.00 0.00 0.00 0.02 0.05 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.09 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.01 0.23 0.03 0.51 0.02 0.00 0.00
Symbolism 0.01 0.00 0.00 0.08 0.00 0.00 0.00 0.02 0.01 0.11 0.00 0.00 0.13 0.00 0.00 0.00 0.00 0.02 0.00 0.01 0.05 0.13 0.00 0.06 0.35 0.00 0.00
Synthetic Cubism 0.00 0.00 0.00 0.03 0.00 0.00 0.00 0.82 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.03 0.00 0.00 0.03 0.00 0.00 0.00 0.00 0.00 0.00
Ukiyo-e 0.01 0.00 0.00 0.07 0.00 0.00 0.00 0.00 0.00 0.03 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.02 0.00 0.01 0.01 0.00 0.00 0.00 0.01 0.00 0.83 0
Ab
Ac
An
Ar tica g
Ba uv bis
C ue
C Fie
C mp ain
Ea m
Ex Ren
Fa sio an
H sm m
Im en
N alis ate
N Art
Po rn
Po llism ais
Po rt
R m
R o
Sy ntic
Sy lis
U etic
ol
on
ub
ig
ai
ew
or
ea re
oc
om
ki
an ion nce
in ism
tN lC
pr
tio
pr
uv
nt m
st
al
ro eau m
rly
in
st
yo Cu
or
ve m
im
th
oc
te d P
is rar ng
lis ss
ne ism
ra
es iss
ti
bo sm
h
es ss
A
q
a
n
i
o
Im
R
-e
ct
ea
pa res
p
Ex
l
in sio
i
i
ai
ni ce
Pr
en
t
m sm
p
s
in
L
u
im
bi
y
io
a
sm
R
ni
t iv
ti
sa
ea
sm
R
i
en
ni
nc
lis
sm
ai
m
e
ss
an
ce
Figure 8: Confusion matrix of style movement classification evaluation on GOYA style space.
Not Only Generative Art: Stable Diffusion
for Content-Style Disentanglement in Art Analysis ICMR ’23, June 12–15, 2023, Thessaloniki, Greece
GOYA - Style
CLIP
GOYA - Style
CLIP
GOYA - Style
CLIP
Figure 9: Retrieval results in GOYA content and style spaces and CLIP latent space based on cosine similarity. In each row, the
similarity decreases from left to right. Copyrighted images are skipped.
ICMR ’23, June 12–15, 2023, Thessaloniki, Greece Wu et al.
GOYA - Style
CLIP
GOYA - Style
CLIP
GOYA - Style
CLIP
Figure 10: Retrieval results in GOYA content and style spaces and CLIP latent space based on cosine similarity. In each row,
the similarity decreases from left to right. Copyrighted images are skipped.