Fashionpedia-Taste A Dataset Towards Explaining Hu

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Fashionpedia-Taste: A Dataset towards Explaining Human Fashion Taste

Mengyun Shi1 Serge Belongie2 Claire Cardie1


1 2
Cornell University University of Copenhagen

Preference bias of
Subject A Subject B visual judgements
between two subjects
arXiv:2305.02307v1 [cs.CV] 3 May 2023

Oval neck
Task1: Do you like this dress ? Sleeveless Task1: Do you like this dress ?
I like this dress explain…
Normal waist I like this dress explain…
Fit and flare
Task2: Attributes selection Task2: Attributes selection
Printed Like this dress for the different
I like this dress because these attributes: I like this dress because these attributes:
-Printed (textile finishing) -Fit and flare (silhouette) liked attributes
-Floral (textile pattern) Floral
Circle
A-line
I dislike this dress because these attributes: I dislike this dress because these attributes: Like this dress for the same
-Sleeveless (length) -Sleeveless (length) disliked attributes
-Oval neck (neckline type) -Oval neck (neckline type)
Above-the-knee

Task3: Human attention Like this dress for the same


Task3: Human attention
I like this dress because this area: liked area
I like this dress because this area:
I dislike this dress because this area: Like this dress for the different
I dislike this dress because this area:
explain…
disliked area
explain…

Task4: Textual explanation: Task4: Textual explaination Like the same area for
I like this area because: I like this area because: different reasons
The printed pattern looks very gorgeous. The fit & flare design makes my waistline slimmer.
I dislike this area because: I dislike this area because: Dislike the different area for
The sleeves don't match the rest of the dress. The oval neckline are boring and not flattering
different reasons
Other properties of User A: correlation Other properties of User B: correlation
Personality: openness Personality: conscientiousness
Favorite brands: ZARA, H&M Favorite brands: Chanel, Gucci
Favorite attributes/styles: baby doll, bohemian Favorite attributes/styles: elegant, classic

(a) (b)

Figure 1: (a) Explainability & Reasoning: our Fashionpedia-taste dataset challenges computer vision systems to not only
predict whether a subject like a fashion image (task 1), but also provide the explaination from the following 3 perspectives:
localized attributes (task 2), human attention (task 3), and caption (task 4); (b) Visual preference bias: even two subjects like
a same dress, they can like this dress for totally different reasons (task 2). Similarly, even two subjects like same area of this
dress, they can like this area for different reasons (task 3&4).

1
Abstract and modalities. Fashionpedia-taste is available at:

1. Introduction
Existing fashion datasets don’t consider the multi-facts
that cause a consumer to like or dislike a fashion image. Why users click ‘like’ on a fashion image shown on so-
Even two consumers like a same fashion image, they could cial platforms? They use ‘like’ to express their preference
like this image for total different reasons. In this paper, for fashion. In the context of fashion, apparel represents
we study the reason why a consumer like a certain fashion second highest e-commerce shopping category. Therefore,
image. Towards this goal, we introduce an interpretabil- fully understand users’ fashion taste could play an impor-
ity dataset, Fashionpedia-taste, consist of rich annotation tant role on the fashion E-commerce.
to explain why a subject like or dislike a fashion image To increase the chance for a consumer to buy fashion
from the following 3 perspectives: 1) localized attributes; products, a various of recommendation systems have been
2) human attention; 3) caption. Furthermore, subjects are developed and they have delivered decent results. However,
asked to provide their personal attributes and preference on even a recommendation system makes a correct recommen-
fashion, such as personality and preferred fashion brands. dation for a consumer, does that mean the model really un-
Our dataset makes it possible for researchers to build com- derstand the reason why this user like this image? The an-
putational models to fully understand and interpret con-
sumers’ fashion taste from different humanistic perspectives 1 Fashionpedia project page: fashionpedia.github.io/home/

1
Task Dataset name ied the correlation between ads and fashion taste among
Recognition iMat [3], Deep [15], Clothing [29], F-MNIST [27], users. Unlike Fashion IQ, ViBE and Fashionpedia-Ads,
F-550k [11], F-128 [21], F-14 [23], Hipster [14] our dataset focuses on explaining and reasoning on users’
Detection ModaNet [32], Deep2 [2], Main [19]
Data Mining Vintage [9], Chic [28], Ups [5], Latent [6], Geo [16]
fashion preference based on both visual and textual sig-
F-144k [20], F-200K [4], Street [17], Runway [25] nal. Beyond fashion domain, the most relevant work is
Retrieval DARN [10], WTBI [13], Zappos [30], Deep [15] VCR [31], which requires models to answer correctly and
Capsule [7], POG [1], VIBE [8], IQ [26] then provide a rationale justificatoin to its answer. Un-
Attribute Fashionpedia [12]
Localization like VCR, our dataset requires models to complete more
Explainability Fashionpedia-Taste
complicated multi-stage reasoning through different modal-
& Reasoning ities (task1/2/3/4) for subjects’ fashion taste, as illustrated in
Fig. 1.
Table 1: Compared to the previous work, Fashionpedia-
Taste is the only study that investigates human fashion taste 3. Fashion Taste Annotation from Subjects
from different humanistic perspectives and modalities.
Subjects and Annotation pipeline We recruited 100 fe-
male subjects from a U.S. university. Our user annotation
swer is not clear to us. Furthermore, even two consumers process consists of 2 parts: 1) collect subjects’ basic in-
like a same image, they could like this image for totally dif- formation (Sec. 3.1); 2) collect subjects’ fashion taste for
ferent reasons, as illustrated in Fig. 1. To our knowledge, given dress length categories (Sec. 3.2). All the subjects are
no previous study has explored this problem before. required to complete these two process.
In this paper, we introduce an explainable fashion taste
dataset, Fashionpedia-Taste, which asked the subjects to 3.1. Basic Information Survey
provide rationale explainations for the reasons that they like This survey is used to collect the subjects’: 1) basic in-
or dislike a fashion image from 3 perspectives: 1) local- formation (gender, ethnicity, age); 2) personality; 3) ba-
ized attribute; 2) human attention; 3) caption. Addition- sic fashion preference (favorite fashion brands, fashion at-
ally, we also collect the extra personal preference informa- tributes and categories); 4) favorite dress length, which
tion from the subjects, such as preferred dress length, per- is used to determine the images from which dress length
sonality, brands, and fine-grained categories while buying should be assigned to each subject for the survey mentioned
a dress. Because these information might also correlate a in Sec. 3.2.
user’s preference for a fashion image.
The aim of this work is to enable future studies and en- Personality Similar to [18], we use the 10-item multiple-
courage more investigation to interpretability research of choice questions to measure subjects’ personality. We col-
fashion taste and narrow the gap between human and ma- lect personality data because we want to see whether there
chine understanding of images. The contributions of this is a correlation between subjects’ personality and their fash-
work are: 1) an explainable fashion taste dataset consists of ion taste.
10,000 expressions (like or dislike of an image), 52,962 se- Basic fashion preference We collect users’ fashion pref-
lected attributes, 20,000 human attentions, and 20,000 cap- erence (fashion categories, fashion attributes, and brands)
tions to explain subjects’ fashion taste, over 1500 unique because we are curious whether this self-reported fashion
images, and 100 unique subjects; 2) we formalize a new preference is aligned with their preference measured in
task that not only requires models to predict whether a sub- Sec. 3.2.
ject like or dislike a fashion image, but also explain the rea-
sons from 3 different perspectives (localized attribute, hu- 3.2. Fashion Taste Survey
man attention, and caption).
Task Design In the fashion taste survey, the subjects are
given 100 dress images based on their favourite dress
2. Related Work lengths reported in their basic information survey (Sec. 3.1).
They are required to tell whether they like these dresses and
Fashion dataset Most of the previous fashion datasets fo-
provide the reasons why they like or dislike these dresses.
cus on recognition, detection, data mining, or retrieval
For each given dress image, they need to complete the fol-
tasks (Table 1). In the domain of interaction between
lowing 4 tasks:
users and fashion images, Fashion IQ [26] provides human-
generated captions that distinguish similar pair of garment • Task 1: judge whether they like or dislike a given dress.
images through natural language feedback. ViBE [8] in-
troduces a dataset to understand users’ fashion preference • Task 2-Attribute selection: explain which aspects
based on her specific body shape. Fashionpedia-Ads stud- make them like and dislike a given dress.

2
• Task 3-Human attention: indicate (draw polygons) the Length Type Selection type # Total # Average
regions of the dress that make them like and dislike a all lengths Liked 38397 3.9397
given dress. all lengths Disliked 14565 1.4565
mini Liked 7991 3.89
• Task 4-Textual explanation: explain why the regions mini Disliked 3117 1.52
they draw from task 3 make them like and dislike a above Liked 8001 3.63
given dress. above Disliked 3076 1.39
below Liked 6312 3.60
below Disliked 2561 1.46
Why we design these 4 tasks Task 2, 3 and 4 allow the
midi Liked 8463 4.03
subjects to explain their fashion taste from 3 different per-
midi Disliked 3076 1.46
spectives and modalities. Task 2 allows the subjects to ex- maxi Liked 7630 4.01
plain their fashion taste on the perspective of fine-grained maxi Disliked 2735 1.43
attributes. However, Task 2 might miss to capture some in-
formation that can only be explained visually. Task 3 is used
Table 2: Task 2-The number of liked and disliked attributes
to address this issue and simulates human gaze capture, al-
annotated by the users.
lowing the subjects to explain their fashion taste visually.
Furthermore, to fully understand the area that subjects draw
in task 3, we use Task 4 to allow the subjects to further ex- Super-category # Liked Freq. # Disliked Freq.
attribute attribute
plain why their draw the areas in Task 3 textually.
Imbalanced likes and dislikes: we expect it will have big Dress Style 2385 6.2 % 1132 7.7 %
data imbalance if we only ask the subjects to explain the Silhouette 8419 21.9 % 1786 12.2 %
Textile Pattern 3194 8.3 % 2031 13.9 %
reasons that make them like a dress. To address this issue,
Tex fini, manu-tech 5656 14.7 % 3493 23.9 %
we asked users to explain both the aspects that make them
None-Textile Type 49 0.1 % 49 0.33 %
like and dislike a given image for task 2, 3, and 4. Neckline Style 6611 17.2 % 2283 15.6 %
Collar Style 387 1% 101 0.7 %
4. Dataset Analysis Lapel Style 28 0.1 % 4 0.02 %
Sleeve Style 2179 5.6 % 726 4.9 %
4.1. User annotation analysis Sleeve length 1917 4.9 % 592 4.1 %
Pocket Style 236 0.6 % 84 0.5 %
User basic info Opening Type 691 1.8 % 453 3.1 %
Task1-Like / dislike We collect 4766 likes and 5234 dis- Waistline 4146 10.7 % 849 5.8 %
likes, over 1500 unique images, and 100 unique users. The Dress length 2499 6.5 % 982 6.7 %
frequency of likes and dislikes selected by each user is
shown in Fig. 2, indicating most of the users have fairly Table 3: Task 2-The number of total attributes annotated by
balanced like and dislike ratio. Balanced like and dislike the users for each fine-grained attribute category.
ratio could potentially help train less biased models.
Task2-Attribute selection Table 2 breaks down the fre- Fig. 3. (a) & (b) shows the distribution of liked and dis-
quency of liked and disliked attributes selected by each user liked attributes annotated by the users. The results show
into different dress lengths. The average liked attributes ’printed’ and ’normal waist’ are the main reason that causes
(3.9397) is nearly 3 times more than disliked attributes some users to like a dress. However, these attributes can
(1.4565). This indicates most of the users tend to select the also be the factor that causes users dislike a dress. Because
attributes that they like rather than dislike while explaining a user could like a certain type of printed pattern but dislike
why they like/dislike a dress. another type of printed pattern. To better explain a user’s
Table 3 displays the details of annotated attributes fashion taste, it requires to train a model to not only under-
into their corresponding fine-grained categories (super- stand users’ preference on a specific attribute, but also the
categories). ’Silhouette’ contains highest percentage of pixel-level pattern on the area that this attribute is located.
liked attributes (21.9 %). This indicates most of users’ fash- For this purpose, we asked the users to conduct task 3.
ion preference is determined by the shape of a dress. In con-
Task3-Human attention Table 4 shows the number of to-
trast, ’textile finishing and manufacturing technique’ (Tex
tal human attention annotated by the users for each dress
fini, manu-tech) contains highest percentage of disliked at-
length. The number of annotated attention is evenly dis-
tributes (23.9 %). This suggests a user could dislike a dress
tributed across different dress length.
because of this category even she likes the ’silhouette’ of a
dress. Task4-Textual explanation for task 3 Task 4 contains 11.3

3
Counts of likes and dislikes by user id
87 Count of Likes
84 Count of Dislikes
81 81
80
75
72 71 71
70
68 67 68 67 68 67 67
66 65 66 66 66
63 63 64 64 64
62 62 61 61 62 62 61 61 62 62
60 59 59 59 59 59 60 60 59 59 60 60 59 59 58
58 58 58 58 58 58
56 55 55 55 55 56 55 56 56 55 56 55 56 55 55 56 56 55
54 54 54 54 53 54 54 53 54 54
51 5050 51 52 51 52 51 51 52 52 51 52 51 51 52 52
49 49 48 49 48 49 49 48 48 49 48 49 49 48 48
47 47
Counts

46 45 45 45 45 45 46 46 46 45 46 46 45 44 45 45 46 46 45
44 44 44 44 44 44 44
41 42 41 41 41 41 42 42 41 41 41 41 42 42 42 42
40 39 39 40 39 40 40 39 40
38 38 37 38 38 37 38 38
35 36 36 36
34 34 33 34 33 34 33 33
32 32 32
29 30 29
28
25
20 19 19
16
13

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
User id

Figure 2: Task 1-Count of likes and dislikes by the user.

(a) Distribution of liked attributes. (b) Distribution of disliked attributes. (c) Word count statistics: Number of 1, 2, 3-grams.

Figure 3: Task 2-Distribution of liked and disliked attributes annotated by the subjects & Task4-Word count statistics

Length Type # Total gaze Linguistic statistics: we use part-of-speech (POS) tag-
all lengths 20000
ging from Spacy [22] to tag noun, propn, and adj of the
mini 4100 captions annotated in Task 4. Table 5 shows the number of
above 4400 most frequent unique words by POS. We find the most fre-
below 3500 quent common nouns are more associated with high level
midi 4200 description of a dress, such as neckline, pattern and waist-
maxi 3800 line. In contrast, the most frequent proper nouns are more
related to detailed description of a dress, such as applique,
Table 4: Task 3-The number of total human attention anno- bead, and peter pan collar. This shows the linguistic diver-
tated by the users for each dress length. sity of our dataset.

5. Conclusion
words in average per user and 5.6 words in average per sub-
task (task 4.1 for liked explanation and task 4.2 for disliked In this work, we studied the problem of human taste in
explanation). fashion product image. we introduce an explainable fash-
Word count statistics: we use SGRank from Textacy [24] ion taste dataset, Fashionpedia-Taste, with a purpose to un-
to calculate the frequency of words. Fig. 3.(c) shows the derstand fashion taste from 3 perspectives: 1) localized at-
most frequent 1, 2, 3 grams for our dataset. Waistline (high, tribute; 2) human attention; 3) caption. The aim of this work
normal, and empire waist) and pattern (floral, geometric, is to enable future studies and encourage more investigation
abstract) related words are high frequency words used by to interpretability research of fashion taste and narrow the
the users to explain the attention that they draw for Task3. gap between human and machine understanding of images.

4
POS Type Word Belongie. Fashionpedia: Ontology, segmentation, and an at-
tribute localization dataset. In European conference on com-
Noun dress, neck, length, neckline, design, shape, pattern, puter vision, pages 316–332. Springer, 2020. 2
waist, color, sleeve, curve, waistline, fabric, skirt
[13] M. H. Kiapour, X. Han, S. Lazebnik, A. C. Berg, and T. L.
Propn maxi, applique, bead, kimono, peter pan, pleat, tiered
Berg. Where to buy it: Matching street clothing photos in
halter, dolman, slit, tent, stripe, cutout, cheetah
online shops. In ICCV, 2015. 2
Adj elegant, beautiful, nice, cute, high, straight, perfect
floral, loose, graceful, sexy, fit, attractive, charming [14] M Hadi Kiapour, Kota Yamaguchi, Alexander C Berg, and
Tamara L Berg. Hipster wars: Discovering elements of fash-
ion styles. In ECCV, 2014. 2
Table 5: Task 4-Linguistic statistics: Number of unique
[15] Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou
words by POS. Tang. Deepfashion: Powering robust clothes recognition and
retrieval with rich annotations. In CVPR, 2016. 2
References [16] Utkarsh Mall, Kevin Matzen, Bharath Hariharan, Noah
Snavely, and Kavita Bala. Geostyle: Discovering fashion
[1] Wen Chen, Pipei Huang, Jiaming Xu, Xin Guo, Cheng Guo,
trends and events. In Proceedings of the IEEE/CVF Inter-
Fei Sun, Chao Li, Andreas Pfadler, Huan Zhao, and Bin-
national Conference on Computer Vision, pages 411–420,
qiang Zhao. Pog: personalized outfit generation for fashion
2019. 2
recommendation at alibaba ifashion. In Proceedings of the
25th ACM SIGKDD international conference on knowledge [17] Kevin Matzen, Kavita Bala, and Noah Snavely. StreetStyle:
discovery & data mining, pages 2662–2670, 2019. 2 Exploring world-wide clothing styles from millions of pho-
[2] Yuying Ge, Ruimao Zhang, Xiaogang Wang, Xiaoou Tang, tos. arXiv preprint arXiv:1706.01869, 2017. 2
and Ping Luo. Deepfashion2: A versatile benchmark for de- [18] Nils Murrugarra-Llerena and Adriana Kovashka. Cross-
tection, pose estimation, segmentation and re-identification modality personalization for retrieval. In Proceedings of
of clothing images. In CVPR, 2019. 2 the IEEE/CVF Conference on Computer Vision and Pattern
[3] Sheng Guo, Weilin Huang, Xiao Zhang, Prasanna Srikhanta, Recognition, pages 6429–6438, 2019. 2
Yin Cui, Yuan Li, Hartwig Adam, Matthew R Scott, and [19] Antonio Rubio, LongLong Yu, Edgar Simo-Serra, and
Serge Belongie. The imaterialist fashion attribute dataset. Francesc Moreno-Noguer. Multi-modal embedding for main
In ICCV Workshops, 2019. 2 product detection in fashion. In ICCV, 2017. 2
[4] Xintong Han, Zuxuan Wu, Phoenix X. Huang, Xiao Zhang, [20] Edgar Simo-Serra, Sanja Fidler, Francesc Moreno-Noguer,
Menglong Zhu, Yuan Li, Yang Zhao, and Larry S. Davis. Au- and Raquel Urtasun. Neuroaesthetics in fashion: Modeling
tomatic spatially-aware fashion concept discovery. In ICCV, the perception of fashionability. In CVPR, 2015. 2
2017. 2 [21] Edgar Simo-Serra and Hiroshi Ishikawa. Fashion style in
[5] Ruining He and Julian McAuley. Ups and downs: Modeling 128 floats: Joint ranking and classification using weak data
the visual evolution of fashion trends with one-class collab- for feature extraction. In CVPR, 2016. 2
orative filtering. In WWW, 2016. 2 [22] Spacy. Industrial-strength natural language processing.
[6] Wei-Lin Hsiao and Kristen Grauman. Learning the latent https://spacy.io/, 2022. 4
“look”: Unsupervised discovery of a style-coherent embed-
[23] Moeko Takagi, Edgar Simo-Serra, Satoshi Iizuka, and Hi-
ding from fashion images. In ICCV, 2017. 2
roshi Ishikawa. What makes a style: Experimental analysis
[7] Wei-Lin Hsiao and Kristen Grauman. Creating capsule of fashion prediction. In ICCV, 2017. 2
wardrobes from fashion images. In Proceedings of the
[24] Textacy. Textacy: Nlp, before and after spacy.
IEEE conference on computer vision and pattern recogni-
https://textacy.readthedocs.io/en/latest/, 2022. 4
tion, pages 7161–7170, 2018. 2
[8] Wei-Lin Hsiao and Kristen Grauman. Vibe: Dressing for [25] S. Vittayakorn, K. Yamaguchi, A. C. Berg, and T. L. Berg.
diverse body shapes. In Proceedings of the IEEE/CVF Con- Runway to realway: Visual analysis of fashion. In WACV,
ference on Computer Vision and Pattern Recognition, pages 2015. 2
11059–11069, 2020. 2 [26] Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven
[9] Wei-Lin Hsiao and Kristen Grauman. From culture to cloth- Rennie, Kristen Grauman, and Rogerio Feris. Fashion iq: A
ing: Discovering the world events behind a century of fash- new dataset towards retrieving images by natural language
ion images. In Proceedings of the IEEE/CVF International feedback. In Proceedings of the IEEE/CVF Conference on
Conference on Computer Vision, pages 1066–1075, 2021. 2 Computer Vision and Pattern Recognition, pages 11307–
[10] Junshi Huang, Rogerio Feris, Qiang Chen, and Shuicheng 11317, 2021. 2
Yan. Cross-domain image retrieval with a dual attribute- [27] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-
aware ranking network. In ICCV, 2015. 2 mnist: a novel image dataset for benchmarking machine
[11] Naoto Inoue, Edgar Simo-Serra, Toshihiko Yamasaki, and learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
Hiroshi Ishikawa. Multi-label fashion image classification 2
with minimal human supervision. In ICCV, 2017. 2 [28] Kota Yamaguchi, Tamara L Berg, and Luis E Ortiz. Chic or
[12] Menglin Jia, Mengyun Shi, Mikhail Sirotenko, Yin Cui, social: Visual popularity analysis in online fashion networks.
Claire Cardie, Bharath Hariharan, Hartwig Adam, and Serge In ACM MM, 2014. 2

5
[29] Kota Yamaguchi, M Hadi Kiapour, Luis E Ortiz, and
Tamara L Berg. Parsing clothing in fashion photographs.
In CVPR, 2012. 2
[30] Aron Yu and Kristen Grauman. Semantic jitter: Dense super-
vision for visual comparisons via synthetic images. In ICCV,
2017. 2
[31] Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi.
From recognition to cognition: Visual commonsense rea-
soning. In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, pages 6720–6731,
2019. 2
[32] Shuai Zheng, Fan Yang, M Hadi Kiapour, and Robinson Pi-
ramuthu. Modanet: A large-scale street fashion dataset with
polygon annotations. In ACM MM, 2018. 2

You might also like