Professional Documents
Culture Documents
Diversity in Fashion Recommendation Using Semantic Parsing
Diversity in Fashion Recommendation Using Semantic Parsing
0.9 0.45
Texture Encoding Layer: The features extracted at each 0.8 0.40
time step of the LSTM network are passed through a texture 0.7 0.35
Average accuracy
Average accuracy
encoding layer which learns the texture of the attentional re- 0.6 0.30
0.5 0.25
gion. The encoding layer incorporates dictionary learning, WTBI(0.506) 0.20 WTBI(0.063)
0.4 0.15
feature pooling and classifier learning into it. The layer as- DARN(0.67) DARN(0.111)
0.3 FashionNet(0.764) 0.10 FashionNet(0.188)
sumes k codewords and assigns descriptors to them using soft 0.2 Ours(0.784) 0.05 Ours(0.253)
assignment so that the function is differentiable and can be 0.10 10 20 30 40 50 0.000 10 20 30 40 50
trained along with the model using Stochastic gradient de- Retrieved Images(k=1,2,3.....50) Retrieved Images(k=1,2,3.....50)
scent with backpropagation as proposed by Zhang et. al.[23]. (a) In-Shop retrieval (b) Consumer-to-shop retrieval
Query
Image 3
Query
Image 1
Hat , Dress
Query Jeans, Dress Dress Skirt, Boots
Shoes
Image 2
Hat
Query
Query Image 6
Query Image 5
Dress Image 4 Boots, Skirt
Belt Boots, Skirt Dress, Hat, Boots Boots, Coat
Fig. 4: Semantically similar results for some of the query images from Fashion144k dataset [5] using our method. Retrieved
images for six query images are shown from left to right, top to bottom. For each query image, retrieved images having similar
clothing items are shown together.
in texture encoding layer. For each image, we assign top in terms of similarity of various parts.
6 highest-ranked labels to the image and compare with the
ground-truth labels. As evaluation metrics for the item recog- 6. CONCLUSION
nition task, the class-agnostic average precision (APall ), and
the mean of each class-average precision (mAP) are used. Ta- We have presented a novel attention based deep neural net-
ble 1 shows the comparison. work which learns to attend to different semantic parts using
We also tested our model for item retrieval tasks, for in- the weakly labeled data. Unlike state of the art techniques us-
Shop as well as consumer-to-shop retrieval tasks on Deep- ing global similarity features, we suggest to extract per part
Fashion dataset [30]. For this task, we extract texture fea- fine-grained features using texture cues: we use DeepTEN in
tures from the model trained on Fashion144k dataset [5] and our experiments. We have conducted experiments on three
use Euclidean distance to find similarity between two im- widely accepted tasks in clothing recognition and retrieval.
age features. Fig. 3 show the top-k retrieval accuracy of We show that the proposed model outperforms previous state-
all the compared methods with k ranging from 1 to 50. We of-the-art models on deep fashion dataset for both consumer-
also list the top-20 retrieval accuracy after the name of each to-shop and in-shop image retrieval tasks. Our experiment on
method. For in-shop retrieval, our model achieves best perfor- recommendation task shows that the proposed model is able
mance(0.784) which is better than state-of-the-art FashionNet to successfully recommend a wide variety of objects match-
[30] accuracy (0.764). For consumer-to-shop, the correspond- ing with different parts of the query image.
ing numbers are 0.253 for our model and 0.188 for Fashion-
Net [30]. 7. REFERENCES
Finally, we test our model for similar item retrieval task.
Fig. 4 shows the retrieval results for various query images. It [1] Junshi Huang, Rogerio Feris, Qiang Chen, and
is easy to see that recommended results have lots of variation Shuicheng Yan, “Cross-domain image retrieval with a
dual attribute-aware ranking network,” in ICCV, 2015, [16] Zhouxia Wang, Tianshui Chen, Guanbin Li, Ruijia Xu,
pp. 1062–1070. and Liang Lin, “Multi-label image recognition by recur-
rently discovering attentional regions,” in ICCV, 2017.
[2] Andreas Veit et al., “Learning visual clothing style with
heterogeneous dyadic co-occurrences,” in ICCV, 2015, [17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
pp. 4642–4650. Sun, “Deep residual learning for image recognition,” in
CVPR, 2016, pp. 770–778.
[3] Xi Wang, Zhenfeng Sun, Wenqiang Zhang, Yu Zhou,
and Yu-Gang Jiang, “Matching user photos to online [18] Karen Simonyan and Andrew Zisserman, “Very deep
products with robust deep features,” in ICMR, 2016, pp. convolutional networks for large-scale image recogni-
7–14. tion,” in CoRR, 2014, vol. abs/1409.1556.
[19] K. S. Loke, “Automatic recognition of clothes pattern
[4] Jiang Wang et al., “Learning fine-grained image simi-
and motifs empowering online fashion shopping,” in
larity with deep ranking,” in CVPR, 2014.
ICCE-TW, 2017, pp. 375–376.
[5] E. Simo-Serra and H. Ishikawa, “Fashion style in 128 [20] Si Liu et al., “Hi, magic closet, tell me what to wear!,” in
floats: Joint ranking and classification using weak data Proceedings of the 20th ACM international conference
for feature extraction,” in CVPR, 2016, pp. 298–307. on Multimedia, 2012, pp. 619–628.
[6] Edgar Simo-Serra, Sanja Fidler, Francesc Moreno- [21] Mircea Cimpoi, Subhransu Maji, and Andrea Vedaldi,
Noguer, and Raquel Urtasun, “A high performance crf “Deep filter banks for texture recognition and segmen-
model for clothes parsing,” in ACCV, 2014, pp. 64–81. tation,” in CVPR, 2015, pp. 3828–3836.
[7] K. Yamaguchi, M. H. Kiapour, L. E. Ortiz, and T. L. [22] Vincent Andrearczyk and Paul F Whelan, “Using filter
Berg, “Parsing clothing in fashion photographs,” in banks in convolutional neural networks for texture clas-
CVPR, 2012, pp. 3570–3577. sification,” in Pattern Recognition Letters, 2016, vol. 84,
pp. 63–69.
[8] K. Yamaguchi, M. H. Kiapour, L. E. Ortiz, and T. L.
Berg, “Retrieving similar styles to parse clothing,” in [23] Hang Zhang, Jia Xue, and Kristin Dana, “Deep TEN:
TPAMI, 2015, vol. 37, pp. 1028–1040. Texture encoding network,” in CVPR, 2017.
[9] Wei Yang, Ping Luo, and Liang Lin, “Clothing co- [24] H. Jgou, M. Douze, C. Schmid, and P. Prez, “Aggregat-
parsing by joint image segmentation and labeling,” in ing local descriptors into a compact image representa-
CVPR, 2014, pp. 3182–3189. tion,” in CVPR, 2010, pp. 3304–3311.
[25] F. Perronnin and C. Dance, “Fisher kernels on visual
[10] Huizhong Chen, Andrew Gallagher, and Bernd Girod,
vocabularies for image categorization,” in CVPR, 2007,
“Describing clothing by semantic attributes,” in ECCV,
pp. 1–8.
2012, pp. 609–623.
[26] Max Jaderberg, Karen Simonyan, Andrew Zisserman,
[11] Kota Yamaguchi et al., “Mix and match: Joint model and Koray Kavukcuoglu, “Spatial transformer net-
for clothing and attribute recognition,” in BMVC, 2015, works,” in NIPS, 2015, pp. 2017–2025.
vol. 1, p. 4.
[27] B. Zhao, X. Wu, J. Feng, Q. Peng, and S. Yan, “Diver-
[12] Edgar Simo-Serra, Sanja Fidler, Francesc Moreno- sified visual attention networks for fine-grained object
Noguer, and Raquel Urtasun, “Neuroaesthetics in fash- classification,” in IEEE Transactions on Multimedia,
ion: Modeling the perception of fashionability,” in 2017.
CVPR, 2015, vol. 2, p. 6.
[28] Naoto Inoue, Edgar Simo-Serra, Toshihiko Yamasaki,
[13] S. Liu, Z. Song, G. Liu, C. Xu, H. Lu, and S. Yan, and Hiroshi Ishikawa, “Multi-label fashion image clas-
“Street-to-shop: Cross-scenario clothing retrieval via sification with minimal human supervision,” in ICCVW,
parts alignment and auxiliary set,” in CVPR, 2012, pp. 2017, pp. 2261–2267.
3330–3337. [29] Andreas Veit et al., “Learning from noisy large-scale
[14] Zhonghao Wang, Yujun Gu, Ya Zhang, Jun Zhou, and datasets with minimal supervision,” in CVPR, 2017.
Xiao Gu, “Clothing retrieval with visual attention [30] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang, “Deep-
model,” in VCIP, 2017. fashion: Powering robust clothes recognition and re-
trieval with rich annotations,” in CVPR, 2016, pp. 1096–
[15] Shang-Fu Chen et al., “Order-free RNN with visual at-
1104.
tention for multi-label classification,” in AAAI, 2018.