Professional Documents
Culture Documents
Essentials of Data Analytics: J Component Report
Essentials of Data Analytics: J Component Report
J Component Report
By
Utkarsh 18BEC1230
Hemant Pamnani 18BEC1241
BACHELOR OF TECHNOLOGY
IN
Submitted to
Dr. R. KARTHIK
MAY 2021
School of Electronics Engineering
DECLARATION BY THE CANDIDATE
I hereby declare that the Report entitled “Apparel Categorisation using Image
Classification Model” submitted by us to VIT Chennai is a record of bonafide work
undertaken by us under the supervision of Dr. R. Karthik, Senior Assistant
Professor, SENSE, VIT Chennai.
Chennai
05/06/2021.
ACKNOWLEDGEMENT
We wish to express our sincere thanks and deep sense of gratitude to our project
guide, Dr. R. Karthik, School of Electronics Engineering for his consistent
encouragement and valuable guidance offered to us in a pleasant manner throughout
the course of the project work.
We express our thanks to our Head of The Department Dr. Vetrivelan. P (for
B.Tech-ECE) for his support throughout the course of this project.
We also take this opportunity to thank all the faculty of the School for their
support and their wisdom imparted to us throughout the courses till date.
We thank our parents, family, and friends for bearing with us throughout the
course of our project and for the opportunity they provided us in undergoing this
course in such a prestigious institution.
BONAFIDE CERTIFICATE
Certified that this project report entitled “Apparel Categorisation using Image
Classification Model” is a bonafide work of Utkarsh (18BEC1230) and Hemant
Pamnani (18BEC1241) carried out the “J”-Project work under my supervision and
guidance for CSE 3506 ESSENTIALS OF DATA ANALYTICS.
Dr. R. Karthik
I Abstract 6
5 Chapter 5 - Conclusion 22
6 Code 23-24
7 Reference 25
I. ABSTRACT
Fashion industry is always evolving and it is important to keep up with the latest
trends. For instance, if it often happens that we like a particular type of apparel or
clothing while watching a TV show. In such situations, one wants to know where
they can buy a similar piece. With our prototype, we aim to lay the groundwork to
facilitate such a system by which we can provide a set of similar apparels available
for online purchase. This requires us to be first able to classify clothing with high
precision. This task has its own challenges because very often the clothes are
deformed, folded in weird manner and not exactly stretched to reveal its actual shape.
If the picture only contains clothing but no person wearing it, it can be hard even for
a human to classify it accurately. Also, the pictures are not always taken from the
front and this variation of angle can also add significant difficulty.
We believe that with a good amount of data with many such variations, CNNs will
do a good job at learning the features most indicative of their respective classes.
Depending on the particular application of fashion classification, the most relevant
problems to solve will differ. We will focus on optimizing fashion classification for
the purposes of annotating images and discovering the most similar fashion items to
a fashion item in a query image. Some of the challenges for this task include: classes
of clothing can share similar characteristics (e.g. Handbag , Travelling Bag , School
Bag etc. are all given a broader label of ‘Bag’ due to their similar characteristics),
clothing can easily deform due to their material, certain types of clothing can be
small, and clothing types can look very different depending on aspect ratio and angle.
1.1 PROBLEM STATEMENT
More than 25% of entire revenue in E-Commerce is attributed to apparels & accessories.
A major problem they face is categorizing these apparels from just the images, especially
when the categories provided by the brands are quite inconsistent.
This poses an interesting computer vision problem which has caught the eyes of several
deep learning researchers.
Thus, the aim of our project is to categorise the apparels into the correct categories (Shoe,
Top, Dress etc.) using the image of the apparel.
● To demarcate/categorise the apparels into the correctly defined categories (Shoe, Top,
Dress, Bag ,Pullover ,Trouser etc.) via employing just the basic image of the apparel .
● Analyzation of vivid supportive algorithms for the model preparation such as 2d-layer
convolution, 2D-max pooling setup, etc.
● Inference through vivid obtained metrics and multiple cross-checks with the actual
categorisation list (further optimization, if required)
Authors presented the effect of mini-batch-size on training models. The model gave
93.33% accuracy on the test set with a minimum batch size of 10. CNN for document image
classification is presented in paper by Kang et al. (2014). Authors used CNN with rectified
linear units and trained with dropout. Results obtained using CNN are better than
traditional methods. Tobacco litigation and NIST tax-form dataset are used for
experimentation. For Tobacco dataset, 80% of accuracy is achieved for training and 20%
for validation for 10 classes of images. The median accuracy of 65.37% is achieved for
100 samples. A median accuracy of 100% is achieved through 100 partitions of training
and test on NIST tax form dataset. CNN is used for house number digit classification in a
paper published by Sermanet et al. (2012). They improved the traditional CNN by
multistage features and use of Lp pooling method. Obtained accuracy of 94.85% using
SVHN dataset for 45.2% error improvement.
2.1 REQUIREMENTS
Tools Required:
● Rstudio (R-4.1.0)
Dataset Employed:
● Baseline-I
As a baseline model, we employed the convolutional neural network to predict the style of
the clothing items. A batch of 128 28x28 inputs is passed through a convolution layer of
32 filters of size 5 × 5 followed by batch normalization, relu activation to bring in non-
linearity and max-pooling over the 2×2 region.
The output from this is again passed into the exact same series of layers once again. Later,
the output from that is passed into a dense fully connected layer of size 128 before going
through another set of batch normalization and relu activation. The last set of layers consist
of another dense layer of size 64 and a softmax layer which converts the outputs to
probability scores for each class.
While training, the model weights are updated by backpropagation so as to reduce the
softmax loss at each iteration. To achieve good performance on the training set, the model
was trained on 13 epochs. While testing, a forward pass is implemented on an image input
and the label with highest score is predicted to be its clothing category.
In summary, the architecture looks like: [conv- batch - relu - 2x2 max pool]*2 - affine -
batch - relu - affine - softmax. A test accuracy of 90.57% was obtained and set as our
baseline for future experiments.
● Baseline-II
For the baseline case, we only considered five attributes: gender, necktie, skin exposure,
wear scarf and collar. As mentioned before, for many of the images certain attributes were
not available because of the ambiguity among the human workers classifying them.
Hence we have a varying number of attributes for each of the images. In order to account
for the varying number of attributes in different images, we trained a different neural
network for each of the chosen attributes. For each of the networks, we only chose the
images having the corresponding attribute for the training set. For all the cases, we trained
a simple 3-layer CNN.
The CNN used was: conv - relu - 2x2 max pool - affine - relu - affine - softmax We obtained
a mean accuracy of 88.26% for the attributes considered in the baseline case. This was not
really impressive considering the fact the labels were binary and the data set was
unbalanced in most of the categories. The accuracy we obtained was not much higher than
the guess accuracy for most of the selected attributes.
3. MODULE DESCRIPTION
3.1 UNDERLYING MODULES
1. Post accessing the training and testing chunks (datasets), we would go for
unflattening the data entries, that means transforming the data into a matrix as they
are easier to interpret in indexable format being in pixelated units.
2. Then, we would go for creation of some sort of significant functions for matrix
rotation, plotting the characteristic images(16 cherry-picked) from the matrix, both
from the training as well as testing sets.
3. Then, we would go for the Neural Network Model (Building + Compiling +
Testing/Evaluating). Moreover, data visualisation is also an important segment of
this module.
4. Finally, we would go for ensuring the correctness of the model opted/designed via
analysing the classification metrics and model feasibility.
Given images of apparels, we try to classify them into different classes (Example: shirt,
blouse, undergarment etc.). For this particular problem, we assume that the input images
are already cropped and one image contains only one clothing item. This input image is
then passed through the CNN to generate one of the labels as the output. For this part, the
input is 16 cherry-picked images from the training set belonging to one of the following
categories:
1. Blouses
2. Cloak
3. Coat
4. Jacket
5. Long Dress
6. Polo shirt or Sport shirt
7. Robe
8. Shirt
9. Short Dress
10. Suit, Suit of clothes
11. Sweater
12. Jersey, T-shirt
13. Undergarment, Upper body
14. Uniform
15. Vest, waist-coat
This image is first converted into a contiguous array which stores individual pixel values.
For example, if we choose a resolution of 64 × 64, the input will have shape 64 × 64 × 3
where the 3 refers to the RGB pixel values. Matrix rotation is required because it is a
tried and tested hack by R users for large pixel-image data. Thus, rows and columns’
order reversal is done and matrix transposition is taken to obtain the image to be
displayed with accurate orientation.
The results seemed promising as the performance increased by 6% on the validation set
and around 5% on the test set. Next, we decided to perform hyper-parameter search on the
same network. We could only do trial and error because the training takes around 4-5 hours
and so grid search seemed infeasible.
While doing the hyper-parameter tuning we realized that the training accuracy never went
over 90%. Hence we tried a deeper network with lesser dropout and pooling and Adadelta
optimizer. Optimizers are algorithms/ methods employed to alter the attributes of a neural
network such as weights and learning rate in order to cut down the losses to an extent.
Now, the very reason for employing Adadelta optimizer only is that the concerned
optimizer basically improves the previous algorithm by introducing a history window
which sets a fixed number of past gradients to take into consideration during the training
phase. In this way, we generally don’t encounter the issue of vanishing learning rate . As
we are employing ConvNet with various deep layers, it is alarmingly essential to keep track
of learning rate so that it is maintained during each iteration of flattening with a certain
dropout.
We thought adadelta might be useful because it is the fastest to reach near convergence
even if it might take a long time to converge. This proved to be correct and we could
observe a higher training accuracy and a slight improvement in validation and test
accuracy.
In the Clothing attribute dataset, each image had at most 23 clothing attribute labels. For
some of the images, certain attributes were not available because of the ambiguity among
the human workers classifying them. The images in the Clothing attribute dataset were
also in the format of jpgs.
For the classification, we only considered attributes having binary classes (most of them
being a yes/no classification). We have 23 such attributes, which included 11 colors, 5
clothing patterns, scarf and collar identification. To account for the varying image sizes,
we decided to resize them to 100 × 100 images for the baseline test.
We used 2 convolution layers (with batchnorm, relu and dropout layers included)
and 1 pooling layer and a robust fully-connected layer. We used a sigmoid activation
function instead of a softmax activation (which we used in the baseline) as we found
much better accuracy. This is to be expected as we were dealing with multi-label
attributes. We employed cross entropy loss and Adadelta optimizer to train the CNN.
Computation of cross-entropy metric between the labels and predictions are
smoothly done using this inbuilt function. It is usually employed when there are
multiple label classes(2 or more). Basically, it consists of 4 optional arguments
within itself, i.e., name, datatype, from_logits(which is a by-default set that output
encodes a probabilistic distribution) and label-smoothing. We performed hyper-
parameter tuning to improve the best model.
Post-building CNN model as well as its very compilation, here comes, training and
evaluation phase of the model. We employed model%>%fit() to primarily train the
model with a set proportion of data-entries for further validation, prominent
arguments which are passed are x_train, y_train, controlled epochs through which
model should undergo, batch_size (initially instantiated) and verbose(
immediately/directly known subclass).
Here, verbose is set to 1 as we are considering the immediate dense layer as known
subclass which is potential enough in dimensionality reduction procedure. Followed
by verbose, validation sets are also prepared by passing testing sets via list. Probably
one-sixth of the dataset is given for validation of built CNN. Model loss as well as
corresponding accuracy for each successive epoch is nicely plotted and numerics
can be seen on console post each epoch completion.
Also, we have visually verified the correctness of the model predictions in attaching
the correct set of labels to corresponding sets of apparels by employing which.max()
function. It returns the location of the first maximum value in the numeric vector.
That means the numeric vector is containing all probable numerics corresponding to
every label for a set of apparel. But the first highest numeric in that vector is certainly
the perfect candidate of exact label to be affixed to the corresponding apparel set.
4.1 RESULT
From the above figure we can observe that a broader label is being given for articles.
For instance: Travelling Bag , Hand Bag , Clutches are given a broader label of “Bag” ,
signifying the future scope of study of going deeper into classification.
The Model is run for 13 Epochs, values of loss and accuracy are noted and recorded in
Table 1.
Fig.3: Depicting accuracy and loss for 13 epoch (also recorded in Table 1 given below)
7 0.2199 0.9185
8 0.2189 0.9198
9 0.2248 0.9173
10 0.2165 0.9178
11 0.2143 0.9183
12 0.2120 0.9213
13 0.2019 0.9298
Fig.4: Plot of Loss and Accuracy of deployed CNN Model on Fashion MNIST Dataset
Fig.5: 16 Cherry picked images are randomly chosen from Test data and are successfully classified .
The minute misclassifications can be due to the presence of a large number of redundant
variants for a particular set of apparel, lack of proper image resolution, complex
computational errors such as missing data-entries or their discarding during the flattening
procedure with certain drop-out in-order to inhibit over-fitting. These misclassifications
are considered as non-linear activations’ occurrence while performing CNN, they can be
checked as we add some more deep dense layers out-of-which, the prominently required
layer is Max-pooling layer. It acts as Noise-Suppressant and effectively helps in denoising
the non-linear activations and proper dimensionality reduction for desired classification at
the end.
4.3 FUTURE WORK
● Broader Labelling: Broader label is given for some apparels , which accounts for
future work for digging deeper and giving more explicit labelling. For example:
Travelling Bag, Hand Bag , Clutches are given a broader label of “Bag” , signifying
the future scope of study of going deeper into classification.
● For this project, we built on top of the default AlexNet architecture using the
Imagenet pre-trained weights as a starting point for our work. As a next step, we
will try modified architectures more tailored for practical fashion classification
applications. For example, we are in the process of implementing spatial pyramid
pooling layers, which have been shown to speed up the R-CNN method by 24-102x
● We can employ further data augmentation techniques such as rotating the image and
flipping along the horizontal axis to further perturb the input image.
5. CONCLUSION
In a nutshell, this can be stated that an accuracy of almost 93% indicates that out of 10,000
validation data-entries we took for the inductive phase, almost 9300 apparels get properly
classified via the model we employed.
6. CODE
[1] Liu, S., Song, Z., Liu, G., Xu, C., Lu, H., Yan, S.: Street-to-Shop: Cross-Scenario
Clothing Retrieval via Parts Alignment and Auxiliary Set. CVPR (2012).
[2] Yang, M., Yu, K.: Real-time clothing recognition in surveillance videos. In: 18th
IEEE International Conference on Image Processing (2011).
[4] Chen, Huizhong, Andrew Gallagher, and Bernd Girod. ”Describing clothing by
semantic attributes.” Computer VisionECCV 2012. Springer Berlin Heidelberg,
2012. 609-623.
[5] Liu, Si, et al. ”Fashion parsing with weak colorcategory labels.” Multimedia,
IEEE Transactions on 16.1 (2014): 253-265.
[6] Lorenzo-Navarro, Javier, et al. ”Evaluation of LBP and HOG Descriptors for
Clothing Attribute Description.” Video Analytics for Audience Measurement.
Springer International Publishing, 2014. 53-65.
[7] Hara, Kota, Vignesh Jagadeesh, and Robinson Piramuthu. ”Fashion Apparel
Detection: The Role of Deep Convolutional Neural Network and Pose-dependent
Priors.” arXiv preprint arXiv:1411.5319 (2014).
[8] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for
accurate object detection and semantic segmentation. CVPR, 2014.