Download as pdf or txt
Download as pdf or txt
You are on page 1of 48

UNSEEN CLASS DISCOVERY IN OPEN-WORLD

CLASSIFICATION
A PROJECT REPORT

for

DATA MINING TECHNIQUES (ITE2006)

in

B.Tech – Information Technology and Engineering

by

JAHNAVI MISHRA (18BIT0243)

HAMSITHA C (18BIT0369)

AADITRI MITTAL (18BIT0409)

Under the Guidance of

Dr. SENTHILKUMAR N C

Associate Professor, SITE

School of Information Technology and Engineering

November, 2020

1
DECLARATION BY THE CANDIDATE

I/We here by declare that the project report entitled “UNSEEN CLASS
DISCOVERY IN OPEN-WORLD CLASSIFICATION” submitted by me/us
to Vellore Institute of Technology University, Vellore in partial fulfillment of the
requirement for the award of the course Data Mining Techniques (ITE2006) is
a record of bonafide project work carried out by us under the guidance of Dr.
Senthilkumar N C. I/We further declare that the work reported in this project
has not been submitted and will not be submitted, either in part or in full, for the
award of any other course.

Hamsitha Challagundla

Place : Vellore Signature

Date : 30/10/2020

2
School of Information Technology & Engineering [SITE]

CERTIFICATE

This is to certify that the project report entitled “UNSEEN CLASS DISCOVERY IN
OPEN-WORLD CLASSIFICATION” submitted by HAMSITHA C(18BIT0369) to
Vellore Institute of Technology University, Vellore in partial fulfillment of the requirement
for the award of the course Data Mining Techniques (ITE2006) is a record of bonafide
work carried out by them under my guidance.

Dr. Senthilkumar N C
GUIDE
Associate Professor, SITE

3
UNSEEN CLASS DISCOVERY IN OPEN-WORLD
CLASSIFICATION
Abstract

This project mainly concerns on the open world classification unlike closed world
operations. Classifier from the open world classification model must perform the
special function from what it does. In general a classifier classifies the test
samples in a seen classes that have appeared in training classes. Now the special
functionality to be performed by the classifier is to reject the samples from the
unseen classes that have not appeared in training. Therefore the project to be
focused on the discovery of hidden unseen classes of the rejected samples and
Unseen class discovery. Joint open classification model to be built with a sub
model for classifying whether a pair of examples belong to the same or different
classes. This sub-model can serve as a distance function for clustering to discover
the hidden classes of the rejected examples.

Keywords – Open world classification, Unseen class discovery, handwritten digits (MNIST),
Image data set.

I. INTRODUCTION

Operations to be applied in open world classification unlike closed world classification.


The classifier not only needs to classify test examples into seen classes that have appeared in
training but also reject examples from unseen or novel classes that have not appeared in
training. This process requires the discovery of unseen classes in open world
classification.(Unseen class discovery).This open world classification model contains
classifier to reject the samples from the unseen classes that have not appeared in training.
This requires in discovery of unseen classes. Joint open classification model to be built with a
sub model for classifying whether a pair of examples belong to the same or different classes.
This sub-model can serve as a distance function for clustering to discover the hidden classes
of the rejected examples.

4
II. BACKGROUND

Open world object recognition, new objects may appear constantly and a classifier built from
examples of old objects may incorrectly classify a new object as one of the old objects. This
situation calls for open-world classification or simply open classification which can classify
those examples from the seen classes (appeared in training) and also detect/reject examples
from unseen classes (not appeared in training).

III. Literature Survey

1) PAPER TITLE- Deep Image Category Discovery using a


Transferred Similarity Function.

AUTHOR NAME - Yen-Chang Hsu, Zhaoyang Lv, Zsolt Kira.

JOURNAL NAME WITH MONTH AND YEAR - Georgia Institute of


Technology, Georgia Tech Research Institute, arxiv- (December, 2016).

SURVEY – In this paper, we utilize prior knowledge to facilitate the discovery of


image categories. To understand what performance is required from the Similarity
Prediction Network (SPN) for the clustering network to perform well, we simulate
different similarity prediction performances by using ground truth labels. The
precision and recall of the similar and dissimilar pairs were controlled to explore the
clustering performance under different density and number of clusters.

2) PAPER TITLE - Online Open World Recognition.

AUTHOR NAME - Rocco De Rosa, Thomas Mensink, and Barbara Caputo.

JOURNAL NAME WITH MONTH AND YEAR - arxiv – (April, 2016).

SURVEY – In this paper we addressed the open world recognition problem and
proposed three extensions to its current formulation: online metric learning,
incremental updating of thresholds for novelty detection, and local learning through
nearest ball classification. We argue that to properly model the dynamics of the
challenging open world recognition scenario, it is necessary to learn online the

5
metric and the novelty threshold as new instances and new classes arrive, rather than
estimating them from an initial, closed set of classes as done so far.

3) PAPER TITLE - Joint Unsupervised Learning of Deep Representations and


Image Clusters.
AUTHOR NAME - Jianwei Yang, Devi Parikh, Dhruv Batra.

JOURNAL NAME WITH MONTH AND YEAR -IEEE CVPR (April, 2016).

SURVEY – In this paper, we have proposed an approach to jointly learn deep


representations and image clusters. We combined agglomerative clustering with
CNNs and formulate them as a recurrent process. We used a partially unrolling
strategy to divide the timesteps into multiple periods. In each period, we merged
clusters step by step during the forward pass and learned representation in the
backward pass, which are guided by a single weighted triplet-loss function.

4) PAPER TITLE - Neural Network-Based Clustering Using Pairwise


Constraints.
AUTHOR NAME - Yen-Chang Hsu, Zsolt Kira.

JOURNAL NAME WITH MONTH AND YEAR - School of Electrical and


Computer Engineering Georgia Institute of Technology Atlanta, GA 30332, USA,
arxiv (April, 2016).

SURVEY – We introduce a novel framework and construct a cost function for


training neural networks to both learn the underlying features while, at the same
time, clustering the data in the resulting feature space. The approach supports both
supervised training with full pairwise constraints or semi supervised with only partial
constraints. We can achieve equal or slightly better results than when explicit labels
are available and a classification criterion is used. Our approach is both easy to
implement for existing classification networks and can be efficiently implemented.

5) PAPER TITLE- Machine-Learning Algorithms to Automate Morphological


and Functional Assessments in 2D Echocardiography.
AUTHOR NAME - Sukrit Narula, Khader Shameer, Alaa Mabrouk Salem Omar,
Joel T. Dudley and Partho P. Sengupta.

6
JOURNAL NAME WITH MONTH AND YEAR - Journal of the American
College of Cardiology (November, 2016).

SURVEY – This study investigated the diagnostic value of a machine-learning


framework that incorporates speckle tracking echocardiographic data for automated
discrimination of hypertrophic cardiomyopathy (HCM) from physiological
hypertrophy seen in athletes (ATH). Our results suggested that machine-learning
algorithms can assist in the discrimination of physiological versus pathological
patterns of hypertrophic remodeling. It compares and models were driven by a
relatively small sample size. Our model was only assessed using limited space and
time points in 2D echocardiographic images.

6) PAPER TITLE - Towards Open Set Deep Networks.


AUTHOR NAME - Abhijit Bendale and Terrance E Boult.
JOURNAL NAME WITH MONTH AND YEAR- IEEE Conference on
Computer Vision and Pattern Recognition, (June, 2016).
SURVEY –A methodology to adapt deep networks for open set recognition, by
introducing a new model layer, OpenMax, which estimates the probability of an
input being from an unknown class has been presented. A key element of
estimating the unknown probability is adapting Meta-Recognition concepts to the
activation patterns in the penultimate layer of the network. OpenMax allows
rejection of "fooling" and unrelated open set images presented to the system;
OpenMax greatly reduces the number of obvious errors made by a deep network.

7) PAPER TITLE - A method for stochastic optimization.


AUTHOR NAME - Kingma, D. P., & Ba, J.
JOURNAL NAME WITH MONTH AND YEAR- arXiv preprint
arXiv:1412.6980, (January, 2017).
SURVEY – Adam, an algorithm for first-order gradient-based optimization of
stochastic objective functions, based on adaptive estimates of lower-order moments
has been introduced. The method is straightforward to implement, is
computationally efficient, has little memory requirements, is invariant to diagonal
rescaling of the gradients, and is well suited for problems that are large in terms of
data and/or parameters. The method is also appropriate for non-stationary

7
objectives and problems with very noisy and/or sparse gradients.

8) PAPER TITLE - Towards Open World Recognition.


AUTHOR NAME - Abhijit Bendale and Terrance Boult.
JOURNAL NAME WITH MONTH AND YEAR - IEEE Conference on
Computer Vision and Pattern Recognition, (January, 2016)
SURVEY – Paper presents a protocol for evaluation of open world recognition
systems. The Nearest Non-Outlier (NNO) algorithm that evolves model efficiently,
adding object categories incrementally while detecting outliers and managing open
space risk has been introduced. Experiments on the ImageNet dataset with 1.2M+
images to validate the effectiveness of our method on large scale visual recognition
tasks have been performed. NNO consistently yields superior results on open world
recognition.

9) PAPER TITLE- iCaRL: Incremental Classifier and Representation Learning.


AUTHOR NAME: Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, and
Christoph H Lampert
JOURNAL NAME WITH MONTH AND YEAR – arXiv preprint
arXiv:1611.07725, (April, 2017).
SURVEY – In this work, a new training strategy, iCaRL, that allows learning in
such a class incremental way: only the training data for a small number of classes
has to be present at the same time and new classes can be added progressively has
been introduced. It has been shown by experiments on CIFAR-100 and ImageNet
ILSVRC 2012 data that iCaRL can learn many classes incrementally over a long
period of time where other strategies quickly fail. iCaRL clearly outperforms the
other methods in this setting. Fixing the data representation after having trained on
the first batch (fixed repr.) performs worse than distillation-based LwF.MC, except
for ILSVRC-full. Same network on fine tuning, trained with all data available
achieves 68.6% multi-class accuracy.

8
10) PAPER TITLE- Analysis of Data Mining Techniques for Heart Disease
Prediction.
AUTHOR NAME- Marjia Sultana*, Afrin Haider and Mohammad ShorifUddin

JOURNAL NAME WITH MONTH AND YEAR - IEEE – (September, 2016).

SURVEY– This paper addresses the issue of prediction of heart disease according
to some input attributes. This paper performed an experiment using diverse data
mining techniques to find out a more accurate technique for the heart disease
prediction. In this paper, two data sets (collected and UCI standard) are used
separately for each data mining technique. findings have shown that for heart disease
prediction the performances of Bayes Net and SMa classifiers are the optimum among
the investigated five classifiers: Bayes Net, SMa, KStar, MLP and J48.

11) PAPER TITLE - Improving Fishing Pattern Detection from Satellite AIS
Using Data Mining and Machine Learning.

AUTHOR NAME - de Souza, E. N., Boerder, K., Matwin, S., & Worm, B.

JOURNAL NAME WITH MONTH AND YEAR - PloS one, (July, 2016)

SURVEY – The approaches they have developed allow us to detect and identify
potential fishing behavior for three main gear types with high accuracy and spatial
resolution on a global scale. Data were obtained under research licence from
exactEarth .The data contained an initial sample of 83 vessels operating in the
North Pacific and corresponds to 217,860 data points collected in July 2013 used
for algorithm development and training. Using First-Passage Time algorithm (FPT)
,and Utilization Distribution algorithm (UD) they acquired Median accuracy of
83%. One disadvantage of the method is the track segmentation algorithm, which
requires defining the number of segments beforehand.

9
12) PAPER TITLE - Semi-supervised Vocabulary-informed Learning
AUTHOR NAME - Y Fu, L Sigal

JOURNAL NAME WITH MONTH AND YEAR - IEEE (April, 2016).


SURVEY – This paper introduces the problem of semi-supervised vocabulary-
informed learning, by utilizing open set semantic vocabulary to help train better
classifiers for observed and unobserved classes in supervised learning, ZSL and
open set image recognition settings. They formulated semi-supervised vocabulary-
informed learning in the maximum margin framework. C5.0 and SVM have shown
81% accuracy. EM was found to be the most promising clustering algorithm with
the accuracy of 68%. It needs more focus on studying the suitability of active
learning, where an interaction module has to balance the number of true label
requests and the performance at any query rate.

13) PAPER TITLE: Deep image category discovery using a transferred similarity
function.
AUTHOR NAME - YC Hsu, Z Lv, Z Kira

JOURNAL NAME WITH MONTH AND YEAR - arXiv preprint (December,


2016)

SURVEY – In this paper to understand what performance is required from the


Similarity Prediction Network (SPN) for the clustering network to perform well,
they simulated different similarity prediction performances by using ground truth
labels. The precision and recall of the similar and dissimilar pairs were controlled
to explore the clustering performance under different density and number of
clusters. As a proof of concept, based on predicted similarities trained on Omniglot,
showing a 99% accuracy which significantly outperforms clustering based
approaches. The proposed clustering method is robust to noisy similarity
predictions due to the densely sampled similarities used for clustering. In future
work, we will extend the clustering method to datasets with a larger number of
categories.

10
14) PAPER TITLE - Incremental classifier and representation learning.
AUTHOR NAME - Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl,
Christoph H. Lampert.
JOURNAL NAME WITH MONTH AND YEAR - University of
Oxford/IST Austria, IST Austria, IEEE -CVPR (November, 2016).

SURVEY – In this paper we introduced iCaRL, a strategy for class-incremental


learning that learns classifiers and a feature representation simultaneously.
Switching off different components of iCaRL (hybrid1, hybrid2, hybrid3, see text
for details) leads to results mostly in between iCaRL and LwF.MC, showing that all
of iCaRL’s new components contribute to its performance. All data available
achieves 68.6% multi-class accuracy. iCaRL’s performance is still lower than what
systems achieve when trained in a batch setting, i.e. with all training examples of all
classes available at the same time.

15) PAPER TITLE - Prediction of Breast Cancer Recurrence using data mining
techniques.
AUTHOR NAME -Uma Ojha , Savita Goel.

JOURNAL NAME WITH MONTH AND YEAR -IEEE (January, 2017).

SURVEY – In this paper, WPBC dataset is used for finding an efficient predictor
algorithm to predict the recurring or non-recurring nature of disease. This might help
Oncologists to differentiate a good prognosis (non-recurrent) from a bad one
(recurrent) and can treat the patients more effectively. The identified critical
parameters should be verified by applying on larger medical dataset to predict the
recurrence of the disease in future. C5.0 and SVM have shown 81% accuracy. EM
was found to be the most promising clustering algorithm with the accuracy of 68%.

16) PAPER TITLE - EMNIST: Extending MNIST to handwritten letters:


AUTHOR NAME - Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andre
van Schaik.
JOURNAL NAME WITH MONTH AND YEAR - arXiv preprint
arXiv:1702.05373,( February,2017)
SURVEY – This paper introduces a variant of the full NIST dataset, which we

11
have called Extended MNIST (EMNIST), which follows the same conversion
paradigm used to create the MNIST dataset. The result is a dataset that constitutes a
more challenging classification task involving letters and digits, and one that shares
the same image structure and parameters as the original MNIST task, allowing for
direct compatibility with all existing classifiers and systems. Benchmark results
using an online ELM algorithm are presented along with a validation of the
conversion process through the comparison of the classification results on NIST
digits and the MNIST digits.

17) PAPER TITLE - Analyzing student performance using educational data


mining
AUTHOR NAME - Asif, R., Merceron, A., Ali, S. A., & Haider, N. G.

JOURNAL NAME WITH MONTH AND YEAR - Elsevier (October,2017)

SURVEY – This paper, uses data mining methods to study the performance of
undergraduate students. The dataset consisted of 11,873 undergraduate students from
the Debre Markos University, Ethiopia , AIT datasets, CTU dataset. Using Decision
trees, Bayesian network algorithms 71.15% accuracy has been found. To have a
better visualization of how students are evolving over the years, They made a heat
map by sorting the students of cohort clusters-wise according to the results of
progression. Predicting students performance using marks only, no socio-economic
data was used . The second demerit involves not mentioning the investigation of
how students' academic performance progresses over the time.

18) PAPER TITLE - DOC: Deep Open Classification of Text Documents

AUTHOR NAME - Lei Shu, Hu Xu, Bing Liu

JOURNAL NAME WITH MONTH AND YEAR - arXiv (September,2017)

SURVEY – This paper proposed a novel deep learning based method, called
DOC, for open text classification. Using the same text datasets and experiment
settings, we showed that DOC performs dramatically better than the state-of-the-

12
art methods from both the text and image classification domains. We also believe
that DOC is applicable to images. Demerits include very poor model performance
during testing.

19) PAPER TITLE - EMNIST: an extension of MNIST to handwritten letters

AUTHOR NAME - G Cohen, S Afshar, J Tapson

JOURNAL NAME WITH MONTH AND YEAR - IEEE (FEBRUARY-2017 )

SURVEY – This paper introduced the EMNIST datasets, a suite of six datasets
intended to provide a more challenging alternative to the MNIST dataset. The
characters of the NIST Special Database 19 were converted to a format that
matches that of the MNIST dataset, making it immediately compatible with any
network capable of working with the original MNIST dataset. Benchmark results
are provided which are consistent with previously published work using the NIST
Special Database 19.

20) PAPER TITLE - Learning and the Unknown: Surveying Steps Toward
Open World Recognition.
AUTHOR NAME - T. E. Boult, S. Cruz, A.R. Dhamija, M. Gunther, J.
Henrydoss,

W.J. Scheirer.

JOURNAL NAME WITH MONTH AND YEAR - University of Colorado


Colorado Springs, Colorado Springs, AIII(APRIL,2019).

SURVEY – While machine learning and deep networks are providing great
advances, anomaly or outlier detection alone is not sufficient because the proper
handling of unknowns involves balancing the risk of the unknown with the risk
from recognition errors. With almost all deep networks, unknowns map into the
same space as knowns and are not easily rejected as outliers. Open set problems
are often challenging because they must balance maintaining accuracy on the core
problem with handling the unknown unknowns.

13
21) PAPER TITLE- Multi-stage Deep Classifier Cascades for Open World
Recognition.
AUTHOR NAME - Xiaojie Guo, Amir Alipour- Fanid, Lingfei Wu, Hemant
Purohit, Xiang Chen, Kai Zeng, Liang Zhao.
JOURNAL NAME WITH MONTH AND YEAR - ACM International
Conference on Information and Knowledge Management,(AUGUST,2019).

SURVEY –In this paper, two real world experiments on the comparison to several
existing methods indicate the effectiveness and efficiency of the proposed method.
Tests to analyse the model parameter threshold indicated that a stable performance
was achieved when the threshold was within a reasonable range. MDCC
outperformed both LOD and S-Forest by 29.1% and 23.2%, respectively, in EN-
Accuracy, and 19% and 37.6%, respectively, in F-score. It is difficult to utilize the
relationships among instances from the same class.

22) PAPER TITLE - Semantic Web mining for Content-Based Online


Shopping Recommender System.
AUTHOR NAME - Ibukun Tolulope Afolabi, Opeyemi Samuel
Makinde,Olufunke Oyejoke Oladipupo
JOURNAL NAME WITH MONTH AND YEAR - International Journal of
Intelligent Information Technologies,(October, 2019)
SURVEY – The methodology is based on two major phases. The first phase is the
semantic preprocessing of textual data using the combination of a developed
ontology and an existing ontology. The second phase uses the Naïve Bayes
algorithm to make the recommendations. The output of the system is evaluated
using precision, recall and fmeasure. The results from the system showed that the
semantic preprocessing improved the recommendation accuracy of the
recommender system by 5.2% over the existing approach. Few demerits observed
in the research where: Implicitly assumes that all the attributes are mutually
independent, Data scarcity.

14
23) PAPER TITLE- Short- Term Load Forecasting Using XGBoost Based on
Similar Days: Days
AUTHOR NAME - Liao, X., Cao, N., Li, M., & Kang, X
JOURNAL NAME WITH MONTH AND YEAR - International Conference on
Intelligent Transportation, Big Data & Smart City (ICITBS),IEEE, (March,2019)
SURVEY – In this paper, the power load data is increasing exponentially and the
traditional forecasting model is fatigued and difficult to achieve high efficiency
when dealing with massive data. A XGBoost load forecasting model based on
similar days is proposed. The real charge data and temperature data in a certain area
is used for prediction. The XGBoost model with the second-order Taylor expansion
and loss function is added to the regular term to control the complexity and over-
fitting.

24) PAPER TITLE - Supplier Prediction in Fashion Industry Using Data Mining
Technology

AUTHOR NAME - N Harale, S Thomassey, X Zeng

JOURNAL NAME WITH MONTH AND YEAR - IEEE (January,2019)

SURVEY – In this paper their work deals with how to predict suppliers based on
previous customer order data by using data mining methods. They applied four
machine learning classification models and the research findings suggest that these
models can be employed for the decision-making concerning supplier prediction.
This study can contribute to the development of automated decision support system
which is reliable and efficient for the supplier prediction. They used previous
customer order data with labels i.e. customer order attributes were tagged with the
label of suppliers who fulfilled these orders; therefore, supervised machine learning
approach has been used in this study. It is observed that, on the trained dataset model
performance of KNN, RF, NN classifiers gives 100% accuracy results, while Naïve
Bayes model gives 86% accuracy.

25) PAPER TITLE - Social image mining for fashion analysis and forecasting

AUTHOR NAME - Seema Wazarkar ∗ , Bettahally N. Keshavamurthy

15
JOURNAL NAME WITH MONTH AND YEAR - Elsevier (July,2019)

SURVEY – This paper deals with research work involved in image mining for the
analysis of fashion trends and forecasting using fashion-related images collected
from the social media. A new Soft clustering technique, (Rough Mean Shift
Clustering) is proposed for grouping the social fashion images. This technique is
robust against uncertainty found in given images. Compared to existing Soft
Clustering Techniques, the technique used here is found to be more efficient.

26) PAPER TITLE- Short- Term Load Forecasting Using XGBoost Based on
Similar Days: Days
AUTHOR NAME - Liao, X., Cao, N., Li, M., & Kang, X
JOURNAL NAME WITH MONTH AND YEAR - International Conference on
Intelligent Transportation, Big Data & Smart City (ICITBS),IEEE, (March,2019)
SURVEY – In this paper, the power load data is increasing exponentially and the
traditional forecasting model is fatigued and difficult to achieve high efficiency
when dealing with massive data. A XGBoost load forecasting model based on
similar days is proposed. The real charge data and temperature data in a certain area
is used for prediction. The XGBoost model with the second-order Taylor expansion
and loss function is added to the regular term to control the complexity and over-
fitting.

27) PAPER TITLE- Classification of Intrusion Detection Using Data Mining


Techniques: Techniques
AUTHOR NAME - Sahani, R., Rout, C., Badajena, J. C., Jena, A. K., & Das, H
JOURNAL NAME WITH YEAR - Computing, analytics and networking,
Springer, Singapore, (April,2018)
SURVEY – This paper focuses to identify normal and attack data present in the
network with the help of C4.5 algorithm which is one of the decisions tree
techniques, and also it helps to improve the IDS system to identify the type of
attacks present in a network. Experimentation is performed on KDD-99 dataset
having number of features and different class of normal and attack type data.
Accuracy of C4.5 algorithm is 99.79% and error rate is 0.51%. Testing the

16
performance of this model over large dataset, handle the classification of unknown
attack in an automatic control system are still challenges.

28) PAPER TITLE - Aspect-Level Sentiment Analysis on E-Commerce Data.


AUTHOR NAME - Satuluri Vanaja, Meena Belwal
JOURNAL NAME WITH YEAR – International Conference on Inventive
Research in Computing Applications (ICIRCA), (January,2018)
SURVEY – This work uses E commerce web page traffic logs of Amazon
customer review data and focuses on finding aspect terms from each review,
identifying the Parts-of-Speech, applying classification algorithms to find the score
of positivity, negativity and neutrality of each review. It uses customer reviews to
take the decision based on their word. Naïve Bayes and SVM are used to classify
the results. Accuracy of Naïve Bayes is 90.423 and that of SVM is 83.423. This can
be only used to classify the data based on their sentimental. But we need to modify
it more to get the more accurate results as the output of these are not much good. In
future they can focus on implementing C4.5 algorithm and compare with Naïve
Bayes.

29) PAPER TITLE- Sentiment Analysis of Movie Review Using Text Mining:
AUTHOR NAME – Thorat, A. M., & Priya, R. V.
JOURNAL NAME WITH MONTH AND YEAR - : International Journal of
Pure and Applied Mathematics, (July,2018)
SURVEY – In this paper, feature level sentiment analysis is done on movie
reviews. The proposed framework check all the movie reviews and separate the
positive or negative words based on the recurrence of word on each review. This
should be possible by using positive and negative word lexicon to discover negative
or positive words. The best accuracy found was 84.13 % using Naïve Bayes
classification algorithm. Big data should be brought into picture. Noise should be
dealt efficiently.

30) PAPER TITLE - Unseen Class Discovery In Open-World Classification

AUTHOR NAME - L Shu, H Xu, B Liu

JOURNAL NAME WITH MONTH AND YEAR - arxiv – (January,2018)

17
SURVEY – This paper first proposed a joint model for performing open image
classification and for predicting whether two images are from the same class using
only the seen class training data. This latter capability enables the transfer of class
distance/similarity knowledge from seen classes to unseen classes, which is
exploited by a hierarchical clustering algorithm for discovering the number of
hidden classes in the rejected examples. Demerits include high complexity, long
training time.

IV. DATASET DESCRIPTION & SAMPLE DATA

MNIST1:
MNIST is a well-known database of handwritten digits (10 classes), which has a training
set of 60,000 examples, and a test set of 10,000 examples. We use 6 classes as the set of
seen classes and use the rest 4 classes as unseen classes (all randomly chosen). We use the
same validation classes from the following EMNIST dataset as the validation dataset for
MNIST.
Columns = 1*1 to 28*28 = 784
Rows => No of digits considered (Training data) = 60000.
Rows => No of digits considered (Testing data) = 10000.

DATASET LINK: https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz

18
V. PROPOSED ALGORITHM WITH FLOWCHART

FLOWCHART

• Importing the MNIST dataset:

For this project, we will be using the MNIST dataset. It is available through keras, a deep
learning library. We are importing both train and test dataset.

• Pre Processing of MNIST dataset

1. Images stored as NumPy arrays are 2-dimensional arrays. However, clustering and
classification algorithm provided by scikit-learn ingests 1-dimensional arrays; as a
result, we will need to reshape each image.

MNIST contains images that are 28 by 28 pixels; as a result, they will have a length of 784
once we reshape them into a 1-dimensional array.
2. Scaling and normalization is also done on the training and testing data values.

19
Then we will be performing two techniques on the pre-processed MNIST dataset.

▪ For image recognition we will be using Convolution Neural Networks.


1. We will be importing all the layers like Dense, Conv2D, MaxPool2D, Flatten,
Dropout and model like sequential from tensorflow.keras deep learning libraries.
Then the CNN model is built.
2. It is then evaluated where we will be obtaining metrics like loss, acc, val_loss,
val_acc and we will plot these metrics. We will also obtain the classification report.
3. Finally prediction is performed and we get output in the form of digits. We will also
be identifying unseen classes to justify the concept of using CNN as an open
classification network.

At the end the final output will be in the form of digits..

▪ Another approach is for digit recognition or imagery analysis.


1. For this first we will be performing PCA using scikitlearn and dimensionality
reduction as the dataset is too big. Using PCA components as centroid for K means.
2. K means clustering is applied, clusters are formed on the basis of similarity in
features of digits.
3. Model is then evaluated, cluster labels are formed and cluster centroids visualized and
image recognition is done
4. The images we will get will be inferred labels.

At the end the final output will be in the form of images.

➢ Proposed Methodology:

• Importing the MNIST dataset:

For this project, we will be using the MNIST dataset. It is available through keras, a deep
learning library we have used in previous tutorials. Although we won't be using other features

20
of keras today, it will save us time to import MNIST from this library. It is also available
through the tensorflow library or for download at http://yann.lecun.com/exdb/mnist/.

We are importing both train and test dataset.

• Pre Processing of MNIST dataset:

▪ Images stored as NumPy arrays are 2-dimensional arrays. However, clustering and
classification algorithm provided by scikit-learn ingests 1-dimensional arrays; as a
result, we will need to reshape each image.

▪ Clustering and classification algorithms almost always use 1-dimensional data. For
example, if we are clustering a set of X, Y coordinates, each point would be passed to
the clustering algorithm as a 1-dimensional array with a length of two (example: [2,4]
or [-1, 4]). If we are using 3-dimensional data, the array would have a length of 3
(example: [2, 4, 1] or [-1, 4, 5]).

▪ MNIST contains images that are 28 by 28 pixels; as a result, they will have a length of
784 once we reshape them into a 1-dimensional array.

▪ Scaling and normalization is also done on the training and testing data values.

21
Then we will be performing two techniques on the pre-processed MNIST dataset.

➢ For digit recognition and classification or imagery analysis, we will be using PCA
followed by Kmeans clustering.

First of all we will be doing principal component analysis to reduce the dimensions of the
training and testing dataset and getting the principal components.

So basically we will be loading the dataset, split it into training and testing dataset, and
reduce the dimensions and reshape the dataset.

We will be doing PCA by using scikitlearn and by dimensionality reduction and get principal
components.

▪ Learning from a reduced set of feature using PCA with 10 components

We will try to find 10 main components using PCA before clusterize data in 10 clusters using
Kmean.

Then we will be forming cluster labels for the number of clusters be 10.

The dimensions of the dataset will be reduced and this will be evident when we check the
shape of the new dataset.

22
▪ Using PCA component as centroids init for KMeans

Try to get 10 clusters with KMeans. Clustering with raw pixel as feature space.

Now the main algorithm for imagery analysis will be Kmeans clustering algorithm.

23
1. KMEANS CLUSTERING:

We will use a K-means algorithm to perform image classification. Clustering isn't limited to
the consumer information and population sciences, it can be used for imagery analysis as
well. Leveraging Scikit-learn and the MNIST dataset, we will investigate the use of K-means
clustering for computer vision.

After importing the dataset and performing the data preprocessing,

▪ Starting with the clustering

Due to the size of the MNIST dataset, we will use the mini-batch implementation of k-means
clustering provided by scikit-learn. This will dramatically reduce the amount of time it takes
to fit the algorithm to the data.

The MNIST dataset contains images of the integers 0 to 9. Because of this, let's start by
setting the number of clusters to 10, one for each digit.

▪ Assigning Cluster Labels

K-means clustering is an unsupervised machine learning method; consequently, the labels


assigned by our KMeans algorithm refer to the cluster each array was assigned to, not the
actual target integer. To fix this, we will define a few functions that will predict which integer
corresponds to each cluster. We will assume K to be 10.

▪ Optimizing and Evaluating the Clustering Algorithm

With the functions defined above, we can now determine the accuracy of our algorithms.
Since we are using this clustering algorithm for classification, accuracy is ultimately the most
important metric; however, there are other metrics out there that can be applied directly to the
clusters themselves, regardless of the associated labels. Two of these metrics that we will use
are inertia and homogeneity.

24
Furthermore, earlier we made the assumption that K = 10 was the appropriate number of
clusters; however, this might not be the case. We will fit the K-means clustering algorithm
with several different values of K, than evaluate the performance using our metrics like
inertia, homogeneity and accuracy.

▪ Visualizing Cluster Centroids

The most representative point within each cluster is called the centroid. If we were dealing
with X,Y points, the centroid would simply be a point on the graph. However, since we are
using arrays of length 784, our centroid is also going to be an array of length 784. We can
reshape this array back into a 28 by 28 pixel image and plot it.

These graphs will display the most representative image for each cluster.

The final output will be the inferred cluster labels and it will be in the form of images.

➢ For image prediction we will be using Convolution Neural Networks.

2. CONVOLUTION NEURAL NETWORKS:

CNN is basically a model known to be Convolutional Neural Network and in the


recent time it has gained a lot of popularity because of it’s usefulness. CNN uses
multilayer perceptrons to do computational works. CNNs use relatively little pre-
processing compared to other image classification algorithms. This means the
network learns through filters that in traditional algorithms were hand-engineered. So,
for image processing task CNNs are the best-suited option.

▪ Preparing the output classes:

Since output of the model can comprise of any of the digits between 0 to 9 so, we
need 10 classes in output. To make output for 10 classes, we

25
use keras.utils.to_categorical function, which will provide with the 10 columns. Out
of these 10 columns only one value will be one and rest 9 will be zero and this one
value of the output will denote the class of the digit.

▪ Create the train data and test data.

Test data: Used for testing the model that how our model has been trained.
Train data: Used to train our model.

▪ Performing normalization on the dataset and scaling on the dataset.


While proceeding further, img_rows and img_cols are used as the image dimensions.
In Mnist dataset, it is 28 and 28. We also need to check the data format i.e.
‘channels_first’ or ‘channels_last’. In CNN, we can normalize data before hands such
that large terms of the calculations can be reduced to smaller terms. Like, we can
normalize the x_train and x_test data by dividing it with 255.

▪ Building the CNN model:

This models consists of only one convolution layer followed by a pooling layer to
speed up computation while still retaining the meaningful information of the data.

The image was flattened and then fed into a dense layer followed by the output layer.

A Softmax activation function was used for the output layer as this is a multi-class
classification problem.

26
Explanation of the working of each layer in CNN model:

Layer1 is Conv2d layer which convolves the image using 32 filters each of size (3*3).
Layer2 is again a Conv2D layer which is also used to convolve the image and is using 64
filters each of size (3*3).
Layer3 is MaxPooling2D layer which picks the max value out of a matrix of size (3*3).
Layer4 is showing Dropout at a rate of 0.5.
Layer5 is flattening the output obtained from layer4 and this flatten output is passed to layer6.
Layer6 is a hidden layer of neural network containing 250 neurons.
Layer7 is the output layer having 10 neurons for 10 classes of output that is using the softmax
function.

Then we do the Data augmentation by simply shifting the pixel order in training and testing
dataset. This is basically done to improve the accuracy.

27
▪ Model Evaluation:

• The loss of the training and validation sets were plotted to ensure we are not
overfitting our model
• Accuracy of the training and validation data were also plotted
• Overall accuracy was obtained from a classification report and confusion matrix

▪ Sample Prediction:

• Random samples from the testing set were selected and our model was used to predict
each sample target.
• The actual image is displayed followed by the prediction.
• We will also be identifying unseen classes to justify the concept of using CNN as an
open classification network.

Finally prediction is performed and we get output in the form of digits.

28
VI. EXPERIMENTS RESULTS

➢ IMPORTING LIBRARIES:

➢ IMPORTING THE DATASET:

29
➢ PCA – Using Scikitlearn

Output:

Hence the reduced shape is obtained.

➢ PCA – Using Dimensionality Reduction

30
Output:

So if we take 200 dimensions, 90% of the variance is explained.

➢ Exploratory Data Analysis:

31
Output:

➢ PREPROCESSING THE MNIST IMAGES:

32
Reshaped that is converted from 2d to 1d array and normalized dataset is obtained.
This process of reshaping the image array is called Flattening.

K-MEANS CLUSTERING:

➢ Using PCA component as centroids init for KMeans

Initially we have considered the number of clusters as 10.


The obtained array is basically the cluster number out of the 10 clusters, to which respective
labels belong.

33
➢ ASSIGNING CLUSTER LABELS:

Output:

34
The first array obtained shows the predicted labels of first 20 data points.
The second array obtained shows the actual labels of first 20 data points.
As we can see that for many datapoints, the predicted and actual labels are different.
So the algorithm needs to be optimized for better results and then evaluated.

➢ OPTIMIZING AND EVALUATING CLUSTERING ALGORITHM:

Output:

35
Hence the accuracy obtained is 90.23%.

➢ VISUALIZING CLUSTER CENTROIDS:

36
Output:

CONVOLUTION NEURAL NETWORKS:

This is how the digit looks when the array is reshaped from 2d to 1d.

37
The new dimensions of training and testing dataset.

➢ ONE HOT ENCODING:

The targets in y_train and y_valid are simply numerical values. We need to perform one-hot encoding for
our model as this is a multi-class classification problem.

38
➢ BUILDING THE MODEL:
This models consists of only one convolution layer followed by a pooling layer to speed
up computation while still retaining the meaningful information of the data.

The image was flattened and then fed into a dense layer followed by the output layer.

A softmax activation function was used for the output layer as this is a multi-class
classification problem.

➢ DATA AUGMENTATION:

A CNN that can robustly classify objects even if its placed in different orientations is said
to have the property called invariance. More specifically, a CNN can be invariant to
translation, viewpoint, size or illumination. This is called data augmentation.

39
Output:

➢MODEL EVALUATION:

Calculating the loss and accuracy values for all the 6 epoches for both training and
validation dataset.

40
The loss of the training and validation sets were plotted to ensure we are not overfitting
our model.

Accuracy of the training and validation data were also plotted.

41
Generating Classification report and Confusion matrix.

Output:

Accuracy of the model is 99%.

42
➢ PREDICTION:

As we can see, correct label is predicted. Hence the model is optimized.

VII. RESULTS AND DISCUSSION

The accuracy obtained for Kmeans algorithm is 90.23% and the accuracy obtained for CNN
is 99%.Classification report and confusion metrics are generated for CNN to obtain metrics
like precision, recall and F1 score.

43
The accuracy of both the algorithms used for image classification and recognition is
compared to come to a conclusion that which one is more accurate and hence more efficient.

Hence from the given classification report and graph, we come to a conclusion that CNN is a
better classifier for images in MNIST dataset as compared to Kmeans Clustering.

44
However CNN and Kmeans both are efficient as compared to other algorithms as in CNN the
neural networks are more tightly bound due to multiple layers and hence give a greater
accuracy as compared to other NN. Also in CNN, data is augmented and so it is insensitive to
position and orientation of the image.

In Kmeans the clusters are tightly formed as compared to other clustering algorithms and
there are multiple iterations resulting in more accuracy.

VIII. CONCLUSION AND FUTURE WORK

As learning is increasingly used in dynamic and open environments, it should no longer make
the closed-world assumption. Open-world learning is required which can not only classify
examples from the seen classes but also reject examples from unseen classes. What is also
important is to identify the hidden unseen classes from the reject examples, which will enable
the system to learn these new classes automatically rather than requiring manual labelling.
This project proposed a model for performing open image classification and also making sure
that the classification is insensitive to the position and orientation if the image so that
prediction is as accurate as possible. Also inspite of this few classes remain hidden or unseen.
A possible approach is applying oversampling and undersampling on the classes so that the
model does not get used to predict the majority class more frequently. Our experiments
demonstrated the effectiveness of the proposed approach.

45
IX. REFERENCES

1. Hsu, Yen-Chang, Zhaoyang Lv, and Zsolt Kira. "Deep image category discovery
using a transferred similarity function." arXiv preprint arXiv:1612.01253 (2016).
2. De Rosa, Rocco, Thomas Mensink, and Barbara Caputo. "Online open world
recognition." arXiv preprint arXiv:1604.02275 (2016).
3. Yang, Jianwei, Devi Parikh, and Dhruv Batra. "Joint unsupervised learning of deep
representations and image clusters." In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 5147-5156. 2016.
4. Hsu, Yen-Chang, and Zsolt Kira. "Neural network-based clustering using pairwise
constraints." arXiv preprint arXiv:1511.06321 (2016).
5. Narula, Sukrit, Khader Shameer, Alaa Mabrouk Salem Omar, Joel T. Dudley, and
Partho P. Sengupta. "Machine-learning algorithms to automate morphological and
functional assessments in 2D echocardiography." Journal of the American College of
Cardiology 68, no. 21 (2016): 2287-2295.
6. Bendale, Abhijit, and Terrance E. Boult. "Towards open set deep networks."
In Proceedings of the IEEE conference on computer vision and pattern recognition,
pp. 1563-1572. 2016.
7. Da, Kingma. "A method for stochastic optimization." arXiv preprint
arXiv:1412.6980 (2016)
8. . Bendale, Abhijit, and Terrance Boult. "Towards open world recognition."
In Proceedings of the IEEE conference on computer vision and pattern recognition,
pp. 1893-1902. 2016.
9. Rebuffi, Sylvestre-Alvise, Alexander Kolesnikov, Georg Sperl, and Christoph H.
Lampert. "icarl: Incremental classifier and representation learning." In Proceedings of
the IEEE conference on Computer Vision and Pattern Recognition, pp. 2001-2010.
2016.
10. Sultana, Marjia, Afrin Haider, and Mohammad Shorif Uddin. "Analysis of data
mining techniques for heart disease prediction." In 2016 3rd International Conference
on Electrical Engineering and Information Communication Technology (ICEEICT),
pp. 1-5. IEEE, 2016.

46
11. de Souza, Erico N., Kristina Boerder, Stan Matwin, and Boris Worm. "Improving
fishing pattern detection from satellite AIS using data mining and machine
learning." PloS one 11, no. 7 (2016): e0158248.
12. Fu, Yanwei, and Leonid Sigal. "Semi-supervised vocabulary-informed learning."
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 5337-5346. 2016.
13. Hsu, Yen-Chang, Zhaoyang Lv, and Zsolt Kira. "Deep image category discovery
using a transferred similarity function." arXiv preprint arXiv:1612.01253 (2016).
14. Rebuffi, Sylvestre-Alvise, Alexander Kolesnikov, Georg Sperl, and Christoph H.
Lampert. "icarl: Incremental classifier and representation learning." In Proceedings of
the IEEE conference on Computer Vision and Pattern Recognition, pp. 2001-2010.
2016.
15. Ojha, Uma, and Savita Goel. "A study on prediction of breast cancer recurrence using
data mining techniques." In 2017 7th International Conference on Cloud Computing,
Data Science & Engineering-Confluence, pp. 527-530. IEEE, 2017.
16. Cohen, Gregory, Saeed Afshar, Jonathan Tapson, and Andre Van Schaik. "EMNIST:
Extending MNIST to handwritten letters." In 2017 International Joint Conference on
Neural Networks (IJCNN), pp. 2921-2926. IEEE, 2017.
17. Asif, Raheela, Agathe Merceron, Syed Abbas Ali, and Najmi Ghani Haider.
"Analyzing undergraduate students' performance using educational data
mining." Computers & Education 113 (2017): 177-194.
18. Shu, Lei, Hu Xu, and Bing Liu. "Doc: Deep open classification of text
documents." arXiv preprint arXiv:1709.08716 (2017).
19. Cohen, Gregory, Saeed Afshar, Jonathan Tapson, and Andre Van Schaik. "EMNIST:
Extending MNIST to handwritten letters." In 2017 International Joint Conference on
Neural Networks (IJCNN), pp. 2921-2926. IEEE, 2017.

20. Boult, Terrance E., Steve Cruz, Akshay Raj Dhamija, M. Gunther, James Henrydoss,
and Walter J. Scheirer. "Learning and the unknown: Surveying steps toward open
world recognition." In Proceedings of the AAAI Conference on Artificial Intelligence,
vol. 33, pp. 9801-9807. 2019.
21. Guo, Xiaojie, Amir Alipour-Fanid, Lingfei Wu, Hemant Purohit, Xiang Chen, Kai
Zeng, and Liang Zhao. "Multi-stage Deep Classifier Cascades for Open World
47
Recognition." In Proceedings of the 28th ACM International Conference on
Information and Knowledge Management, pp. 179-188. 2019.
22. Afolabi, Ibukun Tolulope, Opeyemi Samuel Makinde, and Olufunke Oyejoke
Oladipupo. "Semantic Web mining for Content-Based Online Shopping
Recommender Systems." International Journal of Intelligent Information
Technologies (IJIIT) 15, no. 4 (2019): 41-56.
23. Abbasi, Raza Abid, Nadeem Javaid, Muhammad Nauman Javid Ghuman, Zahoor Ali
Khan, and Shujat Ur Rehman. "Short Term Load Forecasting Using XGBoost."
In Workshops of the International Conference on Advanced Information Networking
and Applications, pp. 1120-1131. Springer, Cham, 2019.
24. Harale, Nitin, Sebastien Thomassey, and Xianyi Zeng. "Supplier Prediction in
Fashion Industry Using Data Mining Technology." In 2019 International Conference
on Industrial Engineering and Systems Management (IESM), pp. 1-6. IEEE, 2019.
25. Wazarkar, Seema, and Bettahally N. Keshavamurthy. "Social image mining for
fashion analysis and forecasting." Applied Soft Computing 95 (2019): 106517.
26. Abbasi, Raza Abid, Nadeem Javaid, Muhammad Nauman Javid Ghuman, Zahoor Ali
Khan, and Shujat Ur Rehman. "Short Term Load Forecasting Using XGBoost."
In Workshops of the International Conference on Advanced Information Networking
and Applications, pp. 1120-1131. Springer, Cham, 2019
27. Sahani, Roma, Chinmayee Rout, J. Chandrakanta Badajena, Ajay Kumar Jena, and
Himansu Das. "Classification of intrusion detection using data mining techniques."
In Progress in computing, analytics and networking, pp. 753-764. Springer,
Singapore, 2018.
28. Sahani, Roma, Chinmayee Rout, J. Chandrakanta Badajena, Ajay Kumar Jena, and
Himansu Das. "Classification of intrusion detection using data mining techniques."
In Progress in computing, analytics and networking, pp. 753-764. Springer,
Singapore, 2018.
29. Thorat, Akanksha Madan, and R. Vishnu Priya. "Sentiment analysis of movie review
using text mining." International Journal of Pure and Applied Mathematics 119, no.
16 (2018): 3561-3566.
30. Shu, Lei, Hu Xu, and Bing Liu. "Unseen class discovery in open-world
classification." arXiv preprint arXiv:1801.05609 (2018).

48

You might also like