Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

2021 2nd International Conference on Computing and Data Science (CDS)

Image Classification with Artificial Intelligence:


Cats vs Dogs
Youngjun Lee
Bejing No.80 Highschool
Beijing, China
youngjunlee@163.com
2021 2nd International Conference on Computing and Data Science (CDS) | 978-1-6654-0428-0/21/$31.00 ©2021 IEEE | DOI: 10.1109/CDS52072.2021.00081

Abstract— With the rise of the techniques including big many machine learning models and effective training
data and artificial intelligence, we now have a higher-level algorithms. Machine learning models can study suitable
technology in solving image classification problems, and parameters from data automatically, and there are less
increase the accuracy of image classification by a remarkable manual operations [6, 7]. After the rise of the deep learning,
amount. To compare and analyze the classification a wide variety of the model of the neural networks,
performance from different machine learning and deep especially convolutional neural networks show an impressive
learning, this paper implemented support vector machine and performance in the image classification problem [8-10].
convolutional neural network to solve the classical Cats vs
Dogs problem, and compared how different parameters affect In this paper, we take the classical Cat vs Dogs problem
CNN. The lessons are summarized and presented in this paper. as an example, implement and compare the SVM and
different structures of CNN models, based on a real-world
Keywords— Support Vector Machine, Convolutional Neural dataset.
Network, Artificial Intelligence, Image Classification
The contributions of this paper are summarized as
follows:
I. INTRODUCTION
(1) We validate that the deep learning models represented
Image classification problems are universal in our daily by CNN outperform the shallow machine learning models
life. For instance, in the task of garbage classification, if we represented by SVM, in the problem of image classification;
do not know which category the trash belongs to, we just
need to take a picture of it with our smartphone. The relevant (2) We give a comprehensive evaluation of the influence
app in the phone will automatically identify the information of different parameters.
in the picture and classify its type. In the process of realizing The finding of this paper are as follows:
the image classification, computer programs can use
handcrafted characteristics of pictures, such as information in (1) The data augmentation technique is helpful in
the edge and the distribution of the color, and then separate improving the deep learning model’s performance;
the different kinds of objects.
(2) In the choice of optimizers, Adam is better than both
Before the rise of the smartphone and artificial SGD and RMSprop;
intelligence, people already had the idea of classifying
(3) In the choice of the convolution kernel, the size of 3 is
images and also built multiple tools to use. OpenCV [1],
better than the size of 5;
born in 1998, for instance, is the most popular open source
tool in the computer vision area. It provides an interface for (4) In the choice of the activation function, LeakyReLU
different programming languages like Python, Matlab and is better than ReLU.
Java, and has been widely applied both in the academia and
industry areas. OpenCV not only provides the basic II. RELATED WORK
manipulation such as drawing, saving and conversion of the
picture, but also provides the basic function of edge A. Digital Image Processing [11]
extraction and image classification. With the rise of
smartphones, these basic function of the picture Since the computer came into being, images can be
manipulations gradually moves to smartphones from stored and processed by computers. For a single-color image,
computers, making people able to use these functions we represent it to pixels with a value range from 0 to 255.
anytime and anywhere. While the image classification tools Image processing software tools including OpenCV can
represented by OpenCV provide the function of process basic operations such as drawing, copying,
classification, all the realization of these functions are based displaying, saving, etc. They also support the affine
on the empirical parameter setting. As a result, it is difficult transformations like translation and rotation. Afterwards,
to fit in the complex situation and increasing demand of big blurred images can be recovered by the noise filter, and the
data. edge can be extracted by detecting changes of nearby pixels.

After the rise of the A.I. technology, researchers collect a These are traditional pre-processing technology of
large amount of images and contribute many large-scale pictures. After these processing, researchers combine
open datasets [2-5]. With these datasets, researchers propose machine learning models like SVM to classify images. The

978-1-6654-0428-0/21/$31.00 ©2021 IEEE 437


DOI 10.1109/CDS52072.2021.00081

Authorized licensed use limited to: ULAKBIM UASL - Sutcu Imam Universitesi. Downloaded on October 26,2022 at 12:34:06 UTC from IEEE Xplore. Restrictions apply.
appearing of deep learning techniques simplifies the process IV. MODELS DESCRIPTIONE
largely. After This section introduces the various models we use,
including machine learning models represented by SVM, and
B. Deep Learning [12] deep learning models represented by CNN.

a. Deep Neural Network A. Support Vector Machine


Deep learning is a subset concept of machine learning, Image classification methods based on support vector
that usually refers to multi-layer neural networks. In recent machines are usually based on two steps, including feature
years, thanks to the appearance of big data sets and powerful extraction and modeling classification. Since traditional
hardware devices, deep learning gradually becomes the main machine learning methods usually rely on the process of
method of processing high dimensional data like pictures, manual feature extraction, different feature extraction
text, sound signals, etc. Neural networks are mainly methods have a great impact on the classification effect of
consisted of the connection of artificial neurons. Each neuron the model.
applies the weighted sum operation first, then obtain the In this paper, in the feature extraction step, we first
output of the current neuron by a nonlinear activation convert the input color image to HSV space, that is, express
function, as an input for the next neuron layer. When a each pixel as the Hue, Saturation, and Value. The reason for
neuron network contains multiple layers as hidden layers, we this step is that the transformed coordinate system can more
call it a deep neural network (DNN). While adding hidden effectively express the characteristics of the image. Then the
layers can highly increase the learning skill of neural distribution of Hue, Saturation, and Value is counted by
networks, it can also increase the risk of overfitting. As a histogram statistics, and the probability value is normalized
result, dropout and early stopping are proposed as effective after quantization into 8 intervals. Each probability
methods of alleviating the overfitting problem. distribution value is composed of 8 data. In other words,
b. Convolutional Neural Network each image has a vector containing 3 * 8 values as input for
modeling and classification.
Convolutional neural network (CNN) is a special type of
neural network, mainly used in two-dimensional image In the process of modeling and classification, we use a
processing or three-dimensional video processing. CNN can standard support vector machine model. Given the vector
extract effective feature representations from pictures by value of each image, the main idea of this model is to find a
applying basic operations of convolution and pooling, and hyperplane, and then distinguish the two vector points of cats
further conduct recognition tasks based on these and dogs. The basis for selecting the optimal hyperplane is
representations. By using groups of convolution kernels, that the vector point should be as far away from the
convolutional neural network can process multi-channel hyperplane as possible. We take the support vector machine
input, like color pictures. Also, it can extract different as a machine learning model as a representative to compare
features as different output channels. In early ImageNet with the following machine model.
image classification contest, AlexNet model defeated a large
number of feature extraction methods that based on
handcrafted features, proved the superiority of CNN for the
B. Convolutional Neural Network
first time.
This paper uses various forms of convolutional neural
While CNN is powerful, it needs a large range of data to networks for comparison.
learn useful features. As a result, for the situation that has no
enough data, one possible solution is using data a. CNN model
augmentation technology to increase the size of the training As introduced in the previous discussion about basic
set. Another way is using transfer learning technologies, machine learning methods, CNN can automatically extract
which directly leverage pre-trained weights from ImageNet features through convolution and pooling, avoiding the
and then fine tune them. process of manually extracting features. In order to verify the
performance of the convolutional neural network, we first
III. DATASET DESCRIPTION used the standard convolutional neural network model. The
The dataset we used contains a total of 25,000 images of convolutional neural network model we use contains 4
cats and dogs, of which 12,500 are cats and 12,500 are dogs. convolutional layers and 4 pooling layers. The size of the
This is a balanced classification problem, because the convolution kernel of each convolution layer is 3×3, but the
number of images for cats or dogs is the same. We use number of convolution kernels are 32, 64, 64 and 128
accuracy as an evaluation metric of our evaluation model. respectively. The window size of the pooling layer is 2×2.
Accuracy is defined as the proportion of correctly classified After the output of the convolutional layer is expanded, it is
images to the total number of images. connected to a fully connected layer with 128 neurons. The
activation function used by this fully connected layer is the
This data set is collected from the Internet, so it is not ReLU function. Finally, different types of output results are
perfect. There may be inaccuracy or interference. For generated by an output layer with 2 neurons, and the
example, the picture may be a person holding a cat, or there activation function used by this output layer is the softmax
may be occlusion. These situations pose greater challenges to function. The final total number of parameters of this model
the classification model. is 3,453,634.
In the following parts, we divide the training set and test We show the CNN model we used in Figure 1.
set according to the ratio of 80%-20%. All models are trained
on the training set and compared on the remaining test set.

438

Authorized licensed use limited to: ULAKBIM UASL - Sutcu Imam Universitesi. Downloaded on October 26,2022 at 12:34:06 UTC from IEEE Xplore. Restrictions apply.
Figure 1. The CNN model used in this study.
is used to manipulate the images. SVM model is
b. Data augmentation implemented with scikit-learn and the deep learning models
are implemented with TensorFlow.
The data augmentation technology generates similar but The parameters for training the convolutional neural
different training samples by making a series of random networks are set as follows: the training epoch is 30, the
changes to the training images, thereby expanding the size of batch size is 10, the loss function is cross entropy and the
the training data set. Randomly changing the training final evaluation metric is accuracy.
samples can reduce the model's dependence on certain Specifically, we conducted the following experiments.
attributes, thereby improving the generalization ability of the
Firstly, we compared the performance of the SVM and CNN
model. For example, we can crop the image in different ways
to make the object of interest appear in different positions, models. Both models used the same original dataset. Then
thereby reducing the dependence of the model on the for the CNN model, we compared the cases with and without
position of the object. We can also adjust factors such as the data augmentation technique. Then we kept using the
brightness and color to reduce the model's sensitivity to data augmentation technique, instead changed the different
color. Therefore, we added image recognition technology parameters. The first change is the optimizer, by replacing
without changing the model to see if the accuracy of the Adam with SGD and RMSprop. The second change is the
model has been improved. kernel size, by replacing 3 with 5. The third and last change
is the activation function, by replacing ReLU with
V. RESULTS AND ANALYSIS LeakyReLu. We show these two functions in Figure 2, in
which the parameter a is set to 0.1 empirically.
A. Experiment Setting B. Result Analysis
We conducted a series of experiments in this paper. The We show the test accuracy for different models in Figure
experiments used a desktop computer with Windows 10 OS. 3.
Python is used as the major programming language. OpenCV

f (y) f (y)

f (y) = y f (y) = y

f (y) = 0 y y
f (y) = ay

Figure 2. ReLu and LeakyReLU.

439

Authorized licensed use limited to: ULAKBIM UASL - Sutcu Imam Universitesi. Downloaded on October 26,2022 at 12:34:06 UTC from IEEE Xplore. Restrictions apply.
Figure 5. The change of loss when data augmentation is not
used.

Figure 3. The comparison of different models.


From Figure 3, we have the following findings:
(1) CNN performs better than SVM;
(2) The data augmentation technique is helpful in
improving the deep learning model’s performance;
(3) In the choice of optimizers, Adam is better than both
SGD and RMSprop;
(4) In the choice of the convolution kernel, the size of 3
is better than the size of 5;
(5) In the choice of the activation function, LeakyReLU
is better than ReLU.
We also evaluated the change of accuracy and loss during
the training process, for different parameter settings. Firstly, Figure 6. The change of accuracy when data augmentation
we found that when data augmentation was not used, the and LeakyReLU are used.
CNN model is prone to overfitting. The changes of accuracy
and loss are shown in Figure 4 and 5, respectively. Then,
after we added the data augmentation technique and replaced
ReLU with LeakyReLU, we achieved the best result so far.
For the best case, the changes of accuracy and loss are shown
in Figure 6 and 7, respectively. The validation accuracy has a
great fluctuation in Figure 6 because we only use 4000
images for validation.

Figure 7. The change of loss when data augmentation and


LeakyReLU are used.

VI. CONCLUSION
In this paper, we take the cat and dog classification
problem as an example, evaluate the machine learning
represented by SVM and the deep learning model
Figure 4. The change of accuracy when data augmentation is represented by CNN, and verify that CNN is superior to
not used. SVM in image classification. We also analyzed the influence
of different parameters on the CNN model. Our results can
provide a reference for the selection of corresponding
parameters for similar problems.
REFERENCES
[1] G. Bradski G, Kaehler A. Learning OpenCV: Computer vision with
the OpenCV library[M]. " O'Reilly Media, Inc.", 2008.

440

Authorized licensed use limited to: ULAKBIM UASL - Sutcu Imam Universitesi. Downloaded on October 26,2022 at 12:34:06 UTC from IEEE Xplore. Restrictions apply.
[2] Deng L. The mnist database of handwritten digit images for machine [8] Sharma N, Jain V, Mishra A. An analysis of convolutional neural
learning research [best of the web][J]. IEEE Signal Processing networks for image classification[J]. Procedia computer science,
Magazine, 2012, 29(6): 141-142. 2018, 132: 377-384.
[3] Cohen G, Afshar S, Tapson J, et al. EMNIST: Extending MNIST to [9] Jiang W, Zhang L. Edge-siamnet and edge-triplenet: New deep
handwritten letters[C]//2017 International Joint Conference on Neural learning models for handwritten numeral recognition[J]. IEICE
Networks (IJCNN). IEEE, 2017: 2921-2926. Transactions on Information and Systems, 2020, 103(3): 720-723.
[4] Jiang W. MNIST-MIX: A Multi-language Handwritten Digit [10] Jiang W. Evaluation of deep learning models for Urdu handwritten
Recognition Dataset[J]. IOPSciNotes, 2020, 1(025002). characters recognition[C]//Journal of Physics: Conference Series. IOP
[5] Deng J, Dong W, Socher R, et al. Imagenet: A large-scale hierarchical Publishing, 2020, 1544(1): 012016.
image database[C]//2009 IEEE conference on computer vision and [11] Gonzalez R C, Woods R E, Eddins S L. Digital image processing
pattern recognition. Ieee, 2009: 248-255. using MATLAB[M]. Pearson Education India, 2004.
[6] Kouropteva O, Okun O, Pietikäinen M. Classification of handwritten [12] Goodfellow I, Bengio Y, Courville A, et al. Deep learning[M].
digits using supervised locally linear embedding algorithm and Cambridge: MIT press, 2016.
support vector machine[C]//ESANN. 2003: 229-234.
[7] Greeshma K V, Sreekumar K. Fashion-MNIST classification based
on HOG feature descriptor using SVM[J]. International Journal of
Innovative Technology and Exploring Engineering, 2019, 8: 960-962.

441

Authorized licensed use limited to: ULAKBIM UASL - Sutcu Imam Universitesi. Downloaded on October 26,2022 at 12:34:06 UTC from IEEE Xplore. Restrictions apply.

You might also like