Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 51

CHAPTER 1

INTRODUCTION

In the last few years, there has been a lot of hype about artificial neural networks and
other machine learning methods, mostly due to the success of a series of applications
in the industry that use this technology. This in turn has spawned the creation of
relatively powerful open source libraries and services that make this technology
accessible to average developers. In order to use these libraries optimally,
programmers have to have a basic understanding of the underlying implementation
of the algorithms.

1.1 Types of Artificial Neural Networks


1.1.1 FeedForward ANN
The information flow is unidirectional. A unit sends information to other unit from
which it does not receive any information. There are no feedback loops. They are
used in pattern generation/recognition/classification. They have fixed inputs and
outputs.

1.1.2 FeedBack ANN


Here, feedback loops are allowed. They are used in content addressable memories.

1.1.3 Working of ANNs


In the topology diagrams shown, each arrow represents a connection between two
neurons and indicates the pathway for the flow of information. Each connection has
a weight, an integer number that controls the signal between the two neurons.

If the network generates a good or desired output, there is no need to adjust the
weights. However, if the network generates a poor or undesired output or an error,
then the system alters the weights in order to improve subsequent results.

1.2 Machine Learning in ANNs


ANNs are capable of learning and they need to be trained. There are several
learning strategies

1
1.2.1 Supervised Learning It involves a teacher that is scholar than the ANN
itself. For example, the teacher feeds some example data about which the
teacher already knows the answers.

For example, pattern recognizing. The ANN comes up with guesses while
recognizing. Then the teacher provides the ANN with the answers. The
network then compares it guesses with the teachers correct answers and
makes adjustments according to errors.

1.2.2 Unsupervised Learning It is required when there is no example data set


with known answers. For example, searching for a hidden pattern. In this
case, clustering i.e. dividing a set of elements into groups according to some
unknown pattern is carried out based on the existing data sets present.

1.2.3 Reinforcement Learning This strategy built on observation. The ANN


makes a decision by observing its environment. If the observation is
negative, the network adjusts its weights to be able to make a different
required decision the next time

1.3 Applications of Neural Networks


They can perform tasks that are easy for a human but difficult for a machine

Aerospace Autopilot aircrafts, aircraft fault detection.

Automotive Automobile guidance systems.

Military Weapon orientation and steering, target tracking, object


discrimination, facial recognition, signal/image identification.

Electronics Code sequence prediction, IC chip layout, chip failure


analysis, machine vision, voice synthesis.

Financial Real estate appraisal, loan advisor, mortgage screening,


corporate bond rating, portfolio trading program, corporate financial
analysis, currency value prediction, document readers, credit application
evaluators.

Industrial Manufacturing process control, product design and analysis,


quality inspection systems, welding quality analysis, paper quality

2
prediction, chemical product design analysis, dynamic modeling of chemical
process systems, machine maintenance analysis, project bidding, planning,
and management.

Medical Cancer cell analysis, EEG and ECG analysis, prosthetic design,
transplant time optimizer.

Speech Speech recognition, speech classification, text to speech


conversion.

Telecommunications Image and data compression, automated information


services, real-time spoken language translation.

1.4 Computer Vision


Computer Vision is a field of Artificial Intelligence and Computer Science that aims
at giving computers a visual understanding of the world. It is one of the main
components of machine understanding:
The goal of Computer Vision is to emulate human vision using digital images
through three main processing components, executed one after the other:
1.Image acquisition
2.Image processing
3. Image analysis and understanding

As our human visual understanding of world is reflected in our ability to make


decisions through what we see, providing such a visual understanding to computers
would allow them the same power:

Fig. 1.1 Computer vision process

3
1.4.1 Image processing
The second component of Computer Vision is the low-level processing of images.
Algorithms are applied to the binary data acquired in the first step to infer low-level
information on parts of the image. This type of information is characterized by image
edges, point features or segments, for example. They are all the basic geometric
elements that build objects in images.
This second step usually involves advanced applied mathematics algorithms and
techniques.

Low-level image processing algorithms include:


1. Edge detection
2. Segmentation
3. Classification
4. Feature detection and matching

1.4.2 Image analysis and understanding


The last step of the Computer Vision pipeline if the actual analysis of the data, which
will allow the decision making.
High-level algorithms are applied, using both the image data and the low-level
information computed in previous steps.
Examples of high-level image analysis are:
1. 3D scene mapping
2. Object recognition
3. Object tracking

1.5 Computer Vision in Image Classification


Computer vision is a problem that has existed for a long time. In this project, we will
be focusing on the task of classification of computer images into different categories.
This specific part of computer vision has many diverse real world applications,
ranging from video games to self-driving cars. However, it also has been traditionally
very difficult to pull off successfully, due to the enormous amount of different factors
(camera angles, lighting, color balance, resolution etc.) that go into creating an
image.

4
We will be focusing on using artificial neural networks for image classification.
While artificial neural networks are some of the oldest machine learning algorithms
in existence, they have not been widely used in the field of computer vision. M
ore recent improvements in the methods of training artificial neural networks have
made them worth looking into once again for the task of image classification.
Image recognition and classification is a problem that has been around for a long
time and has many real world applications. Police can use image recognition and
classification to help identify suspects in security footage. Banks can use it to help
sort out checks. More recently, Google has been using it in their self-driving car
program.
Traditionally, a lot of different machine learning algorithms have been utilized for
image classification, including template matching, support vector machines, k-NN,
and hidden Markov models.

1.5.1 Challenges
Here is a set of challenges in applying deep learning techniques to image
classification problems:

1) Image size: The majority of the image-related tasks where deep learning was
successfully applied used images of small size. Examples: MNIST (28x28
pixels), STL-10(96x96), Norb (108x108), Cifar-10 and Cifar-100 (32x32).

2) Model size and Training time: One exception to dataset list above is the
ImageNet dataset, which consists in high-resolution images of variable sizes.
The best results on this dataset, however, require significant usage of
computer resources (such as 16thousands cores running for three days ). The
texture datasets commonly consist of higher resolution images, and therefore
different techniques need to be tested, in order to classify the textures without
using too much computing resources.

5
1.6 Motivation
There is a considerable set of potential applications for image classification, as
briefly presented above. However, despite the reported success of classical image
classification techniques in many of these tasks, these problems are still not resolved
and are subject of active research, with potential to increase classification rates.
A second motivation is that traditional machine learning techniques often require
human expertise and knowledge to hand-engineer features, for each particular
domain, to be used in classification and regression tasks. It can be considered that the
actual intelligence in such systems is therefore in the creation of such features,
instead of the machine learning algorithm that uses them. Therefore, using
techniques that do not rely on expert-defined feature extractors can make it easier to
develop effective machine learning models for novel datasets, without requiring the
test and selection of a large set of possible features extractors.

1.7 Problem Statement


The main objective of this research is to test whether or not deep learning models, in
particular, Artificial Neural Networks, can be successfully applied for image
classification problems.
Finally, developing a program which automatically classifies images on the basis of
ten categories using artificial neural network.

6
CHAPTER-2
LITERATURE SURVEY
Classification between the objects is easy task for humans but it has proved to be a
complex problem for machines [1]. The raise of high-capacity computers, the
availability of high quality and low-priced video cameras, and the increasing need for
automatic video analysis has generated an interest in object classification algorithms.
A simple classification system consists of a camera fixed high above the interested
zone, where images are captured and consequently processed. Classification includes
image sensors, image pre- processing, object detection, object segmentation, feature
extraction and object classification. [2]
Classification system consists of database that contains predefined patterns that
compares with detected object to classify in to proper category.[3] Image
classification is an important and challenging task in various application domains,
including biomedical imaging, biometry, video- surveillance, vehicle navigation,
industrial visual inspection, robot navigation, and remote sensing Classification
process consists of following steps[5]:

A. Pre-processing- atmospheric correction, noise removal, image transformation,


main component analysis etc.

B. Detection and extraction of a object- Detection includes detection of position and


other characteristics of moving object image obtained from camera. And in
extraction, from the detected object estimating the trajectory of the object in the
image plane. [4]

C. Training: Selection of the particular attribute which best describes the pattern.
D. Classification of the object-Object classification step categorizes detected objects
into predefined classes by using suitable method that compares the image patterns
with the target patterns.[6]

Image classification has made great progress over the past decades in the following
three areas [7]:
(1) development and use of advanced classification algorithms, such as sub pixel,
per-field, and knowledge-based classification algorithms

7
(2) use of multiple remote-sensing features, including spectral, spatial,
multitemporal, and multisensor information.[8]
(3) incorporation of ancillary data into classification procedures, including such
data as topography, soil, road and census data.
Accuracy assessment is an integral part in an image classification procedure.[9]
Accuracy assessment based on error matrix is the most commonly employed
approach for evaluating per-pixel classification, while fuzzy approaches are gaining
attention for assessing fuzzy classification results.[10] Uncertainty and error
propaga tion in the image-processing chain is an important factor influencing
classification accuracy. Identifying the weakest links in the chain and then
reducing the uncertainties are critical for improvement of classification accuracy.
[11] The study of uncertainty will be an important topic in the future research of
image classification. Spectral features are the most important information for image
classification.[15]
Standard training algorithms for the multi-layer perceptron use back-propagation to
evaluate the first derivatives of the error function with respect to the weights and
thresholds in the network. There are, however, several situations in which it is also of
interest to evaluate the second derivatives of the error measure. These derivatives
form the elements of the Hessian matrix. [12]
Second derivative information has been used to provide a fast procedure for re-
training a network following a small change in the training data . In this application it
is important that all elements of the Hessian matrix be evaluated accurately.
Approximations to the Hessian have been used to identify the least significant
weights as a basis for network pruning technique, as well as for improving the speed
of training algorithms.[13] The Hessian has also been used by MacKay for Bayesian
estimation of regularization parameters, as well as for calculation of error bars on the
network outputs and for assigning probabilities to different network solutions.[14]
[9] MacKay found that the approximation scheme of Le Cun was not sufficiently
accurate and therefore included off-diagonal terms in the approximation scheme.

8
Chapter-3
BACKGROUND

3.1 Neural Network Usage


Artificial neural networks were designed to be modeled after the structure of the
brain. They were first devised in 1943 by researchers Warren McCulloch and Walter
Pitt Back then, the model was initially called threshold logic, which branched in two
different approaches: one inspired more by biological processes and one focused on
artificial intelligence applications. Although artificial neural networks initially saw a
lot of research and development, their popularity soon declined and research slowed
because of technical limitations. The computational intensity of artificial neural
networks was too complicated for the computers at the time. Computers at the time
did not have sufficient computational power and would take too long to train neural
networks. As a result, other machine learning techniques became more popular and
artificial neural networks were mostly neglected.
However, one important algorithm related to artificial neural networks was
developed during this time backpropagation, discovered by Paul Werbos. Backpro-
pagation is a way of training artificial neural networks by attempting to minimize the
errors. This algorithm allowed scientists to train artificial networks much more
quickly.
Artificial neural networks became popular once again in the late 2000s when
companies like Google and Facebook showed the advantages of using machine learn-
-ing techniques on big datasets collected from everyday users. These algorithms are
nowadays mostly used for deep learning, which is an area of machine learning that
tries to model relationships that are more complicated for example, nonlinear
relationships.

3.2 Neural Network working


Each artificial neural network consists of many hidden layers. Each hidden layer in
the artificial neural network consists of multiple nodes. Each node is linked to other
nodes using incoming and outgoing connections. Each of the connections can have a

9
different, adjustable weight. Data is passed through these many hidden layers and the
output is eventually interpreted as different results.

Fig 3.1 Neural Network layers

In this example diagram, there are three input nodes, shown in red. Each input node
represents a parameter from the dataset being used. Ideally the data from the dataset
would be preprocessed and normalized before being put into the input nodes.
There is only one hidden layer in this example and it is represented by the nodes in
blue. This hidden layer has four nodes in it. Some artificial neural networks have
more than one hidden layer.
The output layer in this example diagram is shown in green and has two nodes. The
connections between all the nodes (represented by black arrows in this diagram) are
weighted differently during the training process.

3.3 Image Classification and Recognition


Image recognition and classification is a problem that has been around for a long
time and has many real world applications. Police can use image recognition and
classification to help identify suspects in security footage. Banks can use it to help
sort out checks.

10
More recently, Google has been using it in their selfdriving car program.
Traditionally, a lot of different machine learning algorithms have been utilized for
image classification, including template matching, support vector machines, k-
NNand hidden Markov models. Image classification remains one of the most
difficult problems in machine learning, even today.

3.4 Related Work


In academia, there is related work done by Professor Andrew Ng at Stanford
University, Professor Geoffrey Hinton at University of Toronto, Professor Yann
LeCun at New York University, and Professor Michael Jordan at UC Berkeley.
Much of their work deals with applying artificial neural networks or other machine
learning algorithms. Following is sampling of a few papers from the large body of
work that is available. These papers are all fairly recent and are more geared towards
specific applications of machine learning and artificial neural networks.
There is also a lot of research involving machine learning and artificial neural
networks going on in industry, most specifically at Google and Facebook.

11
CHAPTER- 4
PROBLEM DEFINITION & PROCESS INVOLVED

4.1 INTRODUCTION

Image classification is a critical task for both humans and computers. One of the
challenges lies in the large scale of the semantic space. In particular, humans can
recognize tens of thousands of object classes and scenes.
By using the image classification on CIFAR10 dataset We find that: a) computational
issues be- come crucial in algorithm design; b) conventional wisdom from a couple
of hundred image categories on relative performance of different classifiers does not
necessarily hold when the number of categories increases; c) there is a surprisingly
strong relationship between the structure of WordNet (developed for studying
language) and the difficulty of visual categorization; d) classification can be
improved by exploiting the se- mantic hierarchy.

4.2 Deep Learning

The drawbacks of the existing methods are twofold: (a) lack of high accuracy and (b)
slow convergence rate. Since wrong identification leads to misleading results,
accuracy must be exceedingly high in classification and segmentation techniques.
Also, these techniques must possess a faster convergence rate which will make them
practically feasible for real-time applications. These problems can be overcome by
using artificial intelligence techniques and by performing suitable modifications on
the existing conventional algorithms.

4.3 To Recognize Shapes and Generate Images

This paper is from Geoffrey Hinton at the University of Toronto and the Canadian
Institute for Advanced Research and deals with the problem of training multilayer
neural networks[1] . It is an overview of the many different strategies that are used to
train multilayer neural networks today.
First it discusses five strategies for learning neural networks, which include denial,
evolution, procrastination, calculus, and generative. Out of these five strategies, the
most significant ones are strategies four and five. Calculus includes the strategy of

12
backpropagation, which has been independently discovered by multiple researchers.
Generative includes the wakesleep algorithm.

Fig. 4.1 Image classification and segmentation

4.4 Learning Multiple Layers of Representation


This is a paper by Geoffrey Hinton that deals with the problem of training multilayer
neural networks [2]. Training multilayer neural networks has been done mostly using
the back propagation algorithm. Back propagation is also the first computationally
feasible algorithm that can be used to train multiple layers. However, back

13
propagation also has several limitations including requiring labelled data and
becoming slow when used on neural networks with excessive amounts of layers.
The backpropagation algorithm was originally introduced in the 1970s, but its
importance wasn't fully appreciated until a famous 1986 paper by David
Rumelhart, Geoffrey Hinton, and Ronald Williams. That paper describes several
neural networks where backpropagation works far faster than earlier approaches to
learning, making it possible to use neural nets to solve problems which had
previously been insoluble. Today, the backpropagation algorithm is the workhorse of
learning in neural networks.
Backpropagation is a practical realization of the gradient descent algorithm in
multilayered neural networks. It calculates the gradient of the loss function with
respect to all the the weights in the network by iteratively applying the multivariable
chain rule.
Applying the backpropagation algorithm to a neural network is a two way process:
we first propagate the input values through the network and calculate the errors, and
then we backpropagate the errors through the network backwards to adjust the
connection weights in order to minimize the error. The algorithm calculates the
gradient of the loss function with respect to the weights between the hidden layer and
output layer nodes, and then it proceeds to calculate the gradient of the loss function
with respect to the weights between the input layer and hidden layer nodes. After
calculating the gradients, it subtracts them from the corresponding weight vectors to
get the new weights for the connections. This process is repeated until the network
produces the desired outputs.

The whole process is better explained with an example. In the neural network
implemented in this project, the gradient of the loss function with respect to the
weights between the hidden layer and output layer nodes can be computed as
follows:

Equation 4.4.1 Calculation of gradient of loss function

14
where loss is the cross-entropy loss described and z is a vector that holds the values
of the output layer nodes.

Going further down the line, the gradient of the loss function with repsect to the
weights between the input layer and hidden layer nodes is calculated with the
following formula:
Equation 4.4.2.Calculation of gradient of loss function

where y is a vector that holds the values of the hidden layer nodes.

4.5 Google Deepmind


Google has been focusing much more on its machine learning and data science
departments. Almost every product at Google uses some sort of machine learning.
For example, Google Adsense uses data science to better target ads towards
customers and Picasa uses machine learning to recognize faces in images.
One of the more interesting Google products using machine learning and data
science is DeepMind. DeepMind Technologies was a tech startup based in London
that wasa cquired by Google near the beginning of 2014. Their goal is to combine
the best techniques from machine learning and systems neuroscience to build
powerful general purpose learning algorithms.
DeepMind has, in fact, trained a neural network to play video games, including
classics like Pong and Space Invaders.

CHAPTER 5

15
SYSTEM ANALYSIS AND DESIGN

5.1 Technology Used

OS: Windows

SOFTWARES: numPy, sciPy, cython, python, TensorFlow

VERSION: PYTHON 3.5.3

5.1.1 numPy

NumPy is the fundamental package for scientific computing with Python. It contains
among other things:

a powerful N-dimensional array object


sophisticated (broadcasting) functions

tools for integrating C/C++ and Fortran code

useful linear algebra, Fourier transform, and random number capabilities

Besides its obvious scientific uses, NumPy can also be used as an efficient multi-
dimensional container of generic data. Arbitrary data-types can be defined. This
allows NumPy to seamlessly and speedily integrate with a wide variety of databases.

5.1.2 sciPy

SciPy is an open source Python library used for scientific computing and technical
computing. SciPy contains modules for optimization, linear algebra, integration,
interpolation, special functions, FFT, signal and image processing, ODE solvers and
other tasks common in science and engineering.
SciPy builds on the NumPy array object and is part of the NumPy stack which
includes tools like Matplotlib, pandas and SymPy. There is an expanding set of
scientific computing libraries that are being added to the NumPy stack every day.
This NumPy stack has similar users to other applications such as MATLAB, GNU
Octave, and Scilab. The NumPy stack is also sometimes referred to as the SciPy
stack.

16
5.1.3 Cython Compiler
Cython is an optimizing static compiler for both the Python programming language
and the extended Cython programming language (based on Pyrex). It makes writing
C extensions for Python as easy as Python itself.
The Cython language is a superset of the Python language that additionally supports
calling C functions and declaring C types on variables and class attributes. This
allows the compiler to generate very efficient C code from Cython code. The C code
is generated once and then compiles with all major C/C++ compilers in CPython 2.6,
2.7 (2.4+ with Cython 0.20.x) as well as 3.2 and all later versions. We regularly run
integration tests against all supported CPython versions and their latest in-
development branches to make sure that the generated code stays widely compatible
and well adapted to each version. PyPy support is work in progress (on both sides)
and is considered mostly usable since Cython 0.17. The latest PyPy version is always
recommended here.
All of this makes Cython the ideal language for wrapping external C libraries,
embedding CPython into existing applications, and for fast C modules that speed up
the execution of Python code.

5.1.4 Python

Python is a clear and powerful object-oriented programming language, comparable to


Perl, Ruby, Scheme, or Java.

Some of Python's notable features:

Uses an elegant syntax, making the programs you write easier to read.
Is an easy-to-use language that makes it simple to get your program working.
This makes Python ideal for prototype development and other ad-hoc
programming tasks, without compromising maintainability.

Comes with a large standard library that supports many common


programming tasks such as connecting to web servers, searching text with
regular expressions, reading and modifying files.

Python's interactive mode makes it easy to test short snippets of code. There's
also a bundled development environment called IDLE.

17
Is easily extended by adding new modules implemented in a compiled
language such as C or C++.

Can also be embedded into an application to provide a programmable


interface.

Runs anywhere, including Mac OS X, Windows, Linux, and Unix.


Is free software in two senses. It doesn't cost anything to download or use
Python, or to include it in your application. Python can also be freely
modified and re-distributed, because while the language is copyrighted it's
available under an open source license.

5.1.5 TensorFlow

TensorFlow is an open source software library for numerical computation using data
flow graphs. Nodes in the graph represent mathematical operations, while the graph
edges represent the multidimensional data arrays (tensors) communicated between
them. The flexible architecture allows you to deploy computation to one or more.
CPUs or GPUs in a desktop, server, or mobile device with a single API. TensorFlow
was originally developed by researchers and engineers working on the Google Brain
Team within Google's Machine Intelligence research organization for the purposes of
conducting machine learning and deep neural networks research, but the system is
general enough to be applicable in a wide variety of other domains as well.

5.1.6 CIFAR-10 DataSet

CIFAR-10 is a popular computer vision dataset that is used by object recognition


algorithms. It is a labeled subset of 80 million tiny images dataset that was collected
by Alex Krizhevsky, Vinod Nair and Geofrrey Hinton. The dataset is made up of
60000 32x23 color images that are organized in 10 classes, each of which consists of
6000 images. These 60 000 images are divided into 50 000 training images and 10
000 test images.

The pictures are stored in memory as six 10 000 x 3072 numpy arrays where each
row represents a single image. The 3072 values in each row are logically divided into
3 chunks of 1024 values, each representing the red, green and blue channel values,

18
respectively. The images are stored in row-major order, which means that every i-th
logical block consisting of 32 elements in the row represent the i-th row channel
value in the actual image.

The labels are stored as six lists of 10,000 elements, where each element is an integer
between 0 and 9. These numbers map to the class names, so that 0 maps to the
airplanes class, 1 maps to automobiles etc. The i-th element in the labels list is the
label for the i-th picture in the numpy array that was described in the previous
paragraph.

Here are the 10 different classes in the dataset and some examples of the pictures that
belong to these classes:

Fig 5.1 Example of the CIFAR 10 Dataset

19
5.2.DESIGN

Fig.5.2. Design to determine accuracy

CHAPTER 6

20
IMPLEMENTATION

6.1 NEURAL NETWORKS

Neural networks are very loosely based on how biological brains work. They consist

of a number of artificial neurons which each process multiple incoming signals and

return a single output signal. The output signal can then be used as an input signal for

other neurons.

Lets take a look at an individual neuron:

Fig 6.1 an individual Neuron

We have a vector of input values and a vector of weights. The weights are the

neurons internal parameters. Both input vector and weights vector contain the same

number of values, so we can use them to calculate a weighted sum.

21
WeightedSum=input1w1+input2w2+...

As long as the result of the weighted sum is a positive value, the neurons output is
this value. But if the weighted sum is a negative value, we ignore that negative value
and the neuron generates an output of 0 instead. This operation is called a Rectified
Linear Unit (ReLU).

Fig 6.2 Rectified Linear Unit, which is defined by f(x) = max (0,x)

The reason for using a ReLU is that this creates a non linearity. The neurons output
is now not strictly a linear combination (=weighted sum) of its inputs anymore.
Well see why this is useful when we stop looking at individual neurons and instead
look at the whole network.

The neurons in artificial neural networks are usually not connected randomly to each

other. Most of the time they are arranged in layers:

22
Fig 6.3 Architecture and functioning of a neural network

The input images pixel values are the inputs for the networks first layer of neurons.

The output of the neurons in layer 1 is the input for neurons of layer 2.

This is the reason why having a non-linearity is so important. Without the ReLU at

each layer, we would only have a sequence of weighted sums. And stacked weighted

sums can be merged into a single weighted sum, so the multiple layers would give us

no improvement over a single layer network. Introducing the ReLU nonlinearity

solves this problem as each additional layer really adds something to the network.

23
The networks final layers outputs are the values we are interested in, the scores for

the image categories. In this network architecture each neuron is connected to all

neurons of the previous layer, therefore this kind of network is called a fully

connected network.

6.2 Activation function


Each node in the neural network takes many inputs and produces a single output
based on the weighted sum of these inputs.
A relatively simple model that achieves this functionality the perceptron - was
invented already in 1957 by Frank Rosenblatt. The perceptron sums together all the
inputs and their corresponding weights, compares the result to some threshold and
outputs a discrete value based on this comparison. For example, if the sum is bigger
than or equal to the threshold, the node should output 1, otherwise the node emits 0.
Mathematically, it can be formalized as:

Equation 6.2.1 Calculation of sum of inputs and corresponding weights

where w and x are the weight and input vectors, respectively.


By moving the threshold to the left side in both equations and replacing it with the
bias term b, which is equal to the negative threshold (b = -threshold), we get a new
and more commonly used representation of the perceptron rule:

24
Equation 6.2.2.Peceptron rule

where w and x are the weight and input vectors, respectively, and b is the bias term.8
Although in theory, one can use perceptrons to compute any function, it is almost
never used in practice, mostly due to its discrete nature - a small change in the
weights might cause the perceptron to produce a drastically different output, which is
not desirable for learning. Accordingly, in this neural network implementation, a
sigmoid activation function is used instead of the perceptron, which is actually quite
similar to the former, but gets rid of its shortcoming. The formula for the sigmoid
function is the following:

Equation 6.2.3 Sigmoid function


When representing it as a node in the neural network with weights, inputs and biases,
the function takes the following form:

Equation 6.2.4. Sigmoid function with weights, inputs and biases

The intuition behind the function is that it takes a real valued input and outputs a
value between 0 and 1. The bigger the input value, the closer the output is to 1, and
vice versa. In that respect, the sigmoid function is very similar to the perceptron,
because for most of the inputs, it produces an output that is really close to either 1
or 0, as can be seen from the following graph:

25
Fig. 6.4. The sigmoid activation function.
6.3 Loss function
The task of the neural network implemented in this project is to learn to classify
images. In order to learn something, one must be provided with some feedback about
his current performance. The job of the loss function is precisely that: to evaluate the
accuracy of the neural network. As the name suggests, if the loss is low, the neural
network is doing a good job at classifying the images, and the loss will be high if the
network is not guessing the right classes.
In order to calculate the loss for a specific guess, the neural network's output must
first be interpreted as class scores. This is the job of the score function, which takes
the values from the output layer nodes and calculates the probability that a given
input represents a specific class. The score function used in this project is called the
softmax function and is given by the following formula:

Equation.6.3.1.Softmax function

where z is a vector of output nodes and and group denotes a set of indexes of every
node in the output layer.
To illustrate the workings of the score function, an example is probably needed. Let's

26
say we have a neural network that has the job to classify 3 different classes of data.
For this, we need 3 nodes in the output layer, each voting for a different class (first
neuron represents the first class, second neuron represents the second class etc). The
3 nodes in the output layer could have the values 3, 5, 4, so z = (3, 4, 5). After
calculating the softmax probability distribution, we get p=(0.09, 0.244, 0.665). Since
the third node has the highest score, we say that the network is classifying the input
to be from class 3.
After calculating the probability score for each class, the cross-entropy loss function
can be used to calculate the network's total loss:

C= tilog pii

Equation 6.3.2. Calculation of loss function


where t is a vector representing the correct class and p is the probability distribution
that was described above. The target vector t has the element 0 at every position,
except at the index that corresponds to the integer representing the correct class,
where the value is 1. For example, if the correct class is 3, the target vector is (0, 0, 0,
1, 0, 0, 0, 0, 0, 0).

6.4 Gradient descent


In the previous section we defined a loss function, which essentially tells us how
accurately the neural network is able to classify data. Now, we need a way to actually
act on this information and somehow minimize the loss of the network during
training by changing the weights between different nodes. That is done with the
gradient descent algorithm.

Gradient descent works by finding the gradient of the loss function with respect to
the weights of the network, telling us the direction where the function increases the
most. In order to minimize the loss, it is required to subtract a fraction of the gradient
from the corresponding weight vector. To understand the algorithm intuitively, an
analogy can be used.

27
One can imagine being placed blindfolded at the top of the mountain and given a task
to find a way down. The strategy a blindfolded person would use would be to feel the
ground just before him and take a small step in the direction where the ground feels
to descent the steepest. Doing this iteratively, the person is guaranteed to arrive at a
location where the ground just around him is not descending. Since the mountain is
not always going downwards at every location, but rather has a very fluctuating
terrain, it is not guaranteed that the place the person feels the ground to be horizontal
in every direction is actually the steepest place on the mountain. When it is not, it is
called a local minima and it can be visualized with the following picture:

Figure 6.5. A 3-D graph representing


the gradient descent algorithm.

As can be seen from the graph, the hiker's move along the paths (depicted as red
lines) that continuously take them down the mountain when it is descending, but fail
to reach the absolute bottom once the ground is stable.
In order to get out of local minima, several methods can be used, most notably by
adjusting the learning rate, which is intuitively the length of the steps that the hiker
takes in a specific direction. When the learning rate is low, the hiker takes very small
steps and can be almost certain that he is descending, but it takes a lot of time to

28
reach the bottom of the mountain. On the other hand, when the learning rate is high
and he takes long steps, he might reach the ground faster, but with a relatively big
possibility of going in the wrong direction. For example, if the hiker feels that the
ground is descending just before him, but then takes a very big step in that direction,
he might end up at a location that is actually higher than he was at previously.

Another straightforward method that is used to improve gradient descent and to help
it get out of local minima is called the momentum update, which adds a fraction of
the previous weight update values to the current one. The intuition behind the method
is very simple and can be easily derived from its name: the faster the hiker is
descending (i.e the steeper the hill), the more momentum he has due to inertia when
he makes his next step, hence the step will be longer if he is descending fast, and
shorter when the descent is not as drastic.

Softmax Classifier:
The softmax classifier is used to predict one class at a time. The Softmax Regression
model first computes a score sk(x) for each class k, then estimates the probability of
each class by applying the softmax function (also called the normalized exponential)
to the scores.

Equation 6.4.1 Calculation of class scores


each class has its own dedicated parameter vector (k). All these vectors are typically
stored as rows in a parameter matrix .

The softmax function is as follows:

Equation 6.4.2 Softmax function

K is the number of classes.


s(x) is a vector containing the scores of each class for the instance x.

29
(s(x))k is the estimated probability that the instance x belongs to class k
given the scores of each class for that instance.

6.5 Backpropagation
Backpropagation is a practical realization of the gradient descent algorithm in
multilayered neural networks. It calculates the gradient of the loss function with
respect to all the the weights in the network by iteratively applying the multivariable
chain rule.
Applying the backpropagation algorithm to a neural network is a two-way process:
we first propagate the input values through the network and calculate the errors, and

then we backpropagate the errors through the network backwards to adjust the
connection weights in order to minimize the error. The algorithm calculates the
gradient of the loss function with respect to the weights between the hidden layer and
output layer nodes, and then it proceeds to calculate the gradient of the loss function
with respect to the weights between the input layer and hidden layer nodes. After
calculating the gradients, it subtracts them from the corresponding weight vectors to

Fig. 6.6 Backpropagation with Softmax classifier

30
get the new weights for the connections. This process is repeated until the network
produces the desired outputs.
The whole process is better explained with an example. In the neural network
implemented in this project, the gradient of the loss function with respect to the
weights between the hidden layer and output layer nodes can be computed as
follows:

Equation 6.5.1. Calculation of gradient of loss function

where loss is the cross-entropy loss described and z is a vector that holds the values
of the output layer nodes.
Going further down the line, the gradient of the loss function with repsect to the
weights between the input layer and hidden layer nodes is calculated with the
following formula:

Equation 6.5.2 Calculation of gradient of loss function w.r.t weights

where y is a vector that holds the values of the hidden layer nodes.

6.6 The Code

The code is split into two files: two_layer_neural_network.py, which defines the

model, and run_neural_network.py, which runs the model.

6.6.1 Two-Layer Fully Connected Neural Network

two_layer_fc.py contains the following functions:

31
inference() gets us from input data to class scores.

loss() calculates the loss value from class scores.

training() performs a single training step.

evaluation() calculates the accuracy of the network.

6.6.1.1 Inference() describes the forward pass through the network. How are the

class scores calculated, starting from input images?

####################### inference ###############################

def inference(images, image_pixels, hidden_units, classes, reg_constant=0):

# Layer 1

with tf.variable_scope('Layer1'):

# Define the variables

weights = tf.get_variable(

name='weights',

shape=[image_pixels, hidden_units],

initializer=tf.truncated_normal_initializer(

32
stddev=1.0 / np.sqrt(float(image_pixels))

regularizer=tf.contrib.layers.l2_regularizer(reg_constant)

biases = tf.Variable(tf.zeros([hidden_units]), name='biases')

# the layer 1 output

hidden = tf.nn.relu(tf.matmul(images, weights) + biases)

# Layer 2

with tf.variable_scope('Layer2'):

# Define variables

weights = tf.get_variable('weights', [hidden_units, classes],

initializer=tf.truncated_normal_initializer(

stddev=1.0 / np.sqrt(float(hidden_units))),

regularizer=tf.contrib.layers.l2_regularizer(reg_constant)

biases = tf.Variable(tf.zeros([classes]), name='biases')

# layer 2 output

33
logits = tf.matmul(hidden, weights) + biases

##############################################################

6.6.1.2 Calculating the Loss: loss ( )

First we calculate the cross-entropy between logits(the models output)

and labels(the correct labels from the training dataset).

Our goal for regularization is to arrive at a simple model without any

unnecessary complications. There are different ways to achieve this, and the

option we are choosing is called L2-regularization. L2-regularization adds

the sum of the squares of all the weights in the network to the loss function.

This corresponds to a heavy penalty if the model is using big weights and a

small penalty if the model is using small weights

Thats why we used the regularizer parameter when defining the weights and

assigned a l2_regularizer to it. This tells TensorFlow to keep track of the L2-

regularization terms (and weigh them by the parameter reg_constant) for this

variable.

All regularization terms are added to a collection

called tf.GraphKeys.REGULARIZATION_LOSSES, which the loss function

34
accesses. We then add the sum of all regularization losses to the previously

calculated cross-entropy to arrive at the total loss of our model.

######################## loss #############################

def loss(logits, labels):

'''Calculates the loss from logits and labels '''

with tf.name_scope('Loss'):

# Operation to determine the cross entropy between logits and labels

cross_entropy = tf.reduce_mean(

tf.nn.sparse_softmax_cross_entropy_with_logits(

logits=logits, labels=labels, name='cross_entropy'))

# Operation for the loss function

loss = cross_entropy + tf.add_n(tf.get_collection(

tf.GraphKeys.REGULARIZATION_LOSSES))

# Add a scalar summary for the loss

tf.summary.scalar('loss', loss)

return loss

35
##############################################################

6.6.1.3 Optimizing the Variables: training ()

global_step is a scalar variable which keeps track of how many training iterations

have already been performed. When repeatedly running the model in our training

loop, we already know this value. Its the iteration variable of the loop. The reason

were adding this value directly to the TensorFlow graph is that we want to be able to

take snapshots of the model. And these snapshots should include information about

how many training steps have already been performed.

The definition of the gradient descent optimizer is simple. We provide the learning

rate and tell the optimizer which variable it is supposed to minimize. In addition, the

optimizer automatically increments the global_step parameter with every iteration.

########################## training #############################

def training(loss, learning_rate):

'''Sets up the training operation.''

# Create a variable to track the global step

global_step = tf.Variable(0, name='global_step', trainable=False)

# Create a gradient descent optimizer

36
# (which also increments the global step counter)

train_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(

loss, global_step=global_step)

return train_step

#############################################################

6.6.1.4 Measuring Performance: evaluation ()

We compare the models predictions with true labels and calculate the frequency of

how often the prediction is correct. Were also interested in how the accuracy evolves

over time, so were adding a summary operation which keeps track of the value

of accuracy.

########################### evaluation ########################

def evaluation(logits, labels):

'''Evaluates the quality of the logits at predicting the label.'''

with tf.name_scope('Accuracy'):

# Operation comparing prediction with true label

37
correct_prediction = tf.equal(tf.argmax(logits,1), labels)

# Operation calculating the accuracy of the predictions

accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

# Summary operation for the accuracy

tf.summary.scalar('train_accuracy', accuracy)

return accuracy

##########################################################

inference() describes the forward pass through the network. How are the class scores

calculated, starting from input images?

Each neuron takes all values from the previous layer as input and generates a single

output value. Each neuron in the hidden layer therefore has image_pixels inputs and

the layer as a whole generates hidden_units outputs. These are then fed into

the classes neurons of the output layer which generate classes output values, one

score per class.

reg_constant is the regularization constant. TensorFlow allows us to add

regularization to our network very easily by handling most of the calculations

automatically.

38
Fig 6.7.Regularization in process

We use tf.get_variable(), which allows us to add regularization. weights is a matrix

with dimensions of image_pixels by hidden_units (input vector size x output vector

size). The initializer parameter describes the weight variables initial values. Up to

now, weve initialized our variables to 0, but this wouldnt work here. Think about

the neurons in a single layer. They all receive exactly the same input values. If they

all had the same internal parameters as well, they would all make the same

calculation and all output the same value. To avoid this, we need to randomize their

initial weights. We use an initialization scheme which usually works well, the

weights are initialized to normally distributed values. We drop values which are more

39
than 2 standard deviations from the mean, and the standard deviation is set to the

inverse of the square root of the number of input pixels.

To create the first layers output we multiply the images matrix and

the weights matrix with each other and add the bias variable.

Then we apply tf.nn.relu(), the ReLU function to arrive at the hidden layers output.

Layer 2 is very similar to layer 1. The number of inputs is hidden_units, the number

of outputs is classes. Therefore the dimensions of the weights matrix

are [hidden_units, classes]. Since this is the final layer of our network, theres no

need for a ReLU anymore. We arrive at the class scores (logits) by multiplying input

(hidden) and weights with each other and adding bias

To sum it up, the inference() function as whole takes in input images and returns

class scores. Thats all a trained classifier needs to do, but in order to arrive at a

trained classifier, we first need to measure how good those class scores are.

6.6.2 Running the Neural Network


Lets pretend we have 100 training images and a batch size of 10. This means that

after 10 iterations each image will have been picked once on average(1). But in fact

some images will have been picked multiple times while some images havent been

part of any batch so far. As long as you repeat this often enough, its not that terrible

that randomness causes some images to be part of the training batches somewhat

more often than others.

40
But we want to improve the sampling process. What we do is we first shuffle the 100

images of the training dataset. The first 10 images of the shuffled data are our first

batch, the next 10 images are our second batch and so forth. After 10 batches were

at the end of our dataset and the process starts again. We shuffle the data another time

and run through it from front to back. This guarantees that no image is being picked

more often than any other while still ensuring that the order in which the images are

returned is random.

In order to achieve this, the gen_batch() function in data_helpers() returns a

Python generator, which returns the next batch each time it is evaluated.

Were using the Pythons built-in zip()function to generate a list of tuples of the

from [(image1, label1), (image2, label2), ...], which is then passed to our generator

function.

next(batches) returns the next batch of data. Since its still in the form of [(imageA,

labelA), (imageB, labelB), ...], we need to unzip it first to separate images from

labels, before filling feed_dict, the dictionary containing the TensorFlow

placeholders, with a single batch of training data.

Every 100 iterations the models current accuracy is evaluated and printed to the

screen. In addition, the summary operation is being run and its results are added to

the summary_writer which is responsible for writing the summaries to disk.

41
After the training is finished, the final model is evaluated on the test set (remember,

the test set contains data that the model has not seen so far, allowing us to judge how

well the model is able to generalize to new data).

######################### run neural network ####################

# Load CIFAR-10 data

data_sets = data_helpers.load_data()

# Define input placeholders

images_placeholder = tf.placeholder(tf.float32, shape=[None, IMAGE_PIXELS],

name='images')

labels_placeholder = tf.placeholder(tf.int64, shape=[None], name='image-labels')

# Operation for the classifier's result

logits= two_layer_neural_network.inference(images_placeholder, IMAGE_PIXELS,

FLAGS.hidden1, CLASSES, reg_constant=FLAGS.reg_constant)

# Define summery-operation for 'logits'-variable

tf.summary.histogram('logits', logits)

# Operation for the loss function

42
loss = two_layer_neural_network.loss(logits, labels_placeholder)

# Operation for the training step

train_step = two_layer_neural_network.training(loss, FLAGS.learning_rate)

# Operation calculating the accuracy of our predictions

accuracy = two_layer_neural_network.evaluation(logits, labels_placeholder)

# Operation merging summary data for TensorBoard

run_metadata = tf.RunMetadata()

# Define saver to save model state at checkpoints

saver = tf.train.Saver()

with tf.Session() as sess:

# Initialize variables and create summary-writer

summary = tf.summary.merge_all()

summary_writer = tf.summary.FileWriter(logdir, sess.graph)

sess.run(tf.global_variables_initializer())

# Generate input data batches

43
zipped_data = zip(data_sets['images_train'], data_sets['labels_train'])

batches = data_helpers.gen_batch(list(zipped_data), FLAGS.batch_size,

FLAGS.max_steps)

for i in range(FLAGS.max_steps):

# Get next input data batch

batch = next(batches)

images_batch, labels_batch = zip(*batch)

feed_dict = {

images_placeholder: images_batch,

labels_placeholder: labels_batch }

# Periodically print out the model's current accuracy

if i % 100 == 0:

train_accuracy = sess.run(accuracy, feed_dict=feed_dict)

print('Step {:d}, training accuracy {:g}'.format(i, train_accuracy))

summary_str = sess.run(summary, feed_dict=feed_dict)

44
summary_writer.add_summary(summary_str, i)

summary_writer.add_run_metadata(run_metadata, 'step%d' % i)

summary_writer.add_summary(summary_str, i)

# Perform a single training step

sess.run([train_step, loss], feed_dict=feed_dict)

# Periodically save checkpoint

if (i + 1) % 1000 == 0:

checkpoint_file = os.path.join(FLAGS.train_dir, 'checkpoint')

saver.save(sess, checkpoint_file, global_step=i)

print('Saved checkpoint')

# After finishing the training, evaluate on the test set

test_accuracy = sess.run(accuracy, feed_dict={

images_placeholder: data_sets['images_test'],

labels_placeholder: data_sets['labels_test']})

print('Test accuracy {:g}'.format(test_accuracy))

45
CHAPTER 7

RESULTS

We can see that the training accuracy starts at a level we would expect from guessing
randomly (10 classes -> 10% chance of picking the correct one). Over the first about
1000 iterations the accuracy increases to around 50% and fluctuates around that
value for the next 1000 iterations. The test accuracy of 46% is not much lower than
the training accuracy. This indicates that our model is not significantly over-fitted.

7.1 Training the dataset:

Fig 7.1 Data training in Command prompt

46
7.2 Accuracy and Loss graph: The following graph is displayed in
Tensorflow. Accuracy of the classification gradually increasing.

Fig 7.2 Training Accuracy graph in Tensorflow


Loss function is gradually decreasing

Fig. 7.3 Loss decreasing after training

47
7.3 Histograms Generation

Fig 7.4 Logits Histograms

Fig. 7.5 3D representation of 3072 nodes

48
CHAPTER 8

CONCLUSION & FUTURE SCOPE

The task of image classification has relevant applications in a wide range of domains,
and although being successful in many areas, it is still object of active research with
potential to increase classification rates. Herein, we have used artificial neural
networks for image classification. Our technique of image classification, hence,
classifies new samples, with the help of the patterns learned by it, in the training set.
But, the accuracy of this project still needs to be improved as the neural network can
be improved a number of ways like refactoring the code to make it more efficient and
accommodating more hidden layers so that the networks can deal with more complex
images. The objective, to use a machine learning algorithm to learn a classification
model, using a training dataset, has been achieved.

Take humans for instance, we receive somewhat around eighty percent of the
information about our environment through vision, so if simply put this field has
really great scope in machine vision especially Autonomous Robotics, Artificial
intelligence, Satellite imaging, tracking and the list continues.

One good thing about computer vision is; when it is paired with machine learning it
has limitless application in various diverse fields, right from medical to military. But
it demands high computing power and parallel processing which current generation
processors are not able to provide well, although new technologies have improved
the performance but we are not there yet. So, simply put computer vision and
especially machine learning are still in its elementary and budding stage and theres
a lot to come ahead.

Future work can explore the optimization of the hyper parameters on the network
(such as number of layers, number of neurons per layer, etc.), since these hyper
parameters were fixed during the experiments on the present work. The strategy to
extract patches during testing can also be further explored, experimenting with
different strategies such as random patch extraction

49
CHAPTER-9
REFERENCES

[1] P. Kamavisdar, S. Saluja, and S. Agrawal, A Survey on Image Classification


Approaches and Techniques, IJCST, vol. 2, no. 1, pp. 10051009, 2013.

[2] D. L. & Q. Weng, A survey of image classification methods and techniques for
improving classification performance, Int. J. Remote Sens., pp 90-94, 2007.

[3] Mokhairi Makhtar, Engku Fadzli, and Shazwani Kamarudin, The Contribution
of Feature Selection and Morphological Operation For On- Line Business System s
Image Classification, World Applied Science Journal, Vol 1. pp. 22-29, 2014.

[4] Christopher M. Bishop, Exact Calculation of the Hessian Matrix for the
Multilayer Perceptron, Published in Neural Computation, Vol. No. 4 pp. 122-125,
1992

[5] P. Sahoo, S. Soltani, and a. K. Wong, A survey of thresholding techniques,


Published in Computer Vision, Graph. Image Process., vol. 41, no. 2, pp. 233260,
Feb. 1988.

[6] A. C. Berg, T. L. Berg, J. Malik, and U. C.Berkeley, Shape Matching and Object
Recognition using Low Distortion Correspondences, International Journal of
Computer Trends and Technology, pp: 343-349, 2005.

[7] J. F. Nunes, P. M. Moreira, and J. M. R. S.Tavares, Shape Based Image Retrieval


and Classification,In IJCTT, pp. 433438, 2010.

[8] Bishop C. M. Curvature-Driven Smoothing in Feed-forward Networks. In


Proceedings of the International Neural Network Conference, Paris, Vol 2, p749,
1990.

50
[9] L. P. Ricotta, S. Ragazzini and G. Martinelli Learning of Word Stress in a
Suboptimal Second Order Back-propagation Neural Network. In Proceedings IEEE
International Conference on Neural Networks, San Diego, Vol. 1, pp. 355, 2007

[10] Y Le Cun, S. J. Denker and A. S. Solla. Optimal Brain Damage. In Advances


in Neural Information Processing Systems, Vol. 2, pp. 74-82, 1990

[11] A. A. M. Al-kubati, J. A. M. Saif, and M. A.A. Taher, Evaluation of Canny and


Otsu Image Segmentation, Published in IJCST, pp. 2426, 2012.

[12] L. Fei-Fei, R. Fergus, and P. Perona, Learning Generative Visual Models from
Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object
Categories, Conf. of Comput. Vis. Pattern Recognit. Work., pp. 178178, 2004.

[13] Geoffrey E. Hinton. Learning Multiple Layers of Representation. Published


International Journal Of Computer Science and Communication, Vol.1, No.2, pp.
141-144, October 2007.

[14] P.Santhi and V. Murali Bhaskaran,Improving the Performance of Data Mining


Algorithms IJCST, vol. 2, no. 3, pp. 152157, 2011.

[15] S. Banerji, A. Sinha, and C. Liu, New image descriptors based on color,
texture, shape, and wavelets for object and scene image classification,Published in
IEEE Neurocomputing, vol. 117, pp. 173185, Oct. 2013.

51

You might also like