Machine Learning

TABLE OF CONTENT
1.INTRODUCTION
2.WHY DIGIT RECOGNITION
3.DATA DESCRIPTION
4.LITERATURE SURVEY
5.OBJECTIVE OF PROJECT
6.PROBLEM STATEMENT
7.METHODOLOGY
8.IMPLEMENTATION
9.RESULT
10.CONCLUSION
11.REFERENCES
INTRODUCTION
• The problem of handwriting recognition is to interpret intelligible handwritten input automatically, which is of great
interest in the pattern recognition research community because of its applicability to many fields towards more
convenient input devices and more efficient data organization and processing. As one of the fundament problems
in designing practical recoginition systems, the recognition of handwritten digits is an active research field.
Immediate applications of the digit recognition techniques include postal mail sorting, automatically address
reading and mail routing, bank check processing, etc
• A major problem in handwriting recognition is the huge variability and distortions of patterns. Elastic models based
on local observations and dynamic programming such HMM are not efficient to absorb this variability. But their
vision is local. But they cannot face to length variability and they are very sensitive to distortions.
• Then the SVM is used to estimate global correlations and classify the pattern. Support Vector Machine (SVM) is an
alternative to NN. In Handwritten recognition, SVM gives a better recognition result.
WHY DIGIT RECOGNITION?
• For some brief background regarding handwritten digit processing with machine learning, lets note some
interesting features about being able to process this kind of data. Handwritten digits are a common part of
everyday life. One of the first uses that comes to mind is that of zip codes.
• A zip code consists of 5 digits (sometimes more, depending if the trailing digits are included), and is one of
the most important parts of a letter for it to be delivered to the correct location. Many years ago, the
postman would read the zip code manually for delivery. However, this type of work is now automated by
using optical character recognition (OCR) - similar to the type of solution we’ll be implementing in this
article!
OBJECTIVE OF PROJECT
• Objective of this project to developed a machine learning program

which is able to recognize human’s handwritten digit from pictures
which can play a vital role in postal automation services especially in
countries like India where multiple languages and scripts are present.
PROBLEM STATEMENT
• Given a grey scale isolated numerical images taken from MNIST database
• The objective are:-
• 1)To recognize handwritten digits correctly
• 2)To improve the accuracy of detection

METHODOLOGY
• 1.The proposed method use the mnist dataset and three different classifier for handwritten digit recognition.
• 2.Training of datasets is done. Training data is used by the learning algorithm, usually in a supervised learning model, to
increase accuracy
• 3. The label (answer) is provided for each row in the dataset, so the algorithm can learn which data corresponds to which
handwritten digit.
• 4. However, in order to really know how well the program is doing, we need to run it on data that it’s never seen before.
That’s where the cross validation set comes in.
• 5. We’ll split the training set in half. The first half will remain as the training data. The second half will serve as the cross
validation data. We’ll provide the training portion to the learning algorithm, along with the answers.
• 6. After training has completed, we’ll run the algorithm again on our cross validation data to see just how accurate the
solution really is. Since we have the digit labels (answers) for both the training and cross validation sets, we can calculate
an accuracy percentage.
• 7. Using the above technique, we can compare different learning algorithm and find best algorithm for mnist dataset
CLASSIFIER USED IN PROJECT FOR DIGIT
RECOGNITION
• 1.GAUSSIAN NAÏVE BAYES-Naive Bayes methods are a set of supervised learning algorithms based on
applying Bayes’ theorem with the “naive” assumption of independence between every pair of features.
GaussianNB implements the Gaussian Naive Bayes algorithm for classification.
from sklearn.naive_bayes import GaussianNB

>>> gnb = GaussianNB()
>>> clf=gnb.fit(x_train,x_label)
KNN
• 2. kth Nearest Neighbor-As a nonparametrix approach, the kth Nearest Neighbor classifier uses all the training
patterns as prototypes. The classification accuracy is inuenced by the number of nearest neighbor k. We thus try
different k (k = 1; 3; 5; 7; 9) and obtain the test error rate for each classifier. The training error rate is obtained by
the 10-fold cross-validation. As shown in Figure , the highest accuracy is mostly given by k= 3. We hence use 3-NN
classiffier .
• Syntax to implement knn in scikit learn:
• from sklearn.neighbors import KNeighborsClassifier
• >>> neigh = KNeighborsClassifier(n_neighbors=3)
• >>> clf=neigh.fit(xtrain,xlabel)
CLASSIFIER USED IN PROJECT
• 3.SUPPORT VECTOR MACHINE-are a set of supervised learning methods used for classification, regression and outliers
detection.
• The advantages of support vector machines are:

• Effective in high dimensional spaces.
• Still effective in cases where number of dimensions is greater than the number of samples.
• Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
• Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also
possible to specify custom kernels.
• The disadvantages of support vector machines include:

• If the number of features is much greater than the number of samples, avoid over-fitting in choosing Kernel functions and
regularization term is crucial.
• SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation.
• Syntax to implement svm in scikit learn:

• from sklearn import svm
• >>> clf = svm.SVC()
MNIST DATASET
label pixel0 pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 pixel8 pixel9
1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
MNIST
• The MNIST database contains 60,000 digits ranging from 0 to 9. Each digit is normalized and
centered in a gray-level image with size 28 x 28, or with 784 pixel in total as the features
TRAINING DATASET
• Ex train.csv
label pixel0 pixel1 pixel2 pixel3 pixel4 pixel5

3 0 0 0 0 0 0
5 0 0 0 0 0 0
3 0 0 0 0 0 0
8 0 0 0 0 0 0
9 0 0 0 0 0 0
1 0 0 0 0 0 0
3 0 0 0 0 0 0
3 0 0 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
0 0 0 0 0 0 0
7 0 0 0 0 0 0
5 0 0 0 0 0 0
8 0 0 0 0 0 0
6 0 0 0 0 0 0
2 0 0 0 0 0 0
0 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
6 0 0 0 0 0 0
9 0 0 0 0 0 0
TRAINING DATASET
• 70% of mnist dataset is used for training

TESTING DATA
• ex test.csv
label pixel0 pixel1 pixel2 pixel3 pixel4 pixel5
8 0 0 0 0 0 0
9 0 0 0 0 0 0
1 0 0 0 0 0 0
3 0 0 0 0 0 0
3 0 0 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
0 0 0 0 0 0 0
7 0 0 0 0 0 0
5 0 0 0 0 0 0
8 0 0 0 0 0 0
6 0 0 0 0 0 0
2 0 0 0 0 0 0
0 0 0 0 0 0 0
TESTING DATA
• 30% of remaing mnist dataset is used for cross validation(predicting)

IMPLEMENTATION USING GAUSSIAN NAÏVE BAYES
• from sklearn.naive_bayes import GaussianNB
• import matplotlib.pyplot as plt
• from sklearn import datasets
• digit=datasets.load_digits()
• labels=digit.target
• from sklearn.model_selection import train_test_split
• x_train,x_test,x_label,xtest_label=train_test_split(digit.data,labels,test_size=0.3)
• gnb = GaussianNB()
• clf=gnb.fit(x_train,x_label)
• actual_label=xtest_label
• p=clf.predict(x_test)
• from sklearn.metrics import accuracy_score
• score=accuracy_score(actual_label,p)
• print("accuracy",(score)*100,"%")
• output :accuracy 81.85% is achived
IMPLEMENTATION USING KNN
• from sklearn.neighbors import KNeighborsClassifier
• neigh = KNeighborsClassifier(n_neighbors=5)
• clf=neigh.fit(x_train,x_label)
• p=clf.predict(x_test)
• output :accuracy 98% is achived
IMPLEMENTATION USING SVM
• from sklearn.svm import SVC

• clf=SVC(gamma=0.0005)
• clf.fit(x_train,x_label)
• output :accuracy 99% is achived
RESULT
• 1.Handwritten digit recognition using Gaussian naïve bayes:accuracy of 81.85% is achived
• 2.Handwritten digit recognition using knn:accuracy of 98% is achived
• 3.Handwritten digit recognition using svm:accuracy of 99% is achived

CONCLUSION
• 1.Handwriting recognition is a very big research area of pattern recognition and image processing
because of its high level of applicability in different places.
• 2.SVM is the state of the art method for handwriting recognition which can provide very good
accuracy for general systems.
• 3.In this project we learnt how SVM can be applied for digit recognition. We have seen that with
proper set of training data, use of good image processing techniques, oriented features can
provide us with high level of accuracy of 99% in digit recognition using SVM .
REFERENCES
• [1] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vector machines,2001. Software available at
http://www.csie.ntu.edu.tw/~cjlin/libsvm.
•
• [2] Hiroshi Sako Hiromichi Fujisawa Cheng-Lin Liu, Kazuki Nakashima. handwritten digit
• recognition: benchmarking of state-of-the-art techniques. The journal of the pattern
• recognition society, 36:2271{2285, 2003.
•
• [3] Y.LeCun et al. Comparison of learning algorithms for handwriteen digit recognition.
• International Conference on Arti_cial Neural Networks, pages 53{60, 1995.
•
• [4] John C.Platt Patrice Y.Simard, Dave Steinkraus. Best practices for convolutional neural networks applied to visual document
analysis.
•
• [5] David G.Stork Richard O.Duda, Peter E.Hart. Pattern Classi_cation. John Wiley &
• Sons, Inc., 2 edition, 2001.
•
• [6] Changsong Liu Xuewen Wang, Xiaoqing Ding. Gabor _lter-based feature extraction for
• character recognition. The journal of the pattern recognition society, 38:369{379, 2005.

Machine Learning

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning

Uploaded by

Copyright:

Available Formats

TABLE OF CONTENT

• Objective of this project to developed a machine learning program

• The objective are:-

• 1)To recognize handwritten digits correctly

• 2)To improve the accuracy of detection

from sklearn.naive_bayes import GaussianNB

• The advantages of support vector machines are:

• The disadvantages of support vector machines include:

• Syntax to implement svm in scikit learn:

label pixel0 pixel1 pixel2 pixel3 pixel4 pixel5

• 70% of mnist dataset is used for training

label pixel0 pixel1 pixel2 pixel3 pixel4 pixel5

• 30% of remaing mnist dataset is used for cross validation(predicting)

• from sklearn.svm import SVC

• 2.Handwritten digit recognition using knn:accuracy of 98% is achived

• 3.Handwritten digit recognition using svm:accuracy of 99% is achived

because of its high level of applicability in different places.

accuracy for general systems.

You might also like