Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Improving Naïve Bayes Performance in Single Image

Pap Smear Using Weighted Principal Component

Analysis (WPCA)
Yumi Novita Dewi Dwiza Riana
STMIK Nusa Mandiri Jakarta STMIK Nusa Mandiri Jakarta
Jakarta, Indonesia Jakarta, Indonesia

Teddy Mantoro
Faculty of Science and Technology, Sampoerna University
Jakarta, Indonesia

Abstract— The accuracy value for single image classification of Georgeus Papanicolau invented the mechanism of pre-cervical
Pap smear for seven classes still has an unsatisfactory accuracy cancer diagnosis, so early detection of cervical cancer was
whereas the determination of cell classes in single image of Pap known as Pap smear [2].
smear is very important to determine whether the cells are normal
or not. Pap smear image classification that produces good The cell cytologist performed the test by taking samples of
accuracy will greatly assist in the process of detecting cells cells from the cervical neck and putting on a glass slide where
automatically. This study aims to determine whether the use of the the next stage will perform coloring techniques in order to
method of weighted - Principal Component Analysis (PCA) models facilitate the diagnosis on the microscope, so from that method
can improve the performance of the Naïve Bayes algorithm to it can produce class category information on cervical cancer
classify cell images in the Herlev dataset. Accuracy will be checked cells [3].
for the classification of two classes and seven classes. The method
used in this research consists of several stages that are There is a data set which is the result of Pap test [4]. This
preprocessing, knowledge rule, evaluation, and performance data set has been used in many studies. Herlev data sets were
report. The results of this study indicate that the weighted PCA introduced by researchers named Jantzen, Norup, Dounias, &
method can improve the accuracy on the classification of seven Bjerregaard [5]. In many studies these datasets are used to
classes while the classification of two classes does not provide perform algorithmic examination and various classification
better accuracy results. techniques [6], [7], [8].

Keywords—Pap smear images; Classification; Naïve Bayes; In the previous study [5], 20 features have been extracted
Weighted - Principal Component Analysis (PCA). from the segmentation result of Pap smear cell image and then
used as inputs recognize a single Pap smear cell image. Herlev
data set has 917 images that become classification objects. A
I. INTRODUCTION single cell image of Pap smear consists of seven class categories.
Cervical cancer is the fourth most common disease affecting The normal class consists of three classes which are normal cell
women. There were about 530,000 new cases in 2012 class categories including normal superficial, normal
representing about 7.5% causing death for most women in the intermediate, and normal columnar while the other four
world [1]. It is estimated that more than 270,000 deaths occur categories are abnormal cell class including mild (light)
due to cervical cancer every year. Cervical cancer is attacked on dysplasia, moderate dysplasia, severe dysplasia, and carcinoma
the neck of the cervix caused by Human Papilloma Virus (HPV) in situ [5].
which is not directly felt by the sufferer, so mostly the patients
with this cervical cancer will just know when it reaches an In its development, research on the single cell image
advanced stage. classification of Pap smear has become an interesting focus for
discussion [3]. Direct classification is also intended to determine
Surely this will cause death for the sufferer if the medical the form of classification between normal and abnormal cells.
treatment is delayed. The lack of awareness for early Until now, the scope of classification is still carried out by
examination and limited availability of experts who can perform researchers to find out more about the development of cervical
Pap smear diagnoses leads to an ever increasing number of cancer with the concept of data mining. Data mining is a process
deaths. of extracting knowledge of large databases from other databases
or data base repositories.
The first research about early detection of cervical cancer
was performed by scientist Georgeus Papanicolau in 1930 where
Previous researches about image classification of Pap smear.
Research using the same dataset but with the features, method
and algorithm that are different use algorithm of Artificial
Neural Network (ANN), Euclidian Distance (ED) and Support
Vector Machine (SVM) [9], ConvNet [10] but this research only
classified which are normal and abnormal. On the other hand, it
is still difficult to do classification of seven classes on Herlev
data. The classification algorithm Naïve Bayes without feature
selection and ensemble using in [11]. The next research [12]
used WEKA for using Naïve Bayes classification, Nearest
Neighbor filters and Naïve Bayes transfers. The results of the
research show that the use of Naïve Bayes method have the
significant results.
Research on the integration of bagging and greedy forward
selection on Pap smear classification using Naïve Bayes has
been experimented on Herlev datasets with different
classification categories, namely the classification on two
classes of single cell of Pap smear and the classification on seven
classes of single cell image of Pap smear [13]. The results
obtained the highest accuracy value of 92.15% for the
classification of two classes, namely normal classes and Fig. 1. Columnar Epithelium and Squamous Epithelium Region
abnormal classes. However, the classification on seven classes
of single cell of Pap smear only reached an accuracy of 63.25%. Based on the diagnosis, there are 2 (two) types of pre-
In other words, the Naïve Bayes model with Bagging and cancerous cells, namely dysplasia which means irregular
Greedy Forward Selection is still not able to get the best development, and carcinoma in situ [15]. The dysplasia cell
accuracy in classification handling in seven classes [13] itself can be divided into three types, namely mild, moderate,
although the accuracy is still possible to be improved. To and severe dysplasia. Thus, every cell contained in the cervix
improve the best accuracy, it needs an integration of some other can be grouped into seven cell classes. Table I shows the
algorithm models, so the classification with single image cell characteristics of each single cell of each class.
dataset of Pap smear can produce the best accuracy value [14].
By considering the feature of single-cell Pap smear dataset TABLE I. SINGLE CELL CHARACTERISTICS OF PAP SMEAR [7]
categorized in dataset with large-scale dimensions, it is
necessary to have techniques for processing training data and
No Class Name Characteristics
attribute selection so that it is expected to improve the accuracy.
This research will perform reduction of dataset features using 1. Normal • The cells are oval.
PCA with weighting to measure the value of classification Superficial • The nucleus is very small.
accuracy for two classes and seven classes. • The comparison between the nucleus area
and the cytoplasm area is very small.
This paper is presented in several sections. The second part 2. Normal • The cells are round.
describes Pap smears. Section 3 discusses about materials and Intermediate • The nucleus is large.
methods used in the study. Section 4 describes the results and • The comparison between nucleus and
cytoplasm area is small.
discussions about the accuracy of the classification of two
3. Normal • The cells are shaped like columns.
classes and seven classes using a weighted PCA process on the Columnar • The nucleus is large.
Naïve Bayes algorithm, and the last sections consist of • The comparison between nucleus and
conclusions and further research plans. cytoplasm area is medium.
4. Mild (Light) • The nucleus is large and light-colored.
Dysplasia • The comparison between the area of
nucleus and cytoplasm is medium.
Generally, a single Pap smear cell consists of 3 parts of the 5. Moderate • The nucleus is large and dark.
region, namely nucleus, cytoplasm that surrounds the nucleus, Dysplasia • The cytoplasm is dark.
and background that is not a cell area. The specimen cells are • The comparison between the area of the
mostly taken from the columnar region of epithelium and nucleus and cytoplasm is great.
6. Severe • The nucleus is large, dark, and irregular
squamous epithelium as shown in Figure 1. Dysplasia in shape.
Normal conditions will change when there are pre-cancerous • The cytoplasm is dark.
cells in the cervix or so-called dysplastic cells, and the genetic • The comparison between the area of the
nucleus and cytoplasm is large.
information of these cells will change, so the cells are no longer
7. Carcinoma In • The nucleus is large, dark, and irregular
divided into the developmental layers as they should [15]. Situ in shape.
• The comparison between the area of the
nucleus and cytoplasm is large.
After 2003, Erik Martin classified Pap smear cells consisting The purpose of PCA analysis is to reduce existing variables
of seven classes including three normal classes and four to be fewer without losing information contained in original or
abnormal classes [4]. In 2005, Jantzen et al., proposed twenty initial data [18]. By using PCA, variables that are initially in n
features that can be extracted from the segmentation result of variables will be reduced to new variable of k (principal
Pap smear cell image where twenty of these features represent a component) with fewer k than n, and by only using k principal
considerable number of features [5]. component, it will yield the same value using n variable [19].
The result variable of the reduction is called the principal
III. MATERIAL AND METHOD component or can be called a factor. The character of the new
The proposed method design can be seen in Figure 2. The variables formed by PCA analysis will in addition not only have
proposed model scheme design used the input of single Pap fewer number of variables but also eliminate the correlation
smear cell image, namely Herlev Data [5]. This dataset is among variables that are formed [16].
numerical data where the single Pap smear dataset has 20 feature The next step is to normalize the class attributes that reflect
attributes divided into 7 classes: normal and abnormal classes of the relevance of the attribute weights with the class attribute
917 datasets. values [20]. The Weighted PCA process is:
In the design of proposed method, processing process was
performed by preparing the datasets to be grouped into two
groups. The first dataset with 2 classes consisted of normal and
abnormal classes while the next data set consisted of 7 classes
which were superficial squamous, intermediate squamous,
columnar, mild dysplasia, moderate dysplasia, severe dysplasia,
and carcinoma in situ. The next process got the rule from each
data group or knowledge rule.

Fig. 3. Examination Step of Naïve Bayes with Weighted - PCA

In Weighting, the value of accuracy can be seen from the

confusion matrix table with the formula:
Accuracy = (Amount of True Data) / (Amount of Data) x 100%

B. Naïve Bayes Algorithm

The most popular and widely used classification method is
Naïve Bayes method [13, 21]. Naive Bayes method is one of
the classification methods that can predict the probability
of membership of a class. The value of a class in Naive
Bayes method is independent, and it doesn’t depend on other
attributes. This classification is performed by using the
following formula: (x-µ)2
g(x, µ, a) = e 2CT2 (1)

n xi

µ= i=l
∑n (xi-µ)2
a= (3)
Fig. 2. Proposed Algorithm Model Scheme Design n1

Calculations with numerical data types in the Naïve Bayes

A. Weighted Principal Component Analysis algorithm should be performed with the calculation of mean μ
and standard deviation σ using equations (1), (2), and (3). All
PCA is one of the most widely used variable extraction parts are tested until all datasets can be divided into two data
feature (reduction) [16]. PCA method is very useful if the which are data training and data testing.
existing data has a large number of variables and has a
correlation among variables. The calculations of the PCA are Each method and algorithm has its own characteristics, as
based on the calculation of the weigen value and eigen vectors well as Naïve Bayes algorithm. The Naïve Bayes classification
which express the data distribution from a dataset [17]. works on the theory of probability which views all features of
the data as evidence in probability [14]. This gives its own
characteristics for Naïve Bayes, as for these characteristics are:
1. The Naïve Bayes method works robustly on isolated data quickly and accurately handle datasets of small dimensions
which is usually data with different characteristics (outliner). while the dataset test with Naïve Bayes in the classification of
Naive Bayes can also handle incorrect attribute values by seven classes with weighted PCA showed an increase in
overwhelming training data during the model development accuracy compared to the classification which only used the
and prediction process. Naïve Bayes algorithm. The overall comparison of Naïve Bayes
algorithm with Weighted -PCA shows the best accuracy value.
2. It is robust to face irrelevant attributes.
Furthermore, t-test will be conducted to see if the
3. Attributes that have a correlation can degrade the
improvement obtained gives significant influence to the data.
performance of Naïve Bayes classification because the
assumption of attribute independence is gone.
This study proposed the use of PCA method on Naïve Bayes 0,939 0,040 -
algorithm to improve the accuracy optimally on the 0,959 - -
classification of seven classes of single Pap smear image. The
following is a comparison table of Naïve Bayes algorithm with The t-test of the Naïve Bayes algorithm with the Weighted -
Weighted-PCA model: Principal Component Analysis (PCA) model on the datasets of
seven classes and two classes was 0.040. This number is smaller
TABLE II. COMPARISON NAÏVE BAYES WITH WEIGHTED-PCA than the value of α which is 0.05. That shows that both models
have a significant difference to both Pap smear single cell
Class Accuracy Model Result
classes (Table III).
2 Naïve Bayes 90,42%
The solution of this problem is to compare between the
CLASS Naïve Bayes + Weighted-PCA 67,45%
Naïve Bayes algorithm and the Weighted - Principal Component
7 Naïve Bayes 55,73% Analysis (PCA) model used in dealing with datasets with large-
CLASS Naïve Bayes + Weighted-PCA 87,24% scale dimensions so that the accuracy value can be increased to
86.59 %.

Table II shows the comparison of accuracy values on the V. CONCLUSION

Naïve Bayes algorithm and the Weighted-Principal Component
Analysis (PCA) model where the accuracy of the Naïve Bayes The Naïve Bayes algorithm can be featured in the accuracy
algorithm with the feature of seven classes is smaller than other measurements for datasets of two classes because Naïve Bayes
algorithm models. However, on the combination of Naïve Bayes algorithms will be able to quickly and accurately handle datasets
algorithm with Weighted-Principal Component Analysis (PCA) in small-scale dimensions while the dataset test of Naïve Bayes
model in the feature of two classes, the accuracy decreased to with large-scale dimensions experienced a decrease in the value
67.45%. of accuracy. The solution of this problem is to compare between
the Naïve Bayes algorithm and the weighted - Principal
Figure 4 is a graph of the comparison chart of the Naïve Component Analysis (PCA) model used in dealing with datasets
Bayes algorithm with the Weighted - Principal Component with large-scale dimensions so that the value of accuracy can be
Analysis (PCA) model, which shows that the highest chart is increased to 67.45% and 87.28%.
indicated by the accuracy of Naïve Bayes algorithm with the
features of two classes. The results of the t test were performed by Rapidminer
implementation of Naïve Bayes algorithm with Weighted - PCA
model with the rate of 0.023. This number is smaller than the
value of α which is 0.05. It shows that both models of the Naïve
Bayes algorithm with the Weighted-PCA model have significant
differences to both Pap smear single cell classes. This research
is a preliminary study of research on datasets which have
classification features in Pap smear images. Further research will
apply integration method of sample Bootstrapping and
Weighted PCA to classify single image Pap smear into seven
classes so that the better accuracy can be obtained.

This study used the data from: Pap smear Benchmark Data
for Pattern Classification J. Jantzen, J. Norup, G. Dounias, and
B. Bjerregaard, University Dept. of Pathology Herlev Ringvej
75, DK-2730 Herlev, Denmark.
Fig. 4. Comparison graph of NB algorithm and W-PCA

In Figure 4, Naïve Bayes algorithm can be featured in the

accuracy measurements for single Pap Smear cell datasets with
two classes because Naïve Bayes algorithm will be able to
REFERENCES [12] Y. Ma, G. Luo, X. Zeng, and A. Chen, “Transfer learning for cross-
company software defect prediction,” Inf. Softw. Technol., vol. 54, no. 3,
[1] WHO. "Human Papillomavirus (HPV) and Cervical Cancer", June 2016, pp. 248–256, 2012.
[13] D. Riana, A.N. Hidayanto, and Fitriyani.“ Integration of Bagging and
[2] B. W. Stewart and C. P. Wild, World Cancer Report 2014, World Health greedy forward selection on image pap smear classification using Naïve
Organization, 2014. Bayes”, “Proc-2017 5th International Conference on Cyber and IT
[3] Kale, A., & Aksoy, S. “Segmentation of Cervical Cell Images”, IEEE on Service Management (CITSM), 2017.
International Conference Pattern Recognition (ICPR). 2010. [14] J. Han, M. Kamber, and J. Pei, “Data Mining: Concepts and Techniques”.
[4] E. Martin, “Pap-Smear Classification,” Technical University of Denmark 2012.
- DTU, 2003 [15] Shweta Kharya. "Using Data Mining Tecniques for Diagnosis and
[5] Jantzen, J. N., Dounias, G., dan Bjerregaard, B. “Pap-smear Benchmark Prognosis of Cancer Disease". Chhatisgarh, India : Bhilai Institute of
Data For Pattern Classification”, Technical University of Denmark, Technology. 2012.
Denmark. 2005. [16] Da Costa, J. F. P., Alonso, H. and Roque, L. (2011). “A Weighted
[6] D. Riana, D. Ekashanti, O. Dewi, D. H. Widyantoro, and T. L. R. Mengko, Principal Component Analysis and Its Application to Gene Expression
“Segmentation and Area Measurement in Abnormal Pap Smear Images Data”. IEEE/ACM Transaction on Computational Biology and
Using Color Canals Modification with Canny Edge Detection,” in Bioinformatics. Vol. 8 No. 1. January 2011.
International Conference on Women’s Health in Science & Engineering, [17] Tri Agus Setiawan, Romi SW., & Abdul S."Integrasi Metode Sample
2012, pp. 1–4. Bootstrapping dan Weighted Principal Component Analysis untuk
[7] D. Riana, “Hierarchical Decision Approach Berdasarkan Importance Meningkatkan Performa k Nearest Neighbor pada Dataset Besar", ISSN
Performance Analysis Untuk Klasifikasi Citra Tunggal Pap Smear 2356-3982. Journal of Intelligent Systems, Vol. 1, No. 2, December 2015.
Menggunakan Fitur Kuantitatif dan Kualitatif”. Depok: Fakultas Ilmu 320–328. doi:10.1016/j.apenergy.2014.08.110. 2014
Komputer Program Magister Ilmu Komputer Universitas Indonesia, [18] Susetyoko, Ronny dan Purwantini, Elly. “Teknik Reduksi Dimensi
2010. Menggunakan Komponen Utama Data Partisi Pada Pengklasifikasian Data
[8] D. Riana, D. H. Widyantoro, and T. L. Mengko, “Extraction and Berdimensi Tinggi dengan Ukuran Sampel Kecil”. 2011.
classification texture of inflammatory cells and nuclei in normal pap [19] F. Gorunescu, Data mining: Concepts, Model and Techniques. Springer,
smear images,” Proc. - 2015 4th Int. Conf. Instrumentation, Commun. Inf. Heidelberg. 2011.
Technol. Biomed. Eng. ICICI-BME 2015, pp. 65–69, 2016.
[20] Da Costa, J. F. P., Alonso, H. and Roque, L. “A Weighted Principal
[9] S. A and S. Jereesh, “Automated Cervical Cancer Detection through Component Analysis and Its Application to Gene Expression Data”.
RGVF segmentation and SVM Classification,” pp. 663–669, 2015. IEEE/ACM Transaction on Computational Biology and
[10] L. Zhang, L. Lu, I. Nogues, R. Summers, S. Liu, and J. Yao, “DeepPap: Bioinformatics. Vol. 8 No. 1. January. 2011.
Deep Convolutional Networks for Cervical Cell Classification,” IEEE J. [21] Alfisahrin S. N. N. and Mantoro T. "Data Mining Techniques For
Biomed. Heal. Informatics, vol. XX, no. c, pp. 1–1, 2017. Optimatisation of Liver Disease Clasification", 2nd International
[11] Y. E. Kurniawati and A. E. Permanasari, “Comparative Study on Data Conference on Advanced Computer Science Applications and
Mining Classification Methods for Cervical Cancer Prediction Using Technologies (ACSAT 2013), pp. 379-384, Kuching, Malaysia, 22-24
Pap Smear Results,” 2016. December 2013.

You might also like