Professional Documents
Culture Documents
Dewi 2017
Dewi 2017
Teddy Mantoro
Faculty of Science and Technology, Sampoerna University
Jakarta, Indonesia
teddy@ieee.org
Abstract— The accuracy value for single image classification of Georgeus Papanicolau invented the mechanism of pre-cervical
Pap smear for seven classes still has an unsatisfactory accuracy cancer diagnosis, so early detection of cervical cancer was
whereas the determination of cell classes in single image of Pap known as Pap smear [2].
smear is very important to determine whether the cells are normal
or not. Pap smear image classification that produces good The cell cytologist performed the test by taking samples of
accuracy will greatly assist in the process of detecting cells cells from the cervical neck and putting on a glass slide where
automatically. This study aims to determine whether the use of the the next stage will perform coloring techniques in order to
method of weighted - Principal Component Analysis (PCA) models facilitate the diagnosis on the microscope, so from that method
can improve the performance of the Naïve Bayes algorithm to it can produce class category information on cervical cancer
classify cell images in the Herlev dataset. Accuracy will be checked cells [3].
for the classification of two classes and seven classes. The method
used in this research consists of several stages that are There is a data set which is the result of Pap test [4]. This
preprocessing, knowledge rule, evaluation, and performance data set has been used in many studies. Herlev data sets were
report. The results of this study indicate that the weighted PCA introduced by researchers named Jantzen, Norup, Dounias, &
method can improve the accuracy on the classification of seven Bjerregaard [5]. In many studies these datasets are used to
classes while the classification of two classes does not provide perform algorithmic examination and various classification
better accuracy results. techniques [6], [7], [8].
Keywords—Pap smear images; Classification; Naïve Bayes; In the previous study [5], 20 features have been extracted
Weighted - Principal Component Analysis (PCA). from the segmentation result of Pap smear cell image and then
used as inputs recognize a single Pap smear cell image. Herlev
data set has 917 images that become classification objects. A
I. INTRODUCTION single cell image of Pap smear consists of seven class categories.
Cervical cancer is the fourth most common disease affecting The normal class consists of three classes which are normal cell
women. There were about 530,000 new cases in 2012 class categories including normal superficial, normal
representing about 7.5% causing death for most women in the intermediate, and normal columnar while the other four
world [1]. It is estimated that more than 270,000 deaths occur categories are abnormal cell class including mild (light)
due to cervical cancer every year. Cervical cancer is attacked on dysplasia, moderate dysplasia, severe dysplasia, and carcinoma
the neck of the cervix caused by Human Papilloma Virus (HPV) in situ [5].
which is not directly felt by the sufferer, so mostly the patients
with this cervical cancer will just know when it reaches an In its development, research on the single cell image
advanced stage. classification of Pap smear has become an interesting focus for
discussion [3]. Direct classification is also intended to determine
Surely this will cause death for the sufferer if the medical the form of classification between normal and abnormal cells.
treatment is delayed. The lack of awareness for early Until now, the scope of classification is still carried out by
examination and limited availability of experts who can perform researchers to find out more about the development of cervical
Pap smear diagnoses leads to an ever increasing number of cancer with the concept of data mining. Data mining is a process
deaths. of extracting knowledge of large databases from other databases
or data base repositories.
The first research about early detection of cervical cancer
was performed by scientist Georgeus Papanicolau in 1930 where
Previous researches about image classification of Pap smear.
Research using the same dataset but with the features, method
and algorithm that are different use algorithm of Artificial
Neural Network (ANN), Euclidian Distance (ED) and Support
Vector Machine (SVM) [9], ConvNet [10] but this research only
classified which are normal and abnormal. On the other hand, it
is still difficult to do classification of seven classes on Herlev
data. The classification algorithm Naïve Bayes without feature
selection and ensemble using in [11]. The next research [12]
used WEKA for using Naïve Bayes classification, Nearest
Neighbor filters and Naïve Bayes transfers. The results of the
research show that the use of Naïve Bayes method have the
significant results.
Research on the integration of bagging and greedy forward
selection on Pap smear classification using Naïve Bayes has
been experimented on Herlev datasets with different
classification categories, namely the classification on two
classes of single cell of Pap smear and the classification on seven
classes of single cell image of Pap smear [13]. The results
obtained the highest accuracy value of 92.15% for the
classification of two classes, namely normal classes and Fig. 1. Columnar Epithelium and Squamous Epithelium Region
abnormal classes. However, the classification on seven classes
of single cell of Pap smear only reached an accuracy of 63.25%. Based on the diagnosis, there are 2 (two) types of pre-
In other words, the Naïve Bayes model with Bagging and cancerous cells, namely dysplasia which means irregular
Greedy Forward Selection is still not able to get the best development, and carcinoma in situ [15]. The dysplasia cell
accuracy in classification handling in seven classes [13] itself can be divided into three types, namely mild, moderate,
although the accuracy is still possible to be improved. To and severe dysplasia. Thus, every cell contained in the cervix
improve the best accuracy, it needs an integration of some other can be grouped into seven cell classes. Table I shows the
algorithm models, so the classification with single image cell characteristics of each single cell of each class.
dataset of Pap smear can produce the best accuracy value [14].
By considering the feature of single-cell Pap smear dataset TABLE I. SINGLE CELL CHARACTERISTICS OF PAP SMEAR [7]
categorized in dataset with large-scale dimensions, it is
necessary to have techniques for processing training data and
No Class Name Characteristics
attribute selection so that it is expected to improve the accuracy.
This research will perform reduction of dataset features using 1. Normal • The cells are oval.
PCA with weighting to measure the value of classification Superficial • The nucleus is very small.
accuracy for two classes and seven classes. • The comparison between the nucleus area
and the cytoplasm area is very small.
This paper is presented in several sections. The second part 2. Normal • The cells are round.
describes Pap smears. Section 3 discusses about materials and Intermediate • The nucleus is large.
methods used in the study. Section 4 describes the results and • The comparison between nucleus and
cytoplasm area is small.
discussions about the accuracy of the classification of two
3. Normal • The cells are shaped like columns.
classes and seven classes using a weighted PCA process on the Columnar • The nucleus is large.
Naïve Bayes algorithm, and the last sections consist of • The comparison between nucleus and
conclusions and further research plans. cytoplasm area is medium.
4. Mild (Light) • The nucleus is large and light-colored.
Dysplasia • The comparison between the area of
II. PAP SMEAR
nucleus and cytoplasm is medium.
Generally, a single Pap smear cell consists of 3 parts of the 5. Moderate • The nucleus is large and dark.
region, namely nucleus, cytoplasm that surrounds the nucleus, Dysplasia • The cytoplasm is dark.
and background that is not a cell area. The specimen cells are • The comparison between the area of the
mostly taken from the columnar region of epithelium and nucleus and cytoplasm is great.
6. Severe • The nucleus is large, dark, and irregular
squamous epithelium as shown in Figure 1. Dysplasia in shape.
Normal conditions will change when there are pre-cancerous • The cytoplasm is dark.
cells in the cervix or so-called dysplastic cells, and the genetic • The comparison between the area of the
nucleus and cytoplasm is large.
information of these cells will change, so the cells are no longer
7. Carcinoma In • The nucleus is large, dark, and irregular
divided into the developmental layers as they should [15]. Situ in shape.
• The comparison between the area of the
nucleus and cytoplasm is large.
After 2003, Erik Martin classified Pap smear cells consisting The purpose of PCA analysis is to reduce existing variables
of seven classes including three normal classes and four to be fewer without losing information contained in original or
abnormal classes [4]. In 2005, Jantzen et al., proposed twenty initial data [18]. By using PCA, variables that are initially in n
features that can be extracted from the segmentation result of variables will be reduced to new variable of k (principal
Pap smear cell image where twenty of these features represent a component) with fewer k than n, and by only using k principal
considerable number of features [5]. component, it will yield the same value using n variable [19].
The result variable of the reduction is called the principal
III. MATERIAL AND METHOD component or can be called a factor. The character of the new
The proposed method design can be seen in Figure 2. The variables formed by PCA analysis will in addition not only have
proposed model scheme design used the input of single Pap fewer number of variables but also eliminate the correlation
smear cell image, namely Herlev Data [5]. This dataset is among variables that are formed [16].
numerical data where the single Pap smear dataset has 20 feature The next step is to normalize the class attributes that reflect
attributes divided into 7 classes: normal and abnormal classes of the relevance of the attribute weights with the class attribute
917 datasets. values [20]. The Weighted PCA process is:
In the design of proposed method, processing process was
performed by preparing the datasets to be grouped into two
groups. The first dataset with 2 classes consisted of normal and
abnormal classes while the next data set consisted of 7 classes
which were superficial squamous, intermediate squamous,
columnar, mild dysplasia, moderate dysplasia, severe dysplasia,
and carcinoma in situ. The next process got the rule from each
data group or knowledge rule.
Where:
n xi
∑
µ= i=l
n
(2)
∑n (xi-µ)2
i=l
a= (3)
Fig. 2. Proposed Algorithm Model Scheme Design n1
ACKNOWLEDGMENT
This study used the data from: Pap smear Benchmark Data
for Pattern Classification J. Jantzen, J. Norup, G. Dounias, and
B. Bjerregaard, University Dept. of Pathology Herlev Ringvej
75, DK-2730 Herlev, Denmark.
Fig. 4. Comparison graph of NB algorithm and W-PCA