Project

CALIFORNIA STATE UNIVERSITY SAN MARCOS
PROJECT SIGNATURE PAGE
PROJECT SUBMITTED IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE
MASTER OF SCIENCE
IN
COMPUTER SCIENCE
PROJECT TITLE: BREAST CANCER PREDICTION USING MACHINE LEARNING
AUTHOR: SANJANA BALASUBRAMANIAN
DATE OF SUCCESSFUL DEFENSE: APRIL 22nd, 2021
THE PROJECT HAS BEEN ACCEPTED BY THE PROJECT COMMITTEE IN

PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF
SCIENCE IN COMPUTER SCIENCE.
Dr Ahmad Hadaegh
PROJECT COMMITTEE CHAIR SIGNATURE DATE
Dr. Yanyan Li
PROJECT COMMITTEE MEMBER SIGNATURE DATE
Name of Committee Member

PROJECT COMMITTEE MEMBER SIGNATURE DATE
Sanjana Balasubramanian
Department of Computer Science and Information System
Breast cancer prediction using machine Learning
Advisor: Dr. Ahmad R. Hadaegh
Date: April 22nd, 2021
1
List of the Figures
Fig No. Figure Name Page No
Figure 1 System Architecture of Breast Cancer Prediction and tracking 13
system.
Figure 2 Proposed Breast Cancer Prediction and Tracking flow diagram. 14
Figure 4.1 Haar Scaling function and wavelet (left) and their frequency 23
(right).
Figure 4.2 Scaling function and wavelet (left) and their frequency (right). 24
Figure 4.3 Daubechies 20 scaling function and wavelet (left) and their 24
frequency content (right).
Figure 5 Input image transformation to Gray scale image. 31
Figure 6 Output displaying that the inputted image has no cancerous 32
cells and is normal.
Figure 7 Output displaying that the inputted image has cancerous 33
cells and is in Benign stage.
Figure 8 Output displaying that the inputted image has cancerous cells 33
and is in Malignant stage.
Figure 9 Boundary detected Image (a) Benign stage from Figure 4. and 34
(b) Malignant stage from Figure 8.
2
Table of Contents
List of the Figures ..................................................................................................................................... 2
Abstract ..................................................................................................................................................... 5
Chapter 1: Introduction ............................................................................................................................. 6
Chapter 2: Related Work .......................................................................................................................... 9
Chapter 3: Methodology ......................................................................................................................... 12
Chapter 4: Implementation...................................................................................................................... 18
4.1 Clustering based on the FCM algorithm ....................................................................................... 19
4.2 Feature extraction based on PCA, GLCM .................................................................................... 20
4.3 Classification based on KNN ........................................................................................................ 29
Chapter 5: Analysis of the Results .......................................................................................................... 32
Chapter 6: Conclusion and Future work ................................................................................................. 36
References .............................................................................................................................................. 38
3
Acknowledgment
The satisfaction and the euphoria that accompany the successful completion of any task
would be incomplete without the mention of the people who made it possible. The constant
guidance of these persons and encouragement provided, crowned my efforts with success and
glory. I take this opportunity to express my gratitude to one and all.
I am grateful to management and my institute California State University, San Marcos with
its very ideals and inspiration for having provided me with the facilities, which made this, project
a success.
I am indebted with a deep sense of gratitude for the constant inspiration, encouragement,
timely guidance, and valid suggestion given to me by my advisor Dr. Ahmad Hadaegh and Co
advisor Dr. Yanyan Li, Department of Computer Science, California State University, San Marcos.
4
Abstract
Today there are more than 1.15 million cases of breast cancer diagnosed worldwide
annually. At present, only small numbers of accurate prognostic and predictive factors are used
clinically for managing the patients with breast cancer. Early detection of this fatal disease is very
important which helps in decreasing the morality rate and increasing the survival period of breast
cancer patients. The project uses Mammography which is the main test used for screening and
early diagnosis, and its analysis and processing are the keys to improving breast cancer prognosis.
To detect breast cancer in mammogram, image segmentation is performed with the help of Fuzzy
C-means (FCM) technique. Further those segmented regions features are extracted, and it is trained
completely, finally trained images are classified by the efficient classifier of different classes in
mammogram. Texture features are extracted using a feature extraction technique like Multi-level
Discrete Wavelet Transform, Principal Component Analysis (PCA), Gray-level Co-occurrence
Matrix (GLCM). Morphological operators are used to distinguish masses and microcalcifications
from the background tissue and KNN algorithm is used for classification. The boundaries of tumor
affected region in mammogram are marked and displayed to the doctor, along with area of tumor.
5
Chapter 1: Introduction
Breast cancer has become the most recurrent type of health issue among women especially
for women in middle age. Early detection of breast cancer can help women cure this disease and
death rate can be reduced [1]. In the present-day scenario, to observe breast cancer mammograms
are used and they are known be the most effective scanning technique. In this paper the detection
of cancer cells is done by machine learning technique.
Image processing is a method to convert an image into digital form and perform some
operations on it, to get an enhanced image or to extract some useful information from it. It is a
type of signal processing in which input is an image and output may be image or
characteristics/features associated with that image [2]. Usually, Image Processing system includes
treating images as two-dimensional signals while applying already set signal processing methods
to them. Image processing basically includes the following three steps:
• Importing the image with optical scanner or by digital photography.
• Analyzing and manipulating the image which includes data compression and image enhancement
and spotting patterns that are not to human eyes like satellite photographs.
• Output is the last stage in which result can be altered image or report that is based on image
analysis.
Digital Processing techniques help in manipulation of the digital images by using
computers. Digital Image is composed of a finite number of elements, each of which elements
have a particular value at a particular location. These elements are referred to as picture elements,
image elements and pixels. A Pixel is most widely used to denote the elements of a Digital Image.
As raw data from imaging sensors from satellite platform contains deficiencies. To get over such
flaws and to get originality of information, it has to undergo various phases of processing. The
6
three general phases that all types of data have to undergo while using digital technique are Pre-
processing, enhancement and display, information extraction.
The first step is to analyses the images and represent them. The method of image
representation will fix fundamental problems that are the ineffectiveness of capturing textural
information and the weak capacity for classification of features that result in low performance of
retrieval. In Content-based Image Retrieval, similarity estimation is a part of the primary task and
has a greater effect on the accuracy of retrieval and time of retrieval. The project aims to solve
problems such as "Which similarity measure is appropriate for particular feature type and how to
reduce the similarity calculation computation? "And" The texture function is more representative
and discriminatory to describe the mammogram of the given query? In this project we use
classification techniques such as Fuzzy C-means and KNN to employ feature extraction.
The FCM clustering is used for image segmentation once the mammographic image was
collected. Data point in this system be held by to various clusters with changing degrees of
membership and is based on objective criteria. The segmented area is rigorously examined using
Multi-level Discrete Wavelet Transform to get edge details to which is then used as a feature. PCA
is then used for this data to analyze along with GLCM. After performing the analysis, 13 features
extracted in the proposed framework and their pixel values in matrix form are stored in database.
Machine learning is an application of artificial intelligence that provides systems the ability
to automatically learn and improve from experience without being explicitly programmed. The
basic premise of machine learning is to build algorithms that can receive input data and use
statistical analysis to predict an output while updating outputs as new data becomes available. The
process of learning begins with observations or data, such as examples, direct experience, or
instruction, to look for patterns in data and make better decisions in the future based on the
7
examples that we provide. The primary aim is to allow the computers learn automatically without
human intervention or assistance and adjust actions accordingly.
Supervised algorithms with machine learning skills to provide both input and desired
output, in addition to furnishing feedback about the accuracy of predictions during algorithm
training. A classification problem is when the output variable is a category or a group. Using the
KNN classifier method, which primarily relies on the shape of the cancer cells in the image, the
algorithm classifies the image into Benign, Malignant and Normal once the features are extracted
& fully educated. By performing appropriate morphological operations, the device calculates the
appropriate region properties, such as size, Euler number, etc., and displays the detected boundary
image along with the tumor area.
8
Chapter 2: Related Work
[1] Siyabend Turgut et al., “Microarray Breast Cancer Data Classification Using Machine
Learning Methods” [IEEE 2018]
The paper uses microarray breast cancer data for classification of the patients using machine
learning methods. In the first case, eight different machine learning algorithms are applied to the
dataset and the results of classification were noted. Then in the second case, two different feature
selection methods such as Recursive Feature Elimination (RFE) and Randomized Logistic
Regression (RLR) were applied on the microarray breast cancer dataset and 50 features were
chosen as stop criterion. Again, the same eight machine learning algorithms were applied on the
modified dataset. The results of the classifications are compared with each other and with the
results of the first case. The methods applied are SVM, KNN, MLP, Decision Trees, Random
Forest, Logistic Regression, Ad boost and Gradient Boosting Machines. After applying the two
different feature selection methods, SVM gave the best results. MLP is applied using different
number of layers and neurons to examine the effect of the number of layers and neurons on the
classification accuracy [3].
[2] Varalatchoumy M et al., “Four Novel Approaches for Detection of Region of Interest in
Mammograms - A Comparative Study” [ICISS 2017]
The paper compares Four Novel approaches used for detection of Region of Interest in
Mammographic images based on database and Real time images. In Approach I histogram
equalization and dynamic thresholding techniques were used for preprocessing. Region of Interest
(ROI) was partitioned from the preprocessed image by using particle swarm optimization and k-
means clustering methods. In Approach II preprocessing was done using various morphological
operations like erosion followed by dilation. For the identification of ROI, a modified approach of
9
watershed segmentation was used. Approach III uses histogram equalization for preprocessing and
an advanced level set approach for performing segmentation. Approach IV, which is considered
to be the most efficient approach that uses different morphological operations and contrast limited
adaptive histogram equalization for image preprocessing. A very novel algorithm was developed
for detection of Region of Interest. Approaches I and II were applicable for Mammographic Image
Analysis Society (MIAS) database images alone. Approaches III and IV were applicable for MIAS
and Real time hospital images. The various graphs presented in the comparative study, clearly
depicts that the novel approach that used a novel algorithm for detection of ROI is proved to be
the most efficient, accurate and highly reliable approach that can be used by radiologists to detect
tumors in MRM images [4].
[3] Ammu P K et al., “Review on Feature Selection Techniques of DNA
Microarray Data” [IJCA 2013]
This paper reviews few major feature selection techniques employed in microarray data and points
out the merits and demerits of various approaches. Feature selection from DNA microarray data is
one of the most important procedures in bioinformatics. Biogeography Based Optimization (BBO)
is an optimization algorithm which works on the basis of migration of species between different
habitats and the process of mutation. Particle Swarm Optimization (PSO) is an algorithm which
works on the basis of movement of particles in a search space. Redundancy based feature selection
approaches can be used to remove redundant genes from the selected genes as the resultant gene
set can achieve a better representation of the target class. A two-stage hybrid filter wrapper method
where, in the first stage a subset of the original feature set is obtained by applying information gain
as the filtering criteria. In the second stage the genetic algorithm is applied to the set of filtered
genes. Gene selection based on dependency of features where the features are classified as
10
independent, half dependent and dependent features. Independent features are those features that
doesn’t depend on any other features. Half dependent features are more relevant in correlation with
other features and dependent features are fully dependent on other features [5].
[4] Bing Lan Li et al., “Integrating spatial fuzzy clustering with level set methods for automated
medical image segmentation” [ELSEVIER 2010]
A new Fuzzy Level Set algorithm is proposed in this paper to facilitate automated medical image
segmentation. It can directly evolve from the initial segmentation by spatial fuzzy clustering where
centroid and the scope of each subclass are estimated adaptively in order to minimize a pre-defined
cost function. The controlling parameters of Level Set evolution are also estimated from the results
of fuzzy clustering. The level set methods utilize dynamic variational boundaries for image
segmentation. The new Fuzzy Level Set algorithm automates the initialization and parameter
configuration of the level set segmentation, using spatial fuzzy clustering. It employs a Fuzzy-C
means (FCM) with spatial restrictions to determine the approximate contours of interest in a
medical image. Moreover, the Fuzzy Level Set algorithm is enhanced with locally regularized
evolution. Such improvements facilitate level set manipulation and lead to more robust
segmentation. Performance evaluation of the proposed algorithm was carried on medical images
from different modalities. The results confirm its effectiveness for image segmentation [6].
11
Chapter 3: Methodology
The figure 1 shows system architecture of breast cancer prediction and tracking system. The
system mainly consists of four processes. Once the image is acquitted, the system converts the
image into gray scale image. By applying the suitable Image segmentation techniques, the system
aims at extracting a meaningful object lying in the image. Clustering is one the powerful image
segmentation technique involves grouping of data points to cluster, the system implements
clustering technique using Fuzzy C-means (FCM) [7]. The segmented region is completely
analyzed by using the Multi-level Discrete Wavelet Transform, Principal Component Analysis
(PCA) along with Gray Level Co-occurrence Matrix (GLCM) features [8]. Totally 13 features are
extracted in the system and their pixel values in the form of matrix is stored in database that is in
the db. mat file, some of the features extracted by the system are mean, variance, entropy etc. Then
image undergoes classification process with respect to dataset in db. mat file and classifies the
image into Benign, Malignant and Normal. The system also performs morphological operations
and calculates region properties of the image such as Area, Eccentricity and Euler number. If the
image has cancer cells, then the tumor area is computed and displayed by the system along with
the boundary detected image.
12
Figure 1: System Architecture of Breast Cancer Prediction and Tracking system
The physician uploads the mammogram of the patient to the device that is subjected to the
process of image segmentation . Using image processing technique, the segmented image is pre-
processed. Extraction methods to extract necessary features are applied to the image. The classifier
model is given extracted features, then the test image classification process is performed with
respect to the training data present in the database[9]. The test picture is marked as either cancerous
or non-cancerous. Unless the test image is marked as cancerous, the tumor region will be measured,
and the findings will be shown to the doctor along with the observed boundary image. These are
shown below in Figure 2.
13
Figure 2. Proposed breast cancer prediction and Tracking flow diagram
Fuzzy c-means is a cluster function where data points are associated to several clusters with
varying membership rates. One of the segmentation techniques applicable to gray level images is
the FCM. In the proposed method, an initial location of the central cluster and the membership
degree measure for every data point are determined with the aid of Fuzzy C-Means [10].
Classification after feature extraction relies on the shapes of cancer cells in the image and method
often makes use of all features contained in the database for classification purposes. System should
analyze similarity measures and use clustering techniques and feature extraction techniques such
as Multi-level Discrete Wavelet Transform, Principal Component Analysis, GLCM to allow data
set reduction for similarity measurement improvement.
14
The next step will be extraction of a function. These dimensionality reductions can be a
helpful step in the visualization and analysis of high featured datasets, while maintaining as much
variation as possible [11].This technique is used to simplify the classification to predict a better
output. The maximum variance must be found out in order to segregate the data into multiple
clusters. For Example, if chosen 3 different centroids and then make 3 main elements in the 2D
plane of high dimensionality we can see that the data is distributed and is consistent with the
clusters and an find it apparent in the 2D diagram. This reduces overfitting of the data and avoids
the clusters to be far shortly spaced form each other or overlapping from each other.
Likewise, the larger the number of illustrative variables permitted in regression study, the
greater the risk that the model will be overfitted, resulting in the output to fail in terms of the
generalization with the rest of the datasets. One approach is to lessen them to a few main
components, particularly when there are well built association between different potential
variables, and then execute the regression in opposition to them which is the Principal component
regression abbreviated as PCA.
PCA is a method which uses linear variation to relocate a group of examined correlated
variables (entities each of which takes on different numerical values) into principal components.
PCA is a tool used mainly for illustrative data analysis and predictive model building [5].
Reduction of the dimensionality can also be sufficient when the illustrative variable points in a
dataset are obstreperous. Using the PCA method the system can focus on most of the signals into
the main components first and then the features can be captured by keeping the minimized
dimensionality. This process offers benefits because if the main component captures noise then it
there is loss of features.
15
After the PCA extraction the next is segmentation. The segmentation is used because for
the improvement of the delineation of image to a visible examination. Its insight is to see the
condition of these uncertain areas to assign and help the irregularities of the classification between
the cancerous and non-cancerous features. The c means fuzzy logic is used in disintegrating the
tumor exaggerated area by adjusting the numeral of bunches that initializes in the assessment for
the epicenter fragments of the bunch, that are intended to stamp the mean area of each bunch [12].
The efficient costs of the pixel epitomize the probabilities that it be in the right place to a concrete
bunch. To the centroid of their bunches the pixels of high affiliation standards are immediate, and
low affiliation standards are ended missing from the centroids which are being detected as pixels
along with the statistics. Merging for the affiliation purpose which can be checked by distinctive
transmogrifications in the bunch center at dualistic uninterrupted recapitulation phases.
The extraction of features is obtained by segregating the datasets into 2 categories. One is
the training set and another one is the testing set. Here the training set will consist of greater
amounts of statistics and testing set will have then miniscule amounts of statistics. Investigation
facilities indiscriminately test the information to determine that both the small testing sets and
large training sets are analogous.
The Multi-Level Wavelet Conversion strategy associated to PCA is something as a
computable procedure which applies a portion of symmetrical conversions into transmogrify above
the ample perceptions that are made as variables which are related. These related variables are to
be directly associated ad the key factors which are also called as the main component which are
put light to for the segmentated region. The GLCM will work with the testing sets to get the
highlights to the progression of a request with the motive of finding the diagnostic faults in the
examination [13].
16
The dataset is totally accomplished and with the application of investigative statistics
appliance which ostensibly isolates minor part or amount planned to show what the entire
resembles for the information to guarantee that the testing and preparing sets are analogous, which
are in future grouped via the dissimilar cataloguing procedures. For example, the KNN algorithm
which is called as the k nearest neighbor is used here for the characterization and set back reports
whereas the support vector machine is used to establish the characterization based on the decision
limits[14]. This decision limits are bought up by the decision plane. A plane which is taken to
arrange and segregate the different classes associated to the decision tree. The result of this is that
the sorting of the tumor outcome as normal, critical and the early stage or also the yield to express
doubt if it is cancerous.
17
Chapter 4: Implementation
In breast cancer it is not necessary that symptoms will be show every time thus helping by
taking proper precautions. So, early detection and its proper classification is the only way to lessen
the cancer fatality and it is a major task in medical field. The fundamental problems like
ineffectiveness in capturing textural information as well as low retrieval performance caused by
poor discrimination of capabilities of features.
Mammography is being used for early-stage detection diagnosis and screening. Key
elements here are processing and analysis for better prognosis results. Using FCM technique,
image segmentation is performed here. Further, certain features are extracted through these
segmented images and trained. Now, the trained images are being classified by an efficient and
accurate classifier.
FCM algorithm also known as Fuzzy-C-means, every data point belongs to multiple
clusters which vary in degree of membership which is based on objective function. Now the multi-
level Discrete Wavelet Transform is used here, to analyze the segmented region. Textual features
like pixel values and cancer cells are stored in database in form of matrix.
Once the extraction of feature is done and the system is trained, then it classifies the image
into either Benign (Initial Stage), Malignant (Harmful) or Normal. For this, it uses KNN algorithm
which classifies it depending on the shape of cancer cells [15]. After performing certain
morphological operation, the system provides certain region properties like Area, Euler Number
etc. and shows the detected boundary along with tumor area. Several techniques like Multi-level
Discrete Wavelet Transform, PCA, GLCM are used for extracting textural features. The boundary
of tumor is marked properly and shown to doctor.
18
4.1 Clustering based on the FCM algorithm
Fuzzy c-means (FCM) is a clustering method that allows each data point to belong to
multiple clusters with varying degrees of membership.
FCM is based on the minimization of the following objective function
where
• D is the number of data points.
• N is the number of clusters.
• m is fuzzy partition matrix exponent for controlling the degree of fuzzy overlap, with m > 1.
Fuzzy overlap refers to how fuzzy the boundaries between clusters are, that is the number
of data points that have significant membership in more than one cluster.
• xi is the ith data point.
• cj is the center of the jth cluster.
• μij is the degree of membership of xi in the jth cluster. For a given data point, xi, the sum of the
membership values for all clusters is one.
FCM performs the following steps during clustering:
1. Randomly initialize the cluster membership values, μij.
2. Calculate the cluster centers:
3. Update μij according to the following:
19
4. Calculate the objective function, Jm.
5. Repeat steps 2–4 until Jm improves by less than a specified minimum threshold or until
after a specified maximum number of iterations.
4.2 Feature extraction based on PCA, GLCM
4.2.1 Dimensionality reduction:
Such dimensionality reduction can be a very useful step for visualizing and processing
high-dimensional datasets, while still retaining as much of the variance in the dataset as possible.
For example, selecting L = 2 and keeping only the first two principal components finds the two-
dimensional plane through the high-dimensional dataset in which the data is most spread out, so if
the data contains clusters these too may be most spread out, and therefore most visible to be plotted
out in a two-dimensional diagram; whereas if two directions through the data (or two of the original
variables) are chosen at random, the clusters may be much less spread apart from each other, and
may in fact be much more likely to substantially overlay each other, making them
indistinguishable.
Similarly, in regression analysis, the larger the number of explanatory variables allowed,
the greater is the chance of overfitting the model, producing conclusions that fail to generalize to
other datasets. One approach, especially when there are strong correlations between different
possible explanatory variables, is to reduce them to a few principal components and then run the
regression against them, a method called principal component regression.
Dimensionality reduction may also be appropriate when the variables in a dataset are noisy.
If each column of the dataset contains independent identically distributed Gaussian noise, then the
20
columns of T will also contain similarly identically distributed Gaussian noise (such a distribution
is invariant under the effects of the matrix W, which can be thought of as a high-dimensional
rotation of the co-ordinate axes). However, with more of the total variance concentrated in the first
few principal components compared to the same noise variance, the proportionate effect of the
noise is less—the first few components achieve a higher signal-to noise ratio. PCA thus can have
the effect of concentrating much of the signal into the first few principal components, which can
usefully be captured by dimensionality reduction; while the later principal components may be
dominated by noise, and so disposed of without great loss.
4.2.2 Wavelet Transform:
The wavelet transform is similar to the Fourier transform (or much more to the windowed
Fourier transform) with a completely different merit function. The main difference is this: Fourier
transform decomposes the signal into sines and cosines, i.e. the functions localized in Fourier
space; in contrary the wavelet transform uses functions that are localized in both the real and
Fourier space. Generally, the wavelet transform can be expressed by the following equation:
where the * is the complex conjugate symbol and function ψ is some function. This function
can be chosen arbitrarily provided that it obeys certain rules.
As it is seen, the Wavelet transform is in fact an infinite set of various transforms,
depending on the merit function used for its computation. This is the main reason, why we can
hear the term “wavelet transform” in very different situations and applications. There are also
many ways how to sort the types of the wavelet transforms. Here we show only the division based
on the wavelet orthogonality [16]. We can use orthogonal wavelets for discrete wavelet transform.
These two transforms have the following properties:
21
1. The discrete wavelet transform returns a data vector of the same length as the input is. Usually,
even in this vector many data are almost zero. This corresponds to the fact that it decomposes into
a set of wavelets (functions) that are orthogonal to its translations and scaling. Therefore, we
decompose such a signal to a same or lower number of the wavelet coefficient spectrum as is the
number of signal data points. Such a wavelet spectrum is very good for signal processing and
compression, for example, as we get no redundant information here.
2. The continuous wavelet transform in contrary returns an array one dimension larger than the
input data. For a 1D data we obtain an image of the time-frequency plane. We can easily see the
signal frequencies evolution during the duration of the signal and compare the spectrum with other
signals spectra. As here is used the non-orthogonal set of wavelets, data are highly correlated, so
big redundancy is seen here. This helps to see the results in a more humane form.
4.2.3 Discrete Wavelet Transform (DWT):
In numerical analysis and functional analysis, a discrete wavelet transform (DWT) is any
wavelet transform for which the wavelets are discretely sampled. As with other wavelet
transforms, a key advantage it has over Fourier transforms is temporal resolution: it captures both
frequency and location information.
The discrete wavelet transforms (DWT) is an implementation of the wavelet transform
using a discrete set of the wavelet scales and translations obeying some defined rules. In other
words, this transform decomposes the signal into mutually orthogonal set of wavelets, which is the
main difference from the continuous wavelet transform (CWT), or its implementation for the
discrete time series sometimes called discrete-time continuous wavelet transform (DTCWT).
22
The wavelet can be constructed from a scaling function which describes its scaling
properties. The restriction that the scaling functions must be orthogonal to its discrete translations
implies some mathematical conditions on them which are mentioned everywhere,
e.g. the dilation equation
where S is a scaling factor (usually chosen as 2). Moreover, the area between the function must
be normalized and scaling function must be orthogonal to its integer translations, i.e.
In the following figure, some wavelet scaling functions and wavelets are plotted. The most
known family of orthonormal wavelets is the family of Daubechies. Her wavelets are usually
denominated by the number of nonzero coefficients ak.
Figure 4.1: Haar scaling function and wavelet (left) and their frequency content (right).
23
Figure 4.2: Scaling function and wavelet (left) and their frequency content (right).
Figure 4.3: Daubechies 20 scaling function and wavelet (left) and their frequency content (right).
There are several types of implementation of the DWT algorithm. The oldest and most
known one is the Mallat (pyramidal) algorithm. In this algorithm two filters – smoothing and non-
smoothing one – are constructed from the wavelet coefficients and those filters are recurrently
used to obtain data for all the scales. If the total number of data D = 2N is used and the signal length
is L, first D/2 data at scale L/2N - 1 are computed, then (D/2)/2 data at scale L/2N - 2, … up to finally
obtaining 2 data at scale L/2. The result of this algorithm is an array of the same length as the input
one, where the data are usually sorted from the largest scales to the smallest ones. Discrete wavelet
transform can be used for easy and fast de-noising of a noisy signal. If we take only a limited
number of highest coefficients of the discrete wavelet transform spectrum, and we perform an
24
inverse transform (with the same wavelet basis) we can obtain more or less de-noised signal. There
are several ways how to choose the coefficients that will be kept. Within Gwyddion, the universal
thresholding, scale adaptive thresholding [2] and scale and space adaptive thresholding [3] is
implemented. For threshold determination within these methods we first determine the noise
variance guess given by
where Yij corresponds to all the coefficients of the highest scale sub-band of the decomposition
(where most of the noise is assumed to be present). Alternatively, the noise variance can be
obtained in an independent way, for example from the AFM signal variance while not scanning.
For the highest frequency sub-band (universal thresholding) or for each sub-band (for scale
adaptive thresholding) or for each pixel neighbor-hood within sub-band (for scale and space
adaptive thresholding) the variance is computed as
Threshold value is finally computed as
Where
When threshold for given scale is known, we can remove all the coefficients smaller than
threshold value (hard thresholding) or we can lower the absolute value of these coefficients by
threshold value (soft thresholding).
25
4.2.4 Principal component analysis (PCA):
It is a statistical procedure that uses an orthogonal transformation to convert a set of
observations of possibly correlated variables (entities each of which takes on various numerical
values) into a set of values of linearly uncorrelated variables called principal components. If there
are n observations with p variables, then the number of distinct principal components is min (n-
1,p). This transformation is defined in such a way that the first principal component has the largest
possible variance (that is, accounts for as much of the variability in the data as possible), and each
succeeding component in turn has the highest variance possible under the constraint that it is
orthogonal to the preceding components [17]. The resulting vectors (each being a linear
combination of the variables and containing n observations) are an uncorrelated orthogonal basis
set. PCA is sensitive to the relative scaling of the original variables.
PCA is mostly used as a tool in exploratory data analysis and for making predictive
models. It is often used to visualize genetic distance and relatedness between populations. PCA
can be done by eigenvalue decomposition of a data covariance (or correlation) matrix or singular
value decomposition of a data matrix, usually after a normalization step of the initial data. The
normalization of each attribute consists of mean centering – subtracting each data value from its
variable's measured mean so that its empirical mean (average) is zero – and, possibly, normalizing
each variable's variance to make it equal to 1; see Z-scores.] The results of a PCA are usually
discussed in terms of component scores, sometimes called factor scores (the transformed variable
values corresponding to a particular data point), and loadings (the weight by which each
standardized original variable should be multiplied to get the component score). If component
scores are standardized to unit variance, loadings must contain the data variance in them (and that
is the magnitude of eigenvalues). If component scores are not standardized (therefore they contain
26
the data variance) then loadings must be unit-scaled, ("normalized") and these weights are called
eigenvectors; they are the cosines of orthogonal rotation of variables into principal components or
back.
PCA is the simplest of the true eigenvector-based multivariate analyses. Often, its
operation can be thought of as revealing the internal structure of the data in a way that best explains
the variance in the data. If a multivariate dataset is visualized as a set of coordinates in a high-
dimensional data space (1 axis per variable), PCA can supply the user with a lower dimensional
picture, a projection of this object when viewed from its most informative viewpoint. This is done
by transformed data is reduced.
PCA is closely related to factor analysis. Factor analysis typically incorporates more
domain specific assumptions about the using only the first few principal components so that the
dimensionality of the underlying structure and solves eigenvectors of a slightly different matrix.
PCA is also related to canonical correlation analysis (CCA). CCA defines coordinate
systems that optimally describe the cross-covariance between two datasets while PCA defines a
new orthogonal coordinate system that optimally describes variance in a single dataset.
PCA can be thought of as fitting a p-dimensional ellipsoid to the data, where each axis of
the ellipsoid represents a principal component. If some axis of the ellipsoid is small, then the
variance along that axis is also small, and by omitting that axis and its corresponding principal
component from our representation of the dataset, we lose only a commensurately small amount
of information.
To find the axes of the ellipsoid, we must first subtract the mean of each variable from the
dataset to center the data around the origin. Then, we compute the covariance matrix of the data
and calculate the eigenvalues and corresponding eigenvectors of this covariance matrix. Then we
27
must normalize each of the orthogonal eigenvectors to become unit vectors. This procedure is
sensitive to the scaling of the data, and there is no consensus as to how to best scale the data to
obtain optimal results.
4.2.5 Grey Level Co-occurrence Matrix(GLCM):
The approach that we are implementing uses GLCM i.e. grey level co-occurrence matrix.
A GLCM is a matrix where the number of rows and columns is equal to the number of gray levels,
G, in the image. The matrix element P (i, j | Δx, Δy) is the relative frequency with which two
pixels, separated by a pixel distance (Δx, Δy), occur within a given neighborhood, one with
intensity ‘i’ and the other with intensity ‘j’. The matrix element P (i, j | d, ө) contains the second
order statistical probability values for changes between gray levels ‘i’ and ‘j’ at a particular
displacement distance d and at a particular angle (ө). Using a large number of intensity levels G
implies storing a lot of temporary data, i.e. a G × G matrix for each combination of (Δx , Δy) or
(d, ө). Due to their large dimensionality, the GLCM’s are very sensitive to the size of the texture
samples on which they are estimated. Hence, the number of gray levels is often reduced.
GLCM algorithm
1. Count all the number of pixels in the matrix in which the data is saved.
2. Store the counted pixels in matrix P[I,j].
3. Check similarity between pixels in the matrix by applying histogram technique.
4. Calculate contrast factor from the matrix:
5. The elements of g need to be normalized by dividing the pixels.
28
6. Image Segmentation: The Image segmentation technique is applied which will segment
the image based on their properties. Acquiring the image and splitting it into several regions (or
segments) which are further used individually for extracting out the image features is termed as
the segmentation. In this work, the region-based k-mean segmentations technique is applied for
the image segmentation.
4.3 Classification based on KNN
The approach that we are implementing is k-mean clustering algorithm for segmentation. It is one
of the popular methods used for segmentation. In this we divide the image into various clusters i.e.
divides a set of data into specific number of groups. Data is classified into k number of disjoint
sets. A k-centroid is determined and then each point which has least distance from the centroid is
taken into consideration [18]. There are several ways of defining the distance of nearest centroid,
one such method is Euclidean distance. Euclidean distance is calculated consecutively for each
data point and the data point having the minimum distance is assigned to the cluster. These
minimum points are summed up to get a centroid. This is how k-mean clustering works
INPUT: Dataset
OUTPUT: Clustered Data
Start ( )
1. Read dataset and dataset has number of rows “r” and number of columns “m”
2. For (i=0;i=r; i++) /// selection of centroid point
For (j=0; j=m; j++)
Select k=data (i, j);
End
3. Calculation of Euclidian distance ()
29
4. For (i=0;i=r;i++)
For (j=9;j=m;j++)
A(i)=data(i);
B(i)=data(j);
Distance = sqrt[(A(i+1)-A(i)^2) –(B(j+1)-B(j)^2);
End
End
4. Normalization ()
5. For (k=0;k=data++)
Swap k(i+1) and k(i);
End
The comparison of different systems is shown below to have benefits and certain limitations
of different approaches [19].
1) Recursive Feature Elimination [20].
Pros: This technique will fit a model by removing the weaker features recursively until a
specific number of times. By doing so there are more stronger features for better prediction.
Cons: During the process of removing the features there can be more computational time
used.
2) Deep Learning [21]
Pros: It gives a multi layered approach of the artificial neurons and by this the error rate if
minimal.
Cons: The main drawback is that this requires too much data for it to train and can take
more processing time. It can result in consuming too much memory in the system.
30
3) Particle Swarm optimization [22]
Pros: This technique is easy to implement and have less parameters. It can do fast
processing and does not overlap.
Cons: This type of feature selection can be hard to define initial parameter. If there is too
much scattering this solution tends to fail.
In this proposed system, there are multi levels of feature extraction and classification with
KNN. In this way there is an accurate clustering, and the classification becomes easier and more
efficient. There is no need of more datasets and even with fewer datasets the system can predict
the output efficiently.
31
Chapter 5: Analysis of the Results
This section describes the screens of the “Breast cancer prediction and tracking using
machine learning methods”. The figures shown below for each module. The initial stage is to pre-
process the image and then convert it to grey image. This helps the system to perform segmentation
in future steps.
In Figure 5. There is a transformation of the image form the system to grey scale. The
system converts any dimension of image to a 2d image. For example, if the image is in 3d it will
automatically convert it to the 2D image and if it is 2D the grey scale image will remain as 2D
image.
Figure 5. Input image transformation to Gray scale image.
The following stage is to segment the images. Segmented image is where all the noises are
removed, and it distinguishes between the normal cells and the suspected ones. In order to do that
we first convert the gray image to binary form and perform FCM algorithm. For each cluster FCM
32
takes each data point with a different grade. In this project the FCM Clustered Segmented Image
has 3 bins( Benign, malignant and Normal).
Once the segmentation is done then we must extract the features. Extraction of the feature
is achieved by getting the edge features from multi-level wavelet transformation, PCA and GLCM.
Using KNN the dataset is fully trained and categorized . Once the features are extracted then the
classification of the images occur. Here we have 3 clusters (Benign, Malignant and Normal).
Figure 6. shows that the inputted image is normal whereas figures 7. and 8. show the affected
cancerous region. Figure 9 shows the boundary detection in the given image. The boundary is
detected only of the image has cancerous cells.
Figure 6. Output displaying that the inputted image has no cancerous cells and is normal.
Figures 7. and 8. show that the inputted image has cancerous cells and displays the tumor
area in terms of both mm2 and pixels. Figure 7. Also displays that the patient is in Benign stage.
Figure 8. displays that the patient is in Malignant stage.
33
Figure 7. Output displaying that the inputted image has cancerous cells and is in Benign stage.
Figure 8. Output displaying that the inputted image has cancerous cells and is in Malignant stage.
34
Figure 9. displays the affected boundary images of benign and malignant. (a) the affected
boundaries of benign stage of the Figure 7. and (b) as the affected boundaries of Malignant stage
of the Figure 8.
(a) (b)
Figure 9. Boundary detected Image (a) Benign stage from Figure 4. and (b) Malignant stage from Figure 8
In this project the accuracy is calculated by using the formula below . Let us take x as the
testing data true label and the prediction labels for it would be p. Here we are taking 70 datasets
for testing and calculating the true classified sets with its total number of the test data.
Accuracy= sum(x which is the true label)/ sum (total of all the test data) × 100.
Using this formula the accuracy is computed and gives a good efficiency. The combination of
Multi-Level Wavelet Conversion strategy associated to PCA with 13 features extracted and then
classified gives an average accuracy of nearly 92%.
35
Chapter 6: Conclusion and Future work
Breast Cancer represents one of the diseases that makes highest number of deaths every
year. At present, only few accurate prognostic and predictive factors are used clinically for
managing the patients with breast cancer. Here, by making use of Clustering with Level Set
approach, high accuracy can be achieved in detection of effected cell shapes with exact marking
on detected contours. The proposed system helps to enhance the performance of mammogram
retrieval by selecting optimal features.
The Fuzzy-C-means (FCM) clustering has been used for Image segmentation. Each data
point belongs to multiple clusters with varying degrees of membership, and it is based on the
objective function. The segmented region is completely analyzed by using the Multi-level Discrete
Wavelet Transform, Principal Component Analysis (PCA) along with Gray Level Cooccurrence
Matrix (GLCM) features. Totally 13 features are extracted and their pixel values in the form of
matrix is stored in database. After the features are extracted & completely trained the system
classifies the image into Benign, Malignant and Normal using the KNN classifier technique which
mainly depends on the shape of the cancer cells in the image.
By performing suitable morphological operations, system computes the suitable region
properties such as Area, Euler number etc., and displays the boundary detected image along with
the tumor area. These techniques improve accuracy in tracking the breast cancer cells. To assess
the correctness in classifying data with respect to efficiency and effectiveness of each algorithm
in terms of accuracy, precision, sensitivity, and specificity. Hence the design is to provide high
accuracy and maximum efficiency in prediction and tracking of breast cancer. The combination of
Multi-Level Wavelet Conversion strategy associated to PCA with 13 features extracted and then
classified gives an average accuracy of nearly 92%.
36
As a future improvement, the system can add more features such as recommendation of
medicines/treatments based on the severity of the patient. This prediction and recommendation
system can help doctors to diagnose and cure the disease more efficiently.
37
References
1. DeSantis C, Siegel R, Bandi P, Jemal A. Breast cancer statistics, 2011. CA Cancer J Clin. learning methods.” 2018 Electric
Electronics, Computer Science, Biomedical Engineerings' Meeting (EBBT) (2018): 1-3.2011;61(6):409-418.
doi:10.3322/caac.20134.
2. Y. Lu, J.-Y. Li, Y.-T. Su, and A.-A. Liu, ‘‘A review of breast cancer detection in medical images,’’ in Proc. IEEE Vis.
Commun. Image Process. (VCIP), Dec. 2018, pp. 1–4.
3. Turgut, Siyabend et al. “Microarray breast cancer data classification using machine Varalatchoumy and M. Ravishankar,
"Comparative study of four novel approaches developed for early detection of breast cancer and its stages," 2017 International
Conference on Inventive Computing and Informatics (ICICI), Coimbatore, 2017, pp. 411-416, doi:
10.1109/ICICI.2017.8365384.
4. M. Ravishankar and M. Varalatchoumy, "Four novel approaches for detection of region of interest in mammograms — A
comparative study," 2017 International Conference on Intelligent Sustainable Systems (ICISS), Palladam, 2017, pp. 261-265,
doi: 10.1109/ISS1.2017.8389410.
5. Ammu P K and Preeja V. Article: Review on Feature Selection Techniques of DNA Microarray Data. International Journal
of Computer Applications 61(12):39-44, January 20.
6. Bing Nan Li, Chee Kong Chui, Stephen Chang, S.H. Ong,Integrating spatial fuzzy clustering with level set methods for
automated medical image segmentation, Computers in Biology and Medicine, Volume 41, Issue 1, 2011, Pages 1-10, ISSN
0010-4825.
7. Reddy, V. Anji, and Badal Soni. "Breast Cancer Identification and Diagnosis Techniques." Machine Learning for Intelligent
Decision Science. Springer, Singapore, 2020. 49-70.
8. C. Yanyun, Q. Jianlin, G. Xiang, C. Jianping, J. Dan and C. Li, "Advances in Research of Fuzzy C-Means Clustering
Algorithm," 2011 International Conference on Network Computing and Information Security, Guilin, 2011, pp. 28-31, doi:
10.1109/NCIS.2011.104.
9. V. Pali, S. Goswami and L. P. Bhaiya, "An Extensive Survey on Feature Extraction Techniques for Facial Image Processing,"
2014 International Conference on Computational Intelligence and Communication Networks, 2014, pp. 142-148, doi:
10.1109/CICN.2014.43.
10. Jolliffe IT, Cadima J. Principal Component Analysis: a review and recent developments. Philos Trans A Math Phys Eng Sci.
2016;374(2065):20150202. doi:10.1098/rsta.2015.0202.
11. Mishra, Sidharth & Sarkar, Uttam & Taraphder, Subhash & Datta, Sanjoy & Swain, Devi & Saikhom, Reshma & Panda,
Sasmita & Laishram, Menalsh. (2017). Principal Component Analysis. International Journal of Livestock Research. 1.
10.5455/ijlr.20170415115235.
12. Kalist, V. & Packyanathan, Ganesan & Joseph, Maria & B.S, Sathish & Murugesan, R.. (2020). Image Quality Analysis Based
on Texture Feature Extraction Using Second-Order Statistical Approach. 10.1007/978-981-15-3172-9_48.
13. Chattopadhyay, P., Konar, P. Feature Extraction using Wavelet Transform for Multi-class Fault Detection of Induction
Motor. J. Inst. Eng. India Ser. B 95, 73–81 (2014).
14. S. G. Mallat, "A theory for multiresolution signal decomposition: the wavelet representation," in IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 11, no. 7, pp. 674-693, July 1989, doi: 10.1109/34.192463.
15. S. Singh, D. Srivastava and S. Agarwal, "GLCM and its application in pattern recognition," 2017 5th International Symposium
on Computational and Business Intelligence (ISCBI), Dubai, 2017, pp. 20-25, doi: 10.1109/ISCBI.2017.8053537.
16. H. Wang.: Nearest Neighbours without k: A Classification Formalism based on Probability, technical report, Faculty of
Informatics, University of Ulster, N.Ireland, UK (2002)
17. G. Guo, H. Wang, D. Bell, Y. Bi, and K. Greer, “KNN Model-Based Approch in Classification,” Lecture Notes in Computer
Science, pp 986-996, 2003.
18. Y.-S. Sun, Z. Zhao, Z.-N. Yang, F. Xu, H.-J. Lu, Z.-Y. Zhu, W. Shi, J. Jiang, P.-P. Yao, and H.-P. Zhu, ‘‘Risk factors and
preventions of breast cancer,’’ Int. J. Biol. Sci., vol. 13, no. 11, p. 1387, 2017.
COMPARISION
19. Islam, M.M., Haque, M.R., Iqbal, H. et al. Breast Cancer Prediction: A Comparative Study Using Machine Learning
Techniques. SN COMPUT. SCI. 1, 290 (2020).
38
20. X. Zeng, Y. -W. Chen and C. Tao, "Feature Selection Using Recursive Feature Elimination for Handwritten Digit
Recognition," 2009 Fifth International Conference on Intelligent Information Hiding and Multimedia Signal Processing,
2009, pp. 1205-1208, doi: 10.1109/IIH-MSP.2009.145.
21. Shen, L., Margolies, L.R., Rothstein, J.H. et al. Deep Learning to Improve Breast Cancer Detection on Screening
Mammography. Sci Rep 9, 12495 (2019).
22. B. Xue, M. Zhang and W. N. Browne, "Particle Swarm Optimization for Feature Selection in Classification: A Multi-Objective
Approach," in IEEE Transactions on Cybernetics, vol. 43, no. 6, pp. 1656-1671, Dec. 2013, doi:
10.1109/TSMCB.2012.2227469.
39

Project

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Project

Uploaded by

Copyright:

Available Formats

CALIFORNIA STATE UNIVERSITY SAN MARCOS

PROJECT SIGNATURE PAGE

PROJECT SUBMITTED IN PARTIAL FULFILLMENT

PROJECT TITLE: BREAST CANCER PREDICTION USING MACHINE LEARNING

AUTHOR: SANJANA BALASUBRAMANIAN

DATE OF SUCCESSFUL DEFENSE: APRIL 22nd, 2021

THE PROJECT HAS BEEN ACCEPTED BY THE PROJECT COMMITTEE IN

Name of Committee Member

Department of Computer Science and Information System

Breast cancer prediction using machine Learning

Advisor: Dr. Ahmad R. Hadaegh

Date: April 22nd, 2021

Fig No. Figure Name Page No

Figure 1 System Architecture of Breast Cancer Prediction and tracking 13

Figure 2 Proposed Breast Cancer Prediction and Tracking flow diagram. 14

frequency content (right).

Figure 5 Input image transformation to Gray scale image. 31

Figure 6 Output displaying that the inputted image has no cancerous 32

cells and is normal.

Figure 7 Output displaying that the inputted image has cancerous 33

cells and is in Benign stage.

and is in Malignant stage.

(b) Malignant stage from Figure 8.

List of the Figures ..................................................................................................................................... 2

Chapter 1: Introduction ............................................................................................................................. 6

Chapter 2: Related Work .......................................................................................................................... 9

Chapter 3: Methodology ......................................................................................................................... 12

4.1 Clustering based on the FCM algorithm ....................................................................................... 19

4.2 Feature extraction based on PCA, GLCM .................................................................................... 20

4.3 Classification based on KNN ........................................................................................................ 29

Chapter 5: Analysis of the Results .......................................................................................................... 32

Chapter 6: Conclusion and Future work ................................................................................................. 36

glory. I take this opportunity to express my gratitude to one and all.

Discrete Wavelet Transform, Principal Component Analysis (PCA), Gray-level Co-occurrence

of cancer cells is done by machine learning technique.

to them. Image processing basically includes the following three steps:

• Importing the image with optical scanner or by digital photography.

Digital Processing techniques help in manipulation of the digital images by using

processing, enhancement and display, information extraction.

human intervention or assistance and adjust actions accordingly.

image along with the tumor area.

Learning Methods” [IEEE 2018]

classification accuracy [3].

Mammograms - A Comparative Study” [ICISS 2017]

tumors in MRM images [4].

[3] Ammu P K et al., “Review on Feature Selection Techniques of DNA

Microarray Data” [IJCA 2013]

medical image segmentation” [ELSEVIER 2010]

the boundary detected image.

shown below in Figure 2.

set reduction for similarity measurement improvement.

regression abbreviated as PCA.

there is loss of features.

transmogrifications in the bunch center at dualistic uninterrupted recapitulation phases.

large training sets are analogous.

The Multi-Level Wavelet Conversion strategy associated to PCA is something as a

ineffectiveness in capturing textural information as well as low retrieval performance caused by

poor discrimination of capabilities of features.

of tumor is marked properly and shown to doctor.

multiple clusters with varying degrees of membership.

FCM is based on the minimization of the following objective function

• D is the number of data points.

• N is the number of clusters.

• xi is the ith data point.