Download as pdf or txt
Download as pdf or txt
You are on page 1of 40

CALIFORNIA STATE UNIVERSITY SAN MARCOS

PROJECT SIGNATURE PAGE

PROJECT SUBMITTED IN PARTIAL FULFILLMENT


OF THE REQUIREMENTS FOR THE DEGREE

MASTER OF SCIENCE

IN

COMPUTER SCIENCE

PROJECT TITLE: BREAST CANCER PREDICTION USING MACHINE LEARNING

AUTHOR: SANJANA BALASUBRAMANIAN

DATE OF SUCCESSFUL DEFENSE: APRIL 22nd, 2021

THE PROJECT HAS BEEN ACCEPTED BY THE PROJECT COMMITTEE IN


PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF
SCIENCE IN COMPUTER SCIENCE.

Dr Ahmad Hadaegh
PROJECT COMMITTEE CHAIR SIGNATURE DATE

Dr. Yanyan Li
PROJECT COMMITTEE MEMBER SIGNATURE DATE

Name of Committee Member


PROJECT COMMITTEE MEMBER SIGNATURE DATE
Sanjana Balasubramanian

Department of Computer Science and Information System

Breast cancer prediction using machine Learning

Advisor: Dr. Ahmad R. Hadaegh

Date: April 22nd, 2021

1
List of the Figures

Fig No. Figure Name Page No

Figure 1 System Architecture of Breast Cancer Prediction and tracking 13

system.

Figure 2 Proposed Breast Cancer Prediction and Tracking flow diagram. 14

Figure 4.1 Haar Scaling function and wavelet (left) and their frequency 23

(right).

Figure 4.2 Scaling function and wavelet (left) and their frequency (right). 24

Figure 4.3 Daubechies 20 scaling function and wavelet (left) and their 24

frequency content (right).

Figure 5 Input image transformation to Gray scale image. 31

Figure 6 Output displaying that the inputted image has no cancerous 32

cells and is normal.

Figure 7 Output displaying that the inputted image has cancerous 33

cells and is in Benign stage.

Figure 8 Output displaying that the inputted image has cancerous cells 33

and is in Malignant stage.

Figure 9 Boundary detected Image (a) Benign stage from Figure 4. and 34

(b) Malignant stage from Figure 8.

2
Table of Contents

List of the Figures ..................................................................................................................................... 2

Abstract ..................................................................................................................................................... 5

Chapter 1: Introduction ............................................................................................................................. 6

Chapter 2: Related Work .......................................................................................................................... 9

Chapter 3: Methodology ......................................................................................................................... 12

Chapter 4: Implementation...................................................................................................................... 18

4.1 Clustering based on the FCM algorithm ....................................................................................... 19

4.2 Feature extraction based on PCA, GLCM .................................................................................... 20

4.3 Classification based on KNN ........................................................................................................ 29

Chapter 5: Analysis of the Results .......................................................................................................... 32

Chapter 6: Conclusion and Future work ................................................................................................. 36

References .............................................................................................................................................. 38

3
Acknowledgment

The satisfaction and the euphoria that accompany the successful completion of any task

would be incomplete without the mention of the people who made it possible. The constant

guidance of these persons and encouragement provided, crowned my efforts with success and

glory. I take this opportunity to express my gratitude to one and all.

I am grateful to management and my institute California State University, San Marcos with

its very ideals and inspiration for having provided me with the facilities, which made this, project

a success.

I am indebted with a deep sense of gratitude for the constant inspiration, encouragement,

timely guidance, and valid suggestion given to me by my advisor Dr. Ahmad Hadaegh and Co

advisor Dr. Yanyan Li, Department of Computer Science, California State University, San Marcos.

4
Abstract

Today there are more than 1.15 million cases of breast cancer diagnosed worldwide

annually. At present, only small numbers of accurate prognostic and predictive factors are used

clinically for managing the patients with breast cancer. Early detection of this fatal disease is very

important which helps in decreasing the morality rate and increasing the survival period of breast

cancer patients. The project uses Mammography which is the main test used for screening and

early diagnosis, and its analysis and processing are the keys to improving breast cancer prognosis.

To detect breast cancer in mammogram, image segmentation is performed with the help of Fuzzy

C-means (FCM) technique. Further those segmented regions features are extracted, and it is trained

completely, finally trained images are classified by the efficient classifier of different classes in

mammogram. Texture features are extracted using a feature extraction technique like Multi-level

Discrete Wavelet Transform, Principal Component Analysis (PCA), Gray-level Co-occurrence

Matrix (GLCM). Morphological operators are used to distinguish masses and microcalcifications

from the background tissue and KNN algorithm is used for classification. The boundaries of tumor

affected region in mammogram are marked and displayed to the doctor, along with area of tumor.

5
Chapter 1: Introduction

Breast cancer has become the most recurrent type of health issue among women especially

for women in middle age. Early detection of breast cancer can help women cure this disease and

death rate can be reduced [1]. In the present-day scenario, to observe breast cancer mammograms

are used and they are known be the most effective scanning technique. In this paper the detection

of cancer cells is done by machine learning technique.

Image processing is a method to convert an image into digital form and perform some

operations on it, to get an enhanced image or to extract some useful information from it. It is a

type of signal processing in which input is an image and output may be image or

characteristics/features associated with that image [2]. Usually, Image Processing system includes

treating images as two-dimensional signals while applying already set signal processing methods

to them. Image processing basically includes the following three steps:

• Importing the image with optical scanner or by digital photography.

• Analyzing and manipulating the image which includes data compression and image enhancement

and spotting patterns that are not to human eyes like satellite photographs.

• Output is the last stage in which result can be altered image or report that is based on image

analysis.

Digital Processing techniques help in manipulation of the digital images by using

computers. Digital Image is composed of a finite number of elements, each of which elements

have a particular value at a particular location. These elements are referred to as picture elements,

image elements and pixels. A Pixel is most widely used to denote the elements of a Digital Image.

As raw data from imaging sensors from satellite platform contains deficiencies. To get over such

flaws and to get originality of information, it has to undergo various phases of processing. The

6
three general phases that all types of data have to undergo while using digital technique are Pre-

processing, enhancement and display, information extraction.

The first step is to analyses the images and represent them. The method of image

representation will fix fundamental problems that are the ineffectiveness of capturing textural

information and the weak capacity for classification of features that result in low performance of

retrieval. In Content-based Image Retrieval, similarity estimation is a part of the primary task and

has a greater effect on the accuracy of retrieval and time of retrieval. The project aims to solve

problems such as "Which similarity measure is appropriate for particular feature type and how to

reduce the similarity calculation computation? "And" The texture function is more representative

and discriminatory to describe the mammogram of the given query? In this project we use

classification techniques such as Fuzzy C-means and KNN to employ feature extraction.

The FCM clustering is used for image segmentation once the mammographic image was

collected. Data point in this system be held by to various clusters with changing degrees of

membership and is based on objective criteria. The segmented area is rigorously examined using

Multi-level Discrete Wavelet Transform to get edge details to which is then used as a feature. PCA

is then used for this data to analyze along with GLCM. After performing the analysis, 13 features

extracted in the proposed framework and their pixel values in matrix form are stored in database.

Machine learning is an application of artificial intelligence that provides systems the ability

to automatically learn and improve from experience without being explicitly programmed. The

basic premise of machine learning is to build algorithms that can receive input data and use

statistical analysis to predict an output while updating outputs as new data becomes available. The

process of learning begins with observations or data, such as examples, direct experience, or

instruction, to look for patterns in data and make better decisions in the future based on the

7
examples that we provide. The primary aim is to allow the computers learn automatically without

human intervention or assistance and adjust actions accordingly.

Supervised algorithms with machine learning skills to provide both input and desired

output, in addition to furnishing feedback about the accuracy of predictions during algorithm

training. A classification problem is when the output variable is a category or a group. Using the

KNN classifier method, which primarily relies on the shape of the cancer cells in the image, the

algorithm classifies the image into Benign, Malignant and Normal once the features are extracted

& fully educated. By performing appropriate morphological operations, the device calculates the

appropriate region properties, such as size, Euler number, etc., and displays the detected boundary

image along with the tumor area.

8
Chapter 2: Related Work

[1] Siyabend Turgut et al., “Microarray Breast Cancer Data Classification Using Machine

Learning Methods” [IEEE 2018]

The paper uses microarray breast cancer data for classification of the patients using machine

learning methods. In the first case, eight different machine learning algorithms are applied to the

dataset and the results of classification were noted. Then in the second case, two different feature

selection methods such as Recursive Feature Elimination (RFE) and Randomized Logistic

Regression (RLR) were applied on the microarray breast cancer dataset and 50 features were

chosen as stop criterion. Again, the same eight machine learning algorithms were applied on the

modified dataset. The results of the classifications are compared with each other and with the

results of the first case. The methods applied are SVM, KNN, MLP, Decision Trees, Random

Forest, Logistic Regression, Ad boost and Gradient Boosting Machines. After applying the two

different feature selection methods, SVM gave the best results. MLP is applied using different

number of layers and neurons to examine the effect of the number of layers and neurons on the

classification accuracy [3].

[2] Varalatchoumy M et al., “Four Novel Approaches for Detection of Region of Interest in

Mammograms - A Comparative Study” [ICISS 2017]

The paper compares Four Novel approaches used for detection of Region of Interest in

Mammographic images based on database and Real time images. In Approach I histogram

equalization and dynamic thresholding techniques were used for preprocessing. Region of Interest

(ROI) was partitioned from the preprocessed image by using particle swarm optimization and k-

means clustering methods. In Approach II preprocessing was done using various morphological

operations like erosion followed by dilation. For the identification of ROI, a modified approach of

9
watershed segmentation was used. Approach III uses histogram equalization for preprocessing and

an advanced level set approach for performing segmentation. Approach IV, which is considered

to be the most efficient approach that uses different morphological operations and contrast limited

adaptive histogram equalization for image preprocessing. A very novel algorithm was developed

for detection of Region of Interest. Approaches I and II were applicable for Mammographic Image

Analysis Society (MIAS) database images alone. Approaches III and IV were applicable for MIAS

and Real time hospital images. The various graphs presented in the comparative study, clearly

depicts that the novel approach that used a novel algorithm for detection of ROI is proved to be

the most efficient, accurate and highly reliable approach that can be used by radiologists to detect

tumors in MRM images [4].

[3] Ammu P K et al., “Review on Feature Selection Techniques of DNA

Microarray Data” [IJCA 2013]

This paper reviews few major feature selection techniques employed in microarray data and points

out the merits and demerits of various approaches. Feature selection from DNA microarray data is

one of the most important procedures in bioinformatics. Biogeography Based Optimization (BBO)

is an optimization algorithm which works on the basis of migration of species between different

habitats and the process of mutation. Particle Swarm Optimization (PSO) is an algorithm which

works on the basis of movement of particles in a search space. Redundancy based feature selection

approaches can be used to remove redundant genes from the selected genes as the resultant gene

set can achieve a better representation of the target class. A two-stage hybrid filter wrapper method

where, in the first stage a subset of the original feature set is obtained by applying information gain

as the filtering criteria. In the second stage the genetic algorithm is applied to the set of filtered

genes. Gene selection based on dependency of features where the features are classified as

10
independent, half dependent and dependent features. Independent features are those features that

doesn’t depend on any other features. Half dependent features are more relevant in correlation with

other features and dependent features are fully dependent on other features [5].

[4] Bing Lan Li et al., “Integrating spatial fuzzy clustering with level set methods for automated

medical image segmentation” [ELSEVIER 2010]

A new Fuzzy Level Set algorithm is proposed in this paper to facilitate automated medical image

segmentation. It can directly evolve from the initial segmentation by spatial fuzzy clustering where

centroid and the scope of each subclass are estimated adaptively in order to minimize a pre-defined

cost function. The controlling parameters of Level Set evolution are also estimated from the results

of fuzzy clustering. The level set methods utilize dynamic variational boundaries for image

segmentation. The new Fuzzy Level Set algorithm automates the initialization and parameter

configuration of the level set segmentation, using spatial fuzzy clustering. It employs a Fuzzy-C

means (FCM) with spatial restrictions to determine the approximate contours of interest in a

medical image. Moreover, the Fuzzy Level Set algorithm is enhanced with locally regularized

evolution. Such improvements facilitate level set manipulation and lead to more robust

segmentation. Performance evaluation of the proposed algorithm was carried on medical images

from different modalities. The results confirm its effectiveness for image segmentation [6].

11
Chapter 3: Methodology

The figure 1 shows system architecture of breast cancer prediction and tracking system. The

system mainly consists of four processes. Once the image is acquitted, the system converts the

image into gray scale image. By applying the suitable Image segmentation techniques, the system

aims at extracting a meaningful object lying in the image. Clustering is one the powerful image

segmentation technique involves grouping of data points to cluster, the system implements

clustering technique using Fuzzy C-means (FCM) [7]. The segmented region is completely

analyzed by using the Multi-level Discrete Wavelet Transform, Principal Component Analysis

(PCA) along with Gray Level Co-occurrence Matrix (GLCM) features [8]. Totally 13 features are

extracted in the system and their pixel values in the form of matrix is stored in database that is in

the db. mat file, some of the features extracted by the system are mean, variance, entropy etc. Then

image undergoes classification process with respect to dataset in db. mat file and classifies the

image into Benign, Malignant and Normal. The system also performs morphological operations

and calculates region properties of the image such as Area, Eccentricity and Euler number. If the

image has cancer cells, then the tumor area is computed and displayed by the system along with

the boundary detected image.

12
Figure 1: System Architecture of Breast Cancer Prediction and Tracking system

The physician uploads the mammogram of the patient to the device that is subjected to the

process of image segmentation . Using image processing technique, the segmented image is pre-

processed. Extraction methods to extract necessary features are applied to the image. The classifier

model is given extracted features, then the test image classification process is performed with

respect to the training data present in the database[9]. The test picture is marked as either cancerous

or non-cancerous. Unless the test image is marked as cancerous, the tumor region will be measured,

and the findings will be shown to the doctor along with the observed boundary image. These are

shown below in Figure 2.

13
Figure 2. Proposed breast cancer prediction and Tracking flow diagram

Fuzzy c-means is a cluster function where data points are associated to several clusters with

varying membership rates. One of the segmentation techniques applicable to gray level images is

the FCM. In the proposed method, an initial location of the central cluster and the membership

degree measure for every data point are determined with the aid of Fuzzy C-Means [10].

Classification after feature extraction relies on the shapes of cancer cells in the image and method

often makes use of all features contained in the database for classification purposes. System should

analyze similarity measures and use clustering techniques and feature extraction techniques such

as Multi-level Discrete Wavelet Transform, Principal Component Analysis, GLCM to allow data

set reduction for similarity measurement improvement.

14
The next step will be extraction of a function. These dimensionality reductions can be a

helpful step in the visualization and analysis of high featured datasets, while maintaining as much

variation as possible [11].This technique is used to simplify the classification to predict a better

output. The maximum variance must be found out in order to segregate the data into multiple

clusters. For Example, if chosen 3 different centroids and then make 3 main elements in the 2D

plane of high dimensionality we can see that the data is distributed and is consistent with the

clusters and an find it apparent in the 2D diagram. This reduces overfitting of the data and avoids

the clusters to be far shortly spaced form each other or overlapping from each other.

Likewise, the larger the number of illustrative variables permitted in regression study, the

greater the risk that the model will be overfitted, resulting in the output to fail in terms of the

generalization with the rest of the datasets. One approach is to lessen them to a few main

components, particularly when there are well built association between different potential

variables, and then execute the regression in opposition to them which is the Principal component

regression abbreviated as PCA.

PCA is a method which uses linear variation to relocate a group of examined correlated

variables (entities each of which takes on different numerical values) into principal components.

PCA is a tool used mainly for illustrative data analysis and predictive model building [5].

Reduction of the dimensionality can also be sufficient when the illustrative variable points in a

dataset are obstreperous. Using the PCA method the system can focus on most of the signals into

the main components first and then the features can be captured by keeping the minimized

dimensionality. This process offers benefits because if the main component captures noise then it

there is loss of features.

15
After the PCA extraction the next is segmentation. The segmentation is used because for

the improvement of the delineation of image to a visible examination. Its insight is to see the

condition of these uncertain areas to assign and help the irregularities of the classification between

the cancerous and non-cancerous features. The c means fuzzy logic is used in disintegrating the

tumor exaggerated area by adjusting the numeral of bunches that initializes in the assessment for

the epicenter fragments of the bunch, that are intended to stamp the mean area of each bunch [12].

The efficient costs of the pixel epitomize the probabilities that it be in the right place to a concrete

bunch. To the centroid of their bunches the pixels of high affiliation standards are immediate, and

low affiliation standards are ended missing from the centroids which are being detected as pixels

along with the statistics. Merging for the affiliation purpose which can be checked by distinctive

transmogrifications in the bunch center at dualistic uninterrupted recapitulation phases.

The extraction of features is obtained by segregating the datasets into 2 categories. One is

the training set and another one is the testing set. Here the training set will consist of greater

amounts of statistics and testing set will have then miniscule amounts of statistics. Investigation

facilities indiscriminately test the information to determine that both the small testing sets and

large training sets are analogous.

The Multi-Level Wavelet Conversion strategy associated to PCA is something as a

computable procedure which applies a portion of symmetrical conversions into transmogrify above

the ample perceptions that are made as variables which are related. These related variables are to

be directly associated ad the key factors which are also called as the main component which are

put light to for the segmentated region. The GLCM will work with the testing sets to get the

highlights to the progression of a request with the motive of finding the diagnostic faults in the

examination [13].

16
The dataset is totally accomplished and with the application of investigative statistics

appliance which ostensibly isolates minor part or amount planned to show what the entire

resembles for the information to guarantee that the testing and preparing sets are analogous, which

are in future grouped via the dissimilar cataloguing procedures. For example, the KNN algorithm

which is called as the k nearest neighbor is used here for the characterization and set back reports

whereas the support vector machine is used to establish the characterization based on the decision

limits[14]. This decision limits are bought up by the decision plane. A plane which is taken to

arrange and segregate the different classes associated to the decision tree. The result of this is that

the sorting of the tumor outcome as normal, critical and the early stage or also the yield to express

doubt if it is cancerous.

17
Chapter 4: Implementation

In breast cancer it is not necessary that symptoms will be show every time thus helping by

taking proper precautions. So, early detection and its proper classification is the only way to lessen

the cancer fatality and it is a major task in medical field. The fundamental problems like

ineffectiveness in capturing textural information as well as low retrieval performance caused by

poor discrimination of capabilities of features.

Mammography is being used for early-stage detection diagnosis and screening. Key

elements here are processing and analysis for better prognosis results. Using FCM technique,

image segmentation is performed here. Further, certain features are extracted through these

segmented images and trained. Now, the trained images are being classified by an efficient and

accurate classifier.

FCM algorithm also known as Fuzzy-C-means, every data point belongs to multiple

clusters which vary in degree of membership which is based on objective function. Now the multi-

level Discrete Wavelet Transform is used here, to analyze the segmented region. Textual features

like pixel values and cancer cells are stored in database in form of matrix.

Once the extraction of feature is done and the system is trained, then it classifies the image

into either Benign (Initial Stage), Malignant (Harmful) or Normal. For this, it uses KNN algorithm

which classifies it depending on the shape of cancer cells [15]. After performing certain

morphological operation, the system provides certain region properties like Area, Euler Number

etc. and shows the detected boundary along with tumor area. Several techniques like Multi-level

Discrete Wavelet Transform, PCA, GLCM are used for extracting textural features. The boundary

of tumor is marked properly and shown to doctor.

18
4.1 Clustering based on the FCM algorithm

Fuzzy c-means (FCM) is a clustering method that allows each data point to belong to

multiple clusters with varying degrees of membership.

FCM is based on the minimization of the following objective function

where

• D is the number of data points.

• N is the number of clusters.

• m is fuzzy partition matrix exponent for controlling the degree of fuzzy overlap, with m > 1.

Fuzzy overlap refers to how fuzzy the boundaries between clusters are, that is the number

of data points that have significant membership in more than one cluster.

• xi is the ith data point.

• cj is the center of the jth cluster.

• μij is the degree of membership of xi in the jth cluster. For a given data point, xi, the sum of the

membership values for all clusters is one.

FCM performs the following steps during clustering:

1. Randomly initialize the cluster membership values, μij.

2. Calculate the cluster centers:

3. Update μij according to the following:

19
4. Calculate the objective function, Jm.

5. Repeat steps 2–4 until Jm improves by less than a specified minimum threshold or until

after a specified maximum number of iterations.

4.2 Feature extraction based on PCA, GLCM

4.2.1 Dimensionality reduction:

Such dimensionality reduction can be a very useful step for visualizing and processing

high-dimensional datasets, while still retaining as much of the variance in the dataset as possible.

For example, selecting L = 2 and keeping only the first two principal components finds the two-

dimensional plane through the high-dimensional dataset in which the data is most spread out, so if

the data contains clusters these too may be most spread out, and therefore most visible to be plotted

out in a two-dimensional diagram; whereas if two directions through the data (or two of the original

variables) are chosen at random, the clusters may be much less spread apart from each other, and

may in fact be much more likely to substantially overlay each other, making them

indistinguishable.

Similarly, in regression analysis, the larger the number of explanatory variables allowed,

the greater is the chance of overfitting the model, producing conclusions that fail to generalize to

other datasets. One approach, especially when there are strong correlations between different

possible explanatory variables, is to reduce them to a few principal components and then run the

regression against them, a method called principal component regression.

Dimensionality reduction may also be appropriate when the variables in a dataset are noisy.

If each column of the dataset contains independent identically distributed Gaussian noise, then the

20
columns of T will also contain similarly identically distributed Gaussian noise (such a distribution

is invariant under the effects of the matrix W, which can be thought of as a high-dimensional

rotation of the co-ordinate axes). However, with more of the total variance concentrated in the first

few principal components compared to the same noise variance, the proportionate effect of the

noise is less—the first few components achieve a higher signal-to noise ratio. PCA thus can have

the effect of concentrating much of the signal into the first few principal components, which can

usefully be captured by dimensionality reduction; while the later principal components may be

dominated by noise, and so disposed of without great loss.

4.2.2 Wavelet Transform:

The wavelet transform is similar to the Fourier transform (or much more to the windowed

Fourier transform) with a completely different merit function. The main difference is this: Fourier

transform decomposes the signal into sines and cosines, i.e. the functions localized in Fourier

space; in contrary the wavelet transform uses functions that are localized in both the real and

Fourier space. Generally, the wavelet transform can be expressed by the following equation:

where the * is the complex conjugate symbol and function ψ is some function. This function

can be chosen arbitrarily provided that it obeys certain rules.

As it is seen, the Wavelet transform is in fact an infinite set of various transforms,

depending on the merit function used for its computation. This is the main reason, why we can

hear the term “wavelet transform” in very different situations and applications. There are also

many ways how to sort the types of the wavelet transforms. Here we show only the division based

on the wavelet orthogonality [16]. We can use orthogonal wavelets for discrete wavelet transform.

These two transforms have the following properties:

21
1. The discrete wavelet transform returns a data vector of the same length as the input is. Usually,

even in this vector many data are almost zero. This corresponds to the fact that it decomposes into

a set of wavelets (functions) that are orthogonal to its translations and scaling. Therefore, we

decompose such a signal to a same or lower number of the wavelet coefficient spectrum as is the

number of signal data points. Such a wavelet spectrum is very good for signal processing and

compression, for example, as we get no redundant information here.

2. The continuous wavelet transform in contrary returns an array one dimension larger than the

input data. For a 1D data we obtain an image of the time-frequency plane. We can easily see the

signal frequencies evolution during the duration of the signal and compare the spectrum with other

signals spectra. As here is used the non-orthogonal set of wavelets, data are highly correlated, so

big redundancy is seen here. This helps to see the results in a more humane form.

4.2.3 Discrete Wavelet Transform (DWT):

In numerical analysis and functional analysis, a discrete wavelet transform (DWT) is any

wavelet transform for which the wavelets are discretely sampled. As with other wavelet

transforms, a key advantage it has over Fourier transforms is temporal resolution: it captures both

frequency and location information.

The discrete wavelet transforms (DWT) is an implementation of the wavelet transform

using a discrete set of the wavelet scales and translations obeying some defined rules. In other

words, this transform decomposes the signal into mutually orthogonal set of wavelets, which is the

main difference from the continuous wavelet transform (CWT), or its implementation for the

discrete time series sometimes called discrete-time continuous wavelet transform (DTCWT).

22
The wavelet can be constructed from a scaling function which describes its scaling

properties. The restriction that the scaling functions must be orthogonal to its discrete translations

implies some mathematical conditions on them which are mentioned everywhere,

e.g. the dilation equation

where S is a scaling factor (usually chosen as 2). Moreover, the area between the function must

be normalized and scaling function must be orthogonal to its integer translations, i.e.

In the following figure, some wavelet scaling functions and wavelets are plotted. The most

known family of orthonormal wavelets is the family of Daubechies. Her wavelets are usually

denominated by the number of nonzero coefficients ak.

Figure 4.1: Haar scaling function and wavelet (left) and their frequency content (right).

23
Figure 4.2: Scaling function and wavelet (left) and their frequency content (right).

Figure 4.3: Daubechies 20 scaling function and wavelet (left) and their frequency content (right).

There are several types of implementation of the DWT algorithm. The oldest and most

known one is the Mallat (pyramidal) algorithm. In this algorithm two filters – smoothing and non-

smoothing one – are constructed from the wavelet coefficients and those filters are recurrently

used to obtain data for all the scales. If the total number of data D = 2N is used and the signal length

is L, first D/2 data at scale L/2N - 1 are computed, then (D/2)/2 data at scale L/2N - 2, … up to finally

obtaining 2 data at scale L/2. The result of this algorithm is an array of the same length as the input

one, where the data are usually sorted from the largest scales to the smallest ones. Discrete wavelet

transform can be used for easy and fast de-noising of a noisy signal. If we take only a limited

number of highest coefficients of the discrete wavelet transform spectrum, and we perform an
24
inverse transform (with the same wavelet basis) we can obtain more or less de-noised signal. There

are several ways how to choose the coefficients that will be kept. Within Gwyddion, the universal

thresholding, scale adaptive thresholding [2] and scale and space adaptive thresholding [3] is

implemented. For threshold determination within these methods we first determine the noise

variance guess given by

where Yij corresponds to all the coefficients of the highest scale sub-band of the decomposition

(where most of the noise is assumed to be present). Alternatively, the noise variance can be

obtained in an independent way, for example from the AFM signal variance while not scanning.

For the highest frequency sub-band (universal thresholding) or for each sub-band (for scale

adaptive thresholding) or for each pixel neighbor-hood within sub-band (for scale and space

adaptive thresholding) the variance is computed as

Threshold value is finally computed as

Where

When threshold for given scale is known, we can remove all the coefficients smaller than

threshold value (hard thresholding) or we can lower the absolute value of these coefficients by

threshold value (soft thresholding).

25
4.2.4 Principal component analysis (PCA):

It is a statistical procedure that uses an orthogonal transformation to convert a set of

observations of possibly correlated variables (entities each of which takes on various numerical

values) into a set of values of linearly uncorrelated variables called principal components. If there

are n observations with p variables, then the number of distinct principal components is min (n-

1,p). This transformation is defined in such a way that the first principal component has the largest

possible variance (that is, accounts for as much of the variability in the data as possible), and each

succeeding component in turn has the highest variance possible under the constraint that it is

orthogonal to the preceding components [17]. The resulting vectors (each being a linear

combination of the variables and containing n observations) are an uncorrelated orthogonal basis

set. PCA is sensitive to the relative scaling of the original variables.

PCA is mostly used as a tool in exploratory data analysis and for making predictive

models. It is often used to visualize genetic distance and relatedness between populations. PCA

can be done by eigenvalue decomposition of a data covariance (or correlation) matrix or singular

value decomposition of a data matrix, usually after a normalization step of the initial data. The

normalization of each attribute consists of mean centering – subtracting each data value from its

variable's measured mean so that its empirical mean (average) is zero – and, possibly, normalizing

each variable's variance to make it equal to 1; see Z-scores.] The results of a PCA are usually

discussed in terms of component scores, sometimes called factor scores (the transformed variable

values corresponding to a particular data point), and loadings (the weight by which each

standardized original variable should be multiplied to get the component score). If component

scores are standardized to unit variance, loadings must contain the data variance in them (and that

is the magnitude of eigenvalues). If component scores are not standardized (therefore they contain

26
the data variance) then loadings must be unit-scaled, ("normalized") and these weights are called

eigenvectors; they are the cosines of orthogonal rotation of variables into principal components or

back.

PCA is the simplest of the true eigenvector-based multivariate analyses. Often, its

operation can be thought of as revealing the internal structure of the data in a way that best explains

the variance in the data. If a multivariate dataset is visualized as a set of coordinates in a high-

dimensional data space (1 axis per variable), PCA can supply the user with a lower dimensional

picture, a projection of this object when viewed from its most informative viewpoint. This is done

by transformed data is reduced.

PCA is closely related to factor analysis. Factor analysis typically incorporates more

domain specific assumptions about the using only the first few principal components so that the

dimensionality of the underlying structure and solves eigenvectors of a slightly different matrix.

PCA is also related to canonical correlation analysis (CCA). CCA defines coordinate

systems that optimally describe the cross-covariance between two datasets while PCA defines a

new orthogonal coordinate system that optimally describes variance in a single dataset.

PCA can be thought of as fitting a p-dimensional ellipsoid to the data, where each axis of

the ellipsoid represents a principal component. If some axis of the ellipsoid is small, then the

variance along that axis is also small, and by omitting that axis and its corresponding principal

component from our representation of the dataset, we lose only a commensurately small amount

of information.

To find the axes of the ellipsoid, we must first subtract the mean of each variable from the

dataset to center the data around the origin. Then, we compute the covariance matrix of the data

and calculate the eigenvalues and corresponding eigenvectors of this covariance matrix. Then we

27
must normalize each of the orthogonal eigenvectors to become unit vectors. This procedure is

sensitive to the scaling of the data, and there is no consensus as to how to best scale the data to

obtain optimal results.

4.2.5 Grey Level Co-occurrence Matrix(GLCM):

The approach that we are implementing uses GLCM i.e. grey level co-occurrence matrix.

A GLCM is a matrix where the number of rows and columns is equal to the number of gray levels,

G, in the image. The matrix element P (i, j | Δx, Δy) is the relative frequency with which two

pixels, separated by a pixel distance (Δx, Δy), occur within a given neighborhood, one with

intensity ‘i’ and the other with intensity ‘j’. The matrix element P (i, j | d, ө) contains the second

order statistical probability values for changes between gray levels ‘i’ and ‘j’ at a particular

displacement distance d and at a particular angle (ө). Using a large number of intensity levels G

implies storing a lot of temporary data, i.e. a G × G matrix for each combination of (Δx , Δy) or

(d, ө). Due to their large dimensionality, the GLCM’s are very sensitive to the size of the texture

samples on which they are estimated. Hence, the number of gray levels is often reduced.

GLCM algorithm

1. Count all the number of pixels in the matrix in which the data is saved.

2. Store the counted pixels in matrix P[I,j].

3. Check similarity between pixels in the matrix by applying histogram technique.

4. Calculate contrast factor from the matrix:

5. The elements of g need to be normalized by dividing the pixels.

28
6. Image Segmentation: The Image segmentation technique is applied which will segment

the image based on their properties. Acquiring the image and splitting it into several regions (or

segments) which are further used individually for extracting out the image features is termed as

the segmentation. In this work, the region-based k-mean segmentations technique is applied for

the image segmentation.

4.3 Classification based on KNN

The approach that we are implementing is k-mean clustering algorithm for segmentation. It is one

of the popular methods used for segmentation. In this we divide the image into various clusters i.e.

divides a set of data into specific number of groups. Data is classified into k number of disjoint

sets. A k-centroid is determined and then each point which has least distance from the centroid is

taken into consideration [18]. There are several ways of defining the distance of nearest centroid,

one such method is Euclidean distance. Euclidean distance is calculated consecutively for each

data point and the data point having the minimum distance is assigned to the cluster. These

minimum points are summed up to get a centroid. This is how k-mean clustering works

INPUT: Dataset

OUTPUT: Clustered Data

Start ( )

1. Read dataset and dataset has number of rows “r” and number of columns “m”

2. For (i=0;i=r; i++) /// selection of centroid point

For (j=0; j=m; j++)

Select k=data (i, j);

End

3. Calculation of Euclidian distance ()

29
4. For (i=0;i=r;i++)

For (j=9;j=m;j++)

A(i)=data(i);

B(i)=data(j);

Distance = sqrt[(A(i+1)-A(i)^2) –(B(j+1)-B(j)^2);

End

End

4. Normalization ()

5. For (k=0;k=data++)

Swap k(i+1) and k(i);

End

The comparison of different systems is shown below to have benefits and certain limitations

of different approaches [19].

1) Recursive Feature Elimination [20].

Pros: This technique will fit a model by removing the weaker features recursively until a

specific number of times. By doing so there are more stronger features for better prediction.

Cons: During the process of removing the features there can be more computational time

used.

2) Deep Learning [21]

Pros: It gives a multi layered approach of the artificial neurons and by this the error rate if

minimal.

Cons: The main drawback is that this requires too much data for it to train and can take

more processing time. It can result in consuming too much memory in the system.

30
3) Particle Swarm optimization [22]

Pros: This technique is easy to implement and have less parameters. It can do fast

processing and does not overlap.

Cons: This type of feature selection can be hard to define initial parameter. If there is too

much scattering this solution tends to fail.

In this proposed system, there are multi levels of feature extraction and classification with

KNN. In this way there is an accurate clustering, and the classification becomes easier and more

efficient. There is no need of more datasets and even with fewer datasets the system can predict

the output efficiently.

31
Chapter 5: Analysis of the Results

This section describes the screens of the “Breast cancer prediction and tracking using

machine learning methods”. The figures shown below for each module. The initial stage is to pre-

process the image and then convert it to grey image. This helps the system to perform segmentation

in future steps.

In Figure 5. There is a transformation of the image form the system to grey scale. The

system converts any dimension of image to a 2d image. For example, if the image is in 3d it will

automatically convert it to the 2D image and if it is 2D the grey scale image will remain as 2D

image.

Figure 5. Input image transformation to Gray scale image.

The following stage is to segment the images. Segmented image is where all the noises are

removed, and it distinguishes between the normal cells and the suspected ones. In order to do that

we first convert the gray image to binary form and perform FCM algorithm. For each cluster FCM

32
takes each data point with a different grade. In this project the FCM Clustered Segmented Image

has 3 bins( Benign, malignant and Normal).

Once the segmentation is done then we must extract the features. Extraction of the feature

is achieved by getting the edge features from multi-level wavelet transformation, PCA and GLCM.

Using KNN the dataset is fully trained and categorized . Once the features are extracted then the

classification of the images occur. Here we have 3 clusters (Benign, Malignant and Normal).

Figure 6. shows that the inputted image is normal whereas figures 7. and 8. show the affected

cancerous region. Figure 9 shows the boundary detection in the given image. The boundary is

detected only of the image has cancerous cells.

Figure 6. Output displaying that the inputted image has no cancerous cells and is normal.

Figures 7. and 8. show that the inputted image has cancerous cells and displays the tumor

area in terms of both mm2 and pixels. Figure 7. Also displays that the patient is in Benign stage.

Figure 8. displays that the patient is in Malignant stage.

33
Figure 7. Output displaying that the inputted image has cancerous cells and is in Benign stage.

Figure 8. Output displaying that the inputted image has cancerous cells and is in Malignant stage.

34
Figure 9. displays the affected boundary images of benign and malignant. (a) the affected

boundaries of benign stage of the Figure 7. and (b) as the affected boundaries of Malignant stage

of the Figure 8.

(a) (b)

Figure 9. Boundary detected Image (a) Benign stage from Figure 4. and (b) Malignant stage from Figure 8

In this project the accuracy is calculated by using the formula below . Let us take x as the

testing data true label and the prediction labels for it would be p. Here we are taking 70 datasets

for testing and calculating the true classified sets with its total number of the test data.

Accuracy= sum(x which is the true label)/ sum (total of all the test data) × 100.

Using this formula the accuracy is computed and gives a good efficiency. The combination of

Multi-Level Wavelet Conversion strategy associated to PCA with 13 features extracted and then

classified gives an average accuracy of nearly 92%.

35
Chapter 6: Conclusion and Future work

Breast Cancer represents one of the diseases that makes highest number of deaths every

year. At present, only few accurate prognostic and predictive factors are used clinically for

managing the patients with breast cancer. Here, by making use of Clustering with Level Set

approach, high accuracy can be achieved in detection of effected cell shapes with exact marking

on detected contours. The proposed system helps to enhance the performance of mammogram

retrieval by selecting optimal features.

The Fuzzy-C-means (FCM) clustering has been used for Image segmentation. Each data

point belongs to multiple clusters with varying degrees of membership, and it is based on the

objective function. The segmented region is completely analyzed by using the Multi-level Discrete

Wavelet Transform, Principal Component Analysis (PCA) along with Gray Level Cooccurrence

Matrix (GLCM) features. Totally 13 features are extracted and their pixel values in the form of

matrix is stored in database. After the features are extracted & completely trained the system

classifies the image into Benign, Malignant and Normal using the KNN classifier technique which

mainly depends on the shape of the cancer cells in the image.

By performing suitable morphological operations, system computes the suitable region

properties such as Area, Euler number etc., and displays the boundary detected image along with

the tumor area. These techniques improve accuracy in tracking the breast cancer cells. To assess

the correctness in classifying data with respect to efficiency and effectiveness of each algorithm

in terms of accuracy, precision, sensitivity, and specificity. Hence the design is to provide high

accuracy and maximum efficiency in prediction and tracking of breast cancer. The combination of

Multi-Level Wavelet Conversion strategy associated to PCA with 13 features extracted and then

classified gives an average accuracy of nearly 92%.

36
As a future improvement, the system can add more features such as recommendation of

medicines/treatments based on the severity of the patient. This prediction and recommendation

system can help doctors to diagnose and cure the disease more efficiently.

37
References

1. DeSantis C, Siegel R, Bandi P, Jemal A. Breast cancer statistics, 2011. CA Cancer J Clin. learning methods.” 2018 Electric
Electronics, Computer Science, Biomedical Engineerings' Meeting (EBBT) (2018): 1-3.2011;61(6):409-418.
doi:10.3322/caac.20134.
2. Y. Lu, J.-Y. Li, Y.-T. Su, and A.-A. Liu, ‘‘A review of breast cancer detection in medical images,’’ in Proc. IEEE Vis.
Commun. Image Process. (VCIP), Dec. 2018, pp. 1–4.
3. Turgut, Siyabend et al. “Microarray breast cancer data classification using machine Varalatchoumy and M. Ravishankar,
"Comparative study of four novel approaches developed for early detection of breast cancer and its stages," 2017 International
Conference on Inventive Computing and Informatics (ICICI), Coimbatore, 2017, pp. 411-416, doi:
10.1109/ICICI.2017.8365384.
4. M. Ravishankar and M. Varalatchoumy, "Four novel approaches for detection of region of interest in mammograms — A
comparative study," 2017 International Conference on Intelligent Sustainable Systems (ICISS), Palladam, 2017, pp. 261-265,
doi: 10.1109/ISS1.2017.8389410.
5. Ammu P K and Preeja V. Article: Review on Feature Selection Techniques of DNA Microarray Data. International Journal
of Computer Applications 61(12):39-44, January 20.
6. Bing Nan Li, Chee Kong Chui, Stephen Chang, S.H. Ong,Integrating spatial fuzzy clustering with level set methods for
automated medical image segmentation, Computers in Biology and Medicine, Volume 41, Issue 1, 2011, Pages 1-10, ISSN
0010-4825.
7. Reddy, V. Anji, and Badal Soni. "Breast Cancer Identification and Diagnosis Techniques." Machine Learning for Intelligent
Decision Science. Springer, Singapore, 2020. 49-70.
8. C. Yanyun, Q. Jianlin, G. Xiang, C. Jianping, J. Dan and C. Li, "Advances in Research of Fuzzy C-Means Clustering
Algorithm," 2011 International Conference on Network Computing and Information Security, Guilin, 2011, pp. 28-31, doi:
10.1109/NCIS.2011.104.
9. V. Pali, S. Goswami and L. P. Bhaiya, "An Extensive Survey on Feature Extraction Techniques for Facial Image Processing,"
2014 International Conference on Computational Intelligence and Communication Networks, 2014, pp. 142-148, doi:
10.1109/CICN.2014.43.
10. Jolliffe IT, Cadima J. Principal Component Analysis: a review and recent developments. Philos Trans A Math Phys Eng Sci.
2016;374(2065):20150202. doi:10.1098/rsta.2015.0202.
11. Mishra, Sidharth & Sarkar, Uttam & Taraphder, Subhash & Datta, Sanjoy & Swain, Devi & Saikhom, Reshma & Panda,
Sasmita & Laishram, Menalsh. (2017). Principal Component Analysis. International Journal of Livestock Research. 1.
10.5455/ijlr.20170415115235.
12. Kalist, V. & Packyanathan, Ganesan & Joseph, Maria & B.S, Sathish & Murugesan, R.. (2020). Image Quality Analysis Based
on Texture Feature Extraction Using Second-Order Statistical Approach. 10.1007/978-981-15-3172-9_48.
13. Chattopadhyay, P., Konar, P. Feature Extraction using Wavelet Transform for Multi-class Fault Detection of Induction
Motor. J. Inst. Eng. India Ser. B 95, 73–81 (2014).
14. S. G. Mallat, "A theory for multiresolution signal decomposition: the wavelet representation," in IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 11, no. 7, pp. 674-693, July 1989, doi: 10.1109/34.192463.
15. S. Singh, D. Srivastava and S. Agarwal, "GLCM and its application in pattern recognition," 2017 5th International Symposium
on Computational and Business Intelligence (ISCBI), Dubai, 2017, pp. 20-25, doi: 10.1109/ISCBI.2017.8053537.
16. H. Wang.: Nearest Neighbours without k: A Classification Formalism based on Probability, technical report, Faculty of
Informatics, University of Ulster, N.Ireland, UK (2002)
17. G. Guo, H. Wang, D. Bell, Y. Bi, and K. Greer, “KNN Model-Based Approch in Classification,” Lecture Notes in Computer
Science, pp 986-996, 2003.
18. Y.-S. Sun, Z. Zhao, Z.-N. Yang, F. Xu, H.-J. Lu, Z.-Y. Zhu, W. Shi, J. Jiang, P.-P. Yao, and H.-P. Zhu, ‘‘Risk factors and
preventions of breast cancer,’’ Int. J. Biol. Sci., vol. 13, no. 11, p. 1387, 2017.
COMPARISION
19. Islam, M.M., Haque, M.R., Iqbal, H. et al. Breast Cancer Prediction: A Comparative Study Using Machine Learning
Techniques. SN COMPUT. SCI. 1, 290 (2020).

38
20. X. Zeng, Y. -W. Chen and C. Tao, "Feature Selection Using Recursive Feature Elimination for Handwritten Digit
Recognition," 2009 Fifth International Conference on Intelligent Information Hiding and Multimedia Signal Processing,
2009, pp. 1205-1208, doi: 10.1109/IIH-MSP.2009.145.
21. Shen, L., Margolies, L.R., Rothstein, J.H. et al. Deep Learning to Improve Breast Cancer Detection on Screening
Mammography. Sci Rep 9, 12495 (2019).
22. B. Xue, M. Zhang and W. N. Browne, "Particle Swarm Optimization for Feature Selection in Classification: A Multi-Objective
Approach," in IEEE Transactions on Cybernetics, vol. 43, no. 6, pp. 1656-1671, Dec. 2013, doi:
10.1109/TSMCB.2012.2227469.

39

You might also like