Professional Documents
Culture Documents
308387_Brain and Breast
308387_Brain and Breast
308387_Brain and Breast
5 Pages 1.2MB
Apr 11, 2022 11:19 AM GMT+5:30 Apr 11, 2022 11:21 AM GMT+5:30
Summary
Cancer Cell Detection using Machine Learning
Ankit Kumar Singh
Ms. TanuShree Nitin Kumar Yadav
Department of Computer Science and
Department of Computer Science and Department of Computer Science and
Engineering
Engineering Engineering
Galgotias College of Engineering and
Galgotias College of Engineering and Galgotias College of Engineering and
Technology
Technology Technology
Greator Noida, 201310, Uttar Pradesh,
Greator Noida, 201310, Uttar Pradesh, Greator Noida, 201310, Uttar Pradesh,
India
India India
ankitsinghchs07@gmail.com
tanu.shree@galgotiacollege.edu nk157798@gmail.com
Akash Pandey
Department of Computer Science and
Engineering
Galgotias College of Engineering and
Technology
Greator Noida, 201310, Uttar Pradesh,
India
4
Abstract— The early stage of cancer detection is required treatment to the patients, symptoms must be studied
to provide proper and personalized treatment to the patient properly and an automatic prediction system is required
and reduce the risk of death due to cancer. Detection of these
cancerous cells at later stages leads to more suffering and which will classify the tumour into benign or malignant.
potentially increases the chances of death. Researchers have There are near about 100 different types of cancers
been working on and developing various machine learning affecting the human body.
solutions to produce encouraging results. In this paper, we 5
explore the various techniques and technologies that are As soon as the disease is discovered, next task would be
already in practice to detect the cancer cells in their early determining in which stage the cancer is. The stage in which
10
stages and works presently going in the industry. The main the cancer is can be determined by various factors such as
objective of this project is to develop a machine learning thickness, the depth of penetration, and the extent to which
algorithm which requires minimal intervention of humans.
the melanoma or the infection has spread. Based on the
Keywords— cancer, machine learning, dataset, algorithm stage determined, the patients are treated accordingly. In the
5
recent years, due to the increased use of cosmetics and
pollution and radiations, cancer is becoming a common
I. INTRODUCTION disease in the modern era.
1
Cancer is the world‘s second biggest killer disease after
II. RELATED WORK
the deadly heart disease and stroke. Cancer is a group of 1
In paper [2] author has tried to resolve the impression of
diseases. It is a dangerous disease that is characterized by
syndrome clusters in breast cancer scraps evolved from both
the nature of the cell inside the body which has no control social media and research study data using improved K-
of itself. It involves abnormal growth of the cells. It medoid clustering and also developed improved K-medoid
spreads and affects very fast to other parts of the body. clustering which helps to improve the clustering performance
Cancer damages the human body gradually when cells by reassigning some of the negative average silhouette width
(ASW) syndrome to other clusters after initial k-medoid
start growing uncontrollably to form many lumps of clustering.
tissues inside the human body called tumours. It is not
necessary that all kind of tumors are cancerous. Some In paper [7] the author has explored many features and
1 classifiers to select extracted genes from microarray which
tumors do not spread in the body. Tumors may grow and
interact with the other parts of the body. That part may be have many noises. They have taken two datasets: Brain
nervous system, digestive system or circulatory system. cancer and Breast cancer which has the sample 72 and 47
The effect of infected parts of the body releases the respectively. They have used Pearson‘s and Spearman‘s
1 correlation coefficients, Euclidean distance, information
hormones that cause change in the body. Cell can grow to
other cells and can destroy the surrounding tissues that gain, mutual information and signal to noise ratio for
causes other tumors to develop. Tumors can be of two feature selection. For classification, they used MLP, kNN,
1 SVM and SOM. They performed experimental results with
types namely malignant and benign. Malignant tumors can
be a life-threatening and more dangerous in nature. Benign all the dataset given and shown the best result for accuracy
tumors usually do not cause much damage but can become is 97.1% on Leukemia dataset with all the classifiers.
more dangerous if they grow a lot or they might become In paper[8] the author explored PSO for the prediction
malignant after certain amount of time. of patient survival using gene expression data. PSO
Cancer has various symptoms such as tumor, abnormal reduces the dimensionality by implementing Probabilistic
bleeding, more weight loss etc. To provide appropriate NN. The experimental results of PSO/PNN on B-cell
15
XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE
1
Lymphoma dataset of 240 sample was more effective up Correlation based feature selection (CFS). In the
to 80% accuracy in predicting survival. conclusion, the authors got the best efficient result by
SVM-RFE feature selection methods with 100% accuracy
In paper [11] the author has proposed a novel approach
to identify the significant genes present in the body.
based on the feature selection method in order to classify
the high dimensional cancer microarray data. This In paper [21] the author applied data mining techniques
approach uses one of the filtering techniques for on large scale to discover the valuable knowledge. Rough
optimization: signal-to- noise ratio (SNR) and PSO. They set theory was utilized to find the data reliance and reduce
predicted that the PSO gives better result when the feature set contained in the data set. The Hybrid
implementation is done along with SVM, k-NN and PNN. Particle Genetic Swarm Optimization is used to optimize
They have described the dataset of Brain Cancer having 72 the selected features of brain cancer at different stages.
instances with 7129 genes and Breast cancer having 62 Multi class SVM is adopted to classify normal or different
instances with 2000 genes, DLBCL having 77 instances stage of brain cancer using optimized feature set. The
1 1
with 6817 genes. The accuracy they found PSO along with dataset of brain cancer composed of 12042 genes with 493
other classifiers gave 100% in the case of Breast Cancer. instances. The classifier Multiclass SVM, ANN and Naïve
Bayes analyzed their experimental result of accuracy 96%,
In paper [12] the author seeked to extract differentially 1
93% and 90% respectively.
expressed (DE) genes between early and advanced cases
of multiple cancer types through the use of RNA Paper [25] compared different Machine Learning
sequencing data. The importance of these genes is further algorithms: SVM, C4.5, NB and kNN for that dataset is
available on WBCD which has 699 instances and 11 integer-
examined by developing predictive models using K-
valued attributes. Among all algorithms, SVM 1
gave the
nearest neighbour and linear discriminate analysis highest accuracy of 97.13% among all and the lowest error
1
classifiers. The outcome states that a cancer analysis may rate conducted in WEKA data mining tool.
be highly equivalent to standard analyses of individual
III. METHODS
cancers for describing biologically relevant DE genes and
can assist in developing powerful predictive models for The following steps are performed for the early detection of
cancerous cell in human beings. These steps show how the
cancer prediction. Microarray gene expression information cancerous cell can be identified in human beings. Following
normally consists of an enormous number of genes are the steps-
contrasted with less number of tests accessible. In this
manner, it is the motivating assignment to recognize a little A. DATASET
subgroup of persistent genes from microarray gene The datasets have been collected from the database.
expression information where the differentiating Databases already have the cancer cell data through which
chromosome can exclusively be utilized for precisely the identification of the cancer cell be done. There are many
arranging the cancer subspace. Consequently, In paper databases such as Kaggle to store the data.
[24] a reckoning proficient but precise gene ID strategy B. PRE-PROCESSING
has been nominated. At the commencement, the t-test
Pre-Processing is the process of improving the quality of
technique is antiquated to diminish the measurement of
the datasets, which are going to be used for testing and
the dataset and after that; the recommended particle swarm
training of the machine. It includes mainly thresholding,
optimization based approach has been utilized to discover
filtering and log transformation process for better quality
helpful genetic code. This strategy has been connected on
of datasets.
the small round blue cell tumor (SRBCT) information to
arrange the four sub-divisions particularly neuroblastoma,
C. CLASSIFICATION TECHNIQUES
non-Hodgkin lymphoma, rhabdomyosarcoma and Ewing
sarcoma syndrome to other clusters after initial k-medoid There are different type of are being used to detect the
clustering. cancerous cell. Following are the few techniques used-
In paper [18] author studied about many classification i) Stochastic Gradient Descent (SGD)
methods and feature selection methods for expressed
ii) Support Vector Machines(SVM- Linear Kernel)
genes in microarray data. They were able to find the
efficiency of the various classification methods like: iii) Support Vector Machines (SVM- Gaussian
SVM, Radial Basic Function, Multi-Layer Perceptron, DT Kernel)
and RF. The 9-fold cross validation had been applied to
calculate the accuracy of the classifier that includes: K- iv) Convolutional Neural Network (CNN)
means. Further the efficiency of the feature selection
1
methods was measured by SVM-RFE, Chi-Squared and
D. Classification Techniques used
i) Stochastic Gradient
3
Descent (SGD)- forms the basis of
Neural Networks. It is an iterative algorithm that starts from
a random point of a function and then travels down its slope
in steps until it reaches the lowest point of that function.
This algorithm is useful in cases where the optimal points
cannot be found by equating the slope of the function to
zero.
11
iv) Convolutional Neural Networks (CNN) – It is a type of
artificial neural networks that is used in
12
image recognition
and processing. It uses machine vision that includes image
Fig.2 Difference between malignant and benign cancers and video recognition along with recommendation systems
8
and natural language processing. CNN uses multilayer
perceptron system that has been designed for reduced
processing requirements. It consists of input layer, output
layer, and a hidden layer that includes multiple
convolutional layers, pooling layers, fully connected layers
and normalization layers.
Fig 3. Picture depicting the images of brain that has brain ii) Pre- Processing - The main task of pre-processing is to
cancer enhance the input image and build it in an exceedingly
either human or machine vision system. Pre-processing
helps to enhance parameters of man pictures like SNR,
The above figures depicts the different types of dataset that removing noise artifacts, inner smoothing and conserving its
are to be taken into account while predicting the different edges. To enhance the SNR values, and the clarity of raw
types of cancers namely- Brain Cancer and Breast Cancer. man pictures, we tend to apply adjective distinction
The datasets are thoroughly trained and then tested in order improvement supported changed sigmoid operation.
to remove any kind of error still persisting in the model that 2
might interfere in the final results. The accuracy of different iii) Feature Extraction - It is the method of aggregation of
types of algorithms is also caliberated in order to get a clear higher-level information of a picture like form, texture,
picture of which algorithm is better. colour, and distinction. In fact, texture analysis is a very
important parameter of human perception in machine
learning algorithm. It is used effectively to enhance the
accuracy of designation
2
system by choosing distinguished
options. One of the foremost wide used image analysis
applications of grey Level Cooccurrence Matrix (GLCM)
and texture feature.
2
iv) Classification - The classification of imaging pictures is
more difficult task for the automated detection of neoplasm
images. Classification may provide the solution whether or
not the image contains neoplasm or not. For classification
purpose several classifiers can be applied. Fig 5. Histogram plotted for different class labels used
B. For Breast Cancer
6 Discussions
i) The Dataset - The machine learning algorithms were
trained to detect breast cancer using the6 Wisconsin 17
In this paper we have proposed
18
a Machine Learning model
Diagnostic Breast Cancer (WDBC) dataset. The dataset and also used different machine learning algorithms for
consists of features which were calculated from a digitized diagnosis and detection of breast and brain cancers. For pre-
image of a fine needle aspirate (FNA) of a breast mass. The processing of dataset, we have used the standardization
said features describe the
6
characteristics of the cell nuclei. method. The dataset will automatically be extracted from
The dataset features are as follows: radius, texture, perimeter, websites like kaggle. Then we have implemented our ML
area, smoothness, compactness, concavity, concave points, algorithms and achieved 94.56% accuracy .
symmetry, and fractal dimensions.
14 V. CONCLUSION
ii) Dataset Pre-processing -To avoid inappropriate
1
assignment of relevance, the dataset was standardized as: From this survey we conclude that, most of the automatic
9
z =(X – μ)/σ cancer predication systems are based on machine learning
where X is the feature to be standardized, μ refers to mean concepts including classification and clustering algorithms.
value of the feature, and σ signifies the standard deviation of This paper presents an extensive review of various Machine
the feature. Learning classification techniques for the prediction of
cancer and standard datasets have been used in various 1
iii) Machine Learning Algorithms – For detection of breast variety of cancer such as brain cancer and breast cancer. A
cancers three different algorithms are used namely: detailed list of results found by many researchers has been
Stochastic gradient descent with the accuracy of 95.53%, calculated to solve the problems by various computational
SVM with linear kernel with the accuracy of 96.49% and intelligence techniques. The most successful approach is
SVM with Gaussian kernel with the accuracy of 97.56%. SVM and combination of SVM technique which gave up to
99% accuracy on a smaller number of training datasets
which is not a good prediction in case with large datasets.
Results However, options are available for the possibilities of
improvement of predicting the cancer at an early stage. There
are many datasets that are available to explore more for the
same. There are huge numbers of cancer types present with
unknown and invariable functions.
VI. ACKNOWLEDGEMENT
TOP SOURCES
The sources with the highest number of matches within the submission. Overlapping sources will not be
displayed.
ijsret.com
1 42%
Internet
towardsdatascience.com
3 3%
Internet
Priyank Hajela, Ambika Vishal Pawar, Swati Ahirrao. "Deep Learning for...
4 3%
Crossref
ijrat.org
5 3%
Internet
arxiv.org
6 3%
Internet
mdpi.com
8 1%
Internet
Sources overview
Similarity Report ID: oid:27535:15899281
coursehero.com
10 <1%
Internet
deepai.org
14 <1%
Internet
repository.tudelft.nl
15 <1%
Internet
ijert.org
16 <1%
Internet
researchgate.net
19 <1%
Internet
Sources overview
Similarity Report ID: oid:27535:15899281