Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

e-ISSN: 2582-5208

International Research Journal of Modernization in Engineering Technology and Science


( Peer-Reviewed, Open Access, Fully Refereed International Journal )
Volume:03/Issue:08/August-2021 Impact Factor- 5.354 www.irjmets.com

CLASSIFICATION OF CANCER BY GENE EXPRESSION USING NEURAL


NETWORK
Samriti Wadhwa*1, Dr. Sukhvinder Kaur*2
*1Student, Electronics And Communications, Swami Devi Dayal Institute Of Engineering And
Technology, Panchkula, Haryana, India.
*2HOD, Electronics And Communications, Swami Devi Dayal Institute Of Engineering And Technology,
Panchkula, Haryana, India.
ABSTRACT
The classification of different types of tumors is of great importance in cancer diagnosis and its drug discovery.
Cancer classification via gene expression data is known to contain the keys for solving the fundamental
problems relating to the diagnosis of cancer. The recent advent of DNA microarray technology has made rapid
monitoring of thousands of gene expressions possible. With this large quantity of gene expression data,
scientists have started to explore the opportunities of classification of cancer using a gene expression dataset.
To gain a profound understanding of the classification of cancer, it is necessary to take a closer look at the
problem, the proposed solutions, and the related issues altogether. In this research thesis, I present a new way
for Leukemia classification using the latest AI technique of Deep learning using Google TensorFlow on gene
expression data.
Keywords: Cancer Classification, AI, Neural Network, Microarray, Binary Classification, Deep Learning, Gene
Expression Data, Machine Learning.
I. INTRODUCTION
Different classification of machine learning and statistical methods have been applied to cancer classification,
but some issues make it a significant task. The gene expression data is distinct from any of the data these
practices had previously dispensed with. Firstly, it has very high dimensionality, typically comprises thousands
to tens of thousands of genes. Secondly, the publicly accessible data size is very small, all below 100. Third,
nearly all genes are unrelated to cancer distinction. Those existing classification methods were not devised to
handle this sort of data competently and effectively. Some scholars proposed to do gene selection before cancer
classification[1].
Biological Background Information
DNA is a double-stranded polymer comprised of four basic molecular parts called nucleotides. Each nucleotide
comprises a phosphate group, deoxyribose sugar, and one of the four nitrogen bases which are adenine(A),
guanine(G), cytosine(C), and thymine(T). The divides of the double helix structures are held together by the
hydrogen bonds between the nitrogen bases through the base pairs: A with T, C with G. Each strand in the DNA
double helix is a chemical “mirror image” of the other. DNA act as a model for making copies of itself but also as
a design for a molecule called RNA (ribonucleic acid). The genome presents a pattern for the creation of a
variety of RNA molecules. The main types of RNA are messenger RNA (mRNA), transfer RNA (tRNA), and
ribosomal RNA (rRNA)[2].
Gene Expression and DNA Microarray
The process of transliterating a gene’s DNA sequence into RNA is called gene expression. This process is
depicted in Figure 1. A DNA microarray (also commonly known as a DNA chip or biochip) is a compilation of
microscopic DNA spots affixed to a solid surface. Scientists use DNA microarrays to gauge the expression levels
of large numbers of genes concurrently or to genotype various regions of a genome.
PCA
The principal components of a set of points in real coordinate space are a series of p unit vectors, where the i-th
vector is the path of a line that best fits the data while being orthogonal with the first i-1 vectors. A best-fitting
line is defined as one that reduces the average squared space from the points to the line. These directions
represent an orthonormal basis in which various individual dimensions of the data are linearly uncorrelated.

www.irjmets.com @International Research Journal of Modernization in Engineering, Technology and Science


[284]
e-ISSN: 2582-5208
International Research Journal of Modernization in Engineering Technology and Science
( Peer-Reviewed, Open Access, Fully Refereed International Journal )
Volume:03/Issue:08/August-2021 Impact Factor- 5.354 www.irjmets.com
Principal component analysis (PCA) is the process of calculating the principal components and using them to
make a variety of bases on the data, at times using only some principal components and disregarding the
rest[3].

Figure 1. Gene Expression

Artificial Neural Network


Artificial neural networks (ANNs), usually simply called neural networks (NNs), are processing systems
distantly inspired by the biological neural networks that represent animal brains. An ANN is based on a
compilation of linked units or nodes called artificial neurons, which lightly model the neurons in a biological
brain. Each link, like the synapses in a biological brain, can communicate a signal to other neurons. An artificial
neuron that obtains a signal then processes it and can signal neurons attached to it. The "signal" at a link is a
real number, and the output of each neuron is calculated by some non-linear function of its inputs. The links are
called edges. Neurons and edges typically have a weight that alters as learning progresses. The weight increases
or decreases the intensity of the signal at a connection. Neurons may have a limit that a signal is sent only if the
aggregate signal crosses that limit. Typically, neurons are grouped into layers. Various layers may perform
various changes on their inputs. Signals travel from the first layer (the input layer) to the last layer (the output
layer), possibly after navigating the layers multiple times[4].
II. METHODOLOGY
As per the proposed methodology in Figure 2, the workflow is shown. We are using the leukemia data(Golub et
al., 1999) for classification here[11]. The analytical work on data can be divided into the following sections:
1. Data cleaning and pre-processing.
2. Feature extraction and selection.
3. Creating a neural network on top of gene expression data.
4. Classification of DNA is done using Deep Neural Network to define the cancer type.
FEATURE(GENE) SELECTION PROCESS
Feature selection is a helpful preprocessing method in data mining, it is usually used to decrease the
dimensions of the data and enhance classification accuracy. First, the gene expression data set has distinctive
features which are very different from all the earlier data used for classification[12]. Most publicly accessible
gene expression data has the subsequent properties:
 High dimensionality – from thousands to tens of thousands of genes.
 Very small data set size - less than 100 patients.
 Most genes are not linked to cancer classification.

www.irjmets.com @International Research Journal of Modernization in Engineering, Technology and Science


[285]
e-ISSN: 2582-5208
International Research Journal of Modernization in Engineering Technology and Science
( Peer-Reviewed, Open Access, Fully Refereed International Journal )
Volume:03/Issue:08/August-2021 Impact Factor- 5.354 www.irjmets.com

Figure 2. Proposed Methodology


Principal component analysis (PCA) is the method of computing the principal components and utilizing them to
operate a change on the data, occasionally using only some principal components and disregarding the rest.
For the predicament of cancer classification, I believe that a reasonable amount of consideration should be paid
to gene selection and make it an essential preprocessing step. A scalar is characterized as a number that is
invariant in an orthogonal transformation. I split data into a 70:30 ratio keeping 70% for training and 30% for
testing.
Rectified linear Activation Function(RELU)
The rectified linear activation function or ReLU for brief is a piecewise linear function that will yield the input
directly if it is positive, otherwise, it will yield zero. It has come to be the default activation function for several
types of neural networks as a model that uses it is simpler to train and frequently accomplishes improved
performance. Figure 3 shows the graph of this function[15].

Figure 3. Rectified Linear activation function


Sigmoid Activation Function
One of the numerous activation functions is the sigmoid function which is described as

( )

www.irjmets.com @International Research Journal of Modernization in Engineering, Technology and Science


[286]
e-ISSN: 2582-5208
International Research Journal of Modernization in Engineering Technology and Science
( Peer-Reviewed, Open Access, Fully Refereed International Journal )
Volume:03/Issue:08/August-2021 Impact Factor- 5.354 www.irjmets.com
Sigmoid function yields in the range (0, 1), which makes it perfect for binary classification challenges wherever
we need to find the probability of the data fitting to a particular class. The sigmoid function is differentiable at
every single point and its derivative comes out to be ( ) ( ) ( ( ))as illustrated in Figure 4. Since
the phrase involves the sigmoid function, its value can be recycled to make the backward propagation speedier.
Sigmoid function undergoes “vanishing gradients” as it smooths out at both ends, causing very small changes in
the weights during backpropagation. This can make the neural network refuse to train and get jammed. That’s
why the handling of the sigmoid function is being substituted by other non-linear functions for instance
RELU[16].

Figure 4. Sigmoid Function

DEEP NEURAL NETWORK


A deep neural network (DNN) is an artificial neural network (ANN) with several layers between the input and
output layers[13]. There are various types of neural networks, but they always comprise the same components:
neurons, synapses, weights, biases, and functions. These parts operate like the human brain and can be taught
like any other ML algorithm. Figure 5 shows the design of a typical DNN.

Figure 5. Deep Neural Network


www.irjmets.com @International Research Journal of Modernization in Engineering, Technology and Science
[287]
e-ISSN: 2582-5208
International Research Journal of Modernization in Engineering Technology and Science
( Peer-Reviewed, Open Access, Fully Refereed International Journal )
Volume:03/Issue:08/August-2021 Impact Factor- 5.354 www.irjmets.com
DIRECTED ACYCLIC GRAPH (DAG)
A directed acyclic graph (DAG) is a directed graph without directed cycles. It consists of vertices and edges (also
called arcs), with every edge directed from one vertex to another, such that following those paths will never
form a closed loop. A directed graph is a DAG if and only if it can be topologically arranged, by organizing the
vertices as a linear ordering that is coherent with all edge directions. Figure 6 shows a complex DAG. DAGs have
several scientific and computational functions, varying from biology to sociology to computation[14].
A typical DAG will have 4 things:
o Nodes - A place to store the data.
o Directed Edges - Arrows that point in one direction.
o Ancestral node - Nodes with no parents.
o Leaves - Nodes with no children.

Figure 6. Directed Acyclic Graph

III. MODELING AND ANALYSIS


Exploratory Data Analysis
Raw data is far from prepared for modeling, as it has a gene(features) in 7129 rows. So, there is a requirement
for transforming this data to make it ready for model use. Data ratio of two classes ALL: AML is 65:35. This
illustrates class imbalances.
Data Understanding
The numbered column in data refers to the 72 patients in the research, cross-indexed to the ground truth in
actual.csv. By taking all 7129 genes for patient 1, one should be able to forecast Acute Myeloid Leukemia (AML)
or Acute Lymphoblastic Leukemia (ALL). It is important to note that these values have been scaled to be
equivalent.
The call column can have 3 possible values: Present, Absent, and Marginal. They are a determination on
whether that gene is present in the sample in the prior column.
Proposed Algorithm
Based on the data understanding after EDA described in the last section I have created this data processing
pipeline as described in Figure 7.

Figure 7. Proposed Algorithm

Data Preparation

www.irjmets.com @International Research Journal of Modernization in Engineering, Technology and Science


[288]
e-ISSN: 2582-5208
International Research Journal of Modernization in Engineering Technology and Science
( Peer-Reviewed, Open Access, Fully Refereed International Journal )
Volume:03/Issue:08/August-2021 Impact Factor- 5.354 www.irjmets.com
These steps were performed on the raw dataset to make it ready for feature extraction –
Data Cleaning
I tidy up the data set by removing call columns and removing the duplicates.
Data Preprocessing
Transpose
First, we need to engineer features(genes) so the first step is to transpose the data. After transposing we have
features in columns instead of rows
Reordering and reshaping
Neither the training nor testing row indexes are not in numeric order, so I must reorder these so that the labels
will line up with the corresponding data.
Standardizing Features
The test dataset must use identical scaling to the training dataset. This standardization can be achieved by
scaling the data using the StandardScaler function in the scikit-learn library.

Figure 8. Gaussian Distribution of Original and Scaled data


Scaled data is better distributed, standardized, and is a great fit for feature selection and data modeling. Figure
8 shows the Gaussian distribution of the original and scaled dataset.
Feature Extraction
After standardizing features, the next step is to obtain the features for modeling out of 7129 features available.
As no model works well with so many features I need to reduce it to controllable features by using
Dimensionality Reduction techniques called Principal Component Analysis(PCA)
Principal Component Analysis(PCA)
95% of the variance is explained by 32 principal features/components. We cannot map something in 32
dimensions, so let me just see what the PCA looks like when I just choose the top three components in Figure 9
Classification
After PCA we have 32 principal components or features in our dataset. Next, I need to train a neural network for
binary classification based on these 32 features.
So here I use 2 layers of Rectified linear activation function for faster learning and then input those to a sigmoid
activation function for classification of cancer for each patient in sequential mode.

www.irjmets.com @International Research Journal of Modernization in Engineering, Technology and Science


[289]
e-ISSN: 2582-5208
International Research Journal of Modernization in Engineering Technology and Science
( Peer-Reviewed, Open Access, Fully Refereed International Journal )
Volume:03/Issue:08/August-2021 Impact Factor- 5.354 www.irjmets.com

Figure 9. Top 3 Feature directions

Figure 10 shows the plot of the directed acyclic graph(DAG) with the input and output at each node/layer of
this deep neural network.

Figure 10. DAG of our DNN

IV. RESULTS AND DISCUSSION


For training of neural network classification model as described in the section above, I used PCA dataset along
with parameter batch size as 8 and trained it for 200 epochs. After 200 iterations/epochs, the training accuracy
comes out to be 85.29%. After the model is successfully trained on the training dataset it is then used to predict
classes on the test dataset, which gave the testing accuracy of 85.3%
After testing I created the Neural network confusion matrix to describe the performance of my binary
classification model.

Figure 11. Final NN Confusion Matrix

www.irjmets.com @International Research Journal of Modernization in Engineering, Technology and Science


[290]
e-ISSN: 2582-5208
International Research Journal of Modernization in Engineering Technology and Science
( Peer-Reviewed, Open Access, Fully Refereed International Journal )
Volume:03/Issue:08/August-2021 Impact Factor- 5.354 www.irjmets.com
Figure 11 shows the confusion matrix. Results can be depicted as 18 out of 20 patients in the test dataset with
ALL were correctly identified by my model while 2 were incorrectly tagged as AML. 11 out of 14 patients in the
test dataset with AML are true positive while 3 were incorrectly tagged as AML.
We can compare this result with similar studies done on the same Leukemia data set. Table 1 shows this
comparison.
Table 1. Comparison with similar studies

V. CONCLUSION
Neural Networks are a new and upcoming technology in the field of Artificial Intelligence. Here I successfully
classified the type of cancer based on pre-calculated gene expression with over 85% accuracy. This model and
modeling technique will be very useful in the accurate and early detection of acute myeloid leukemia (AML)
and acute lymphoblastic leukemia (ALL) for clinical purposes. The results satisfy my objective of classifying
cancer using gene expression data using a neural network classification model with the latest technology of
google named TensorFlow. When we compare the results with older approaches of SVM so of those matured
strategies outperform the Neural network. NN`s performance is limited by the size of the data set. Here I had
total data for just 72 patients. So, there is a scope of a lot of improvement in NN`s accuracy and stability.
VI. REFERENCES
[1] A. Azuaje. Interpretation of genome expression patterns: computational challenges and opportunities.
IEEE Engineering in Medicine and Biology, 2000.J. Clerk Maxwell, A Treatise on Electricity and
Magnetism, 3rd ed., vol. 2. Oxford: Clarendon, 1892, pp.68–73.
[2] A. Ben-Dor, L. Bruhn, N. Friedman, I. Nachman, M. Schummer, and Z. Yakhini. Tissue classification with
gene expression profiles. In Proc. of the Fourth Annual Int. Conf. on Computational Molecular Biology,
2000.
[3] JOLLIFFE, I.T., 2002. Principal Component Analysis, second edition, New York: Springer-Verlag New
York, Inc.
[4] McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. The
Bulletin of Mathematical Biophysics, 5(4), 115–133.
[5] Empirical Assessment of Classification Accuracy of Local SVM Nicola Segata and Enrico Blanzieri March
21, 2008
[6] Lee, Yoonkyung & Lee, Cheol-Koo. (2003). Classification of Multiple Cancer Types by Multicategory
Support Vector Machines Using Gene Expression Data. Bioinformatics. 19. 1132-1139.
10.1093/bioinformatics/btg102.
[7] Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR,
Caligiuri MA, Bloomfield CD, Lander ES. Molecular classification of cancer: class discovery and class
prediction by gene expression monitoring. Science. 1999 Oct 15;286(5439):531-7. DOI:
10.1126/science.286.5439.531. PMID: 10521349.
[8] Khan, J. and Wei, J. S. and Ringner, M. and Saal, L. H. and Ladanyi, M. and Westermann, F. and Berthold,
F. and Schwab, M. and Antonescu, C. R. and Peterson, C. and Meltzer, P. S. (2001). Classification and
diagnostic prediction of cancers using gene expression profiling and artificial neural networks, Nature
Medicine, 7, 673–679.
[9] Rifkin, Ryan & Mukherjee, Sayan & Tamayo, Pablo & Ramaswamy, Sridhar & Yeang, Chen-Hsiang &
Angelo, Michael & Reich, Michael & Poggio, Tomaso & Lander, Eric & Golub, Todd & Mesirov, Jill.
(2003). An Analytical Method for Multiclass Molecular Cancer Classification. Society for Industrial and
Applied Mathematics. 45. 706-723. 10.1137/S0036144502411986.
[10] Jia Lv, Qinke Peng, Xiao Chen, Zhi Sun, A multi-objective heuristic algorithm for gene expression
microarray data classification, Expert Systems with Applications, volume 59, 2016, Pages 13-19, ISSN
www.irjmets.com @International Research Journal of Modernization in Engineering, Technology and Science
[291]
e-ISSN: 2582-5208
International Research Journal of Modernization in Engineering Technology and Science
( Peer-Reviewed, Open Access, Fully Refereed International Journal )
Volume:03/Issue:08/August-2021 Impact Factor- 5.354 www.irjmets.com
0957-4174, https://doi.org/10.1016/j.eswa.2016.04.020.
[11] Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR,
Caligiuri MA, Bloomfield CD, Lander ES. Molecular classification of cancer: class discovery and class
prediction by gene expression monitoring. Science. 1999 Oct 15;286(5439):531-7. DOI:
10.1126/science.286.5439.531. PMID: 10521349.
[12] Zrimec, J., Börlin, C.S., Buric, F. et al. Deep learning suggests that gene expression is encoded in all parts
of a co-evolving interacting gene regulatory structure. Nat Commun 11, 6141 (2020).
https://doi.org/10.1038/s41467-020-19921-4
[13] APA. Bengio, Y. (2016). Deep Learning. MIT Press.
[14] Thulasiraman, K.; Swamy, M. N. S. (1992), "5.7 Acyclic Directed Graphs", Graphs: Theory and
Algorithms, John Wiley and Son, p. 118, ISBN 978-0-471-51356-8. Thulasiraman, K.; Swamy, M. N. S.
(1992), "5.7 Acyclic Directed Graphs", Graphs: Theory and Algorithms, John Wiley and Son, p. 118, ISBN
978-0-471-51356-8.
[15] Agarap, Abien Fred. (2018). Deep Learning using Rectified Linear Units (ReLU).
[16] Sridhar Narayan, The generalized sigmoid activation function: Competitive supervised learning,
Information Sciences, Volume 99, issues 1–2, 1997, Pages 69-82, ISSN 0020-0255,
https://doi.org/10.1016/S0020-0255(96)00200-9.

www.irjmets.com @International Research Journal of Modernization in Engineering, Technology and Science


[292]

You might also like