1 s2.0 S0079610722000803 Main

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Progress in Biophysics and Molecular Biology 177 (2023) 1–13

Contents lists available at ScienceDirect

Progress in Biophysics and Molecular Biology


journal homepage: www.elsevier.com/locate/pbiomolbio

A survey on gene expression data analysis using deep learning methods for
cancer diagnosis
Ravindran Ua, Gunavathi Cb, *
a
School of Information Technology and Engineering, Vellore Institute of Technology, Vellore, India
b
School of Computer Science and Engineering, Vellore Institute of Technology, Vellore, India

A R T I C L E I N F O A B S T R A C T

Keywords: Gene Expression Data is the biological data to extract meaningful hidden information from the gene dataset. This
Gene expression data gene information is used for disease diagnosis especially in cancer treatment based on the variations in gene
Data augmentation expression levels. DNA microarray is an efficient method for gene expression classification and prediction of
Deep learning methods
cancer disease for specific types of cancer. Due to the abundance of computing power, deep learning (DL) has
become a widespread technique in the healthcare sector. The gene expression dataset has a limited number of
samples but a large number of features. Data augmentation is needed for gene expression datasets to overcome
the dimensionality problem in gene data. It is a technique to generating the synthetic samples to increase the
diversity of data. Deep learning methods are designed to learn and extract the features that come from the raw
input data in the form of multidimensional arrays. This paper reviews the existing research in deep learning
techniques like Feed Forward Neural Network (FFN), Convolutional Neural Network (CNN), Autoencoder (AE)
and Recurrent Neural Network (RNN) for the classification and prediction of cancer disease and its types through
gene expression data analysis.

1. Introduction sequencing technique that allows for the analysis of transcriptomes and
the detection of splice variants and novel sequences (Mantione et al.,
Bioinformatics is an interdisciplinary platform for solving problems 2014). Microarray technology is a technique that allows the monitoring
in healthcare domain using biological data by applying computational of thousands of gene expression levels simultaneously. Gene expression
techniques (Yoo et al., 2012). The most popular biological data in the data is mostly used for cancer classification for better prediction accu­
healthcare domain is genomics data. Gene expression data comes from a racy (Peng et al., 2003) (Simon, 2009).
part of the chromosome in a human cell. There are three stages involved Cancer is one of the leading causes of death worldwide, and roughly
in expressing the gene information. The first stage is the isolation phase, 10 million cancer deaths occurred in 2020. In general, one out of every
in which messenger ribonucleic acid (mRNA) is isolated from the human six deaths in the world are due to Cancer (Sung et al., 2021). Omics data
cell. The second stage is the reverse transcription phase, in which mRNA analysis is used to identify the relationships between the biological
has to perform reverse transcription-polymerase chain reaction molecules involved in a human organism. The aim of investigating omics
(RT-PCR) to get complementary deoxyribonucleic acid (cDNA) frag­ datasets based on genome analysis is to classify tumor and non-tumor
ments. The third stage is the hybridization phase, in which cDNA is samples and to segregate the types of cancer (Kristensen et al., 2014).
labeled with fluorescent tags and introduces the arrays with hybridiza­ Analyzing the genomics data is a challenging task because the dataset
tion with probes to express the gene information (https://www.your­ consists of a limited number of samples and the number of features in the
genome.org/facts/what-is-gene-expression.). The gene expression level dataset is large (Kong and Yu, 2018). Therefore, data augmentation is
can be measured using two different technologies: RNA Sequencing often employed to increase the training sample size and to minimize
(RNA-seq) and Microarray/DNA Chip (https://sapac.illumina.com/s­ overfitting. An effective method is needed for handling the omics dataset
cience/technology/next-gener­ to overcome problems like non-reproducible results, overfitting, and
ation-sequencing/microarray-rna-seq-comparison.html?langsel = /in/.) noisy data. To handle the data dimensionality problem in the dataset for
(https://www.youtube.com/watch?v = 2c3t3tDEmsU.). RNA-seq is a getting useful information, two methods are applied: feature selection

* Corresponding author.
E-mail address: gunavathi.cm@vit.ac.in (C. Gunavathi).

https://doi.org/10.1016/j.pbiomolbio.2022.08.004
Received 27 May 2022; Received in revised form 9 August 2022; Accepted 12 August 2022
Available online 19 August 2022
0079-6107/© 2022 Elsevier Ltd. All rights reserved.
U. Ravindran and C. Gunavathi Progress in Biophysics and Molecular Biology 177 (2023) 1–13

Fig. 2. Architecture of CNN.

Fig. 1. Architecture of FFN.

and feature extraction. The goal of the two techniques is to increase the
accuracy of classification and reduce the computational cost. In feature
selection, the basic principle is to select the important features from the
original datasets based on some criteria. It is a preprocessing method to
reduce redundant and irrelevant features from the dataset (Shukla et al.,
2018) (Shi et al., 2018). In feature extraction, the basic principle is to
generate the new features from the existing datasets based on some
criteria. It is a preprocessing method in which the low dimension of new
feature space is extracted from existing feature space. Some of the
feature extraction approaches are Linear Discriminant Analysis (LDA),
Principal Component Analysis (PCA), Canonical Correlation Analysis
(CCA) and Autoencoder (Tan et al., 2019) (Guyon et al., 2002) (Día­
z-Uriarte and Alvarez de Andrés, 2006).
It is difficult to analyze data in an omics dataset because it is com­
plex, high-dimensional, and heterogeneous. Deep Learning is a suitable Fig. 3. Architecture of AE.
technique for handling the omics dataset for the prediction of better
results. In a Deep Learning algorithm, the features are learned auto­ pooling layer for reducing the data size and fully connected layer to
matically from the original input data (Hinton and Salakhutdinov, classify the data (https://youtu.be/PmZp5VtMwLE). Fig. 2 shows the
2006). Deep learning methods include Feed Forward Neural Network architecture of Convolutional Neural Network.
(FFN), Convolutional Neural Network (CNN), Autoencoder(AE), and The Output of the new width of the convolution layer:
Recurrent Neural Network(RNN). These techniques are employed in W = (Wi − f + 2p)/s + 1 (4)
real-time genomics dataset applications such as genomics and sequence
analysis, medical image processing, biomarker discovery and patient The Output of the new height of the convolution layer:
categorization (Martorell-Marugán et al., 2019).
H = (Hi − f + 2p)/s + 1 (5)
Feed Forward Neural Network (FFN) is a hierarchical structure of
neural network and it consists of input layer with n dimensional vector, The Output of the new depth of the convolution layer:
L-1 multi hidden layers having n neurons each and one output layer
D=k (6)
consists of k classes. It processing the inputs with weight along with bias
and optimizing with backpropagation (https://youtu.be/AASR9­ where,
rOzhhA). Fig. 1 shows the architecture of Feed Forward Neural Network.
The Output function of FFN: k is number of filters,
f (x) = hL (x) = O(aL (x)) (1) s is stride,
p is padding,
where, f is filter and
hi is activation layer: Wi and Hi are weights and heights of the previous layer.

hi (x) = g(ai (x)) (2)


Autoencoder (AE) is a special type of Feed forward Neural Network
with encoder and decoder. The encoder encodes the input into hidden
ai is pre-activation layer:
information and decoder decodes the input again from hidden infor­
ai (x) = bi + Wi hi− 1 (x) (3) mation. This model is trained to minimize the loss function and it re­
constructs the input close to the original input (https://youtu.
wi is weight and bi is bias. be/wPz3MPl5jvY). Fig. 3 shows the architecture of Autoencoder.
Convolutional Neural Network (CNN) model is used to learn the The output function of AE:
multiple layers of kernel filters along with learning the weights of the
classifier. It consists of convolution layer for extracting the features; Xi = f (W * H + C) (7)

2
U. Ravindran and C. Gunavathi Progress in Biophysics and Molecular Biology 177 (2023) 1–13

A feed-forward neural network was proposed to discover biomarkers


in a breast cancer dataset. They used the lasso penalty to estimate the
parameters and the Youden’s J index to compare models. The simulation
result achieved more robust if Youden’s J index value decreases because
it will produce better sensitivity and specificity values when compared
to meta logistic regression methods and meta SVM methods (Lim et al.,
2019). A Deep Neural Network model was proposed to classify normal
and tumor samples of gene data and to learn the important character­
istics of gene data. These data were downloaded from GEO and TCGA
and the model achieved good classification accuracy for training and
independent testing datasets (Ahn et al., 2018). A Deep Flexible Neural
Network (DFNForest) model was suggested to solve many binary clas­
sification problems and to extend the flexible neural tree (FNT) without
adding additional parameters to increase the depth of the tree. Along
with fisher ratio and rough set, they were employed to minimize the
dimensions of gene data. The model achieved better results for the
classification of cancer subtypes such as breast, glioblastoma and lung
(Xu et al., 2019).
Fig. 4. Architecture of RNN A preprocessing technique was applied to handle noisy and outlier
data before training the model. The proposed model was called
The hidden function of AE: “improved DNN,” and it included Dropout in the DNN to avoid over­
fitting. The model was achieved better classification accuracy compared
H = g(Wxi + b) (8) to DNN, CNN and RNN (Ahmed and Brifcani, 2019). A Deep Neural
Network (DNN) model was proposed for feature selection in gene data
where, and for getting better classification results. Prior to the classification,
gene ranking is done to achieve a high performance in classification and
xi is an original input, to improve the prediction accuracy. This model achieved a better ac­
b is bias, curacy value compared to MLP, Nave Bayes and decision trees (Chan­
C is constant and drasekar et al., 2020). A Grouping Genetic Algorithm (GGA) along with
Xi is reconstructed input an Extreme Learning Machine (ELM) was proposed to solve the grouping
problem for unbalanced gene expression datasets and to classify the
Recurrent Neural Network (RNN) has a recurrent connection used for data. It achieved good generalization performance with extreme speed
handling the sequential information. The state vector is the intermediate for classification applications in multi-label datasets, dimensionality
process in the RNN. It will compute the combination of current input reduction problems and regression applications. The proposed model
values and previous values stored in state vectors. The output is based was achieved better accuracy to classify the 49 gene samples (García-­
current input and past input (https://youtu.be/ym5qG-3kJ10). Fig. 4 Díaz et al., 2020).
shows the architecture of Recurrent Neural Network. Genome Deep Learning (GDL) was proposed to learn the relationship
The Output function of RNN: between traits and genome variations for cancer identification in deep
yt = f (Vsi + c) (9) neural networks. To train this model, the healthy samples are down­
loaded from the International Genome Sample Resource (IGSR) and
The state function of RNN: TCGA repository tumor samples are downloaded. This model was ach­
si = f (Ux + Wsi− 1 + b) (10) ieved better classification accuracy, sensitivity and specificity values.
For predicting better accuracy, we need to consider more factors in
where, genome variations like age, sex and proteome data (Sun et al., 2019). A
modified generator GAN (MG-GAN) model was proposed to generate
W, U and V are parameters shared at each time step, synthetic data using multivariate noise with a Gaussian distribution. The
b is bias and c is constant. proposed model uses latent space to specify the permissible region. DNN
was used to improve classification accuracy by optimizing the loss
2. Feed forward neural network based gene expression data function (Chaudhari et al., 2020). An improved KNN was proposed to
analysis methods generate the synthetic samples based on each target end point. In the
proposed model, Euclidean distance, counting quotient filter and the
A multistage ensemble classification was proposed to classify the mean value of k nearest neighbors were calculated. DNN was used for
gene expression cancer data. In the first stage, most informative genes the classification of sample data, and it achieved better recall value and
are extracted from the first hidden neural network, then take start as the also ensured the sensitivity (Chaudhari et al., 2021). Conditional GAN
informative genes as input for the second neural network. This model was proposed as a sharp and accurate prediction for gene inference in
shows better classification accuracy for leukemia and colon datasets the target profiles. The proposed model engaged with ℓ1-norm loss and
(Singh et al., 2019). Deep Cancer subtype Classification (Deep CC) was adversarial loss functions to produce better regression results and
supervised classification framework for cancer classification and pre­ computed Mean Absolute Error and Concordance Correlation values for
diction. This framework was used to categorize the gene expression for GEO and GTEx datasets (Wang et al., 2018).
13 independent datasets and the confusion matrix was calculated. The The Wasserstein generative adversarial network (WGAN) was a deep
results achieved better performance than Random Forest (RF), Support learning approach used for cancer diagnosis for imbalanced datasets.
Vector Machine (SVM), and Gradient Boosting Machine (GBM). This This model was a progress indicator and it was extremely useful for the
proposed model was to learn the biological characteristics of different uniqueness of data. The proposed model predicted the gene data for
molecular subtypes for patient samples and to be used for single pre­ stomach, breast and lung tissues (Xiao et al., 2021). A multilayer FFN
diction (Gao et al., 2019). was proposed to learn the existing non-linearity between the input space
and the output space. For feature selection of RNA Sequence data, the

3
U. Ravindran and C. Gunavathi Progress in Biophysics and Molecular Biology 177 (2023) 1–13

Table 1
Feed Forward Neural Network (FFN) based gene expression data analysis methods.
Paper Ref. Dataset Method Performance Parameter Values
Technique Obtained

Singh et al. (2019) Benchmark Dataset: Leukemia and colon cancer FFN – Multistage Ensemble Classification Accuracy (Leukemia) 97.1%
Accuracy (Colon) 90.3%
Gao et al. (2019) MsigDB, The Cancer Genome Atlas (TCGA): Deep CC – Deep Cancer Classification Accuracy >90%
Colon and Breast Cancer Subtype Classification Sensitivity >95%
Specificity >95%
Lim et al. (2019) The Cancer Genome Atlas (TCGA): Breast Cancer FFN Prediction Learning Rate 0.9
Data Sensitivity 0.7770
Specificity 0.5229
Youden’s J index 0.3
Ahn et al. (2018) 1.GEO: Gene Expression Omnibus FFN Classification Training Dataset 0.997
2.TCGA: The Cancer Gene Atlas Independent Test data 0.979
3.TARGET: Therapeutically Applicable
Research To Generate Effective Treatments
4.GTEx:
Genotype-Tissue Expression
Xu et al. (2019) The Cancer Genome Atlas (TCGA) DFNForest- Deep flexible Classification Accuracy(BRCA) 0.936
neural forest Accuracy(GBM) 0.842
Accuracy (Lung) 0.880
Ahmed and 1.Kent Ridge Bio-medical Dataset: Leukemia and FFN with Dropout Classification Accuracy(DLBCL) 98.4%
Brifcani (2019) colon 2.Benchmark Dataset: DLBCL and Prostate Accuracy(Prostate) 93.2%
Accuracy(Leukemia) 99%
Accuracy(Colon) 91.4%
Chandrasekar et al. NCBI Repository FFN Classification Accuracy 72.5%
(2020)
García-Díaz et al. NCBI Repository FFN - GGA with ELM Classification Accuracy 98.81%
(2020)
Sun et al. (2019) 1. IGSR- The International Genome Sample FFN - GDL Prediction Accuracy >97%
Resource Sensitivity >98%
2. TCGA- The Cancer Genome Atlas Specificity >97%
Chaudhari et al. NCBI repository MG-GAN: Modified Generator Classification Accuracy(Lung) 88.1%
(2020) GAN Precision(Lung) 92.1%
Accuracy(Prostate) 93.6%
Precision(Prostate) 91%
Accuracy(Leukemia) 88.4%
Precision(Leukemia) 90.47%
Accuracy(Breast) 90.3%
Precision(Breast) 89%
Accuracy(Colon) 91.7%
Precision(Colon) 91.5%
Chaudhari et al. NCBI repository FNN & Improved KNN Classification Accuracy(Lung) 76.2%
(2021) Recall (Lung) 72%
Accuracy(Prostate) 81.4%
Recall (Prostate) 74.8%
Accuracy(Leukemia) 74.9%
Recall (Leukemia) 68%
Accuracy(Breast) 77.6%
Recall (Breast) 72%
Accuracy(Colon) 76.8%
Recall (Colon) 73%
Wang et al. (2018) 1.GEO FNN -Conditional Prediction Mean Absolute Error(MAE) 0.2897±0.0890
2.GTEx Generative Adversarial – GEO DATA
Network Concordance correlation 0.8785±0.0894
(CC) – GEO DATA
Mean Absolute Error(MAE) 0.4215±0.1264
– GTEx DATA
Concordance correlation 0.7475±0.2070
(CC) – GTEx DATA
Xiao et al. (2021) The Cancer Genome Atlas (TCGA): BRCA, STAD FNN- Wasserstein generative Prediction Accuracy(BRCA) 98.33%
and LUAD adversarial network(WGAN) Precision(BRCA) 100%
Recall(BRCA) 96.67%
F1(BRCA) 98.31%
AUC(BRCA) 0.98
Accuracy(STAD) 96.67%
Precision(STAD) 100%
Recall(STAD) 93.33%
F1(STAD) 96.55%
AUC(STAD) 0.97
Accuracy(LUAD) 96.67%
Precision(LUAD) 100%
Recall(LUAD) 93.33%
F1(LUAD) 95.55%
AUC(LUAD) 0.96
Urda et al. (2017) LASOO, DeepNeti, DeepNetii Prediction AUC(BRCA) – LASOO 0.65
(continued on next page)

4
U. Ravindran and C. Gunavathi Progress in Biophysics and Molecular Biology 177 (2023) 1–13

Table 1 (continued )
Paper Ref. Dataset Method Performance Parameter Values
Technique Obtained

The Cancer Genome Atlas (TCGA) –BRCA, COAD AUC(BRCA)- DeepNeti 0.62
and KIPAN AUC(BRCA) – DeepNetii 0.65
AUC(COAD) – LASOO 0.57
AUC(COAD)- DeepNeti 0.58
AUC(COAD) – DeepNetii 0.57
AUC(KIPAN) – LASOO 0.77
AUC(KIPAN)- DeepNeti 0.72
AUC(KIPAN) – DeepNetii 0.75
Park et al. (2021) The Cancer Genome FFN - GAN Prediction AUC 4%
Atlas (TCGA) -BRCA, LAML, LIHC, improvement
LUAD, PAAD, STAD and LGG AUC(pancreatic 27.9%
adenocarcinom) improvement

LASSO method was used to adjust the gene expression data. Two accuracy for normal and leukemia samples compared to machine
dimensionality reduction techniques DeepNeti and DeepNetii were used learning algorithms and also achieved better accuracy for subtypes of
to retrieve the most important genes (Urda et al., 2017). A deep multiclass classification (Ahmed et al., 2019). The CNN Model was
learning-based method was proposed, and it has three steps. First, the proposed to train the bone marrow images and to diagnose the Acute
GAN method was used to learn the prognostic specific networks for Lymphoblastic Leukemia (ALL) subtypes. The proposed model achieved
omics data. Second, scoring was done for each patient data set in the better classification accuracy compared to other machine learning
network. Third, a deep-feed forward neural network was used to predict classifiers like SVM, KNN, and Navie Bayesian (Rehman et al., 2018). In
the patient’s prognosis (Park et al., 2021). The different methods used convolution methods, the drawback is that they fail to differentiate the
for the gene expression data analysis using FFN is given in Table 1. various tumor types. The CNN model was proposed to train the 2-dimen­
sional images from the high-dimensional RNA sequence data and to
3. Convolutional neural network based gene expression data identify the specific tumor types. The data was downloaded from The
analysis methods Pan-Cancer Atlas, which has abundant information for different tumor
types for classification. The model achieved a better accuracy rate (Lyu
Several CNN models were proposed to classify unstructured gene and Haque, 2018).
expression and to distinguish between tumor and normal samples. 1- A hierarchical based graph convolution network was proposed to
Dimensional-CNN was used to process the vector input, 2-Dimen­ construct the gene interaction graph and similarity graph. Two graphs
sional-Vanilla-CNN was used to extract the local features from 2-Dimen­ are used to gather the neighborhood information of gene data. This
sional images and 2-Dimensional-Hybrid-CNN processes the 2D input proposed model was to learn the better gene representation and perform
with a 1D kernel. These proposed models achieved improvement in the downstream task with better performance. It achieved better clas­
prediction accuracy from The Cancer Genome Atlas (TCGA) across 33 sification performance when the training percentage was 70 (Tan et al.,
tumor classes and one normal class. The 1D-CNN produced the best 2021). To classify various tumor types on RNA-seq data a novel CNN
accuracy value among five subtypes in predicting the biomarkers of model was proposed, along with binary particle swarm optimization
breast cancer (Mostavi et al., 2020). A lightweight CNN model was with decision tree (BPSO-DT). It consists of three phases: data pre­
proposed to classify gene datasets for identifying breast cancer. The data processing, data augmentation and data classification. The data pre­
is downloaded from the Pan Cancer Atlas and then preprocessing was processing phase involves using BPSO-DT to select the best features of
done to remove the outlier samples and converted them into 2D images. the RNA sequence and convert them into 2-D images. The data
A normalization process was applied to ensure the correct gene augmentation phase consists of generating the synthesis data to improve
expression level and avoid biases. The proposed model achieved better classification accuracy. The DeepCNN architecture phase extracts the
classification accuracy (Elbashir et al., 2019). The DeepGx method was features from the images. This model achieved a better classification
proposed to solve the classification problem of cancer data in the gene accuracy result for BRCA, KIRC, LUAD, LUSC and UCEC tumor types
expression dataset. The proposed method was to construct (Khalifa et al., 2020).
two-dimensional images from the one-dimensional RNA-Sequence in­ For data augmentation, Self-attention Progressive Growing of GANs
formation. The CNN model was used for classification to find each gene (SPGGANs) was proposed to generate skin lesion images based on
for each type and it improved the classification accuracy (Joseph et al., aggregated information in compact feature locations. The CNN model
2019). was used for classification of the images to detect the melanoma and the
The Deep Learning Method was used for gene expression level pre­ results achieved an improvement in sensitivity (Abdelhalim et al.,
diction in molecular biology. The drawback is to eliminate the current 2021). The DeepCNN architecture was proposed for the classification of
structure for analyzing the evolutionary dependencies. The proposed multi-grade brain tumors. In this model, first data augmentation tech­
model has two approaches. The first approach was gene family guided niques are effectively applied to generate the data to avoid insufficient
splitting, which is for training and testing of two different gene families, data problems. Then the CNN model is used to train for non-augmented
and the second approach was ortholog contrasts, which is to compare and augmented data. The augmented data shows better classification
the orthologous genes for evolutionary during the training process accuracy results for the Radiopaedia dataset and brain tumor dataset
(Washburn et al., 2019). A multi-layer genome perceptron model was (Sajjad et al., 2019).
developed to classify and group similar annotations in genomics data. The Gene Regulatory Interaction Prediction (Grip) along with the
This proposed model was used to predict the breast cancer on normal Deep Learning (DL) method was proposed to predict the gene regulatory
and tumor samples variations in gene expression level for improvement networks (GRNs) for Drosophila embryonic gene expression images. It
in Estrogen Receptor (ER) status (Jha et al., 2018). achieved a better improvement in accuracy and F1 score compared with
The CNN Model was proposed to identify the four subtypes of leu­ stability-driven nonnegative matrix factorization (staNMF) (Yang et al.,
kemia from microscopic images. The data were downloaded from the 2019a). The CNN model was proposed for microarray gene data to
ALL-IDB and ASH Image Bank then applied data augmentation for classify the seven different cancer datasets. This model achieved a better
generating synthesis data for training. The CNN model achieved high classification accuracy result for brain and breast cancer data (Zeebaree

5
U. Ravindran and C. Gunavathi Progress in Biophysics and Molecular Biology 177 (2023) 1–13

Table 2
Convolutional Neural Network (CNN) based gene expression data analysis methods.
Paper Ref. Dataset Method Performance Parameter Values Obtained
Technique

Mostavi et al. The Cancer Genome Atlas (TCGA) 1D-CNN, 2D Vanilla Prediction Accuracy 93.9–95.0%
(2020) – CNN, Accuracy(Breast 88.42%
2D Hybrid- CNN cancer)
Elbashir et al. The Cancer Genome Atlas (TCGA): Breast Cancer Light Weight CNN Classification Accuracy 98.76%
(2019) Sensitivity 91.43%
Specificity 100%
Precision 100%
F-measure 0.955
Joseph et al. (2019) 1.TCGA DeepGx Classification Accuracy 95.65%
2.GEO
3.GTEx
Washburn et al. NCBI Repository CNN Prediction ROC curve Range from 0.75 to
(2019) 0.94
Jha et al. (2018) The Cancer Genome Atlas (TCGA): Breast Cancer GCNN Prediction GSE20685 AUC 91.7%
P 0.0231
Value
GSE25055 AUC 94.5%
P 0.44
Value
GSE22219 AUC 94.5%
P 0.001
Value
GSE12276 AUC 91%
P 0.011
Value
GSE7390 AUC 92.4%
P 0.05
Value
GSE24450 AUC 92.4%
P 0.029
Value
Ahmed et al. (2019) ALL-IDB and ASH Image Bank: Leukemia data CNN Classification Accuracy 88.25%
Accuracy(Multi 81.75%
Class)
Rehman et al. Amreek Clinical Laboratory: Leukemia data CNN Classification Accuracy 97.78%
(2018)
Lyu and Haque The Cancer Genome Atlas (TCGA) CNN Classification Accuracy 95.59%
(2018) Precision 95.54%
Recall 95.59%
F- Score 95.43%
Tan et al. (2021) UCSC Xena Repository CNN Classification Accuracy (Training 92.6%
Percentage 70)
Khalifa et al. (2020) The Cancer Genome Atlas (TCGA): BRCA, KIRC, LUAD, CNN Classification Accuracy(BRCA) 98.3%
LUSC and UCEC Accuracy(KIRC) 98.2%
Accuracy(LUAD) 84.8%
Accuracy(LUSC) 97.7%
Accuracy(UCEC) 96.4%
Abdelhalim et al. HAM10000- annotated skin lesion images GAN, CNN (DL) Classification Sensitivity 2.5%
(2021) Improvement
Sensitivity 8.6%
Improvement (For
melanoma class)
Sajjad et al. (2019) Radiopaedia dataset: 121 MR images CNN Classification Accuracy (For 90.67%
radiopaedia dataset)
Sensitivity (For brain 88.41%
tumor dataset)
Specificity (For brain 96.12%
tumor dataset)
Yang et al. (2019a) Benchmark dataset: ISH images GripDL Prediction Accuracy and F1 14% improvement
Score
Zeebaree et al. Benchmark dataset: Brain, Breast, Colon, Leukemia, CNN Classification Accuracy(Brain) 97.62%
(2018) stomach, prostate and lung datasets Accuracy(Breast) 97.69%
Accuracy(Colon) 64.52%
Accuracy(Leukeima) 100%
Accuracy(Lung) 72.09%

et al., 2018). The different methods used for the gene expression data (SAE) was a regression-based prediction model to measure gene
analysis using CNN is given in Table 2. expression profiles and genotypes. SAE was used for feature selection for
the input as SNP genotypes and MLP were applied for backpropagation.
4. Autoencoder based gene expression data analysis methods To avoid over-fitting and to improve performance for high-dimensional
data, a dropout was used during processing (Xie et al., 2017). The Deep
Multilayer Perceptron (MLP) with Stacked Denoising Autoencoder Subspace Fusion model was a clustering technique proposed to

6
U. Ravindran and C. Gunavathi Progress in Biophysics and Molecular Biology 177 (2023) 1–13

Table 3
Autoencoder (AE) based gene expression data analysis methods.
Paper Ref. Dataset Method Performance Parameter Values Obtained
Technique

Xie et al. Yeast data set Stacked Denoising Autoencoder Prediction Mean Square Error 0.2890
(2017) (MSE)
Correlation value of 0.3082
MSE
Yang et al. The Cancer Genome Deep Subspace Fusion Clustering (DSFC) Prediction Cox Log-Rank P- 7.02e-6 (No. of
(2019b) Atlas (TCGA): GBM,BIC, KRCCC, LSCC, COAD value (GBM) Cluster = 3)
and OV Cox Log-Rank P- 8.95e-4(No. of
value (BIC) Cluster = 4)
Cox Log-Rank P- 1.46e-3(No. of
value (KRCCC) Cluster = 3)
Cox Log-Rank P- 8.12e-4(No. of
value (LSCC) Cluster = 3)
Cox Log-Rank P- 7.95e-6(No. of
value (COAD) Cluster = 2)
Cox Log-Rank P- 4.03e-5 (No. of
value (OV) Cluster = 3)
Wu et al. SEQC neuroblastoma dataset – GEO database HetEnc Prediction Cross validation in 0.854
(2019) RNA Seq
Cross validation in 0.825
Microarray
Danaee et al. The Cancer Genome Stacked Denoising Autoencoder Classification Accuracy(ANN) 96.95%
(2017) Atlas (TCGA): Breast samples Sensitivity(ANN) 98.73%
Specificity(ANN) 95.29%
Precision(ANN) 95.42%
F-measure(ANN) 0.970
Accuracy(SVM) 98.04%
Sensitivity(SVM) 97.21%
Specificity(SVM) 99.11%
Precision(SVM) 99.17%
F-measure(SVM) 0.981
Accuracy (SVM- 98.26%
RBF)
Sensitivity(SVM- 97.61%
RBF)
Specificity(SVM- 99.11%
RBF)
Precision(SVM- 99.17%
RBF)
F-measure(SVM- 0.983
RBF)
Guan et al. KEGG PATHWAY and PubMed Central Stacked Denoising Autoencoder Multi- Classification Coverage precision 0.577
(2018) Label Learning (SdaMLL) Ranking loss 0.286
Coverage 2.436
Jiang et al. Blood pressure data from the MIMIC-II Autoencoder Prediction Accuracy 88.14%
(2021) database
Xie et al. Yeast data set Stacked Denoising Auto-encoder Prediction Average MSE 0.3082
(2016)
Liu et al. IBC dataset: Sample Expansion Method with Classification Accuracy(Breast)- 87.33%
(2017) Colon, Leukemia and Breast Autoencoder and 1DCNN SESAE
Accuracy 49.79%
(Leukemia)- SESAE
Accuracy(Colon)- 84.42%
SESAE
Accuracy(Breast)- 95.33%
SE1DCNN
Accuracy 57.87%
(Leukemia)-
SE1DCNN
Accuracy(Colon)- 84.90%
SE1DCNN
Karim et al. The Cancer Genome Convolutional LSTM, Convolutional Prediction Accuracy(CLSTM) 74.25%
(2020) Atlas (TCGA) autoencoder and Model averaging Accuracy(CAE) 78.32%
ensemble (MAE) MAE 2.5% improvement
Pirmoradi et al. NCBI Repository: (GEO database) GSE67047, Self-organizing auto-encoder (SOAE) Classification Accuracy(Thyroid 100%
(2020) GSE13117, GSE16619, GSE34678 and Cancer)
GSE9222 F-measure(Thyroid 100%
Cancer)
Specificity(Thyroid 100%
Cancer)
Sensitivity(Thyroid 100%
Cancer)
Accuracy (Mental 94.4%
Retardation)
(continued on next page)

7
U. Ravindran and C. Gunavathi Progress in Biophysics and Molecular Biology 177 (2023) 1–13

Table 3 (continued )
Paper Ref. Dataset Method Performance Parameter Values Obtained
Technique

F-measure (Mental 93.1%


Retardation)
Specificity (Mental 93.2%
Retardation)
Sensitivity (Mental 96.4%
Retardation)
Accuracy(Breast 100%
Cancer)
F-measure(Breast 100%
Cancer)
Specificity(Breast 100%
Cancer)
Sensitivity(Breast 100%
Cancer)
Accuracy 96%
(Colorectal Cancer)
F-measure 96.3%
(Colorectal Cancer)
Specificity 91.7%
(Colorectal Cancer)
Sensitivity 100%
(Colorectal Cancer)
Accuracy (Autism) 99.1%
F-measure(Autism) 99.2%
Specificity(Autism) 100%
Sensitivity(Autism) 98.5%
Zhang et al. NCBI Repository: (GEO database) GSE2990, Linear Discriminant Analysis (LDA) and Prediction Accuracy 98.27%
(2020) GSE3494, GSE9195, GSE17705 and Autoencoder (AE)
GSE17907
Seal et al. The Cancer Genome Deep Denoising Autoencoder (DDAE) Classification Accuracy 95.1%
(2020) Atlas (TCGA): Liver data

differentiate the gene expression similarity between patient samples to synthesis samples a denoising autoencoder was used to corrupt the input
achieve effective results. It adopts DNN to expose each type of feature. data. For the classification of data samples, 1D-CNN was used. These
Similarity Network Fusion (SNF) was applied to measure the similarity models achieved better accuracy values (Liu et al., 2017). Before cancer
between patients using Euclidean distances and to obtain the final result occurs, copy number variations (CNVs) are utilized to define genetic
spectral clustering was done (Yang et al., 2019b). The HetEnc model was features. The ensemble method was proposed to predict the cancer
proposed to handle platform-independent gene datasets for both su­ types. Convolutional LSTM and convolutional autoencoder were pro­
pervised and unsupervised networks. For extracting the features, they posed for two different sparse representations that were developed to
used three encoding networks: Autoencoder (AE), CombNET and train and capture important features. A snapshot ensemble method was
CrossNET. A 6 layered DNN was used for predicting the output of each used to make the single prediction with better accuracy. (Karim et al.,
targeted endpoint (Wu et al., 2019). 2020).
A stacked denoising autoencoder was proposed to obtain meaningful To analyze the Single Nucleotide Polymorphism (SNP) genomics
information from gene data in the TCGA dataset. ANN, SVM, and data for classifying the patient samples effectively, the deep Autoen­
SVM–RBF supervised classification models were used to evaluate the coder model was proposed. For preprocessing the data, the mean
value of extracted features for breast cancer cell detection (Danaee et al., encoding method was used to convert the numeric values from the SNP
2017). A Stacked Denoising Autoencoder (SDA) with Multi-Label and the filter method was applied to remove redundant and irrelevant
Learning (MLL) architecture was proposed to predict the gene features. The proposed model was a Self-organizing auto-encoder
pathway analysis. SDAE was used to remove the various noises from the (SOAE) used to classify the samples and achieved better classification
original vector in the feature matrix. Then the output of the stacked accuracy for different types of cancer datasets (Pirmoradi et al., 2020).
denoising autoencoder is fed into BP-MLL for pathway prediction. The To classify the different features in the gene expression profile with a
result achieved good performance in coverage precision value compared low-intensity ratio, the authors proposed an Autoencoder (AE), along
to traditional multi-label algorithms (Guan et al., 2018). The hybrid with a linear discriminant analysis approach. This proposed model was
deep learning model was proposed for acute hypotensive episode (AHE) used on different breast cancer independent datasets to achieve better
detection. For processing the AHE signal, a single decomposition prediction accuracy (Zhang et al., 2020). The Deep Denoising
method was used, and for feature extraction, an autoencoder was used. Auto-encoder (DDAE) was a predictive model used to handle omics data
The result achieved good classification accuracy for patient samples in various platforms, like epigenetic and genetic data. The DDAE was
(Jiang et al., 2021). A Multi-Layer Perceptron (MLP) with a Stacked applied to train the gene data and extract the important features. A
Denoising Autoencoder (SAE) was proposed to train the model to Multi-layer Perceptron (MLP) was used for backpropagation. These
retrieve the meaningful features and predict the gene value of an SNP models achieved better classification accuracy for gene expression
genotype. To improve the model, backpropagation is applied to the datasets (Seal et al., 2020). The different methods used for the gene
multilayer perceptron and to prevent the overfitting problem, dropout expression data analysis using AE is given in Table 3.
techniques were added. The proposed model outperforms than model
without dropout and traditional methods such as Random Forest and 5. Recurrent neural network based gene expression data
Lasso (Xie et al., 2016). analysis methods
In the gene expression data, the Sample Expansion method was
proposed for effective classification of tumor samples. To generate the A recurrent network and convolutional network architectures were

8
U. Ravindran and C. Gunavathi Progress in Biophysics and Molecular Biology 177 (2023) 1–13

Table 4
Recurrent Neural Network (RNN) based gene expression data analysis methods.
Paper Ref. Dataset Method Performance Parameter Values Obtained
Technique

Zheng et al. (2020) miRBase and NCBI Repository RNN Prediction Accuracy 90%
Specificity 90%
Immaculate NCBI Repository LSTM and KNN Prediction Accuracy 0.945
MercyChidambaram Running Time 0.14
(2020)
Kishan et al. (2019) BioGRID repository Gene Network Embedding Prediction AUROC (Yeast) 0.710
(GNE) –One hot AUPR (Yeast) 0.983
Representation AUROC (E Coli) 0.653
AUPR (E Coli) 0.658
Bychkov et al. (2018) The Hospital District of Helsinki and Uusimaa CNN, LSTM Prediction Hazard ratio 2.3
AUC 0.69
Zhao et al. (2019) Benchmark Dataset (scRNA): 1.Data1- primitive RNN Classification AUROC (Data1) 0.620
endoderm (PrE) cells 2.Data2- mouse embryonic AUROC (Data2) 0.587
fibroblast (MEF) cells 3.Data3- definitive endoderm AUROC (Data3) 0.578
(DE) cells
Liu and Liu (2020) Benchmark Dataset: DREAM3 and DREAM4 MALASSORNN-GRN Classification AUROC (node = 0.6351
10, Density =
20%)
AUROC (node = 0.7188
10, Density =
40%)
Majji et al. (2021) GEMLeR repository: AP_Colon_Kidney data, JayaALO-based DeepRNN Classification Accuracy 95.97%
AP_Breast_Ovary data, AP_Breast_Colon data, Sensitivity 95.95%
AP_Breast_Kidney data Specificity 96.96%
Chowdhury et al. (2019) Benchmark Dataset: Colon and Leukemia GRU, LSTM, RNN and Classification F1 score 11% and 15% (DT
bidirectional LSTM. and NB
Classifiers)
Şahín and Diri (2019) Gene expression profile (GEP) datasets: COLON, LUNG, LSTM based AIRS Classification Average Accuracy 92.8%
PROSTATE, SRBCT, LYMPHOMA, AND LEUKEMIA
Suresh et al. (2021) NCBI Repository LSTM Classification Accuracy 86.36%
Aher and Jena (2021) 1.UCI Health Repository:Lung cancer RCO-RNN Classification Accuracy 94.5%
2.Computer Age Statistical Inference: Leukemia cancer (Leukemia)
3.Bioinformatics Lab:SBRCT cancer Sensitivity 94.5%
(Leukemia)
Specificity 94.7%
(Leukemia)
PPV(Leukemia) 94.5%
NPV(Leukemia) 94.5%
Accuracy(SBRCT) 94.0%
Sensitivity 94.0%
(SBRCT)
Specificity 94.0%
(SBRCT)
PPV(SBRCT) 94.0%
NPV(SBRCT) 94.0%
Accuracy(Lung) 95.0%
Sensitivity(Lung) 95.0%
Specificity(Lung) 95.0%
PPV(Lung) 95.0%
NPV(Lung) 95.0%

applied to automatically learn characteristics from RNA-seq and detect appropriate features from the input vector were extracted using a
human miRNA. The RNN model consists of three LSTM layers to convolution network. LSTM was used to predict the risk score for the
remember or ignore the past information. The model achieved good patient. (Bychkov et al., 2018). A RNN model was proposed to train the
prediction accuracy and sensitivity values for the human pre-miRNA gene data and to identify the transcriptional target factor with better
dataset (Zheng et al., 2020). Long short-term memory (LSTM) is a accuracy. The AUROC was measured for three different datasets and
RNN approach that was proposed for the classification of gene expres­ transferred from the Gene Regulatory Network to binary classification
sion data. LSTM was combined with K-Nearest Neighbour (KNN) to (Zhao et al., 2019). Optimizing gene regulatory networks (GRNs) is a
perform better classification performance. For feature selection, prin­ challenging problem because the relationships between genes are spared
cipal component analysis and the CHI square test were used. The pro­ in the dataset. To overcome the problem, a MALASSORNN-GRN model
posed model shows a better accuracy value and also reduces the running was proposed for achieving the best results. The memetic algorithm was
time compared to existing traditional methods (Immaculate MercyChi­ applied to learn the parameters of the recurrent network and LASSO was
dambaram, 2020). The Gene Network Embedding (GNE) framework was used for reconstructing the GRNs (Liu and Liu, 2020).
used to combine the gene expression data and gene interaction network. The Jaya Ant Lion optimization model was proposed for gene
The proposed model was used to learn the embedded representation and expression data preprocessing. First, data normalization was used to
to predict the gene interactions from heterogeneous datasets. Deep remove the redundant data. Second, data transformation was used to
embeddings yielded substantially more accurate predictions, according generate the patterns, and third, factorization was used for feature
to the findings. (Kishan et al., 2019). reduction. Finally, a recurrent neural network model was employed for
An ensemble method was used for classifying and predicting tumor the classification of the cancer (Majji et al., 2021). Several RNN
and normal samples. CNN and RNN architectures were used. The frameworks were proposed for feature selection of micro array gene

9
U. Ravindran and C. Gunavathi Progress in Biophysics and Molecular Biology 177 (2023) 1–13

Fig. 5. Accuracy value obtained from different Deep Learning Methods:


a: Accuracy value obtained from Feed forward Neural Network (FFN) b: Ac­ Fig. 6. Accuracy value obtained from different types of cancer:
curacy value obtained from Convolutional neural network (CNN) c: Accuracy a: Accuracy value obtained from Breast cancer b: Accuracy value obtained from
value obtained from Autoencoder (AE) d: Accuracy value obtained from Lung cancer c: Accuracy value obtained from Leukemia cancer d: Accuracy
Recurrent Neural Network (RNN). value obtained from Colon cancer.

expression datasets. These frameworks were RNN, LSTM, bidirectional The Long Short Term Memory (LSTM) with the artificial immune
LSTM and GRU. These frameworks were mostly applied to choose the recognition system (AIRS) algorithm was used to effectively find the
most important characteristics. Traditional machine learning algorithms most significant features in the gene expression data. AIRS was used to
were used for classification tasks. The results achieved an improvement remember the long sequences and LSTM was used to map the sequential
in F1-Score value (Chowdhury et al., 2019). features. This algorithm achieved a better accuracy for different
microarray gene expression datasets (Şahín and Diri, 2019). Genomic

10
U. Ravindran and C. Gunavathi Progress in Biophysics and Molecular Biology 177 (2023) 1–13

Chicken Optimization with Recurrent Neural Network (RCO-RNN) al­


gorithm was proposed to identify the tumor using gene data. For
selecting the gene features, entropy was used to reduce the dimension­
ality. The proposed algorithm was used to train the gene features and
classify the data. In confusion matrix the results are achieved with better
performance and maximum value (Aher and Jena, 2021). The different
methods used for the gene expression data analysis using RNN is given in
Table 4.

6. Discussion

In this review paper, we have analyzed the existing research works


which utilized deep learning methods for cancer diagnosis using gene
expression dataset. In the existing literature, the majority of gene data is
downloaded from the TCGA, GEO and NCBI repositories. In many
existing literature papers, the accuracy (Acc), sensitivity (Sen), speci­
ficity (Spe), precision (Prec), and F1 score values are calculated. Fig. 5
shows the accuracy values obtained from different deep learning
methods. Fig. 5a shows the accuracy values obtained from the feed-
forward neural network. The maximum accuracy of 99.7% was ach­
ieved using FFN. Fig. 5b shows the accuracy values obtained from the
convolutional neural network. The maximum accuracy of 98.76% was
achieved using CNN. Fig. 5c shows the accuracy values obtained from
the autoencoder. The maximum accuracy of 98.27% was achieved using
AE. Fig. 5d shows the accuracy values obtained from the recurrent
neural network. The maximum accuracy of 95.97% was achieved using
RNN.
Fig. 6 shows the accuracy values obtained from different types of
cancer dataset using deep learning methods. Fig. 6a shows the accuracy
values obtained from the breast cancer dataset. The maximum accuracy
achieved was 100%. Fig. 6b shows the accuracy values obtained from
the lung cancer dataset. The maximum accuracy achieved was 97.7%.
Fig. 6c shows the accuracy values obtained from the leukemia cancer
dataset. The maximum accuracy achieved was 100%. Fig. 6d shows the
accuracy values obtained from the colon cancer dataset. The maximum
accuracy achieved was 91.7%.
Fig. 7 shows the average value obtained for Sensitivity, Specificity,
Precision, and F- Measure using deep learning methods. Fig. 7a shows
the sensitivity values obtained and the maximum sensitivity achieved
was 98.98%. Fig. 7b shows the specificity values obtained and the
maximum specificity achieved was 100%. Fig. 7c shows the precision
values obtained and the maximum precision achieved was 100%. Fig. 7d
shows the F-measure values obtained and the maximum F-measure
achieved was 98.98%.

7. Conclusion

Microarray or RNA sequence Gene expression data is especially used


for cancer disease diagnosis and prediction. The curse of dimensionality
problem occurs in gene expression datasets for classifying and predicting
cancer samples and their types. Data augmentation techniques are
applied in most of the papers to solve the dimensionality problem for
gene expression datasets. Data preprocessing was applied in most of the
papers to remove the redundant data and noisy data. For extracting the
Fig. 7. Sensitivity, Specificity, Precision, F- Measure average values retrieved
relevant features from the gene expression dataset, the feature selection
from existing deep learning methods:
and feature extraction methods were used. Deep learning algorithms
a: Sensitivity value retrieved from Deep Learning Methods b: Specificity value
retrieved from Deep Learning Methods c: Precision value retrieved from Deep were used to classify and predict cancer disease and its types and
Learning Methods d: F-Measure value retrieved from Deep Learning Methods. compare the results with traditional algorithms. This survey paper has
mainly focused on analyzing gene expression datasets on feed-forward
neural network, convolutional neural network, autoencoder and recur­
sequencing was primarily used in cancer treatment to identify infected
individuals who had risk factors and complaints in their human cells. An rent neural network based deep learning methods.
LSTM with an optimization approach was proposed for predicting can­
cer disease using genome sequencing. For optimization, an Adaptive Bat Declaration of competing interest
Sonar Algorithm was used. This approach achieved a higher accuracy
when compared to hybrid classifiers (Suresh et al., 2021). A Rider The authors declare that they have no known competing financial
interests or personal relationships that could have appeared to influence

11
U. Ravindran and C. Gunavathi Progress in Biophysics and Molecular Biology 177 (2023) 1–13

the work reported in this paper. Kishan, K.C., Li, R., Cui, F., Yu, Q., Haake, A.R., 2019. GNE: a deep learning framework
for gene network inference by aggregating biological information. BMC Syst. Biol. 13
(2), 1–14.
References Kong, Y., Yu, T., 2018. A deep neural network model using random forest to extract
feature representation for gene expression data classification. Sci. Rep. 8 (1), 1–9.
Abdelhalim, I.S.A., Mohamed, M.F., Mahdy, Y.B., 2021. Data augmentation for skin Kristensen, V.N., Lingjærde, O.C., Russnes, H.G., Vollan, H.K.M., Frigessi, A., Børresen-
lesion using self-attention based progressive generative adversarial network. Expert Dale, A.L., 2014. Principles and methods of integrative genomic analyses in cancer.
Syst. Appl. 165, 113922. Nat. Rev. Cancer 14 (5), 299–313.
Aher, C.N., Jena, A.K., 2021. Rider-chicken optimization dependent recurrent neural Lim, J., Bang, S., Kim, J., Park, C., Cho, J., Kim, S., 2019. Integrative deep learning for
network for cancer detection and classification using gene expression data. Comput. identifying differentially expressed (DE) biomarkers. Comput. Math. Methods Med.
Methods Biomech. Biomed. Eng.: Imaging & Visualization 9 (2), 174–191. 2019.
Ahmed, O., Brifcani, A., 2019, April. Gene expression classification based on deep Liu, L., Liu, J., 2020. Reconstructing gene regulatory networks via memetic algorithm
learning. In: 2019 4th Scientific International Conference Najaf (SICN). IEEE, and LASSO based on recurrent neural networks. Soft Comput. 24 (6), 4205–4221.
pp. 145–149. Liu, J., Wang, X., Cheng, Y., Zhang, L., 2017. Tumor gene expression data classification
Ahmed, N., Yigit, A., Isik, Z., Alpkocak, A., 2019. Identification of leukemia subtypes via sample expansion-based deep learning. Oncotarget 8 (65), 109646.
from microscopic images using convolutional neural network. Diagnostics 9 (3), 104. Lyu, B., Haque, A., 2018, August. Deep learning based tumor type classification using
Ahn, T., Goo, T., Lee, C.H., Kim, S., Han, K., Park, S., Park, T., 2018, December. Deep gene expression data. In: Proceedings of the 2018 ACM International Conference on
learning-based identification of cancer or normal tissue using gene expression data. Bioinformatics, Computational Biology, and Health Informatics, pp. 89–96.
In: IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, Majji, R., Nalinipriya, G., Vidyadhari, C., Cristin, R., 2021. Jaya Ant lion optimization-
pp. 1748–1752, 2018. driven Deep recurrent neural network for cancer classification using gene expression
Bychkov, D., Linder, N., Turkki, R., Nordling, S., Kovanen, P.E., Verrill, C., et al., 2018. data. Med. Biol. Eng. Comput. 59 (5), 1005–1021.
Deep learning based tissue analysis predicts outcome in colorectal cancer. Sci. Rep. 8 Mantione, K.J., Kream, R.M., Kuzelova, H., Ptacek, R., Raboch, J., Samuel, J.M.,
(1), 1–11. Stefano, G.B., 2014. Comparing bioinformatic gene expression profiling methods:
Chandrasekar, V., Sureshkumar, V., Kumar, T.S., Shanmugapriya, S., 2020. Disease microarray and RNA-Seq. Med. Sci. Monitor Basic Res. 20, 138.
prediction based on micro array classification using deep learning techniques. Martorell-Marugán, J., Tabik, S., Benhammou, Y., del Val, C., Zwir, I., Herrera, F.,
Microprocess. Microsyst. 77, 103189. Carmona-Sáez, P., 2019. Deep Learning in Omics Data Analysis and Precision
Chaudhari, P., Agrawal, H., Kotecha, K., 2020. Data augmentation using MG-GAN for Medicine. Exon Publications, pp. 37–53.
improved cancer classification on gene expression data. Soft Comput. 24 (15), Mostavi, M., Chiu, Y.C., Huang, Y., Chen, Y., 2020. Convolutional neural network models
11381–11391. for cancer type prediction based on gene expression. BMC Med. Genom. 13 (5),
Chaudhari, P., Agarwal, H., Bhateja, V., 2021. Data augmentation for cancer 1–13.
classification in oncogenomics: an improved KNN based approach. Evolut. Park, C., Oh, I., Choi, J., Ko, S., Ahn, J., 2021. Improved prediction of cancer outcome
Intelligence 14 (2), 489–498. using graph-embedded generative adversarial networks. IEEE Access 9,
Chowdhury, S., Dong, X., Li, X., 2019, December. Recurrent neural network based 20076–20088.
feature selection for high dimensional and low sample size micro-array data, 2019. Peng, S., Xu, Q., Ling, X.B., Peng, X., Du, W., Chen, L., 2003. Molecular classification of
In: IEEE International Conference on Big Data (Big Data). IEEE, pp. 4823–4828. cancer types from microarray data using the combination of genetic algorithms and
Danaee, P., Ghaeini, R., Hendrix, D.A., 2017. A deep learning approach for cancer support vector machines. FEBS Lett. 555 (2), 358–362.
detection and relevant gene identification. In: Pacific Symposium on Biocomputing Pirmoradi, S., Teshnehlab, M., Zarghami, N., Sharifi, A., 2020. A self-organizing deep
2017, pp. 219–229. auto-encoder approach for classification of complex diseases using snp genomics
Díaz-Uriarte, R., Alvarez de Andrés, S., 2006. Gene selection and classification of data. Appl. Soft Comput. 97, 106718.
microarray data using random forest. BMC Bioinf. 7 (1), 1–13. Rehman, A., Abbas, N., Saba, T., Rahman, S.I.U., Mehmood, Z., Kolivand, H., 2018.
Elbashir, M.K., Ezz, M., Mohammed, M., Saloum, S.S., 2019. Lightweight convolutional Classification of acute lymphoblastic leukemia using deep learning. Microsc. Res.
neural network for breast cancer classification using RNA-seq gene expression data. Tech. 81 (11), 1310–1317.
IEEE Access 7, 185338–185348. Şahín, C.B., Diri, B., 2019. Robust feature selection with LSTM recurrent neural networks
Gao, F., Wang, W., Tan, M., Zhu, L., Zhang, Y., Fessler, E., et al., 2019. DeepCC: a novel for artificial immune recognition system. IEEE Access 7, 24165–24178.
deep learning-based framework for cancer molecular subtype classification. Sajjad, M., Khan, S., Muhammad, K., Wu, W., Ullah, A., Baik, S.W., 2019. Multi-grade
Oncogenesis 8 (9), 1–12. brain tumor classification using deep CNN with extensive data augmentation.
García-Díaz, P., Sánchez-Berriel, I., Martínez-Rojas, J.A., Diez-Pascual, A.M., 2020. J. Computat. Sci. 30, 174–182.
Unsupervised feature selection algorithm for multiclass cancer classification of gene Seal, D.B., Das, V., Goswami, S., De, R.K., 2020. Estimating gene expression from DNA
expression RNA-Seq data. Genomics 112 (2), 1916–1925. methylation and copy number variation: a deep learning regression model for multi-
Guan, R., Wang, X., Yang, M.Q., Zhang, Y., Zhou, F., Yang, C., Liang, Y., 2018. Multi- omics integration. Genomics 112 (4), 2833–2841.
label deep learning for gene function annotation in cancer pathways. Sci. Rep. 8 (1), Shi, H., Li, H., Zhang, D., Cheng, C., Cao, X., 2018. An efficient feature generation
1–9. approach based on deep learning and feature selection techniques for traffic
Guyon, I., Weston, J., Barnhill, S., Vapnik, V., 2002. Gene selection for cancer classification. Comput. Network. 132, 81–98.
classification using support vector machines. Mach. Learn. 46 (1), 389–422. Shukla, A.K., Singh, P., Vardhan, M., 2018. A hybrid gene selection method for
Hinton, G.E., Salakhutdinov, R.R., 2006. Reducing the dimensionality of data with neural microarray recognition. Biocybern. Biomed. Eng. 38 (4), 975–991.
networks. Sci. Technol. Humanit. 313 (5786), 504–507. Simon, R., 2009. Analysis of DNA microarray expression data. Best Pract. Res. Clin.
https://sapac.illumina.com/science/technology/next-generation-sequencing/microarr Haematol. 22 (2), 271–282.
ay-rna-seq-comparison.html?langsel=/in/. Singh, V.P., Kalita, D.J., Tripathi, D., 2019, March. Classifying gene expression data of
https://www.yourgenome.org/facts/what-is-gene-expression. cancer using multistage ensemble of neural networks. In: Proceedings of 2nd
https://www.youtube.com/watch?v=2c3t3tDEmsU. International Conference on Advanced Computing and Software Engineering (ICACSE).
https://youtu.be/AASR9rOzhhA. Sun, Y., Zhu, S., Ma, K., Liu, W., Yue, Y., Hu, G., et al., 2019. Identification of 12 cancer
https://youtu.be/PmZp5VtMwLE. types through genome deep learning. Sci. Rep. 9 (1), 1–9.
https://youtu.be/wPz3MPl5jvY. Sung, H., Ferlay, J., Siegel, R.L., Laversanne, M., Soerjomataram, I., Jemal, A., Bray, F.,
https://youtu.be/ym5qG-3kJ10. 2021. Global cancer statistics 2020: GLOBOCAN estimates of incidence and
Immaculate Mercy, A., Chidambaram, M., 2020. Deep learning classifier for gene mortality worldwide for 36 cancers in 185 countries. CA A Cancer J. Clin. 71 (3),
expression datasets using a hybrid LSTM network. Int. J. Innovative Technol. Explor. 209–249.
Eng. 9 (4), 1081–1089. Suresh, A., Nair, R.R., Neeba, E.A., Kumar, S.A., 2021. Recurrent neural network for
Jha, A., Verma, G., Khan, Y., Mehmood, Q., Rebholz-Schuhmann, D., Sahay, R., 2018, genome sequencing for personalized cancer treatment in precision healthcare.
December. Deep convolution neural network model to predict relapse in breast Neural Process. Lett. 1–10.
cancer. In: 2018 17th IEEE International Conference on Machine Learning and Tan, X., Deng, L., Yang, Y., Qu, Q., Wen, L., 2019. Optimized regularized linear
Applications (ICMLA). IEEE, pp. 351–358. discriminant analysis for feature extraction in face recognition. Evolut. Intelligence
Jiang, D., Tu, G., Jin, D., Wu, K., Liu, C., Zheng, L., Zhou, T., 2021. A hybrid intelligent 12 (1), 73–82.
model for acute hypotensive episode prediction with large-scale data. Inf. Sci. 546, Tan, K., Huang, W., Liu, X., Hu, J., Dong, S., 2021. A hierarchical graph convolution
787–802. network for representation learning of gene expression data. IEEE J. Biomed. Health
Joseph, M., Devaraj, M., Leung, C.K., 2019, August. DeepGx: deep learning using gene Inform. 25 (8), 3219–3229.
expression for cancer classification. In: IEEE/ACM International Conference on Urda, D., Montes-Torres, J., Moreno, F., Franco, L., Jerez, J.M., 2017, June. Deep
Advances in Social Networks Analysis and Mining (ASONAM). IEEE, pp. 913–920, learning to analyze RNA-seq gene expression data. In: International Work-
2019. Conference on Artificial Neural Networks. Springer, Cham, pp. 50–59.
Karim, M.R., Rahman, A., Jares, J.B., Decker, S., Beyan, O., 2020. A snapshot neural Wang, X., Ghasedi Dizaji, K., Huang, H., 2018. Conditional generative adversarial
ensemble method for cancer-type prediction based on copy number variations. network for gene expression inference. Bioinformatics 34 (17), i603–i611.
Neural Comput. Appl. 32 (19), 15281–15299. Washburn, J.D., Mejia-Guerra, M.K., Ramstein, G., Kremling, K.A., Valluru, R.,
Khalifa, N.E.M., Taha, M.H.N., Ali, D.E., Slowik, A., Hassanien, A.E., 2020. Artificial Buckler, E.S., Wang, H., 2019. Evolutionarily informed deep learning methods for
intelligence technique for gene expression by tumor RNA-Seq data: a novel predicting relative transcript abunldance from DNA sequence. Proc. Natl. Acad. Sci.
optimized deep learning approach. IEEE Access 8, 22874–22883. USA 116 (12), 5542–5549.
Wu, L., Liu, X., Xu, J., 2019. HetEnc: a deep learning predictive model for multi-type
biological dataset. BMC Genom. 20 (1), 1–10.

12
U. Ravindran and C. Gunavathi Progress in Biophysics and Molecular Biology 177 (2023) 1–13

Xiao, Y., Wu, J., Lin, Z., 2021. Cancer diagnosis using generative adversarial networks Yoo, I., Alafaireet, P., Marinov, M., Pena-Hernandez, K., Gopidi, R., Chang, J.F., Hua, L.,
based on deep learning from imbalanced data. Comput. Biol. Med., 104540 2012. Data mining in healthcare and biomedicine: a survey of the literature. J. Med.
Xie, R., Quitadamo, A., Cheng, J., Shi, X., 2016, December. A predictive model of gene Syst. 36 (4), 2431–2448.
expression using a deep learning framework. In: IEEE International Conference on Zeebaree, D.Q., Haron, H., Abdulazeez, A.M., 2018, October. Gene selection and
Bioinformatics and Biomedicine (BIBM). IEEE, pp. 676–681, 2016. classification of microarray data using convolutional neural network. In:
Xie, R., Wen, J., Quitadamo, A., Cheng, J., Shi, X., 2017. A deep auto-encoder model for International Conference on Advanced Science and Engineering (ICOASE). IEEE,
gene expression prediction. BMC Genom. 18 (9), 39–49. pp. 145–150, 2018.
Xu, J., Wu, P., Chen, Y., Meng, Q., Dawood, H., Khan, M.M., 2019. A novel deep flexible Zhang, X., He, D., Zheng, Y., Huo, H., Li, S., Chai, R., Liu, T., 2020. Deep learning based
neural forest model for classification of cancer subtypes based on gene expression analysis of breast cancer using advanced ensemble classifier and linear discriminant
data. IEEE Access 7, 22086–22095. analysis. IEEE Access 8, 120208–120217.
Yang, Y., Fang, Q., Shen, H.B., 2019a. Predicting gene regulatory interactions based on Zhao, Y., Joshi, P., Shin, D.G., 2019, November. Recurrent neural network for gene
spatial gene expression data and deep learning. PLoS Comput. Biol. 15 (9), regulation network construction on time series expression data, 2019. In: IEEE
e1007324. International Conference on Bioinformatics and Biomedicine (BIBM). IEEE,
Yang, B., Zhang, Y., Pang, S., Shang, X., Zhao, X., Han, M., 2019b. Integrating multi-omic pp. 610–615.
data with deep subspace fusion clustering for cancer subtype prediction. IEEE ACM Zheng, X., Fu, X., Wang, K., Wang, M., 2020. Deep neural networks for human microRNA
Trans. Comput. Biol. Bioinf 18 (1), 216–226. precursor detection. BMC Bioinf. 21 (1), 1–7.

13

You might also like