Why Deep Learning Is Changing The Way To Approach NGS Data Processing A Review

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

68 IEEE REVIEWS IN BIOMEDICAL ENGINEERING, VOL.

11, 2018

Why Deep Learning Is Changing the Way to


Approach NGS Data Processing: A Review
Fabrizio Celesti , Antonio Celesti, Member, IEEE, Jiafu Wan , Member, IEEE,
and Massimo Villari, Member, IEEE

Abstract—Nowadays, big data analytics in genomics is specific fields, such as comparative genomic, metagenomics, bi-
an emerging research topic. In fact, the large amount of ological systematic, medical diagnosis, early detection of can-
genomics data originated by emerging next-generation se- cer, single nucleotide polymorphisms (SNPs) research, regula-
quencing (NGS) techniques requires more and more fast
and sophisticated algorithms. In this context, deep learn- tion of gene expression, forensic biology, and many others. In re-
ing is re-emerging as a possible approach to speed up the cent years, modern high-performance sequencing methods have
DNA sequencing process. In this review, we specifically dis- been developed. In such a context, next-generation sequencing
cuss such a trend. In particular, starting from an analysis (NGS) indicates a number of different modern DNA sequencing
of the interest of the Internet community in both NGS and techniques that are applied, for example, for genome sequenc-
deep learning, we present a taxonomic analysis highlighting
the major software solutions based on deep learning algo- ing, genome resequencing, transcriptome profiling (RNA-Seq),
rithms available for each specific NGS application field. We DNA-protein interactions (ChIP-sequencing), epigenome char-
discuss future challenges in the perspective of cloud com- acterization, etc. NGS solutions allow us to speed up DNA
puting services aimed at deep learning based solutions for sequencing tasks. Even though it is possible to analyze short
NGS. fragments of nucleic acids in a more efficient fashion, the pos-
Index Terms—Big data, biotechnology, deep learning, ge- sibility to carry out a large number of parallel sequencing tasks
nomics, next-generation sequencing (NGS). causes a huge amount of genomics data that need to be stored
and processed in a short time. Therefore, the huge amount of
genomics data brought by NGS techniques are examples of
I. INTRODUCTION
the well-known “big data” problem [1]–[3]. NGS allows re-
OWADAYS, big data analytics in genomics is an emerg-
N ing research topic. Knowing the nucleotide sequence of
the genome of an organism is extremely important in various
searchers to perform numerous parallel sequencing processes in
order to obtain a large number of sequences in a short time and
at low cost, compared to a traditional Sanger sequencing based
biotechnological research fields. DNA sequencing is a molec- on chain-termination methods.
ular biology process able to determine the right nucleotide se- With the advent of NGS, one of the major challenges in
quence of a DNA molecule, which is constituted by the alterna- bioinformatics is to efficiently transform genomics big data into
tion of four nucleotides: adenine (A), thymine (T), guanine (G), valuable knowledge. In fact, on one hand, a key issue is repre-
and cytosine (C). Information obtained from the DNA sequenc- sented by the complexity of errors that are generated by NGS
ing process are used for basic biological research and in other data, whereas, on the other hand, another issue is represented
by the processing of the huge amount of NGS data that require
Manuscript received July 5, 2017; revised December 29, 2017; ac- a considerable execution time, making the old statistical ma-
cepted March 31, 2018. Date of publication April 12, 2018; date of current chinery based algorithms not so efficient any more. In order to
version July 24, 2018. This work was supported by the Italian Healthcare address such issues, both industrial and scientific communities
Ministry funded project “Do severe acquired brain injury patients benefit
from telerehabilitation? A cost-effectiveness analysis study,” under Grant are looking at deep learning solutions. Deep learning is a re-
GR-2016-02361306. (Corresponding author: Antonio Celesti.) search field of machine learning and artificial intelligence that
F. Celesti is with the Department of Biomedical and Dental Sci- is based on different levels of representation, corresponding to
ences and Morphological and Functional Images, University of Messina,
Messina 98125, Italy (e-mail: fabrizio.celesti@studenti.unime.it). factors of hierarchy in which high-level concepts are defined on
A. Celesti is with the MIFT Department, University of Messina, Messina low-level basis. Although, it was introduced by Ivakhnenko and
98166, Italy, with the Alma Digit Research Laboratory, Messina 98166, Lapa in 1965 [4], currently, it is re-emerging due to its potential
Italy, and also with the BIG DATA Laboratory, National Interuniver-
sity Consortium for Informatics, Rome 00185, Italy (e-mail: acelesti@ applications to solve many big data problems in different appli-
unime.it). cation fields, including genomics. In fact, it allows researchers
J. Wan is with the School of Mechanical and Automotive Engineer- to perform DNA sequencing tasks in a more efficient and faster
ing, South China University of Technology, Guangzhou 510630, China
(e-mail: mejwan@scut.edu.cn). fashion than in the past.
M. Villari is with the MIFT Department, University of Messina, Messina Up to now, several surveys have been proposed regarding the
98166, Italy, with the IRCCS Centro Neurolesi “Bonino Pulejo,” Messina application of deep learning in bioinformatics. In this context, a
98124, Italy, and with the Alma Digit Research Laboratory, Messina
98166, Italy (e-mail: mvillari@irccsme.it). survey on deep learning in medical image analysis was proposed
Digital Object Identifier 10.1109/RBME.2018.2825987 in [5], whereas a survey of recent advances in deep learning

1937-3333 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

Authorized licensed use limited to: Pázmány Péter Catholic University. Downloaded on September 27,2023 at 13:41:55 UTC from IEEE Xplore. Restrictions apply.
CELESTI et al.: WHY DEEP LEARNING IS CHANGING THE WAY TO APPROACH NGS DATA PROCESSING: A REVIEW 69

techniques for electronic health record analysis was proposed With the advent of NGS, one of the major challenges in bioin-
in [6]. However, a survey specifically focusing on the adoption formatics has been to efficiently transform genomics big data
of deep learning based software tools in the NGS domain is into valuable knowledge. In fact, on one hand, the key issue of
still missing. In this review, different from existing surveys, NGS is represented by the complexity of errors that are gener-
we aim to overcome such a gap in the literature. In particular, ated and that have to be managed during genomics big data pro-
starting from a Google Trends analysis of the interest of the cessing. For example, current error rates can vary roughly from
Internet community in both NGS and deep learning and from <1% for germline SNPs to >25% somatic INDELs; whereas,
an analysis of scientific works available in the literature, we on the other hand, the processing of genomics big data coming
present a taxonomic analysis of the current state of the art on from NGS requires a considerable execution time, making the
deep learning based solutions for NGS big data analytics. For old statistical machinery based algorithms not so efficient any
the sake of simplicity, in the rest of the paper, we will refer to more. In this regard, both industrial and scientific communities
such a kind of software family with the term “deep learning have look at deep learning solutions that allow to perform NGS
NGS” (DLN). In particular, besides highlighting the countries data processing tasks faster than in the past.
where such a trend is more evident, we highlight the major NGS Nowadays, deep learning is one of the strategic technology
application fields in which DLN software tools are available. trends reported by Gartner in 2017 [7]. It represents a class of
The remainder of the review is organized as follows. In machine-learning (ML) algorithms that uses different abstrac-
Section II, we motivate why deep learning is so important in tion levels to give a meaning to data, in which each level takes
the field of NGS. An analysis on the interest of the Internet input from the previous one [8]. Currently, it is used to address
community on such topics is provided in Section III. A taxo- problems in various application fields, including, e.g., complex
nomic analysis of DLN software tools is discussed in Section III. networks [9], finance [10], cancer prediction [11], satellite im-
A discussion on the current state of the art and future challenges age processing [12], image classification [13], social networks
in the perspective of DLN cloud-based services is provided in data analysis [14], real-time video streaming [15], [16], cloud
Section V. Section VI concludes the paper. management [17], etc. It was born from the artificial neural
networks (ANNs) research in order to improve the poor perfor-
mance of back-propagation algorithms when it is necessary to
II. WHY DOES NGS REQUIRE DEEP LEARNING? consider networks with many hidden layers [18]. Algorithms can
In this section, we motivate why deep learning is so important be either supervised or unsupervised. Commonly, deep learning
in the field of NGS in order to introduce the reader toward algorithms are used to solve signal processing, graphical mod-
a taxonomic analysis of existing related works and initiatives eling, and pattern recognition problems. Typical deep learning
available in the literature. methods include the following.
DNA sequencing is the process of determining the precise 1) Deep neural network (DNN): It is an ANN with mul-
order of A, T, G, and C nucleotides within a DNA molecule. tiple hidden layers between the input and output layers,
The advent of DNA sequencing methods has greatly accelerated which can model complex nonlinear relationships. Archi-
biological and medical research works and discoveries. Knowl- tectures generate compositional models where the object
edge of DNA sequences is fundamental for basic biological is expressed as a layered composition of primitives.
research and in various applied fields, including genome 2) Convolutional neural network (CNN): It is a class of deep,
analysis and SNP research (e.g., diagnostic approaches and feed-forward ANNs that has successfully been applied
prevention, analysis structure of mutant proteins), comparative to analyze images. CNNs use a variation of multilayer
genomics (e.g., metagenomics, ribosomal RNA (rRNA) perceptrons (MLPs) designed to require minimal prepro-
classification, infectious disease diagnostics), regulation of cessing. They are also known as shift invariant (or space
gene expression (e.g., role of messenger RNA (mRNA) and invariant) ANN, based on their shared-weight architec-
noncoding RNA), forensic biology, biological systematic, early ture and translation invariance characteristics.
detection of cancer, virology, etc. Modern DNA sequencing 3) Recurrent neural network (RNN): It is a class of ANN in
technologies have allowed researchers to efficiently analyze the which connections between units form a directed cycle
genomes of various living species, including humans, animal, that allows us to analyze a dynamic temporal behavior.
plant, and microbial organisms. Several new methods for DNA Unlike feed-forward ANNs, RNNs can use their internal
sequencing were developed in 90 s and have been implemented memory to process arbitrary sequences of input.
in commercial DNA sequencers since 2000. 4) Autoencoder (AE): It is an ANN used for unsupervised
NGS, also known as high-throughput sequencing, is a catch- learning of efficient codings. Four main variants exist as
all term used to describe different modern DNA sequencing follows:
technologies, including Illumina (Solexa) sequencing, Roche a) Sparse autoencoder (SA): By imposing sparsity
454 sequencing, Ion torrent: Proton/PGM sequencing, and on the hidden units during the training process
SOLiD sequencing. NGS technologies allow researchers to se- (while having a larger number of hidden units than
quence DNA much more quickly and cheaply compared to the inputs), an AE can learn useful structures in the
older Sanger sequencing technology and, as such, have revolu- input data. This allows sparse representations of
tionized the study of genomics and molecular biology. inputs.

Authorized licensed use limited to: Pázmány Péter Catholic University. Downloaded on September 27,2023 at 13:41:55 UTC from IEEE Xplore. Restrictions apply.
70 IEEE REVIEWS IN BIOMEDICAL ENGINEERING, VOL. 11, 2018

Fig. 1. Google Trends. Interest during the time period ranging from the week of December 18–25, 2011 to the week of December 4–10, 2016.

b) Denoising autoencoder (DA): It takes a partial cor- TABLE I


GOOGLE TRENDS: INTEREST OF THE INTERNET COMMUNITY IN NGS AND
rupted input while it trains to recover the original DEEP LEARNING DURING THE TIME PERIOD RANGING FROM THE WEEK OF
undistorted one. DECEMBER 18–25, 2011 TO THE WEEK OF DECEMBER 4–10, 2016
c) Variational autoencoder (VA): It inherits the AE
architecture, but makes strong assumptions con-
cerning the distribution of latent variables.
d) Contractive autoencoder (CA): It adds an ex-
plicit regularizer in its objective function that
forces the model to learn from slight variations of
input values.
Apart from the aforementioned methods, several variants have
been proposed so far.
The research on deep learning has demonstrated success in
various application fields including, among others, healthcare
and biotechnology. Considering applications of deep learning
in healthcare, a hidden Markov model (HMM) enriched by a
feed-forward neural network was applied for speech recogni-
tion in [19]. Moreover, the interest for deep learning for speech
recognition was also shown by Microsoft in [20]. A CNN-based
method for determining the author’s personality studying the
text was reported in [21]. Other relevant applications of deep
learning in healthcare regarded mammography [22] and analysis
of molecules for discovering new drugs [23]. Regarding genet- Fig. 2. Google Trends. Distribution on map of the interest in deep
ics and genomics problems, AE methods were used to predict learning in the considered time period.
gene ontology annotations and gene-function relationships. In
general, deep learning methods are very useful in the following
conditions: III. INTEREST OF THE INTERNET COMMUNITY IN NGS AND
1) there are genomics big data (e.g., > 50 k samples and DEEP LEARNING
> 1M ideals; In recent years, we have observed a constant interest of the
2) there are high-quality input data and labels for training; Internet community in NGS and an increasing interest in deep
3) the mapping from data and labels is unknown but defi- learning. Fig. 1 shows the interest of the Internet community in
nitely exists; and NGS and Deep Learning according to Google Trends during the
4) there are problems hard to be solved with classical statisti- time period ranging from the week on December 18–25, 2011
cal and ML approaches, in which the analysis of previous to the week on December 4–10, 2016. As shown in Table I,
high-quality results can improve the training task. numbers represent the interest of searching the terms NGS and
Considering the latter, the deep learning approach well fits deep learning with respect to the highest point of the graph
the SNP and INDEL calling from NGS big data. In particular, according to a region in a considered period. The 100 value
figuring out the true genome sequence derived from NGS big indicates the greatest term search frequency, 50 indicates the
data, there is a computational and statistical challenge if we metaterm search, whereas 0 indicates a term search frequency
consider that actual sequencer output includes roughly 1 billion less than 1% with respect to the greatest search frequency. We
basepair long DNA reads (i.e., 30 per coverage) and that a can observe that even though the interest on NGS has been
true genome sequence includes 3 billion bases in 23 contiguous constant since 2011, in the last years, there was a meaningful
chunks (i.e., chromosomes). Moreover, a complex error process increase of the interest in deep learning. Figs. 2 and 3 depict
makes it difficult to call variants accurately in NGS big data. the interest of Google users, respectively, in NGS and deep
In the following, we will provide a comprehensive taxonomic learning, considering different geographical areas. Considering
analysis of deep learning based solutions in the NGS domain. both maps, we can notice that there are peaks of users interested

Authorized licensed use limited to: Pázmány Péter Catholic University. Downloaded on September 27,2023 at 13:41:55 UTC from IEEE Xplore. Restrictions apply.
CELESTI et al.: WHY DEEP LEARNING IS CHANGING THE WAY TO APPROACH NGS DATA PROCESSING: A REVIEW 71

mechanism. Furthermore, mutations in specific regulatory se-


quences are on the basis of the alteration of these cellular pro-
cesses that could compromise cell vitality.
Genome analysis and SNP research have the objective to
identify polymorphisms in the genome (in particular, SNPs that
could be associated with diseases). Accordingly, it is possible
to develop new diagnostic approaches and increase resources
in disease prevention. In addition, the knowledge of polymor-
phisms in the coding regions of DNA allows the prediction of
conformational structures of proteins that should be encoded.
Early detection of cancer is extremely important to fight
Fig. 3. Google Trends. Distribution on map of the interest in NGS in diseases caused by an abnormal cell proliferation caused by
the considered time period. genetics anomalies. Furthermore, it is also important for the
administration of the most appropriate medical treatments.
in both NGS and deep learning in the United States, Italy, United Fig. 4 shows a taxonomic tree related to the major DLN soft-
Kingdom, Germany, and India. ware solution currently available over the Internet. In particular,
Looking at Figs. 2 and 3, it is evident how, according to nodes represent the NGS application fields, whereas leaves rep-
Google Trends, the Internet community interested in both NGS resent the adopted DLN software solution. Furthermore, Table II
and deep learning come from the United States, United King- summarizes, for each NGS application field, the major recent
dom, Italy, Germany, and India. DLN software solutions, bibliographic references, title of sci-
entific works, adopted deep learning methods, and countries
of origin.
IV. TAXONOMIC ANALYSIS OF SCIENTIFIC INITIATIVES
The huge amount of information obtained by means of the A. Epigenetic Modifications
NGS technologies must be stored and analyzed in a proper man-
Deepmethyl [29] is stacked denoising autoencoder (SDA)
ner. Therefore, many software solutions are available for these
based piece of software aimed at predicting the methylation
purposes. In particular, hereby, we focus on deep learning based
state of DNA CpG dinucleotides by adopting features inferred
software solutions for big NGS data analytics (which we call
from a three-dimensional genome topology (based on Hi-C)
DLG) [24]. In order to analyze the state of the art, we considered
and patterns of DNA sequences. DeepChrome [30] is a piece
databases, PubMed, Web of Science, and Scopus, that include
of framework based on CNN able to classify gene expression
papers coming from over 5000 of the major publishers spread
using, as input, histone modification data. It enables automatic
all over the world [25]. In particular, in our search, we combined
extraction of complex interactions. In addition, in order to simul-
the “deep learning” and “NGS” terms. Although at the time of
taneously display the combinatorial interactions among histone
writing of this review, we found more than 28 000 records, only
modifications, it adopts a novel optimization based technique
a few of them specifically focused on DLN software solutions.
that generates feature pattern maps.
From a taxonomic analysis of the state of the art, the major
NGS application fields in which deep learning has been adopted
so far includes the following: B. Proteins and Regulatory Sequences and DNA
1) regulation of gene expression (including epigenetic mod- Binding Proteins
ifications, interactions between proteins and regulatory Basset [31] is an open source software package based on
sequences, and prediction of splicing variants of mRNA); CNN that is able to learn the functional activity of DNA se-
2) genome analysis and SNPs research (including SNPs in quences analyzing genomics data. Basset was tested on a com-
coding and noncoding regions of genome, and protein pendium of accessible genomic sites mapped in 164 cell types
structure prediction); and by DNA sequencing in order to demonstrate greater predic-
3) early detection of cancer (including research and moni- tive accuracy compared to other methods. Deep Motif (DeMo)
toring of biomarkers) [26]–[28]. [32] is a dashboard toolkit that provides a suite of visualiza-
Regulation of gene expression is on the basis of the correct tion strategies in order to extract Motifs, or sequence patterns
progress of all biological functions, which are finely adjusted. from DNN models for TF binding classification. DanQ [33]
In this regard, an important mechanism involved in it considers is a piece of hybrid framework able to predict new noncoding
epigenetic modifications (such as DNA methylation and histone functions from sequences. It combines CNNs and bidirectional
modifications), which affect gene expression without changing long short-term memory network (BLSTM), which is a variant
the nucleotide sequence of the genome. In addition, it is also of the RNN that combines the outputs of two RNNs, one pro-
important that the correct interaction of the DNA with spe- cessing the sequence from left to right, the other one from right
cific proteins as the transcription factors (TFs) is responsible to left. PEDLA [34] is a piece of framework based on DNN
for the transcription of DNA into RNA, in a specific-sequence and HMM, which can directly learn an enhancer predictor from
manner. Even the alternative splicing, responsible for the syn- a huge amount of heterogeneous genomics data and is able to
thesis of various mRNA from a gene, is an important regulatory apply generalization strategies with the purpose to be mostly

Authorized licensed use limited to: Pázmány Péter Catholic University. Downloaded on September 27,2023 at 13:41:55 UTC from IEEE Xplore. Restrictions apply.
72 IEEE REVIEWS IN BIOMEDICAL ENGINEERING, VOL. 11, 2018

Fig. 4. Taxonomic tree reporting the major DLN software tools.

consistent across various cell types and/or tissues. DECRES D. SNPs in Coding and Noncoding Regions of Genome
[35] is a supervised deep learning solution for the identification
DeepSea [41] is a piece of framework based on CNN algo-
of enhancer and promoter regions in the human genome. It com-
rithms that directly learns a regulatory sequence code from large-
bines the following deep learning basic and derived methods:
scale chromatin-profiling data and enables the prediction of
CNN, DA, CA, SDA, stacked contractive autoencoder (SCA),
chromatin effects of sequence alterations with single-nucleotide
multiclass logistic/softmax regression (MCL/SR), MLPs, re-
sensitivity. Diet networks [42] is a software tool based on the
stricted Boltzman machine (RBM), deep belief network (DBN),
concept of CNN reparameterization aimed at solving the over-
and stacked restricted Boltzman machine (SRBM). Flexible In-
fitting problem originated when the number of input features
tegration of Data with Deep LEarning (FIDDLE) [36] is an
can be orders of magnitude larger than the number of training
open source flexible integrative data-agnostic piece of frame-
examples. In particular, by means of a neural network parame-
work that is able to learn a unified representation by analyzing
terization, it is able to considerably reduce the number of free
multiple data types in order to infer another data type. DEEP
parameters. It is based on the idea that it can first learn and pro-
[37], [38] is a predictive piece of framework using the CNN
vide a distributed representation for each input feature (e.g., for
approach able to streamline the analysis of enhancer’s prop-
each position in the genome where variations are observed), and
erties in various cellular conditions. By using such a solution,
then learn by means of another neural network how to map each
it is possible to train many models of individual classification
distributed feature representation into a vector of parameters by
that can be combined to classify DNA regions as enhancers
means of a classifier neural network in which weights links each
or nonenhancers. DEEP uses features that are deduced from
value of the feature to a specific hidden unit.
histone modification marks or attributes coming from differ-
DANN [43] is a software tool aimed at annotating genetic
ent sequence characteristics. DeepBind [39] is a stand-alone
variants, especially noncoding variants, for the purpose of iden-
software tool based on CNN that is totally automatic and is
tifying pathogenic ones. It uses the same feature set and train-
able to handle millions of sequences for each experiment. The
ing data as combined annotation-dependent depletion to train a
specificities that are determined by DeepBind are displayed
DNN that can capture nonlinear relationships among features in
as a weighted position matrix or as a “mutation map,” which
an efficient fashion when there is large number of samples and
indicates how the variations affect binding within a specific
features.
sequence.
E. Protein Structure Prediction
C. Prediction of Splicing Variants of mRNA
Given enough large protein families and using a global statis-
DeepSplice [40] is a CNN-based solution adopted as a splice tical inference approach, it is possible to obtain enough accuracy
junction classification tool employing deepCNNs that offers the in protein residue contact predictions to predict the structure of
following: many proteins. However, these approaches do not consider the
1) offers better performances compared to other methods fact that the contacts in a protein are neither randomly nor
for predicting splice sites; independently distributed, but actually follow precise rules gov-
2) offers high computational efficiency; and erned by the structure of the protein, and thus, are interdepen-
3) can be applied so as to pick out self-defined training data. dent. Considering such a concept, PconsC2 [44] is a multilayer

Authorized licensed use limited to: Pázmány Péter Catholic University. Downloaded on September 27,2023 at 13:41:55 UTC from IEEE Xplore. Restrictions apply.
CELESTI et al.: WHY DEEP LEARNING IS CHANGING THE WAY TO APPROACH NGS DATA PROCESSING: A REVIEW 73

TABLE II
RELATED WORK TITLE, AUTHORS, COUNTRY, AND DEEP LEARNING BASED SOLUTION FOR EACH NGS SUBFIELD

feed-forward (MLFF) stack of random decision forests learn- from gene expression data identifying genes that are critical for
ers aimed at identifying proteinlike contact patterns in order to the diagnosis of breast cancer.
improve contact predictions.
V. DISCUSSION AND FUTURE CHALLENGES
F. Research and Monitoring of Biomarkers From both Google Trends and taxonomic analysis of aca-
Cancer detection from gene expression data is a challenge demic initiatives, it is evident that deep learning for the process-
due to the high dimensionality and complexity of considered ing of NGS big data is an emerging research topic.
data. After decades of research, there is still uncertainty in the Fig. 5 highlights that the highest percentage of DLN software
clinical diagnosis of cancer and the identification of tumor- tools is available in the United States, confirming the interest
specific markers. The stacked denoising AE (SDAE) solution of the Internet community of this country on deep learning and
[45] adopts a deep learning approach aimed at cancer detection NGS. From an academic point of view, other relevant initiatives

Authorized licensed use limited to: Pázmány Péter Catholic University. Downloaded on September 27,2023 at 13:41:55 UTC from IEEE Xplore. Restrictions apply.
74 IEEE REVIEWS IN BIOMEDICAL ENGINEERING, VOL. 11, 2018

Service (PaaS) [50] and Software as a Service (SaaS) [51]: The


first provides a global application development and deployment
environment to software developers [cloud providers typically
offer development toolkit through application program interface
(API)]; the second provides applications by means of web inter-
faces. Currently, there are many frameworks that are used to run
parallel algorithms in cloud computing systems. An example is
Apache Hadoop, which is used to run distributed parallel pro-
Fig. 5. Percentage of deep learning based NGS tools according
cessing clusters on top of a scalable virtual infrastructure. In this
to country. regard, we underline that DNA sequencing and the NGS data
processing are typical problems that can be solved by means of
parallel deep learning algorithms.
Future challenges regard the interoperability among heteroge-
neous bioinformatics cloud providers aimed at providing DLN
in form of PaaS and SaaS, which is the ability of information
systems and software applications to communicate each other
and exchange data [52]. Unfortunately, most existing NGS li-
brary preparation devices, sequencing instruments, and software
tools have not been designed to work in a clinical networked
environment even if an interoperable world-wide ecosystem of
cooperating bioinformatics clouds would bring benefits for the
Fig. 6. Percentage of deep learning based tools according NGS whole research community. In our opinion, evolved federated
subarea.
PaaS aimed at DLN could push the development of new mash-
up DLN SaaS. However, in order to achieve such a scenario, the
were developed in Canada, China, Sweden, and Saudi Arabia. research community has to define the following:
Instead, Fig. 6 highlights that 50% of analyzed deep learning 1) common standard communication interfaces to enable the
tools are aimed at proteins and regulatory sequences that re- communication between different federated bioinformat-
sults the major NGS subfield, which is currently more treated, ics cloud providers;
followed by epidemic modification (12.5%). Other NGS fields 2) common standard DNA sequencing data format;
that currently are not covered by deep learning based solutions 3) common PaaS development and deployment environment
include comparative genomics (e.g., metagenomics, rRNA clas- for DLN SaaS solutions exploiting the advantages of
sification, and infectious disease diagnostics), forensic biology, cloud computing in terms of system scalability, data pro-
biological systematic, and virology. cessing, and security; and
Regarding deep learning methods, the most used DLN soft- 4) common standard APIs for the development of DLN SaaS
ware tools are CNN, DNN, HMM, and SDA. The major one solutions.
is CNN, which is used in DeepChrome, Basset, FIDDLE,
DEEP, DeepBind, DeepSlice, DeepSea, and Diet software tools;
whereas, variants are used in DeMo and DanQ software tools. VI. CONCLUSION
The DNN method is used in DANN and PEDLA in combina- Deep learning for big data analytics in genomics is an emerg-
tion with HMM. SDA is used in Deepmethyl and SDAE. In the ing research topic. In this review, we specifically focused on
end, MLFF, derived from CNN, is used in PconsC2. A partic- deep learning adopted for NGS. Although deep learning was
ular consideration deserves DECRES that combines different introduced in 1965, currently it is re-emerging to solve big
deep learning methods including CNN, DA, CA, SDA, SCA, data problems in genomics. In particular, first, by means of
MCL/SR, MLP, RBM, DBN, and SRBM. the Google Trends tools, we analyzed the interest of the Inter-
Cloud computing is emerging as a promising solution for the net community on NGS and deep learning topics in a period
fast on-demand data processing in distributed systems, thanks starting from December 2015 to December 2016, highlighting
to its theoretical infinite availability of storage and computa- how the interest in the latter is constantly growing. Second, we
tion resources. The adoption of such a paradigm opens toward presented a taxonomy in order to analyze the current state of
new business opportunities [46], besides bringing different ad- the art of DLN software solutions. From our study, we realized
vantages in terms of service discovery [47] and security [48]. that deep learning software tools are at an early stage. Many
The advent of this paradigm is changing the way of conceiv- researchers are beginning to look at deep learning especially in
ing information systems in different application fields, including context of regulatory sequences and DNA binding proteins, but
biotechnology. From our analysis, even though cloud computing there are not yet so many solutions focusing on other genomics
based NGS solutions [49] exist, currently it is evident how DLN research fields. Moreover, from our analysis, it is evident how
software tools are currently provided in form of stand-alone ap- deep learning is currently adopted in NGS stand-alone applica-
plications and not in an “as a service” form. DLN software tools tions and not in cloud computing PaaS and SaaS. In our opinion,
could benefit of cloud service levels including Platform as a the research community could benefit of the advantages offered

Authorized licensed use limited to: Pázmány Péter Catholic University. Downloaded on September 27,2023 at 13:41:55 UTC from IEEE Xplore. Restrictions apply.
CELESTI et al.: WHY DEEP LEARNING IS CHANGING THE WAY TO APPROACH NGS DATA PROCESSING: A REVIEW 75

by future cloud computing services based on deep learning al- [23] John Markoff, “Scientists see promise in deep-learning programs,”
gorithms in terms of both computational resource scalability New York Times. 2012. [Online]. Available: http://www.nytimes.com/
2012/11/24/science/scientists-see-advances-in-deep-learning-a-part-of-
and DNA fragments data sharing. With this review, we hope we artificial-intelligence.html. Accessed: Nov. 2012.
succeeded in stimulating the biotechnology community toward [24] F. Celesti et al., “Big data analytics in genomics: The point on deep
the development of advanced DLN software tools. learning solutions,” in Proc. IEEE Symp. Comput. Commun., 2017,
pp. 306–309.
[25] SCOPUS, Scopus Content at-a-glance. 2017. [Online]. Available:
REFERENCES https://www.elsevier.com/solutions/scopus/content to Accessed: Oct. 20,
2017.
[1] M. Fazio, M. Paone, A. Puliafito, and M. Villari, “Huge amount of het- [26] D. Rav et al., “Deep learning for health informatics,” IEEE J. Biomed.
erogeneous sensed data needs the cloud,” in Proc. Int. Multi-Conf. Syst., Health Informat., vol. 21, no. 1, pp. 4–21, Jan. 2017.
Signals, Devices, 2012, pp.–1–6. [27] C. Angermueller, T. Prnamaa, L. Parts, and O. Stegle, “Deep learning for
[2] M. Fazio, A. Celesti, A. Puliafito, and M. Villari, “Big data storage in the computational biology,” Mol. Syst. Biol., vol. 12, no. 7, 2016, Art. no. 878.
cloud for smart environment monitoring,” Procedia Comput. Sci., vol. 52, [Online]. Available: http://dx.doi.org/10.15252/msb.20156651
no. 1, 2015, pp. 500–506. [28] M. K. K. Leung, A. Delong, B. Alipanahi, and B. J. Frey, “Machine
[3] M. Fazio, A. Celesti, M. Villari, and A. Puliafito, “The need of a hybrid learning in genomic medicine: A review of computational problems and
storage approach for IoT in PaaS cloud federation,” in Proc. IEEE 28th data sets,” Proc. IEEE, vol. 104, no. 1, pp. 176–197, Jan. 2016.
Int. Conf. Adv. Inf. Netw. Appl. Workshops, 2014, pp. 779–784. [29] Y. Wang, et al., “Predicting DNA methylation state of CpG dinucleotide
[4] A. G. Ivakhnenko and V. G. Lapa, Cybernetic Predicting Devices. New using genome topological features and deep networks,” Sci. Rep., vol. 6,
York, NY, USA: CCM Information Corporation, 1965. pp. 1–15, 2016.
[5] G. Litjens et al., “A survey on deep learning in medical image analysis,” [30] R. Singh, J. Lanchantin, G. Robins, and Y. Qi, “DeepChrome: Deep-
Med. Image Anal., vol. 42, pp. 60–88, 2017. learning for predicting gene expression from histone modifications,”
[6] B. Shickel, P. Tighe, A. Bihorac, and P. Rashidi, “Deep EHR: A Bioinformatics, vol. 32, no. 17, pp. i639–i648, 2016.
survey of recent advances in deep learning techniques for electronic [31] D. Kelley, J. Snoek, and J. Rinn, “Basset: Learning the regulatory code of
health record (EHR) analysis,” IEEE J. Biomed. Health Informat., still the accessible genome with deep convolutional neural networks,” Genome
to be published. Res., vol. 26, no. 7, pp. 990–999, 2016.
[7] K. Panetta, “Gartner’s top 10 strategic technology trends for 2017: Arti- [32] J. Lanchantin, R. Singh, B. Wang, and Y. Qi, “Deep Motif dashboard:
ficial intelligence, machine learning, and smart things promise an intelli- Visualizing and understanding genomic sequences using deep neural net-
gent future,” Gartner, Oct. 2016. [Online]. Available: http://www.gartner. works,” Pacific Symp. Biocomput., Waimea, HI, USA, Jan. 2016, vol. 22,
com/smarterwithgartner/gartners-top-10-technology-trends-2017/ pp. 254–265.
[8] L. Deng and D. Yu, “Deep learning: Methods and applications,” Tech. [33] D. Quang and X. Xie, “DanQ: A hybrid convolutional and recurrent deep
Rep., May 2014. [Online]. Available: https://www.microsoft.com/en- neural network for quantifying the function of DNA sequences,” Nucleic
us/research/publication/deep-learning-meth ods-and-applications/ Acids Res., vol. 44, no. 11, 2016.
[9] D. C. Mocanu, E. Mocanu, P. H. Nguyen, M. Gibescu, and A. Liotta, “A [34] F. Liu, H. Li, C. Ren, X. Bo, and W. Shu, “PEDLA: Predicting enhancers
topological insight into restricted Boltzmann machines,” Mach. Learn., with a deep learning-based algorithmic framework,” Sci. Rep., vol. 6,
vol. 104, no. 2, pp. 243–270, Sep. 2016. 2016, Art. no. 28517.
[10] S. Sohangir, D. Wang, A. Pomeranets, and T. Khoshgoftaar, “Big data: [35] Y. Li, W. Shi, and W. W. Wasserman, “Genome-wide prediction of cis-
Deep learning for financial sentiment analysis,” J. Big Data, vol. 5, no. 1, regulatory regions using supervised deep learning methods,” Cold Spring
pp.–1–25, 2018. Harbor Laboratory, Cold Spring Harbor, NY, USA. [Online]. Available:
[11] D. Bychkov et al., “Deep learning based tissue analysis predicts outcome https://www.biorxiv.org/content/early/2016/02/28/041616
in colorectal cancer,” Sci. Rep., vol. 8, no. 1, 2018, Art. no. 3359. [36] U. Eser and L. S. Churchman, “FIDDLE: An integrative deep learning
[12] M. Zou and Y. Zhong, “Transfer learning for classification of optical framework for functional genomic data inference,” Cold Spring Har-
satellite image,” Sens. Imag., vol. 19, no. 1, 2018. bor Laboratory, Cold Spring Harbor, NY, USA. [Online]. Available:
[13] Y. Zhou, Q. Hu, and Y. Wang, “Deep super-class learning for long-tail http://biorxiv.org/content/early/2016/10/17/081380
distributed image classification,” Pattern Recognit., vol. 80, pp. 118–128, [37] D. Kleftogiannis, P. Kalnis, and V. Bajic, “Deep: A general computational
2018. framework for predicting enhancers,” Nucleic Acids Res., vol. 43, no. 1,
[14] M. Bastos and D. Mercea, “Parametrizing Brexit: Mapping twitter political pp. 1–6, 2015.
space to parliamentary constituencies,” Inf. Commun. Soc., vol. 21, no. 7, [38] X. Min, N. Chen, T. Chen, and R. Jiang, “DeepEnhancer: Predicting en-
pp. 921–939, 2018. hancers by convolutional neural networks,” Proc. IEEE Int. Conf. BioIn-
[15] M. T. Ve ga, D. C. Mocanu, and A. Liotta, “Unsupervised deep learning format. Biomed., 2016, pp. 637–644.
for real-time assessment of video streaming services,” Multimedia Tools [39] B. Alipanahi, A. Delong, M. Weirauch, and B. Frey, “Predicting the se-
Appl., vol. 76, no. 21, pp. 22303–22327, Nov. 2017. quence specificities of DNA- and RNA-binding proteins by deep learning,”
[16] M. T. Vega, D. C. Mocanu, J. Famaey, S. Stavrou, and A. Liotta, “Deep Nature Biotechnology, vol. 33, no. 8, pp. 831–838, 2015.
learning for quality assessment in live video streaming,” IEEE Signal [40] Y. Zhang, X. Liu, J. N. MacLeod, and J. Liu, “DeepSplice: Deep classi-
Process. Lett., vol. 24, no. 6, pp. 736–740, Jun. 2017. fication of novel splice junctions revealed by RNA-Seq,” in Proc. IEEE
[17] Y. Zhang, J. Yao, and H. Guan, “Intelligent cloud resource management Int. Conf. BioInformat. Biomed., 2016, pp. 330–333.
with deep reinforcement learning,” IEEE Cloud Comput., vol. 4, no. 6, [41] J. Zhou and O. Troyanskaya, “Predicting effects of noncoding variants with
pp. 60–69, Nov./Dec. 2018. deep learning-based sequence model,” Nature Methods, vol. 12, no. 10,
[18] D. Yu and L. Deng, “Deep learning and its applications to signal and pp. 931–934, 2015.
information processing [exploratory DSP],” IEEE Signal Process. Mag., [42] A. Romero et al., “Diet networks: Thin parameters for fat genomic,” Int.
vol. 28, no. 1, pp. 145–154, Jan. 2011. Conf. Learning Representations 2017 (ICLR2017), Toulon, France, Apr.
[19] G. Hinton et al., “Deep neural networks for acoustic modeling in speech 2017.
recognition: The shared views of four research groups,” IEEE Signal [43] D. Quang, Y. Chen, and X. Xie, “DANN: A deep learning approach for
Process. Mag., vol. 29, no. 6, pp. 82–97, Nov. 2012. annotating the pathogenicity of genetic variants,” Bioinformatics, vol. 31,
[20] “Recent advances in deep learning for speech research at Microsoft,” in no. 5, pp. 761–763, 2015.
Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., May 2013. [On- [44] M. Skwark, D. Raimondi, M. Michel, and A. Elofsson, “Improved contact
line]. Available: https://www.microsoft.com/en-us/research/publication/ predictions using the recognition of protein like contact patterns,” PLoS
recent-advances-in-deep-learning-for-speech-research-at-microsoft/ Comput. Biol., vol. 10, no. 11, 2014, Art. no. e1003889.
[21] N. Majumder, S. Poria, A. Gelbukh, and E. Cambria, “Deep learning-based [45] P. Danaee, R. Ghaeini, and D. Hendrix, “A deep learning approach for
document modeling for personality detection from text,” IEEE Intell. Syst., cancer detection and relevant gene identification,” Proc. 22nd Pac. Symp.
vol. 32, no. 2, pp. 74–79, Mar. 2017. Biocomput., 2017, pp. 219–229.
[22] A. S. Becker, M. Marcon, S. Ghafoor, M. C. Wurnig, T. Frauenfelder, and [46] G. Di Modica and O. Tomarchio, “Matching the business perspectives
A. Boss, “Deep learning in mammography,” Investigative Radiol., vol. 52, of providers and customers in future cloud markets,” Cluster Comput.,
no 7, pp. 434–440. Feb. 2017. vol. 18, no. 1, pp. 457–475, 2015.

Authorized licensed use limited to: Pázmány Péter Catholic University. Downloaded on September 27,2023 at 13:41:55 UTC from IEEE Xplore. Restrictions apply.
76 IEEE REVIEWS IN BIOMEDICAL ENGINEERING, VOL. 11, 2018

[47] G. Di Modica, O. Tomarchio, and L. Vita, “Resource and service discovery Jiafu Wan (M’11) has been a Professor with
in SOA: A P2P oriented semantic approach,” Int. J. Appl. Math. Comput. the School of Mechanical and Automotive En-
Sci., vol. 21, no. 2, pp. 285–294, 2011. gineering, South China University of Technol-
[48] G. D. Modica and O. Tomarchio, “Semantic security policy matching in ogy, Guangzhou, China, since September 2015.
service oriented architectures,” in Proc. IEEE World Congr. Serv., Jul. He has directed 16 research projects, including
2011, pp. 399–405. the National Key Research and Development
[49] A. Celesti, M. Fazio, F. Celesti, G. Sannino, S. Campo, and M. Villari, Project, the National Natural Science Founda-
“New trends in biotechnology: The point on NGS cloud computing solu- tion of China, the High-level Talent Project of
tions,” Proc. IEEE Symp. Comput. Commun., 2016, pp. 267–270. Guangdong Province, and the Natural Science
[50] A. Celesti, N. Peditto, F. Verboso, M. Villari, and A. Puliafito, “Draco Foundation of Guangdong Province. His re-
PaaS: A distributed resilient adaptable cloud oriented platform,” in Proc. search interests include cyber-physical systems,
IEEE 27th Int. Parallel Distrib. Process. Symp. Workshops PhD Forum, intelligent manufacturing, big data analytics, Industry 4.0, smart factory,
2013, pp. 1490–1497. cloud robotics, and Internet of vehicles.
[51] A. Celesti, D. Mulfari, M. Fazio, A. Puliafito, and M. Villari, “Evaluating Prof. Wan is an Associate Editor for the IEEE ACCESS (SCI) and on
alternative DAAS solutions in private and public openstack clouds,” Softw. the Editorial Board of PLoS One (SCI), and he is a Managing Editor for
Pract. Experience, vol. 47, no. 9, pp. 1185–1200, 2017. the International Journal of Autonomous and Adaptive Communications
[52] A. Celesti, F. Celesti, M. Fazio, P. Bramanti, and M. Villari, “Are next- Systems (Ei Compendex) and International Journal of Applied Research
generation sequencing tools ready for the cloud?” Trends Biotechnology, and Technology (Ei Compendex). He has been a Leading Guest Editor
vol. 35, pp. 486–489, 2017. for several SCI-indexed journals, such as IEEE SYSTEMS JOURNAL, IEEE
ACCESS, Computer Networks (Elsevier), Mobile Networks & Applications
(Springer), Computers and Electrical Engineering (Elsevier), and Micro-
Fabrizio Celesti received the master’s degree processors and Microsystems (Elsevier). He is a senior member of both
in biotechnology from the University of Messina, CMES and CCF.
Messina, Italy, in 2017. He is currently work-
ing toward the second-level master postgradu-
ate course in advanced medical biotechnology
for diagnostic in laboratory at the Department of
Biomedical and Dental Sciences and Morpho-
functional Images, Messina University.
Since 2016, he has collaborated with the De-
partment of Engineering on research activities
on big genomics data analytics. Since 2016, he
has been a Technical Program Committee member of the IEEE ICT so-
lutions for eHealth (ICTS4eHealth). His main research interests include
ehealth, genomics, and big data analytics.
Massimo Villari (M’08) received the Laurea
degree in 1999 in electronic engineering, and
Antonio Celesti (M’16) received the master’s the Ph.D. in advanced technologies for informa-
degree in computer science in 2008, and the tion engineering in 2002, from the University of
Ph.D. in advanced technologies for informa- Messina, Messina, Italy. He is currently an As-
tion engineering in 2012, from the University of sociate Professor in computer engineering with
Messina, Messina, Italy. He is currently an Ad- the University of Messina. He is actively work-
junct Professor in databases with the University ing as an IT Security and Distributed Systems
of Messina. He has been a collaborator in many Analyst in cloud computing, virtualization, and
national and international projects, such as the storage. For the EU Project RESERVOIR, he
EU FP7 Project RESERVOIR—Resources and led IT security activities, and for the EU Project
Services Virtualization Without Barriers, from VISION-CLOUD, he covered the role of architectural designer for the
2008 to 2011, and the EU FP7 Project VISION University of Messina. He is currently working on the EU Project Fron-
CLOUD—Cloud Virtualized Storage Services Foundation for the Future tierCities, the Accelerator of FIWARE on Smart Cities—Smart Mobility,
Internet, from 2010 to 2013. From 2012 to 2014, he was an Assistant Re- where he is responsible for information and communications technology.
searcher with the University of Messina working on the POR FSE 2007– He is strongly involved in EU Future Internet initiatives, specifically cloud
2013 Project SIMONE, focusing on energy management and cloud com- computing and security in distributed systems. He has been a co-author
puting. Since 2012, he has been responsible for the Digital Iconographic of more than 130 scientific publications and patents on cloud comput-
Atlas of Numismatics in Antiquity. Since 2014, he has been an Assistant ing (cloud federation), distributed systems, wireless networks, network
Researcher with the University of Messina working on the EU FP7 project security, cloud security, and cloud and IoTs. He was a General Chair
FrontierCities—European Cities Driving the Future Internet. From 2015 of ESOCC 2015 and IEEE-ISCC 2016. Since 2011, he has been a fel-
to 2016, he was a collaborator in the EU Horizon 2020 Project BEACON, low of IARIA, recognized as a cloud computing expert, and has also
focusing on federated cloud networking. Since 2017, he has been a Prin- been involved in the activities of the FIArch, the EU Working Group on
cipal Investigator of the Horizon 2020 Project FrontierCities2 subgrant Future Internet Architecture. In 2014, he was recognized by an inde-
CASMOB and responsible for the University of Messina with the Project pendent assessment (IEEE TRANSACTIONS ON CLOUD COMPUTING, April
entitled “Do severe acquired brain injury patients benefit from Telere- 2014) as one of the worldwide active scientific researchers, top 27 clas-
habilitation? A cost-effectiveness analysis study,” funded by the Italian sification, in the cloud computing area. He has been a General Chair of
healthcare industry. He has co-authored many scientific publications on EAI-CN4IoT. He is an Editor-in-Chief of the EAI-endorsed Transactions
distributed systems and cloud computing. His main research interests on Smart Cities. He is currently the scientifist responsible for research
include distributed systems and cloud computing (with particular regard activities on Cloud and eHealth agreed upon between the University
to federation, storage, security, and energy efficiency), and assistive of Messina and IRCCS Centro Neurolesi “Bonino Pulejo” in eHealth:
technology. http://healthycloud.irccsme.it/.

Authorized licensed use limited to: Pázmány Péter Catholic University. Downloaded on September 27,2023 at 13:41:55 UTC from IEEE Xplore. Restrictions apply.

You might also like