Professional Documents
Culture Documents
Data Science For Genomics 1St Edition Amit Kumar Tyagi Full Chapter
Data Science For Genomics 1St Edition Amit Kumar Tyagi Full Chapter
Edited by
Ajith Abraham
Director,
Machine Intelligence Research Labs,
United States
Academic Press is an imprint of Elsevier
125 London Wall, London EC2Y 5AS, United Kingdom
525 B Street, Suite 1650, San Diego, CA 92101, United States
50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States
The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom
ISBN: 978-0-323-98352-5
v
vi Contents
9. Effective dimensionality reduction 2.3 PIN diodes uses and advantages 171
model with machine learning 2.4 PIN photodiode applications 171
classification for microarray gene 3. Results and simulations 171
expression data 3.1 Effect of light on a PIN photodiode 171
3.2 Procedure to design and observe the
Yakub Kayode Saheed
effect of light 171
1. Introduction 153 3.3 VeI characteristic of a PIN photodiode 174
2. Related work 154 4. Conclusion 176
3. Materials and methods 155 Appendix (Silvaco Code) 176
3.1 Feature selection 155 Effect of light on the characteristics of pin
3.2 Principal component diode code 176
analysis 155 Effect of light on the characteristics of SDD
3.3 Logistic regression 157 diode code 177
3.4 Extremely randomized trees References 177
classifier 157
3.5 Ridge classifier 157 11. One step to enhancement the per-
3.6 Adaboost 157 formance of XGBoost through GSK
3.7 Linear discriminant analysis 157 for prediction ethanol, ethylene,
3.8 Random forest 157 ammonia, acetaldehyde, acetone,
3.9 Gradient boosting machine 157 and toluene
3.10 K-nearest neighbors 158 Samaher Al-Janabi, Hadeer Majed and Saif
3.11 Data set used for analysis 158 Mahmood
4. Results and discussion 158
4.1 Experimental analysis on 10-fold 1. Introduction 179
cross-validation 158 2. Related work 180
4.2 Experimental analysis on 3. Main tools 181
eightfold cross-validation 159 3.1 Internet of Things (IoTs) 181
4.3 Comparison of our findings with 3.2 Optimization techniques 181
some earlier studies 160 3.3 Prediction techniques 184
5. Conclusion and future work 160 4. Result of implementation 194
References 161 4.1 Description of dataset 194
4.2 Result of preprocessing 194
10. Analysis the structural, electronic 4.3 Checking missing values 195
and effect of light on PIN 5. Conclusions 201
photodiode achievement through References 202
SILVACO software: a case study
12. A predictive model for classifying
Samaher Al-Janabi, Ihab Al-Janabi and Noora colorectal cancer using principal
Al-Janabi component analysis
1. Introduction 165 Micheal Olaolu Arowolo, Happiness Eric
1.1 Photodiode 165 Aigbogun, Precious Eniola Michael, Marion
1.2 Effect of light on the IeV characteristics Olubunmi Adebiyi and Amit Kumar Tyagi
of photodiodes 165
1.3 IeV characteristics of a photodiode 167 1. Introduction 205
1.4 Types of photodiodes 168 2. Related works 206
1.5 Modes of operation of a photodiode 168 3. Methodology 207
1.6 Effect of temperature on IeV char of 3.1 Experimental dataset 208
photodiodes 168 3.2 Dimensionality reduction tool 208
1.7 Signal-to-noise ratio in a photodiode 169 3.3 Classification 209
1.8 Responsivity of a photodiode 169 3.4 Research tool 210
1.9 Responsivity versus wavelength 169 3.5 Performance evaluation metrics 210
2. PIN photodiode 170 4. Results and discussions 210
2.1 Operation of PIN photodiode 170 5. Conclusion 215
2.2 Key PIN diode characteristics 170 References 215
viii Contents
Girish Kumar Adari, School of Electronics Engineering, Aswani Kumar Cherukuri, School of IT and Engineering,
Vellore Institute of Technology, Chennai, Tamil Nadu, Vellore Institute of Technology, Katpadi, Vellore, India
India Abhishek Das, Department of Electronics and Communi-
Marion Olubunmi Adebiyi, Department of Computer cation Engineering, ITER, Siksha ‘O’ Anusandhan
Science, Landmark University, Omu-Aran, Kwara (Deemed to be University), Bhubaneswar, Odisha, India
State, Nigeria H.R. Deekshetha, School of Computer Science and
Olayinka R. Adelegan, Department of Computer Science, Engineering, Vellore Institute of Technology, Chennai,
Nigerian Defence Academy, Kaduna, Nigeria Tamil Nadu, India
Happiness Eric Aigbogun, Department of Computer Sci- Sri Sai Deepthi Bhrugubanda, Mamatha Medical Col-
ence, Landmark University, Omu-Aran, Kwara State, lege, Khammam, Telangana, India
Nigeria Prakriti Dwivedi, Research & Business Analytics-PGDM,
Samaher Al-Janabi, Department of Computer Science, Welingkar Institute of Management Development and
Faculty of Science for Women (SCIW), University of Research, Mumbai, India
Babylon, Babylon, Iraq Monika Gulia, Department of Pharmacy, School of
Ihab Al-Janabi, Babylon Electricity Distribution Branch, Medical and Allied Sciences, GD Goenka University,
Electricity Distribution Company for the Middle, Min- Gurugram, Haryana, India
istry of Electricity, Babylon, Iraq Sumeet Gupta, Department of Pharmacy, MM College of
Noora Al-Janabi, Department of Laser physics, Faculty of Pharmacy, Ambala, Haryana, India
Science for Women (SCIW), University of Babylon, Akbar Hossain, School of Engineering, Auckland Uni-
Babylon, Iraq versity of Technology, Auckland, New Zealand
J. Andrew John, School of Electronics Engineering, Vikas Jhawat, Department of Pharmacy, School of Med-
Vellore Institute of Technology, Chennai, Tamil Nadu, ical and Allied Sciences, GD Goenka University,
India Gurugram, Haryana, India
Micheal Olaolu Arowolo, Department of Computer Sci- V. Kakulapati, Sreenidhi Institute of Science and Technol-
ence, Landmark University, Omu-Aran, Kwara State, ogy, Yamnampet, Ghatkesar, Hyderabad, Telangana, India
Nigeria; Department of Electrical Engineering and
Computer Science, University of Missouri, Columbia, Firuz Kamalov, Department of Electrical Engineering,
MO, United States Canadian University Dubai, Dubai, UAE
Olalekan J. Awujoola, Department of Computer Science, Akbar Ali Khan, Research & Business Analytics-PGDM,
Nigerian Defence Academy, Kaduna, Nigeria Welingkar Institute of Management Development and
Research, Mumbai, India
Abidemi E. Awujoola, Department of Biotechnology,
Nigerian Defence Academy, Kaduna, Nigeria Saif Mahmood, Department of Computer Science, Faculty
of Science for Women (SCIW), University of Babylon,
Vedant Vikrom Borah, Department of Bio-Sciences, Babylon, Iraq
Assam Don Bosco University, Guwahati, Assam, India
Hadeer Majed, Department of Computer Science, Faculty
Rene Barbie Browne, Department of Biochemistry, of Science for Women (SCIW), University of Babylon,
Assam Don Bosco University, Guwahati, Assam, India Babylon, Iraq
A. Chaitanya Kumar, School of Electronics Engineering, Precious Eniola Michael, Department of Computer Sci-
Vellore Institute of Technology, Chennai, Tamil Nadu, ence, Landmark University, Omu-Aran, Kwara State,
India Nigeria
xi
xii Contributors
Mihir Narayan Mohanty, Department of Electronics and S.A. Sajidha, School of Computer Science and Engineer-
Communication Engineering, ITER, Siksha ‘O’ Anu- ing, Vellore Institute of Technology, Chennai, Tamil
sandhan (Deemed to be University), Bhubaneswar, Nadu, India
Odisha, India M. Shamila, Department of Computer Science and
Saumendra Kumar Mohapatra, School of Information Engineering (AIML), Gokaraju Rangaraju Institute of
Technology, SRM University Sikkim, Gangtok, Engineering and Technology, Hyderabad, Telangana,
Sikkim, India India
Sareeta Mugde, Research and Business Analytic, Garima Sharma, IIC & Research, Welingkar Institute of
Welingkar Institute of Management Development and Management Development and Research, Mumbai,
Research, Mumbai, India India
Sriman Naini, Rosenheim Technical University of Applied Gulshan Soni, Department of Computer Science and
Science, Rosenheim, Germany Engineering, School of Engineering, O.P. Jindal Uni-
Anroop Nair, Department of Pharmaceutical Sciences, versity, Raigarh, Chhattisgarh, India
College of Clinical Pharmacy, King Faisal University, Hana Sulieman, Department of Mathematics and Sta-
Al-Ahsa, Al-Ahsa Oasis, Kingdom of Saudi Arabia tistics, American University of Sharjah, Sharjah, UAE
Philip O. Odion, Department of Computer Science, Kannadhasan Suriyan, Department of Electronics and
Nigerian Defence Academy, Kaduna, Nigeria Communication Engineering, Study World College of
Francisca N. Ogwueleka, Department of Computer Engineering, Coimbatore, Tamilnadu, India
Science, Nigerian Defence Academy, Kaduna, Nigeria Fadi Thabtah, School of Digital Technologies, Manukau
Institute of Technology, Otara, Auckland, New Zealand
Raj Kumar Pegu, Department of Botany, Assam Don
Bosco University, Guwahati, Assam, India Prasannavenkatesan Theerthagiri, Department of Com-
Maheswari Raja, Centre for Smart Grid Technologies, puter Science and Engineering, GITAM School of
SCOPE, Vellore Institute of Technology, Chennai, Technology, GITAM Deemed to be University,
Tamil Nadu, India; School of Computer Science Engi- Bengaluru, India
neering, Vellore Institute of Technology, Chennai, Amit Kumar Tyagi, Department of Fashion Technology,
Tamil Nadu, India National Institute of Fashion Technology, New Delhi,
Nagarajan Ramalingam, Department of Electrical and India; Vellore Institute of Technology, Chennai, Tamil
Electronics Engineering, Gnanamani College of Tech- Nadu, India
nology, Namakkal, Tamilnadu, India P. Vijaya, Modern College of Business and Science,
S. Mahender Reddy, Otto-Friedrich University of Bam- Bowshar, Sultanate of Oman; Department of Mathe-
berg, IsoSySc, Bamberg, Germany matics and Computer Science, Modern College of
Business and Science, Bowshar, Sultanate of Oman
Jayanti Datta Roy, Department of Bio-Sciences, Assam
Don Bosco University, Guwahati, Assam, India K. Vinuthna, Department of Computer Science and
Engineering, Neil Gogte Institute of Technology,
Yakub Kayode Saheed, School of Information Technol- Hyderabad, Telangana, India
ogy and Computing, American University of Nigeria,
Yola, Adamawa, Nigeria Jai Narain Vishwakarma, School of Life Sciences,
Assam Don Bosco University, Guwahati, Assam, India
Preface
Genomics is a branch of genetics coined by Tom Roderick in 1986. Genetics is the study of a single gene, whereas ge-
nomics refers to the study of a group of genes called genomes. Genomes can be considered as an instruction manual for
human life. Originally, the analysis of these genomic data was very costly. However, due to advancements in technology,
the sequencing cost has been reduced incredibly, so that we are now able to include genomic analysis in daily medical
routines. The more we explore our genomes, the better we are able to take medical decisions and cure diseases.
Genomic data do not only include the individual’s data but also their family and ancestors’ data. Any leakage of this
type of information could cause very serious issues. Therefore, it is very much necessary to protect this data from reaching
the outside world. Privacy laws such as GINA (Genetic Information Non-discrimination Act), HIPAA (Health Insurance
Portability and Accountability Act of 1996), and GDPR (General Data Protection Regulation) help users to protect their
privacy by restricting the sharing of patients’ sensitive information. However, we really need to focus more on privacy
issues in an era of such rapid developments in the healthcare sectors.
The main categories of privacy in healthcare include data privacy, location privacy, identity privacy, and genomic
privacy. Existing tools are insufficient to handle genomic data because of the large size of the datasets. This book focuses
on genomic data sources, analysis tools, and the importance of privacy preservation. We cover tools such as tensor flow
and BioWeka, privacy laws, HIPAA, and technologies like the Internet of Things (IoT), IoT-based cloud, cloud computing,
edge computing, and blockchain technology.
On the other side, data science is a broad field encompassing some of the fastest-growing subjects in interdisciplinary
statistics, mathematics, and computer science. It encompasses the process of inspecting, cleaning, transforming, and
modeling data with the goal of discovering useful information, suggesting conclusions, and supporting decision-making.
Data analysis has multiple facets and approaches, including diverse techniques under a variety of names, in different
business, science, and social science domains. This book starts with a basic introduction to data science, machine learning,
deep learning, data analysis, and visualization techniques, etc. Further, we also include genomes, genomics, genetics,
transcriptomes, and proteomes as basic concepts of modern molecular biology. The book concludes with future per-
spectives with respect to genomic privacy.
The book also provides numerous practical case studies (including self-assessments) using real-world data throughout;
supports understanding through hands-on experience of solving data science problems in healthcare (for genomics); de-
scribes techniques and tools for statistical analysis, machine learning, graph analysis, and parallel programming; reviews a
range of applications of data science, including recommender systems and sentiment analysis of text data; and provides
supplementary code resources and data at an associated website. In summary, this book on Data Sciences for Genomics
addresses the needs of a broad spectrum of scientists and students who utilize quantitative methods in their daily research.
xiii
This page intentionally left blank
Acknowledgment
First of all, we would like to extend our gratitude to our family members, friends, and supervisors, which stood as advisors
in the completion of this book. Also, we would like to thanks our almighty “God” who enabled us to write this book. We
also thank Elsevier Publishers (who provided their continuous support during the COVID-19 pandemic), and our col-
leagues with whom we have worked together both within and outside of our college/university positions who have
provided their continuous support toward our completing this book on Data Science for Genomics.
Also, we would like to thank our respected colleagues, Prof. G. Aghila, Prof. Siva Sathya, Prof. Sir N. Sreenath and
Prof. Aswani Kumar Cherukuri, for their valuable inputs and helping us in completing this book.
-Dr. Amit Kumar Tyagi
-Dr. Ajith Abraham
xv
This page intentionally left blank
Chapter 1
1. Introduction
Load forecasting is defined as a procedure used for predicting the future electricity demand using historical data to be able
to manage electric generation and electric demand of electric utilities. In the present scenario the load forecasting is an
essential task in a smart grid. The smart grid is an electrical grid that uses computers, digital technologies, or other
advanced technologies for real-time monitoring, maintaining generation and demand, and to act on particular information
(information such as behavior of electric utilities or consumers) for improving efficiency, reliability, sustainability, and
economics [1]. To fulfill the applications of a smart grid the load forecasting plays an important role. A smart grid has
various modes of forecasting in electric grids, which are load forecasting, price forecasting, solar-based electricity gen-
eration forecasting, and wind-based electricity generation forecasting. The load forecasting is classified into four categories
[2e4]: (i) very short-term load forecasting, (ii) short-term load forecasting, (iii) mid-term load forecasting, and (iv) long-
term load forecasting. The strong focus done in this paper is on short-term load forecasting. As the demand of electricity is
increasing the very short-term load forecasting and short-term load forecasting are helpful to provide additional security,
reliability, and protection to smart grids. Also, it is useful for energy efficiency, electricity price, market design, demand
side management, matching generation and demand, and unit commitment [5]. The machine learning will accurately
predict the electrical load to fulfill the needs of smart grids.
The well-defined long short-term memory (LSTM) and recurrent neural network (RNN) are used in many papers for
load forecasting, and these methods are hybridized to improve the predictions. The review on well-defined RNN and
LSTM methods used for load forecasting is as follows. In paper [6], the author has applied LSTM RNN for nonresidential
energy consumption forecasting. The real-time energy consumption data is from South China, which contains multiple
sequences of 48 nonresidential consumers’ energy consumption data. The unit of measured data is in kilowatts, and data is
collected from Advanced metering infrastructure (AMI) with sampling interval of 15 min. To calculate the prediction
accuracy, the Mean Absolute Error (MAE), Mean Absolute Percent Error (MAPE), and Root Mean Squared Error (RMSE)
method is used. In paper [7], the author has applied a RNN-LSTM neural network for long-term load forecasting. The real
time ISO New England load data is used for 5-year load prediction. The MAPE method is used to calculate the accuracy of
forecasted results. Year-wise and season-wise MAPE is calculated from which the majority MAPE is below 5% and not
exceeding 8%.
In paper [8], the author mentions multiple sequence LSTM is become an attractive approach for load prediction because
of increasing volume variety of smart meters, automation systems, and other sources in smart grids. For energy load fore-
casting the multisequence LSTM, LSTM-Genetic Algorithm (GA), LSTM-Particle swarm optimization (PSO), random
forecast, Support vector machines (SVM), Artificial Neural Network (ANN), and extra tree regressor methods are used, and a
comparison is made between them using RMSE and MAE. The load data was obtained from Réseau de Transport d’Éle-
ctricité (RTE) Corporation, French electricity transmission network. In paper [9], the author has used LSTM for power
demand forecasting, and LSTM prediction is compared with Gradient Boosted Trees (GBT) and Support Vector Regression
(SVR). The LSTM gives better prediction than GBT and SVR by decreasing MSE by 21.80% and 28.57%, respectively.
Timeseries features, weather features, and calendar features are considered for forecasting. University of Massachusetts has
provided the power data for forecasting. The evaluation of model accuracy is calculated using MSE and MAPE.
In paper [10], the electricity consumption prediction is carried out for residential and commercial buildings using a deep
recurrent neural network (RNN) model. The Austin, Texas, residential buildings electricity consumption data is used for
mid-term to long-term forecasting and for commercial buildings, and the Salt Lake City, Utah, electricity consumption data
is used for prediction. For commercial buildings the RNN performs better than a multilayered perceptron model. In paper
[11], the author has used LSTM method for power load forecasting. The eunite real power load data has been used for
forecasting. The next hour and next half day prediction has been made using a single-point forecasting model of LSTM and
multiple-point forecasting model of LSTM. The model accuracy has calculated using MAPE. The single-point forecasting
model of LSTM performs better than multiple-point forecasting model of LSTM.
In paper [12], the author has applied RNN for next 24-h load prediction. The RNN prediction result is compared with
Back-Propagation neural network. In paper [13], the author has used deep RNN, DRNN-Gated Recurrent Unit (GRU),
DRNN-LSTM, multilayer perceptron (MLP), Autoregressive Integrated Moving Average (ARIMA), SVM, and MLR
methods for load demand forecasting. For prediction the author has used residential load data from Austin, Texas, USA.
Methods evaluation was calculated based on MAE, RMSE, and MAPE. In paper [14], the author has used RNN, LSTM,
and GBT for wind power forecasting. Using the wind velocity data from Kolkata, India, the wind power output forecasting
was carried out. The methods accuracy was calculated using MAE, MAPE, MSE, and RMSE.
In paper [15], the author used LSTM for short-term load forecasting. Here, 24-h, 48-h, 7-day, and 30-day ahead
predictions were made and compared with actual load. The LSTM accuracy was tested using RMSE and MAPE. In paper
[16], the author made long-term energy consumption prediction using LSTM. The real-time industrial data was used for
forecasting. The LSTM result was compared with ARMA, ARFIMA, and BPNN prediction result; out of this the LSTM
performed better. MAE, MAPE, MSE, and RMSE was used to evaluate methods accuracy.
The contribution of this paper is to accurately forecast the load using well-defined machine learning methods. In this
paper, two different zones of a real-time load dataset are used for prediction. The first load dataset is of Paschim Gujarat Vij
Company Ltd. (PGVCL), India, and the second load dataset is of NYISO, USA. For both datasets the well-defined machine
learning methods called RNN and LSTM are applied for load prediction. The accuracy of forecasted load is calculated
using root mean squared error and mean absolute percentage error. Further, the machine learning methods results are
compared with time series models prediction results that tried to achieve better prediction than time series models. In most
cases the machine learning works excellently. The time series models result is taken from Ref. [17] or this paper is
extended work of Ref. [17].
The rest of the paper is prepared as follows. Section 2 includes explanations of the well-defined applied machine
learning method, i.e., RNN and LSTM. Section 3 shows the output prediction result of applied machine learning methods
for both load datasets. Section 4 will conclude the paper in short.
2. Methodology
2.1 RNN
The concept of RNN is introduced to process the sequence data and to recognize the pattern in sequence. The reason to
develop the RNN is the feed forward network fails to predict the next value in sequence or the feed forward network
predicts the next value poorly. The feed forward network is mostly not used for sequence prediction because the new
output has no relation with previous output. Now let us see how the RNN can solve the feed forward network problem for
prediction. Fig. 1.1 illustrates the generalized way to represent the RNN, in which there is a loop where the information is
flowing from the previous timestamp to the next timestamp. For a better understanding, Fig. 1.2 shows the unrolling of a
generalized form of RNN, i.e., Fig. 1.1 [18].
From Fig. 1.2, we have input at “t-1,” which will feed it to the network; then we will get the output at “t-1.” Then at the
next time stamp, i.e., at “t” we have input at time “t” that will be given to a network along with the information from the
previous timestamp, i.e., “t-1,” and that will help us to get the output at “t.” Similarly, for output “tþ1,” we have two
inputs: one is a new input at “tþ1” that we feed to the network, and the other is the information coming from the previous
time stamp, i.e., at “t” to get the output at time “tþ1.” Likewise, it can go on [19]. Fig. 1.3 indicates the mathematical
structure of RNN. From Fig. 1.3, two generalized equations can be written as follows:
ht ¼ gh ðWi xt þ WR ht1 þ bh Þ (1.1)
y t ¼ gy W y ht þ by (1.2)
Genomics and neural networks in electrical load forecasting with computational intelligence Chapter | 1 3
Output
Input
y0 y1 y2
Wy Wy Wy
WR WR WR
h0 h1 h2
Wt Wt Wt
x0 x1 x2
Where, wi is the input weight matrix, wy is output weight matrix, WR is hidden layer weight matrix, gh and gy are activation
functions, and bh and by are the biases. Eqs. (1.1) and (1.2) are useful to calculate the h0, h1, h2, . and y0, y1, y2, . values,
as shown in Fig. 1.3, respectively.
For calculating ℎ0 and y0, let us consider time “t” equals zero (i.e., t ¼ 0), and at t ¼ 0 the input is x0. Now by
substituting t ¼ 0 and input x0 in Eqs. (1.1) and (1.2), we get
h0 ¼ gh ðWi x0 þ WR h1 þ bh Þ (1.3)
But in Eq. (1.3) the term WR * ℎ1 cannot be applied because time can never be negative, so Eq. (1.3) can be rewritten
as
h0 ¼ gh ðWi x0 þ bh Þ (1.4)
y 0 ¼ gy W y h 0 þ by (1.5)
From Eqs. (1.4) and (1.5), we can calculate ℎ0 and y0. Now, let us consider t ¼ 1 and the input x1 at t ¼ 1 for calculating
ℎ1 and y1, so by putting values of t ¼ 1 and input in Eqs. (1.1) and (1.2), we get
h1 ¼ gh ðWi x1 þ WR h0 þ bh Þ (1.6)
y 1 ¼ gy W y h1 þ by (1.7)
From Eqs. (1.6) and (1.7), we can find ℎ1 and y1. Similarly, for input x2 at t ¼ 2, we can calculate the value of ℎ2 and y2.
By substituting values into Eqs. (1.1) and (1.2), we get
h2 ¼ gh ðWi x2 þ WR h1 þ bh Þ (1.8)
y 2 ¼ gy W y h2 þ by (1.9)
From Eqs. (1.8) and (1.9), we can calculate ℎ2 and y2. Likewise, it can go on up to “n” period of time. So, this is how
RNN works mathematically. This method is explained by referring to various sources [11,18,19].
Cell state
ct–1 ct
tanh
ft it čt ot
σ σ tanh σ
ht–1 ht
xt
Forget Input Output
gate gate gate
(ℎt). The hidden state is passed from one cell to other in a chain. In internal RNN cells, there is only tanh activation, but
from Fig. 1.4 the LSTM has a complex internal cell. From Fig. 1.4 the s is the sigmoid activation.
For understanding the mathematics behind LSTM and how a hidden state is calculated in it, the forget gate, input gate,
cell state, and output gate are split into different parts, shown in Fig. 1.5AeD respectively. Before going to the mathe-
matics equation, let us see the function of tanh and sigmoid activation layers. The values that are flowing through the
LSTM network are regulated with the help of tanh activation. The tanh activation will squish (lessen) values between 1
and 1. A sigmoid activation has similar function as tanh activation; the difference is the sigmoid activation will lessen
values between 0 and 1. The value or values in the vector that come out from the sigmoid activation indicate values that are
closer to 0 are completely forgotten, and values that are closer to 1 are to be kept in the network or in cell state.
The forget gate is considered the first step in LSTM. This gate will make a decision (decide) regarding which infor-
mation should be kept or removed from the cell state or network. From Fig. 1.5A, the mathematical representation of the
forget gate is expressed as
ft ¼ s Wf ½ht1 ; xt þ bf (1.10)
In Eq. (1.10) the s is sigmoid activation, wf is weight, ℎt1 is output from the previous time stamp, xt is new input, and
bf is bias. In Fig. 1.5A and Eq. (1.10) to calculate ft, the previous output or previous hidden state ℎt1 and new input xt are
combined and multiplied with weight; after added to bias, the result is passed through the sigmoid activation. Now the
sigmoid activation will squish values between 0 and 1, and values that are nearer to 0 will be discarded and values that are
nearer to 1 will kept.
The next step is input gate, which will update the values of cell state. To update the cell state, the previous output (ℎt1)
and present input are passed through sigmoid activation. The sigmoid activation will convert the values between 0 and 1;
from this we can know which values should be updated or not. The output that comes from sigmoid activation is it. Further,
the previous output and present input are passed through tanh activation. The tanh activation will squish values between
1 and 11 to regulate the network [22]. The output that comes from tanh activation is C t. From Fig. 1.5B the mathematical
representation of input gate is expressed as
it ¼ sðWi ½ht1 ; xt þ bi Þ (1.11)
t ¼ tanhðWc ½ht1 ; xt þ bc Þ
C (1.12)
The next step is to update the old cell state, i.e., ct1, into the new cell state, i.e., ct; for this, first, the old cell state is
multiplied by ft, where the vector ft has values between 0 and 11, so the old cell state values that are multiplied by 0 will
become 0 or dropped. Now the sigmoid activation output (it) and tanh activation output (C t) are multiplied; here the
sigmoid activation will decide what to keep or to remove, i.e., it has vector values between 0 and 1. Then there is pointwise
addition to get a new cell state, shown in Fig. 1.5C. The mathematical equation is written as
t
ct ¼ ct1 ft þ it C (1.13)
ft it čt
ct
σ σ tanh
ht–1 ht–1 [ht–1, xt] tanh
xt xt ot
(a) (b) σ
ht–1 ht
ct–1 ct
xt
ft (d)
it čt
(c)
FIGURE 1.5 Various gates and cell states are split from LSTM cell to understand the mathematics behind it: (A) forget gate, (B) input gate, (C) cell
state, and (D) output gate.
6 Data Science for Genomics
The last step is output gate in which the hidden state (ℎt) is calculated, and this calculated hidden state is passed forward
to the next time stamp (next cell). Hidden state is used for prediction, and it has the information of previous input. To find
the hidden state, first the previous hidden state (ℎt1) and present input are passed through sigmoid activation to get the ot.
Now the new cell state (ct) is passed through tanh activation. Further, the tanh activation output and sigmoid activation
output, i.e., ot, are multiplied to get the new hidden state ht as shown in Fig. 1.5D. The mathematical equation is written as
Ot ¼ sðWo ½ht1 ; xt þ bo Þ (1.14)
ht ¼ ot tanhðct Þ (1.15)
Further, the hidden state ℎt and new cell state ct are carried over to the next time stamp. This method is explained by
referring to various sources [13,23,24].
3. Experiment evaluation
3.1 Testing methods effectiveness for PGVCL data
For the PGVCL load dataset the short-term load forecasting was carried out; i.e., day-ahead and week-ahead predictions
were made using RNN and LSTM. The actual observed data provided by PGVCL is from April 1, 2015 to March 31, 2019
(approximately 4 years), and the time horizon is hourly; i.e., each point was observed at each hour in a day. Fig. 1.6 shows
the real-time observed load by PGVCL [25].
For day-ahead, the method effectiveness was checked for March 31, 2019 (24 h). Here the load data from April 1,
2015 to March 30, 2019, historical data, is given in the training data set and March 31, 2019 data is given to testing data
set. Using the training set the prediction for day March 31, 2019 is done. Likewise, for week-ahead the method
effectiveness is checked for days in March 25, 2019 to March 31, 2019 (each hour in 1 week). Here the load data from
April 1, 2015 to March 24, 2019, historical data, is given in the training set, and March 25, 2019 to March 31, 2019 data
is given to the testing set. Using the training set the prediction for days March 25, 2019 to March 31, 2019 is made.
Fig. 1.7 illustrates the comparison between actual load data of PGVCL and predicted load by RNN and LSTM for day
ahead.
Also, this predicted load by RNN and LSTM is further compared with time series models prediction, as shown in
Table 1.1. The time series models prediction results is taken from Ref. [17]. In this paper, we tried to achieve a better
prediction with RNN and LSTM and experiment with how well the machine learning methods can work on PGVCL load
data. The AR (25) model gives a better prediction than the machine learning method (i.e., RNN and LSTM) for day ahead,
per Table 1.1. From Table 1.1, the AR (25) model gives a better prediction result with approximately 99% accuracy (1.92%
MAPE) and with 95.78 MW measured error, while the RNN gives a prediction result with approximately 97% accuracy
(2.77% MAPE) and with 148.83 MW measured error, and the LSTM gives a prediction result with approximately 97%
accuracy (2.85% MAPE) and with 153.38 MW measured error.
Fig. 1.8 illustrates the comparison between actual load data of PGVCL and predicted load by RNN and LSTM for week
ahead, respectively. Also, this predicted load by RNN and LSTM is further compared with the time series models
5000
4000
3000
2000
1000
0
01-06-15
01-04-15
01-08-15
01-10-15
01-12-15
01-02-16
01-04-16
01-06-16
01-08-16
01-10-16
01-12-16
01-02-17
01-04-17
01-06-17
01-08-17
01-10-17
01-12-17
01-02-18
01-04-18
01-06-18
01-08-18
01-10-18
01-12-18
01-02-19
Genomics and neural networks in electrical load forecasting with computational intelligence Chapter | 1 7
prediction, as shown in Table 1.2. Also, for week ahead, we tried to achieve the better prediction with RNN and LSTM
than time series models, and here too, we experiment with how well the machine learning methods can work on PGVCL
load data for weekly prediction. The RNN gives a better prediction than time series models for week ahead, per Table 1.2.
From Table 1.2 the RNN gives a prediction result with approximately 97% accuracy (2.74% MAPE) and with 147.94 MW
measured error, and the LSTM worked well for week-ahead prediction giving a result with approximately 97% accuracy
(2.77% MAPE) and with 148.35 MW measured error. Both RNN and LSTM show better prediction than time series
models for week-ahead prediction.
8 Data Science for Genomics
prediction result with 3.4 MAPE, i.e., approximately 97% accurate, while the RNN and LSTM give 3.67 and 3.80 MAPE,
respectively. Both RNN and LSTM have the same accuracy as NYISO per MAPE, but the RNN and LSTM work better
than NYISO prediction per RMSE and as shown in the prediction graph in Fig. 1.10.
4. Conclusion
In this paper we have used two machine learning methods call RNN and LSTM for electrical load forecasting. Both
methods are well explained in Section 2 by studying various sources. The forecasting made by RNN and LTTM is further
compared with time series models predictions. Overall, the machine learning methods perform better for large sequence
predictions. For day-ahead PGVCL load data the time series model performs better than RNN and LSTM, while for week-
ahead the machine learning shows better prediction than the time series model. For week-ahead NYISO load data the
NYISO prediction gives better prediction than machine learning in terms of MAPE, but at the same time, the machine
learning gives better prediction than NYISO prediction in terms of RMSE.
References
[1] C. Kuster, Y. Rezgui, M. Mourshed, Electrical load forecasting models: a critical systematic review, Sustainable Cities and Society 35 (August
2017) 257e270.
[2] T. Hong, S. Fan, Probabilistic electric load forecasting: a tutorial review, International Journal of Forecasting 32 (3) (JulyeSeptember 2016)
914e938.
[3] R. Patel, M.R. Patel, R.V. Patel, A review: introduction and understanding of load forecasting, Journal of Applied Science and Computations (JASC)
IV (IV) (June 2019) 1449e1457.
[4] M.R. Patel, R. Patel, D. Dabhi, J. Patel, Long term electrical load forecasting considering temperature effect using multi-layer perceptron neural
network and k-nearest neighbor algorithms, International Journal of Research in Electronics and Computer Engineering (IJRECE) 7 (2) (AprileJune
2019) 823e827.
[5] K. Yan, W. Li, Z. Ji, M. Qi, Y. Du, A hybrid LSTM neural network for energy consumption forecasting of individual households, IEEE Access 7
(2019) 157633e157642.
[6] R. Jiao, T. Zhang, Y. Jiang, H. He, Short-term non-residential load forecasting based on multiple sequences LSTM recurrent neural network, IEEE
Access 6 (2018) 59438e59448.
[7] R.K. Agrawal, F. Muchahary, M.M. Tripathi, Long term load forecasting with hourly predictions based on long-short-term-memory networks, in:
2018 IEEE Texas Power and Energy Conference (TPEC), 2018, pp. 1e6. College Station, TX.
[8] S. Bouktif, A. Fiaz, A. Ouni, M.A. Serhani, Multi-sequence LSTM-RNN deep learning and metaheuristics for electric load forecasting, Energies 13
(2020) 391.
[9] C. Yao, C. Xu, D. Mashima, V.L.L. Thing, Y. Wu, PowerLSTM: power demand forecasting using long short-term memory neural network. In-
ternational Conference on Advanced Data Mining and Applications (ADMA), 2017, pp. 727e740.
[10] A. Rahmana, V. Srikumar, A.D. Smith, Predicting electricity consumption for commercial and residential buildings using deep recurrent neural
networks, Applied Energy 212 (February 2018) 372e385.
[11] D. Tang, C. Li, X. Ji, Z. Chen, F. Di, Power load forecasting using a refined LSTM, in: 11th International Conference on Machine Learning and
Computing (ICMLC ’19), 2019, pp. 104e108. NY-USA.
[12] V. Mansouri, M.E. Akbari, Efficient short-term electricity load forecasting using recurrent neural networks, Journal of Artificial Intelligence in
Electrical Engineering 3 (9) (June 2014) 46e54.
[13] L. Wen, K. Zhou, S. Yang, Load demand forecasting of residential buildings using a deep learning model, Electric Power Systems Research 179
(February 2020) 106073.
10 Data Science for Genomics
[14] T. Srivastava, Vedanshu, M.M. Tripathi, Predictive analysis of RNN, GBM and LSTM network for short-term wind power forecasting, Journal of
Statistics and Management Systems 23 (1) (February 2020) 33e47.
[15] S. Muzaffar, A. Afshari, Short-term load forecasts using LSTM networks, Energy Procedia 158 (February 2019) 2922e2927.
[16] J.Q. Wang, Y. Du, J. Wang, LSTM based long-term energy consumption prediction with periodicity, Energy 16 (February 2020) 117197.
[17] M.R. Patel, R.B. Patel, D.N.A. Patel, Electrical energy demand forecasting using time series approach, International Journal of Advanced Science
and Technology 29 (3s) (March 2020) 594e604.
[18] S. Bouktif, A. Fiaz, A. Ouni, M.A. Serhani, Optimal deep learning LSTM model for electric load forecasting using feature selection and genetic
algorithm: comparison with machine learning approaches, Energies 11 (7) (June 2018) 1636.
[19] T. Prasannavenkatesan, Forecasting hyponatremia in hospitalized patients using multilayer perceptron and multivariate linear regression techniques,
Concurrency and Computation: Practice and Experience 33 (16) (2021) e6248.
[20] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Computation 9 (8) (November 1997) 1735e1780.
[21] M. Chai, F. Xia, S. Hao, D. Peng, C. Cui, W. Liu, PV power prediction based on LSTM with adaptive hyperparameter adjustment, IEEE Access 7
(2019) 115473e115486.
[22] P. Theerthagiri, I. Jeena Jacob, A. Usha Ruby, V. Yendapalli, Prediction of COVID-19 possibilities using K-nearest neighbour classification al-
gorithm, International Journal of Current Research and Review 13 (06) (2021) 156.
[23] S. Motepe, A.N. Hasan, R. Stopforth, Improving load forecasting process for a power distribution network using hybrid AI and deep learning
algorithms, IEEE Access 7 (2019) 82584e82598.
[24] Y. Ma, Q. Zhang, J. Ding, Q. Wang, J. Ma, Short term load forecasting based on iForest-LSTM, in: 2019 14th IEEE Conference on Industrial
Electronics and Applications (ICIEA), Xi’an, China, 2019, pp. 2278e2282.
[25] http://www.pgvcl.com.
[26] T. Prasannavenkatesan, Probable forecasting of epidemic COVID-19 in using COCUDE model, EAI Endorsed Transactions on Pervasive Health
and Technology 7 (26) (2021) e3.
[27] http://www.energyonline.com/Data/GenericData.aspx?DataId¼14&NYISO ISO_Load_Forecast.
Chapter 2
1. Introduction
Nowadays, cancer is one of the major causes of death worldwide. Microarray data analysis is one of the most advanced
technologies in cancer diagnosis. It helps physicians to detect the disease at an early stage. This technology allows keeping
the genetic expressions that provide useful information about the complex biologic process. This advanced technology not
only gives a new approach to the biologic phenomenon but also helps in analyzing the activities of genes presented in the
human body. Due to this advanced feature of this technique, it will help physicians in providing an effective and reliable
diagnosis and prognosis. Recent health care is well equipped in terms of analysis, diagnosis, and research. This motivates
researchers to work in diversified fields with the application of recent engineering techniques wherever necessary. Day by
day, types of disease increase, as well as the patients due to population growth. In most cases, the mortality rate increases,
and this produces data on a huge scale. It belongs to patients and their characteristics, diseases and their characteristics, and
causes of mortality and their characteristics. This creates medical data mining and helps the researchers in analysis,
diagnosis, prevention, and cure [1]. The modern lifestyle generates many different types of diseases that spread generation-
wise. Therefore gene analysis is extremely important for all these and for the next generation. In this piece of work, a deep
learning approach is used for microarray data classification. For effective cancer diagnosis, early identification is one of the
important factors. Deoxyribonucleic acid (DNA) microarray is used for cancer detection due to gene sequence disorder.
Machine learning algorithms are capable to handle high-dimensional data [2]. Typically, these types of data contain more
features for each sample, and a usual classification task is to distinguish the healthy and cancerous patients based on their
gene expression profile
Due to the greater number of features and small sample size, microarray data classification is a challenging task for
machine learning researchers. To deal with this high feature set, numerous feature selection algorithms were adopted in the
literature, including wrappers and embedded methods [3]. Along with these feature selection algorithms, different machine
learning classifiers were also adopted for classifying the microarray data. These methods include support vector machine
(SVM) [4], multilayer perceptron [4], fuzzy neural network [5], k-nearest neighbor [6], naive Bayes [7,8], decision tree
[9,10], and radial basis function network [11e13]. Most of these works are of a two-step process including feature se-
lection and classification. In some cases, authors have also adopted clustering techniques [14,15] to obtain the relevant
gene patterns. For selecting the optimal gene subset, numerous optimization algorithms were adopted by researchers that
include genetic algorithm [16e18], heuristic algorithm [10], artificial bee colony [7], particle swarm optimization [19],
harmony search [20], and memetic algorithm [21]. The major drawbacks in these microarray data analysis models are as
follows. Most of these algorithms fail to provide better results due to high-dimensional data. Overfitting problems also
occur due to the traditional architecture of the classifier. Due to the presence of various uncertainties during data gener-
ation, the conventional machine learning models fail to provide a reliable classification model. And, computational time is
greater due to the number of steps. To avoid such types of issues, deep learningebased approaches were adopted for
classifying the high-dimensional microarray data [22e24]. The performance of the deep learning models was also
improved by providing the optimized features to the classifier [25,26], and it is observed that the deep learning models
were providing better performance compared with the traditional machine learning-based classifiers in high-dimensional
data [27,28]. Most of the deep neural network classifiers are more specific to the training data, and sometimes they
may locate a dissimilar set of weights at each training epoch. This problem is generally referred to as a neural network
having maximum variance, and it creates chances of misclassification. Statistical, computational, and representation
problems are three major issues that can be found in single classification models. Instead of training a single model, a
combination of multiple models can be a successful approach to deal with the aforementioned issues. This combined model
is known as the ensemble model, and from the literature, it is observed that the performance of this model is better
compared with the single model [29,30].
Microarray data analysis and classification can be considered an effective way of cancer diagnosis. Different ma-
chine learningebased classifiers were adopted by researchers to develop an automated genetic expression data clas-
sification model, and it is observed that the traditional techniques fail to provide a satisfactory accuracy. To overcome
this issue, various ensemble models were used for classifying such types of imbalanced data. Authors in [31] have used
RotBoost techniques for classifying the genetic data. The performance of their proposed model was compared with five
different feature selection algorithms, and from the results, it was found that RotBoost with independent component
analysis features provided a better result. For finding the efficiency of that ensemble model, they have also compared
the accuracy with some other nonensemble models such as SVM and decision tree. A novel ensemble classifier was
presented in [29] for classifying five different types of microarray data. They have used an ensemble SVM classifier for
this purpose, and it is observed that their proposed model provided better results compared with standard, conventional
SVM. In [32], the author has proposed a novel optimized ensemble model to enhance performance by maximizing
behavioral diversity. The proposed model was also verified with three different types of microarray data as breast,
colon, and prostate cancer data. For selecting suitable features, the authors have also used an ensemble model that
combines multiple filters for selecting relevant features from the original data [33]. The main objective of their approach
was to improve the classification accuracy by reducing the dimensionality of the input data, and from their result, it can
be observed that the accuracy was better in ensemble filters compared with single filters in input data dimensionality
reduction. An ensemble of multiple neural networks was used for classifying the three different types of gene
expression data in [34]. Random forest is one of the most popular ensemble models in the classification of the different
data sets. Authors in [15] have used this popular bagging-based ensemble model for classifying the three different types
of the cancer data sets. They have compared the accuracy by using the clustering approach, and it was observed that
performance was found to be better with random forest with clustering approach. For selecting important genetic
features, a double-stage model was proposed in [18]. In the first stage, they developed an ensemble filter by taking the
union and interaction of the top three features selection algorithm. Again for getting more relevant features, a genetic
algorithm was also used in their work. For classification purposes, three different classifiers such as multilayer per-
ceptron, SVM, and k-nearest neighbor were used over five types of cancer data sets such as colon, lung, leukemia,
small-blue-round-cell tumor, and prostate.
In ensemble techniques, multiple submodels statistically contribute to a combined prediction problem. This approach is
generally referred to as the model averaging technique. Most of the ensemble models are classified into three categories
such as bagging, boosting, and stacking. Generally, the usual ensemble learning models consist of bagging and boosting in
the random subspace of data [35]. The main objective of the ensemble model is to derive a new classifier by combining
various base classifiers, and it will provide better performance compared with any constituent classifier. In Fig. 2.1, the
outline of an ensemble technique is shown, and it can be observed that the final output of this model is decided by using the
voting approach. From the literature, it can be observed that numerous works were done by using different ensemble
models, and this is discussed in the next subsection.
Second Family—Gobiidæ.
Body elongate, naked or scaly. Teeth generally small, sometimes
with canines. The spinous dorsal fin, or portion of the dorsal fin, is
the less developed, and composed of flexible spines; anal similarly
developed as the soft dorsal. Sometimes the ventrals are united into
a disk. Gill-opening more or less narrow, the gill-membranes being
attached to the isthmus.
Small carnivorous littoral fishes, many of which have become
acclimatised in fresh water. They are very abundant with regard to
species as well as individuals, and found on or near the coasts of all
temperate and tropical regions. Geologically they appear first in the
chalk.
Gobius.—Body scaly. Two dorsal fins, the anterior generally with
six flexible spines. Ventral fins united, forming a disk which is not
attached to the abdomen. Gill-opening vertical, moderately wide.
Fig. 220.—Gobius lentiginosus, from New Zealand.
The “Gobies” are distributed over all temperate and tropical
coasts, and abundant, especially on the latter. Nearly three hundred
species have been described. They live especially on rocky coasts,
attaching themselves firmly with their ventrals to a rock in almost any
position, and thus withstanding the force of the waves. Many of the
species seem to delight in darting from place to place in the rush of
water which breaks upon the shore. Others live in quiet brackish
water, and not a few have become entirely acclimatised in fresh
water, especially lakes. The males of some species construct nests
for the eggs, which they jealously watch, and defend even for some
time after the young are hatched. Several species are found on the
British coast: G. niger, paganellus, auratus, minutus, ruthensparri.
Fossil species of this genus have been found at Monte Bolca.
A very small Goby, Latrunculus pellucidus, common in some
localities of the British Islands and other parts of Europe, is
distinguished by its transparent body, wide mouth, and uniserial
dentition. According to R. Collett it offers some very remarkable
peculiarities. It lives one year only, being the first instance of an
annual vertebrate. It spawns in June and July, the eggs are hatched
in August, and the fishes attain their full growth in the months from
October to December. In this stage the sexes are quite alike, both
having very small teeth and feeble jaws. In April the males lose the
small teeth, which are replaced by very long and strong teeth, the
jaws themselves becoming stronger. The teeth of the females remain
unchanged. In July and August all the adults die off, and in
September only the fry are to be found.
There are several other genera, closely allied to Gobius, as
Euctenogobius, Lophiogobius, Doliichthys, Apocryptes, Evorthodus,
Gobiosoma and Gobiodon (with scaleless body) Triænophorichthys.
Sicydium.—Body covered with ctenoid scales of rather small size.
Cleft of the mouth nearly horizontal, with the upper jaw prominent; lips
very thick; the lower lip generally with a series of minute horny teeth.
A series of numerous small teeth in upper jaw, implanted in the gum,
and generally movable; the lower jaw with a series of conical widely-
set teeth. Two dorsal fins, the anterior with six flexible spines. Ventral
fins united, and forming a short disk, more or less adherent to the
abdomen.
Small freshwater fishes inhabiting the rivers and rivulets of the
islands of the tropical Indo-Pacific. About twelve species are known;
one occurs in the West Indies. Lentipes from the Sandwich Islands is
allied to Sicydium.
Periophthalmus.—Body covered with ctenoid scales of small or
moderate size. Cleft of the mouth nearly horizontal, with the upper jaw
somewhat longer. Eyes very close together, immediately below the
upper profile, prominent, but retractile, with a well-developed outer
eyelid. Teeth conical, vertical in both jaws. Two dorsal fins, the anterior
with flexible spines; caudal fin with the lower margin oblique; base of
the pectoral fin free, with strong muscles. Ventral fins more or less
coalesced. Gill-openings narrow.
The fishes of this genus, and the closely-allied Boleophthalmus,
are exceedingly common on the coasts of the tropical Indo-Pacific,
especially on parts covered with mud or fucus. During ebb they leave
the water and hunt for small crustaceans, and other small animals
disporting themselves on the ground which is left uncovered by the
receding water. With the aid of their strong pectoral and ventral fins
and their tail, they hop freely over the ground, and escape danger by
rapid leaps. The peculiar construction of their eyes, which are very
movable, and can be thrust far out of their sockets, enables them to
see in the air as well as in the water; when the eyes are retracted
they are protected by a membranous eyelid. These fishes are absent
in the eastern parts of the Pacific and on the American side of the
Atlantic; but singularly enough one species reappears on the West
African coast. About seven species are known (including
Boleophthalmus), P. koelreuteri being one of the most common
fishes of the Indian Ocean.
First Family—Cepolidæ.
Body very elongate, compressed, covered with very small cycloid
scales; eyes rather large, lateral. Teeth of moderate size. No bony
stay for the angle of the præoperculum. One very long dorsal fin,
which, like the anal, is composed of soft rays. Ventrals thoracic,
composed of one spine and five rays. Gill-opening wide. Caudal
vertebræ exceedingly numerous.
The “Band-fishes” (Cepola) are small marine fishes, belonging
principally to the fauna of the northern temperate zone; in the Indian
Ocean the genus extends southwards to Pinang. The European
species (C. rubescens) is found in isolated examples on the British
coast, but is less scarce in some years than in others. These fishes
are of a nearly uniform red colour.
Second Family—Trichonotidæ.
Body elongate, sub-cylindrical, covered with cycloid scales of
moderate size. Eyes directed upwards. Teeth in villiform bands. No
bony stay for the angle of the præoperculum. One long dorsal fin,
with simple articulated rays, and without a spinous portion; anal long.
Ventrals jugular, with one spine and five rays. Gill-opening very wide.
The number of caudal vertebræ much exceeding that of the
abdominal.
Small marine fishes, belonging to two genera only, Trichonotus
(setigerus) from Indian Seas, with some of the anterior dorsal rays
prolonged into filaments; and Hemerocoetes (acanthorhynchus) from
New Zealand, and sometimes found far out at sea on the surface.
Third Family—Heterolepidotidæ.
Body oblong, compressed, scaly; eyes lateral; cleft of the mouth
lateral; dentition feeble. The angle of the præoperculum connected
by a bony stay with the infraorbital ring. Dorsal long, with the spinous
and soft portions equally developed; anal elongate. Ventrals
thoracic, with one spine and five rays.
Fig. 222.—Scale from the
lateral line of Hemerocœtes
acanthorhynchus, with lacerated
margin.
Sixth Family—Mastacembelidæ.
Body elongate, eel-like, covered with very small scales. Mandible
long, but little moveable. Dorsal fin very long, the anterior portion
composed of numerous short isolated spines; anal fin with spines
anteriorly. Ventrals none. The humeral arch is not suspended from
the skull. Gill-openings reduced to a slit at the lower part of the side
of the head.
Freshwater-fishes characteristic of and almost confined to the
Indian region. The structure of the mouth and of the branchial
apparatus, the separation of the humeral arch from the skull, the
absence of ventral fins, the anatomy of the abdominal organs,
affords ample proof that these fishes are Acanthopterygian eels.
Their upper jaw terminates in a pointed moveable appendage, which
is concave and transversely striated inferiorly in Rhynchobdella, and
without transverse striæ in Mastacembelus: the only two genera of
this family. Thirteen species are known, of which Rh. aculeata, M.
pancalus and M. armatus are extremely common, the latter attaining
to a length of two feet. Outlying species are M. aleppensis from
Mesopotamia and Syria, and M. cryptacanthus, M. marchei, and M.
niger, from West Africa.
First Family—Sphyrænidæ.
Body elongate, sub-cylindrical, covered with small cycloid scales;
lateral line continuous. Cleft of the mouth wide, armed with strong
teeth. Eye lateral, of moderate size. Vertebræ twenty-four.
This family consists of one genus only, Sphyræna, generally
called “Barracudas,” large voracious fishes from the tropical and sub-
tropical seas, which prefer the vicinity of the coast to the open sea.
They attain to a length of eight feet, and a weight of forty pounds;
individuals of this large size are dangerous to bathers. They are
generally used as food, but sometimes (especially in the West
Indies) their flesh assumes poisonous qualities, from having fed on
smaller poisonous fishes. Seventeen species.
The Barracudas existed in the tertiary epoch, their remains being
frequently found at Monte Bolca. Some other fossil genera have
been associated with them, but as they are known from jaws and
teeth or vertebræ only, their position in the system cannot be exactly
determined; thus Sphyrænodus and Hypsodon from the chalk of
Lewes, and the London clay of Sheppey. The American Portheus is
allied to Hypsodon. Another remarkable genus from the chalk,
Saurocephalus, has been also referred to this family.[44]
Second Family—Atherinidæ.
Body more or less elongate, sub-cylindrical, covered with scales
of moderate size; lateral line indistinct. Cleft of the mouth of
moderate width, with the dentition feeble. Eye lateral, large or of
moderate size. Gill-openings wide. Vertebræ very numerous.
Small carnivorous fishes inhabiting the seas of the temperate and
tropical zones; many enter fresh water, and some have been entirely
acclimatised in it. This family seems to have been represented in the
Monte Bolca formation by Mesogaster.
Atherina.—Teeth very small; scales cycloid. The first dorsal is
short and entirely separated from the second. Snout obtuse, with the
cleft of the mouth straight, oblique, extending to or beyond the anterior
margin of the eye.
The Atherines are littoral fishes, living in large shoals, which habit
has been retained by the species acclimatised in fresh water. They
rarely exceed a length of six inches, but are nevertheless esteemed
as food. From their general resemblance to the real Smelt they are
often thus misnamed, but may always be readily recognised by their
small first spinous dorsal fin. The young, for some time after they are
hatched, cling together in dense masses, and in numbers almost
incredible. The inhabitants of the Mediterranean coast of France call
these newly hatched Atherines “Nonnat” (unborn). Some thirty
species are known, of which A. presbyter and A. boyeri occur on the
British coast.
Atherinichthys, distinguished from Atherina in having the snout
more or less produced; and the cleft of the mouth generally does not
extend to the orbit.
These Atherines are especially abundant on the coasts and in the
fresh waters of Australia and South America. Of the twenty species
known, several attain a length of eighteen inches and a weight of
more than a pound. All are highly esteemed as food; but the most
celebrated is the “Pesce Rey” of Chile (A. laticlavia).
Tetragonurus.—Body rather elongate, covered with strongly
keeled and striated scales. The first dorsal fin is composed of
numerous feeble spines, and continued on to the second. Lower jaw
elevated, with convex dental margin, and armed with compressed,
triangular, rather small teeth, in a single series.
This very remarkable fish is more frequently met with in the
Mediterranean than in the Atlantic, but generally scarce. Nothing is
known of its habits; when young it is one of the fishes which
accompany Medusæ, and, therefore, it must be regarded as a
pelagic form. Probably, at a later period of its life, it descends to
greater depths, coming to the surface at night only. It grows to a
length of eighteen inches.
Third Family—Mugilidæ.
Body more or less oblong and compressed, covered with cycloid
scales of moderate size; lateral line none. Cleft of the mouth narrow
or of moderate width, without or with feeble teeth. Eye lateral, of
moderate size. Gill-opening wide. The anterior dorsal fin composed
of four stiff spines. Vertebræ twenty-four.
The “Grey Mullets” inhabit in numerous species and in great
numbers the coasts of the temperate and tropical zones. They
frequent brackish waters, in which they find an abundance of food
which consists chiefly of the organic substances mixed with mud or
sand; in order to prevent larger bodies from passing into the
stomach, or substances from passing through the gill-openings,
these fishes have the organs of the pharynx modified into a filtering
apparatus. They take in a quantity of sand or mud, and, after having
worked it for some time between the pharyngeal bones, they eject
the roughest and indigestible portion of it. The upper pharyngeals
have a rather irregular form; they are slightly arched, the convexity
being directed towards the pharyngeal cavity, tapering anteriorly and
broad posteriorly. They are coated with a thick soft membrane, which
reaches far beyond the margin of the bone, at least on its interior
posterior portion; this membrane is studded all over with minute
horny cilia. The pharyngeal bone rests upon a large fatty mass,
giving it a considerable degree of elasticity. There is a very large
venous sinus between the anterior portion of the pharyngeal and the
basal portion of the branchial arches. Another mass of fat, of
elliptical form, occupies the middle of the roof of the pharynx,
between the two pharyngeal bones. Each branchial arch is provided
on each side, in its whole length, with a series of closely-set gill-
rakers, which are laterally bent downwards, each series closely
fitting into the series of the adjoining arch; they constitute together a
sieve, admirably adapted to permit a transit for the water, retaining at
the same time every other substance in the cavity of the pharynx.
The lower pharyngeal bones are elongate, crescent-shaped, and
broader posteriorly than anteriorly. Their inner surface is concave,
corresponding to the convexity of the upper pharyngeals, and
provided with a single series of lamellæ, similar to those of the
branchial arches, but reaching across the bone from one margin to
the other.
The intestinal tract shows no less peculiarities. The lower portion
of the œsophagus is provided with numerous long thread-like
papillæ, and continued into the oblong-ovoid membranaceous cœcal
portion of the stomach, the mucosa of which forms several
longitudinal folds. The second portion of the stomach reminds one of
the stomach of birds; it communicates laterally with the other portion,
is globular, and surrounded by an exceedingly strong muscle. This
muscle is not divided into two as in birds, but of great thickness in
the whole circumference of the stomach, all the muscular fasciculi
being circularly arranged. The internal cavity of this stomach is rather
small, and coated with a tough epithelium, longitudinal folds running
from the entrance opening to the pyloric, which is situated opposite
to the other. A low circular valve forms a pylorus. There are five
rather short pyloric appendages. The intestines make a great
number of circumvolutions, and are seven feet long in a specimen
thirteen inches in length.
Fig. 229.—Mugil proboscideus.
Some seventy species of Grey Mullets are known, the majority of
which attain to a weight of about four pounds, but there are many
which grow to ten and twelve pounds. All are eaten, and some even
esteemed, especially when taken out of fresh water. If attention were
paid to their cultivation, great profits could be made by fry being
transferred into suitable backwaters on the shore, in which they
rapidly grow to a marketable size. Several species are more or less
abundant on the British coasts, as Mugil octo-radiatus (Fig. 105, p.
254), M. capito, M. auratus (Fig. 106, p. 254), and M. septentrionalis
(Fig. 107, p. 254), which, with the aid of the accompanying figures,
and by counting the rays of the anal fin, may be readily distinguished
—M. octo-radiatus having eight, and M. capito and M. auratus nine
soft rays. A species inhabiting fresh waters of Central America (M.
proboscideus) has the snout pointed and fleshy, thus approaching
certain other freshwater and littoral Mullets, which, on account of a
modification of the structure of the mouth, have been formed into a
distinct genus, Agonostoma. Myxus comprises Mullets with teeth
more distinct than in the typical species.
This genus existed in the tertiary epoch, remains of a species
having been found in the gypsum of Aix, in Provence.
First Family—Gastrosteidæ.
Body elongate, compressed. Cleft of the mouth oblique; villiform
teeth in the jaws. Opercular bones not armed; infraorbitals covering
the cheek; parts of the skeleton forming incomplete external mails.
Scales none, but generally large scutes along the side. Isolated
spines in front of the soft dorsal fin. Ventral fins abdominal, joined to
the pubic bone, composed of a spine and a small ray.
Branchiostegals three.