Convolutional Neural Network Utilizing Error-Correcting Output Codes Support Vector Machine For Classification of Non-Severe Traumatic Brain Injury From Electroencephalogram Signal

Received January 15, 2021, accepted January 30, 2021, date of publication February 3, 2021, date of current version
February 12, 2021.

Digital Object Identifier 10.1109/ACCESS.2021.3056724
Convolutional Neural Network Utilizing

Error-Correcting Output Codes Support
Vector Machine for Classification of
Non-Severe Traumatic Brain Injury
From Electroencephalogram Signal
CHI QIN LAI 1 , (Member, IEEE), HAIDI IBRAHIM 1 , (Senior Member, IEEE),
JAFRI MALIN ABDULLAH 2,3 , AZLINDA AZMAN 4 ,
AND MOHD ZAID ABDULLAH 1 , (Member, IEEE)
1 School of Electrical and Electronic Engineering, Engineering Campus, Universiti Sains Malaysia, Nibong Tebal 14300, Malaysia
2 Brainand Behaviour Cluster, School of Medical Sciences, Health Campus, Universiti Sains Malaysia, Kota Bharu 16150, Malaysia
3 Department of Neurosciences, School of Medical Sciences, Health Campus, Universiti Sains Malaysia, Kota Bharu 16150, Malaysia
4 School of Social Sciences, Universiti Sains Malaysia, Penang 11800, Malaysia
Corresponding author: Haidi Ibrahim (haidi_ibrahim@ieee.org)

This work was supported by the Ministry of Higher Education (MoHE), Malaysia, through the Trans-Disciplinary Research
Grant Scheme (TRGS), under Grant 203/PELECT/6768002.
ABSTRACT A sudden blow or jolt to the human brain called traumatic brain injury (TBI) is one of the
most common injuries recorded in the health insurance claim. Generally, computed tomography (CT) or
magnetic resonance imaging (MRI) is required to identify the trauma’s severity. Unfortunately, CT and MRI
equipment are bulky, expensive, and not always available, limiting their use in TBI detection. Therefore,
as an alternative, this study presents a novel classification architecture that can classify non-severe TBI
patients from healthy subjects by using resting-state electroencephalogram (EEG) as the input. The proposed
architecture employs a convolutional neural network (CNN), and error-correcting output codes support
vector machine (ECOC-SVM) to perform automated feature extraction and multi-class classification. In this
architecture, complex feature selection and extraction steps are avoided. The proposed architecture attained
a high-performance classification accuracy of 99.76%, potentially being used as a classification approach to
preventing healthcare insurance fraud. The proposed method is compared to existing studies in the literature.
The outcome from the comparisons indicates that the proposed method has outperformed the benchmarked
methods by presenting the highest classification accuracy and precision.
INDEX TERMS Accuracy, batch normalization, convolutional neural networks, electroencephalography,

data preparation.
I. INTRODUCTION care fraud detection plays an important role. Health insurance

Health insurance is an insurance policy covering part of or fraud happens when an individual or entity defrauds their
the entire risk of an individual incurring medical expenses. insurer or the insurance company, resulting in some unau-
It provides coverage for payments of benefits as a result thorized benefits to him or her accomplices, according to the
of sickness or injury. However, many individuals misused National Health Care Anti-Fraud Association [1]. To inves-
health insurance by committing health care fraud. Such a tigate brain injury claims, related insurance companies will
scenario has caused a loss of tens of billions of dollars have to receive medical images of the brain from the hospital,
annually to the insurance businesses [2]. Therefore, health and subsequently, these images will be analyzed by medical
staff. Processing each of the neuroimaging results manu-
The associate editor coordinating the review of this manuscript and ally by medical professionals requires a high volume of the
approving it for publication was Ludovico Minati . workforce.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
24946 VOLUME 9, 2021
C. Q. Lai et al.: CNN Utilizing ECOC-SVM for Classification of Non-Severe TBI From EEG Signal
Generally, TBI can be divided into three categories, which patients were required to perform memory tasks during EEG
are severe TBI, moderate TBI, and mild TBI. Severe TBI recording. Event-related Tsallis entropies were extracted as
claims are more straightforward to be verified because features to train a support vector machine (SVM) to classify
patients who suffer from severe TBI are often uncon- healthy subjects and TBI patients. Their results indicated the
scious [6]. On the other hand, identifying mild TBI or mod- potential of EEG for an early-stage TBI detection. A review
erate TBI without medical imaging analysis is challenging. has been carried out by Rapp et al. [20] on the applications
Mild TBI is among the most common neurologic condition, of EEG on detecting TBI. From the literature, it is found that
and it is the least severe end of the continuum. Unfortunately, active paradigms (i.e., task-related or stimulations exposure)
biomarkers of the pathophysiologic effects of mild TBI are are often performed during EEG recording [21], [22].
still not established for clinical use. Thus, neuroimaging Although active paradigms have shown promising results
technologies are needed for a reasonable justification for in detecting TBI, it requires extensive setup time for the
mild TBI. Following a mild TBI, patients often experience EEG recording. These active paradigms are necessary to
acute short-term symptoms such as headache, memory loss, assess the sensory pathways’ functionality and responses
irritability, sensitivity to light, loss of concentration, and of the human brain post-injury [23]. Accordingly, task-free
fatigue [7]. However, the majority of those symptoms will be paradigms do not require subjects to be responsive to stimu-
resolved in two to three weeks. Usually, neuroimaging, such lants’ tasks, making it a more accessible and better option for
as a CT scan, is required to justify mild TBI. TBI detection.
Because the severity of the moderate TBI is not perma- Two similar approaches are reviewed, which were used to
nent, evaluating this type of TBI continues to be a challeng- classify severe or mild TBI from healthy samples. In the work
ing task. Patients with moderate TBI might encounter an of den Brink et al. [24], a Naive Bayes classifier was used
acute phase course where both inter-cranial and intra-cranial to classify EEG signals into two groups, which are severe
trauma may induce secondary brain injury, elevating the TBI patients and healthy samples. The classifier was trained
severity of TBI [8]. On the other hand, a study has shown based on average power features from the beta band of each
that patients with moderate TBI have fared less well. Approx- electrode and EEG connectivity of delta, theta, and gamma
imately 60% of the patients show positive recovery [9]. bands extracted from resting-state EEG. First, the signal was
Nonetheless, screening for every submitted claim of mild and pre-processed by applying a notch filter to remove the electri-
moderate TBI patients using neuroimaging is not practical cal line noise, followed by a low-pass filter at 100Hz. Next,
because CT or MRI scans are relatively expensive. Besides, a high-pass filter with a 0.5Hz cutoff is applied. After that,
these scanners are not always available in the hospitals [10]. the segments containing artifacts in the signal are manually
Electroencephalogram (EEG) has been suggested as an removed. The resulted signal of each subject is divided into
alternative to detect non-severe TBI [11]. Studies have found two-seconds segments. Subsequently, the three features were
that TBI biomarkers can be obtained by analyzing the extracted from the resultant segments. The three bands’ con-
quantitative EEG (qEEG) of frequency bands of the sig- nectivity is obtained by computing the correlation between
nal (i.e., alpha, beta, theta, and gamma bands). They have the log-transformed orthogonalized amplitude envelopes of
observed a significant reduction in mean alpha band fre- the delta, theta, and gamma bands [24]. This approach pre-
quency and an increase in theta band activities in TBI patients, sented high classification accuracy. Even so, the performance
but not in healthy subjects [12]–[15]. Therefore, these find- of their classifier heavily relies on the quality of the extracted
ings indicate that EEG contains biomarkers for TBI. features. Thus, extensive exploration has to be done to select
informative features to ensure effective classifier learning.
A. RELATED WORK McNerney et al. [25] used resting-state EEG and adaptive
As EEG has high temporal resolution and can reflect boosting (AdaBoost) in classifying mild TBI. First, a band-
brain activity, some experimental works have been carried pass filter with cutoff frequency from 0.1 Hz to 100Hz is
out by researchers to detect TBI from EEG [16], [17]. applied to the input raw EEG. Subsequently, artifacts and
In work by Fisher et al. [18], EEG with the implantation spikes are manually marked and removed from the signal.
of cortical somatosensory evoked electroencephalographic After that, power spectral densities (PSD) of the delta, theta,
potentials (SSEPs) is used to detect and track, in real- alpha, and gamma bands are extracted as features. The PSD
time, the neural electrophysiological abnormalities following is calculated for signals of channel AF7 to FpZ and AF8 to
head injury in an animal model. It was found that ini- FpZ. The base-10 logarithm of the average PSD for each
tially, the amplitude of the signal increased over time but frequency band are to train an AdaBoost classifier. AdaBoost
dropped significantly after one hour of monitoring. Signif- is a powerful classifier that creates a highly accurate clas-
icant changes were observed in low-frequency components sifier by combining several weak and inaccurate classifiers,
and an increase of EEG entropy prior to 30 minutes post- creating a classification model cascade. It is simple and
injury. Their experimental results suggested that cortical requires less tweaking of parameters to achieve high classifi-
SSEPs could be used to detect and monitor TBI. cation results despite its sensitivity to noisy data and outliers.
On the other hand, McBride et al. [19] explored the visual Thus, to remove external noises, pre-processing becomes an
evoked potential EEG of TBI patients. In this study, TBI unavoidable stage in the work of McNerney et al. [25].
VOLUME 9, 2021 24947

EEG recordings are usually contaminated with unwanted the experiments conducted to design a novel classification
elements such as noises and artifacts. Pre-processing is cru- architecture. Lastly, Section VII concludes this article.
cial to remove these undesirable elements in the signal [26].
Feature extraction and selection are also time-consuming and II. DATA ACQUISITION
extensive. Furthermore, extracting features from resting-state The dataset used in this study is provided by Hospital Uni-
EEG can be even more challenging as it contains less infor- versiti Sains Malaysia, Kelantan, Malaysia. Ethical approval
mation than EEG with external stimulants. It is preferable for has been obtained from Universiti Sains Malaysia (USM),
machines to search and learn the data itself, especially the with reference number USM/JEPeM/15110486. A total
implementation of resting-state EEG. of 36 resting-state eyes-closed EEG recordings were col-
Convolutional neural network (CNN) is one of the common lected from 36 subjects. A matching of age between TBI
methods used in development requiring classification [27]. patients with healthy volunteers is used in this work. The age
CNN can overcome the complex design for feature extraction range of all subjects in this study is between 18 to 65 years
and feature selection. CNN is a machine learning method old. Volunteers that contributed to this dataset are 12 mild
inspired by biological system [28], initially proposed for TBI patients (mean age = 30.95 ± 11.51), 12 moderate TBI
image classification task [29]. Due to its great potential in patients (mean age = 34.39 ± 15.79), and 12 healthy persons
analyzing pixels per pixels of images, it is also applicable (mean age = 35 ± 13.24).
in EEG analysis [30]–[35]. The data points of EEG can be In this study, all volunteers are male. Besides, all of the TBI
arranged in matrix form, similar to the matrix of pixels. patients sustained nonsurgical mild or moderate TBI accord-
CNN’s topology is made of multilayer perception (MLP), ing to the Glasgow Coma Scale (GCS), corresponding to a
combining input layers, hidden layers, and the output layers. score between 9-12 and 14-15, respectively. All of them suf-
The hidden layers include the convolutional layer and the fered an initial hit involving the left frontal-temporal-parietal
conventional back-propagation neural network dense layer. lobe as diagnosed by a brain CT scan. EEG has to be recorded
The convolutional layers are made up of convolutional ker- from the patients in a time range of a minimum of four
nels. These kernels carry learnable parameters that require weeks to a maximum of ten weeks post-injury (mean time =
multiple iterations of learning and validation to determine 7 ± 2 weeks). Else the respective patient will be excluded
the optimum value empirically [36]. The convolutional layers from this study. Each subject must be at a seated position dur-
play the role of extracting important features from the input ing the recording with eyes closed to obtain the resting-state
matrix through the weighted learnable kernels [37]. Each EEG data. There are no tasks or activities performed during
forward input of the matrix computes a feature map. The the data acquisition (i.e., task-free EEG recording). EEG is
convolutional layers learn to activate the feature maps when continuously recorded for 350 seconds.
the patterns of interest are detected in the input. Activated
feature maps are then down-sampled using the pooling layer
and further feed-forward to the next layer. A fully connected
layer (also known as a dense layer) is trained using the
feature map. The updates of the learnable parameters in the
CNN architecture implies back-propagation [29] and gradient
descent [38].
This article presents an EEG-based non-severe TBI clas-
sification model. Using EEG, the model can classify mild
TBI, moderate TBI, and healthy subjects, proposed as
a classification approach to evaluate the brain’s condi-
tion to prevent healthcare insurance fraud. CNN utilizing
the error-correcting output codes support vector machine
(ECOC-SVM) is selected as the machine learning approach
to overcome the complex stages of feature extraction and
selection.
This article is divided into six sections. Section I intro- FIGURE 1. Arrangement of EEG channels on the WaveGuard EEG cap.
duces the paper. Then, Section II explains how the data
is acquired for this research. In Section III, the method- The EEG signals were continuously recorded using
ology of preparing the EEG signal into compatible input 64 electrodes mounted on a 64-channel WaveGuard EEG
for the proposed architecture is discussed. Then, Section IV cap. The channels’ placement is based on the international
gives an overview of the proposed CNN architecture for 10-10 EEG electrode system, which allows a high number of
mild and moderate TBI classification. In Section V, perfor- electrode placements. The placement of electrodes is shown
mance measures and training procedures used to evaluate in Figure 1. Therefore, the electrical activities from the scalp
each experiment are presented. Next, Section VI presents will be recorded at 64-sites. However, CPz channel recording
24948 VOLUME 9, 2021

FIGURE 2. Overview of pre-processing procedure.
is excluded, leaving only 63 useful channels, because CPz high frequency. However, it is optional as in the state-
channel was used as an Electrooculography (EOG) channel of-art, a lowpass filter was not necessarily used before
in this study. The ground electrode is located 10% anterior downsampling [24].
to Fz, linked earlobes served as reference, and electrode Subsequently, the signal is downsampled from 1000 Hz to
impedances are below five kOhm. EEG signals are recorded 100 Hz (i.e., using a downsampling factor D of 10). Down-
using a programmable DC-coupled broadband SynAmps sampling is commonly used in the EEG processing task as
amplifier. The EEG signals are amplified (gain 2500, accu- it can reduce the data time points and save up computational
racy 0.033/bit) with a recording range set of ±55 mV in power [24], [40], [41]. Besides, downsampling can free up
the DC to the 70-Hz frequency range. The EEG signals are memories due to lesser time points, making this method
digitized using a sampling frequency Fs of 1000 Hz and portable and less costly to implement. The downsampled
16-bit analog-to-digital converters. The digital EEG signal d signal, xi [n], which is obtained from di [n] in Section II,
of channel i at discrete data point n, which is di [n], is obtained is defined as:
from the analog EEG signal a at the corresponding channel.
This signal can be expressed as: xi [n] = di [Dn] = di [10n] (2)

n where D is the downsampling factor. The downsampling
di [n] = ai (nT ) = ai (1) operates by decimating the signal by D, retaining only every
Fs
D-th sample, and eliminating the rest.
The conversion of analog EEG signal to digital EEG signal Next, the first 60 seconds of the recording are discarded
took place by taking samples (i.e., sampling) at each sampling as they are usually contaminated by artifacts as subjects
time interval, T (i.e., T = one millisecond) of the analog EEG are less calm at the early phase of the recording. From the
signal [54]. literature, most of the studies used 60 seconds of recordings.
This finding indicates that 60 seconds of recording is suf-
III. DATA PREPARATION AND PRE-PROCESSING ficient for obtaining reliable diagnosis results using qEEG
The recorded EEG signals were pre-processed to eliminate features [39], [42]. Furthermore, the presence of more dis-
unwanted elements, which will affect the training of the criminating characteristics of EEG is close at the beginning
proposed architecture (i.e., electrical line noises and arti- of the recording [43]. Therefore, the next 60 seconds of the
facts). The pre-processing steps used in this study are shown recording is extracted from the recording and then divided
in Figure 2. First, because the electrical line frequency in into 60 segments of one second.
Malaysia is 50 Hz, a 50 Hz notch filter is applied to remove As the input to the proposed method, the EEG is arranged
electrical line noises from the EEG. Next, the resultant signal in the form of a matrix of the channel’s amplitude versus
undergoes a bandpass filter of 0.1 Hz and 100 Hz. It was time. The arrangement of the channels refers to the default
suggested that the frequency analysis of TBI is limited to a arrangement given by the WaveGuard EEG cap. Because
frequency band between 0.1 Hz and 100 Hz, which includes each segment is in one second, the matrix size of the EEG
several useful sub-bands (i.e., delta, theta, alpha, beta, and is N × (Fs /D), where N is the number of channels, Fs is
gamma bands) [20]. From the literature, it can be seen that the sampling frequency, and D is the downsampling factor
a bandpass filter of 0.1 Hz and 100 Hz is commonly used in used in Equation (2). In this research, the matrix size is 63 ×
works related to TBI [39]. As physiology is best understood 100 because the sampling rate of 1000Hz and downsampling
for these frequency bands, using a bandpass filter of 0.1 Hz factor D of 10 are used, and the number of channels is 63.
and 100 Hz enabled TBI analysis to be carried out, focusing As only 60 seconds from the EEG data is used, each EEG data
on the useful frequency range. produces 60 segments stored as matrices. The components in
The resultant signal then undergoes visual inspection of the matrix M of segment k is stored from the EEG data points
artifacts. Segments containing artifacts were removed manu- using the formula:
ally from the recording, and the previous part is concatenated Mk [i, n] = xi [100(k − 1) + n + 6000] (3)
to the later part of the signal. EEG filtering and artifact
removal processes were placed before signal downsampling where k = 1, 2, 3, . . . , 60, i is the channel of the sampling
to ensure there is no aliasing caused by high frequency. point (i.e., for this case, i = 1, 2, 3, . . . , 63), n is the sampling
It was recommended to place a lowpass filter to remove point (i.e., for this case, n = 1, 2, 3, . . . , 100) and xi [n] is the
VOLUME 9, 2021 24949

FIGURE 3. Proposed CNN ECOC-SVM Voting Ensembles Architecture.
amplitude of the sampling point of channel i at point n. Noted TABLE 1. ECOC SVM coding design.
that from Equation (3), segmentation started from the 6001-
th time point as the first 60 seconds were omitted. Therefore,
there are a total of 2160 EEG matrices (i.e., 60 one-second
segment × 12 patients × 3 classes).
IV. OVERVIEW OF PROPOSED CNN ECOC-SVM VOTING

ENSEMBLES ARCHITECTURE decoding. In this analysis, Hamming distance is used as the
The proposed architecture is divided into two parts. The decoding method to check for the minimum distance between
first part of the architecture performs feature extraction, the prediction vector and the word code, which counts the
while the second part performs classification, which is shown number of bits that differ. Both of the ECOC-SVM classifiers
in Figure 3. The activations from each fully connected layer trained are combined in parallel into a voting ensemble,
are used to train three ECOC-SVM classifiers, respectively where the classification prediction decision is based on the
(i.e., feature vector from Layer 3, 4, 5). highest vote among the classifiers. The overall architecture is
Error-correcting output codes (ECOC) is often used presented in Figure 3.
together with SVM to perform multi-class classification, Input to the CNN is a 63 × 100 matrix. In the convolution
as SVM alone only performs binary classification. ECOC layer, there are six 3 × 3 filters, sliding using stride length
classification requires (i) a coding design to determine of one. The activation function used for the convolutional
the classes’ binary learners (i.e., SVM) to train on, and layer is a rectified linear unit (ReLU). Batch normaliza-
(ii) a decoding scheme that defines the aggregation of the tion is placed after the convolutional layer. The convolution
final prediction of all the binary classifiers. The coding design layer’s output will be the same as the input matrix, as zero
used in this study is a one-versus-one scheme, also known as paddings are performed before the convolution. Six 63 ×
an exhaustive matrix scheme. The coding design is presented 100 feature maps are produced and directed to a 2 × 2
in Table 1. In this work, 1 is the notation for positive class, average pooling layer with a stride length of 1. The aver-
-1 is for negative class, while 0 is for ignoring the class. age pooling layer downsampled the feature maps, resulting
For example, SVM 1 treats the healthy subject as the posi- in six 31 × 50 feature maps. These six feature maps are
tive class and mild TBI subject as a negative class, whereas flattened into a long feature vector of size 9300 (i.e., 6 ×
moderate TBI class is omitted. The other SVM are trained 31 × 50) and then directed into the fully connected (FC)
similarly. layers. By referring to Figure 3, Layer 3 and 4 are containing
Any classifier outputs a ‘‘0’’ or ‘‘1’’ when making a predic- 128 neurons, producing a feature vector of 128 activations
tion, generating a vector for output code. This output vector respectively, whereas Layer 5 has a feature vector with three
is compared with each codeword in the matrix and the class activations. Features from Layer 3, 4, and 5 are used to train
whose codeword has the closest distance to the output vector three ECOC-SVM classifiers, respectively. The ECOC-SVM
is selected as the class to be predicted. The process of com- classifiers were subsequently used to form a majority voting
bining the outputs of individual binary classifiers is known as ensemble algorithm.
24950 VOLUME 9, 2021

TABLE 2. Parameters and values. segmentation is done, and the resampling sample size is set
to 36.
Efron suggested that 250 iterations can give useful per-
centile intervals [44]. Therefore, to obtain the optimal
parameters and the design of the proposed architecture,
250 iterations of the resampled bootstrap sample set are
used. To achieve a more ambitious measure of confidence
intervals, Efron suggested a minimum of 1000 iterations of
the resampled bootstrap sample set [44]. Thus, in the eval-
uation of the final proposed architecture, 2000 iterations of
Six parameters chosen for the proposed architecture are bootstrap resampling are performed. 3-fold cross-validation
shown in Table 2. The learning rate of 0.001 is selected and is performed on each bootstrap sample. From the cross-
remains constant throughout the training of CNN. L2 nor- validation, two quantitative evaluations are recorded for each
malization is used to perform batch normalization of the generated bootstrap sample set (i.e., accuracy and precision).
convolutional layer. The mini-batch size for every iteration Box plots were plotted for the evaluation of the results.
is set at 16. The training iteration per epoch is fixed at 30.
L2 regularization, with a factor of 0.0005, is utilized to avoid VI. EXPERIMENTS, RESULTS AND DISCUSSIONS
overfitting. The optimizer used for the back-propagation for Experiments have been done using a simple hill-climbing
CNN training is adaptive moment estimation (ADAM). approach to determine the optimum parameters and archi-
tecture. The search stopped when the performance shows a
V. TRAINING PROCEDURE AND PERFORMANCE downtrend, and the parameter with the best performance is
MEASURE selected. Figure 4 shows the flow of the experiments that
In the application of bioinformatics, a small dataset often were conducted to design the proposed architecture. There
becomes an issue due to unforeseen restrictions, such as are no clear guidelines for the design and parameter set up
the limited amount of patients. A small dataset caused the of CNN architecture. For this study, the first two parameters
evaluation of classifiers to be optimistic biases, inaccurate that are determined are the learning rate and mini-batch size.
in estimating its performance. Data augmentation can be This decision is to ensure effective learning before fine-tuning
done to increase data, which is commonly seen in image other parameters. If the optimum learning rate and mini-batch
classification. However, augmentation of mild or moderate size were not obtained beforehand, architecture learning
TBI patient’s EEG can increase the classification error as would be affected. Thus, the experiments’ flow began with
random noises may be added during augmentation. the search of optimum learning rate and mini-batch size,
The bootstrap method [44] is used in this work to overcome subsequently followed by fine-tuning the design of architec-
small dataset issues in the evaluation of this proposed archi- ture. As the experiments’ input to determine the optimum
tecture. A bootstrap method is a resampling approach that architecture, raw EEG data underwent downsampling and
generates bootstrap sample sets used to quantify the uncer- segmentation procedure mentioned in Equation (2) and (3),
tainties associated with a machine learning method despite respectively. The final architecture was evaluated with both
using a small dataset. It can estimate a proposed machine raw and pre-processed EEG to explore the potential of CNN
learning architecture using a small dataset by providing a extracting quality features from the raw EEG signal.
percentile confidence interval of the performance measures There are eight subsections in the following section.
(i.e., classification accuracy and precision). In short, the boot- Section VI-A discussed the experiments conducted to select
strap resampling approach allows the user to mimic the pro- the optimum learning rate. Section VI-B presents the selec-
cess of obtaining a new dataset to estimate the performance tions of mini-batch size for the training of CNN. Next,
of the proposed architecture without generating new samples. Section VI-C discusses the experiments conducted to deter-
The bootstrapping concept is explained in three steps. First, mine the optimal number of convolutional layers.
a random sample is selected from the original dataset. Next, Experiments have been done to determine the suitable
the random sample is added to the new dataset and then type of pooling layer and are discussed in Section VI-D.
returned to the original dataset. These two steps repeat until Section VI-E presents experiments conducted to select the
the bootstrap sample set reaches the fixed number of samples. optimum number of filters and filter size used in the convolu-
In the machine learning approach, the size of the bootstrap tional layer. Section VI-F presents experiments to determine
sample set generated is equal to the size of the original the optimum architecture of the fully connected layer.
dataset [45]. Therefore, certain samples may be represented Two optimizers that are commonly used in the state-of-art
repetitively, while others may not be selected at all [45]. are stochastic gradient descent (SGD) and ADAM. Experi-
Bootstrapping is a useful approach as the trained machine ments have been conducted to determine the best optimizers.
learning model’s prediction results using the bootstrap sam- These experiments are presented in Section VI-G together
ple sets often present a Gaussian distribution. In our work, with its results. In Section VI-H, experiments conducted to
bootstrap resampling is performed on all 36 subjects before construct and evaluate the final proposed architecture are
VOLUME 9, 2021 24951

using an initial CNN with one convolutional layer with six

5 × 5 filters, one 2 × 2 max-pooling layer, 32 mini-batch
sizes, and the ADAM optimizer. The learning rates used are
0.1, 0.01, 0.001 and 0.0001 respectively.
The accuracy and precision of each learning rate are shown
in Figure 5. From Figure 5(a), results show that the architec-
ture’s accuracy increased (i.e., based on the median accuracy)
as the learning rate decreased from 0.1 to 0.001 and dropped
when the learning rate falls to 0.0001. From Figure 5(b),
in terms of the distribution of the obtained precision val-
ues, the learning rate of 0.001 is also showing the best
performance.
The training times for CNN using different learning rates
are also recorded and shown in Figure 6. It can be seen
that the training time increased when the learning rate
decreased. By tolerating some learning time, the learning rate
of 0.001 showed the best performance; thus, it will be used as
one of the parameters in designing the proposed architecture.
B. SELECTION OF OPTIMUM MINI BATCH SIZE

In CNN learning, the training set is divided into numbers of
mini-batches, each consisting of a small number of training
samples. The mini-batch size controls the accuracy of the
estimate of the error gradient during CNN training. The error
gradient is used to update the CNN model weights, and the
process repeats.
A study has shown that using a mini-batch size that is too
large can cause significant degradation in the quality of the
trained CNN model due to the lack of generalization ability
FIGURE 4. Flowchart of the experiments to determine the proposed and converge to a sharp minimum [48]. Thus, the optimum
architecture. mini-batch size has to be found to ensure a better convergence
rate and better stability of the CNN training [46].
Mini batch size of 32 has been introduced as the default
presented and discussed in detail. Pre-processing has always value based on a recommendation by some studies [47], [49].
been one of the crucial stages in determining the input quality However, to explore the smaller mini-batch size’s per-
to CNN. CNN carries the potential to extract features by formance, 8 is selected as the starting point rather than
ignoring the unwanted elements selectively. Experiments are 32. Experiments were conducted using 8, 16, 32, 64, and
conducted using pre-processed data and downsampled raw 128 mini-batch sizes to determine the optimum value using
data on the final architecture later in Section VI-H. Finally, a CNN architecture with one convolutional layer of six
the proposed architecture is compared to similar works in 5 × 5 filters, one 2 × 2 max-pooling layer, the learning rate
literature, as well as our previous studies in Section VI-I. of 0.001 and an ADAM optimizer. Figure 7(a) and (b) show
their accuracy and precision, respectively. From this figure,
A. SELECTION OF OPTIMUM LEARNING RATE it is shown that mini-batch size of 16 gave the best perfor-
The learning rate is an important parameter that determines mance in terms of accuracy and precision.
the update step of learnable weights for backpropagation
learning [46]. When the learning rate is too huge, the gradient C. SELECTION OF OPTIMUM NUMBER OF CONVOLUTION
descent can recklessly increase rather than decrease the train- LAYER
ing error. On the other hand, using a learning rate that is too The convolution layer of a CNN network performs unsu-
small can cause slow training and might cause invariable high pervised feature extraction through convolution operation.
training errors. Therefore, determining the optimum learning Studies have shown CNN can capture the spatial and tem-
rate is crucial to optimize the search for the minimum point poral dependencies without neglecting the pixel dependen-
of loss in backpropagation learning. cies throughout and has shown high performance in image
The current study shows that a good learning rate can be applications. Thus, the convolution layer advances the feature
estimated by initiating a larger learning rate and decreas- extraction stage from EEG as it can extract the correlation
ing it at each iteration using 0.1 as a starting point [47]. between channels of the EEG by sliding across the input using
Experiments are carried out by varying the learning rate filters.
24952 VOLUME 9, 2021

FIGURE 5. a) Accuracy for Various Learning Rate b) Precision for Various Learning Rate.
features. From the experiment outcome, a CNN with one

convolutional layer is selected.
D. SELECTION OF POOLING TYPE

The pooling layer is commonly placed after the convolutional
layer to reduce the dimension of feature map to help reduce
computational load and processing time. The pooling layer
works by sliding an n × n kernel through the resulting feature
FIGURE 6. Training time by different learning rate. map from the previous convolutional layer, where n is the size
of the kernel. Two types of pooling layers are commonly used
(i.e., max pooling and average pooling). In the max pooling
EEG is arranged in two-dimensional (i.e., channels × data operation, the kernel will take the maximum value within the
points), similar to images. Performing convolution operation n × n kernel as the output. On the other hand, the average
to the signal can extract data points dependencies as features. pooling operation calculated the kernel’s average value then
Zero-padding is performed before convolution operation to stored in the output matrix.
preserve features existing at the edges of the EEG matrix. To investigate the suitable type of pooling, two experiments
This approach helps to preserve the original input size, which were conducted separately using max pooling and average
can reduce information lost during feature extraction. pooling of 2 × 2 kernel size using CNN architecture made up
Experiments have been conducted by increasing the num- of one convolutional layer with six 5×5 filters, 0.001 learning
ber of the convolutional layer (i.e., one to three) with six rate, and a mini-batch size of 16. In Figure 9, architecture
5 × 5 filters in a CNN architecture with one 2 × with an average pooling layer performed slightly better than
2 max-pooling layer, 16 mini-batch sizes, 0.001 learning rate, the one with the max-pooling layer, indicating that average
and the ADAM optimizer. Accuracy and precision for each pooling is a better option in this EEG application.
CNN architecture with different number of the convolutional Max pooling operation works by obtaining extreme values.
layer are shown in Figure 8. However, in the EEG application, extreme input values usu-
As shown in Figure 8, the CNN architecture with one ally represent noises, artifacts, and peaks. These unwanted
convolutional layer outperformed the rest of the compared elements caused inaccurate classification. On the other hand,
multi-convolutional layer CNN. In further comparison, it can average pooling takes into account all values in the kernel,
be seen that there is a performance dip when the convolutional thus retaining most information.
layer is increased to two layers.
Adding more convolution layers beyond a certain threshold E. SELECTION OF OPTIMUM FILTER SIZE AND NUMBER
can lead to the extraction of irregularities in the data, causing OF FILTERS
performance degradation. Therefore, one convolutional layer Two parameters determine the convolution process, the num-
is sufficient to extract feature maps containing informative ber of filters in the convolutional layer, and the filter size.
VOLUME 9, 2021 24953

FIGURE 7. a) Accuracy for Various Mini Batch Size b) Precision for Various Mini Batch Size.
FIGURE 8. a) Accuracy for Various Number of Convolutional Layer b) Precision for Various Number of
Convolutional Layer.
The number of filters in the convolutional layer determines shown in Figure 10. The performance of the architecture
the numbers of resultant feature maps, while the filter size is improved when the number of 3 × 3 filters is increased
the parameter that determine the quality of the feature map from one filter to four filters. However, further increasing
(i.e., smaller filter size extract finer features). Experiments the number of 3 × 3 filters to five shown degradation in
were conducted using different filter sizes with different performance measures. Experiments were further conducted
numbers of filters in the convolutional layer, using a CNN using six and seven 3 × 3 filters, respectively. It was shown
architecture made up of one convolutional layer, one 2 × that six 3×3 filters performed better than the architecture with
2 average pooling layer, and one 3 neurons FC layer with a four 3 × 3 filters. On the other hand, seven 3 × 3 filters cause
mini-batch size of 16, a learning rate of 0.001, and an ADAM degradation.
optimizer, which is the combined outcomes from previous Experiments were repeated using 5 × 5 filter. The results
experiments. are presented in Figure 11. When the number of 5×5 filters is
There are three filter sizes evaluated (i.e., 3 × 3, 5 × 5 increased to two, the performance is improved. Experiments
and 7 × 7). Experiments conducted using 3 × 3 filters are were further conducted using three to eight 5 × 5 filter,
24954 VOLUME 9, 2021

FIGURE 9. a) Accuracy for Different Pooling Type b) Precision for Different Pooling Type.
FIGURE 10. a) Accuracy for Different Number of 3 × 3 Filter b) Precision for Different Number of 3 × 3 Filter.
respectively. As the number of filters is increased to four, a threshold, the performance of the architecture worsen. This
the performance of architecture dropped. However, there is result happens because the extra feature maps extracted did
a slight improvement when the number of filters increased not supply useful features to the architecture learning. Extra
from five to seven. When eight 5 × 5 filters were used, non-informative features can confuse architecture learning
the architecture’s performance worsened. and cause a dip in performance.
Experiments have been conducted using the same CNN This result suggested six feature maps extracted by the six
architecture with 7 × 7 filters and the results were presented 3 × 3 filters presented finer and more informative features
in Figure 12. Increment of the number of 7 × 7 filters to two that benefit the architecture learning. On the other hand,
filters improved the performance of architecture. Neverthe- architecture with a 7×7 filter under-performing indicated the
less, architecture using three and four 7 × 7 filters deflated filter size is too large that it missed out too many fine features.
the result. The outcome from the conducted experiments, six 3 × 3
From the results, CNN architecture with six 3 × 3 filters filters are used for the convolutional layer in the proposed
performed the best. As the number of filters increased above architecture.
VOLUME 9, 2021 24955

F. SELECTION OF OPTIMUM FULLY CONNECTED LAYER the defined CNN architecture. Next, an additional layer of
ARCHITECTURE FC with 32 neurons was added above the first FC. The
The role of the fully connected (FC) layer is to take the experiment was repeated by changing the number of neurons
output from the last pooling layer then perform classification. for the second FC layer with 64, 128, and 256 neurons,
There are no guidelines for the architecture of FC layers for respectively.
CNN (i.e., number of layers and neurons for each layer). From Figure 13, the results indicated an increasing number
Nine possible FC layer architectures were tested with CNN of neurons from 32 to 128 neurons on the second FC layer
using the best parameters determined (i.e., one convolutional improved the CNN architecture’s performance. However,
layer with six 3 × 3 filters, one 2 × 2 average pooling layer, a further increment of the neurons for the second FC layer
the mini-batch size of 16, the learning rate of 0.001, and the degraded the architecture’s performance.
ADAM optimizer). The next experiment was conducted by using three FC
The results of this experiment are presented in Figure 13. layers (i.e., two FC layers with 128 neurons and one FC
Firstly, one FC layer with three neurons was tested with layer with three neurons). It was shown that three FC layers
24956 VOLUME 9, 2021

FIGURE 13. a) Accuracy for Various FC architecture for CNN b) Precision for Various FC architecture for CNN.
architecture performed worst than the architecture with one of the proposed architecture are (i) architecture with one
128 neurons FC layer and three neurons FC layer. Subse- 128 neurons FC layer and one three neurons FC layers;
quently, a CNN architecture with three 128 neurons FC layers and (ii) architecture with two 128 neurons FC layers and one
and one three neurons FC layer were tested, and it was found 3 neurons FC layer.
this setup did not improve the performance of the previous
architecture. G. SELECTION OF OPTIMIZER FOR BACKPROPAGATION
By further investigation, CNN architecture with four Two optimizers are evaluated in this study (i.e., the stochas-
128 neurons FC layers and one 3 neurons FC layer were tic gradient descent (SGD) with momentum and adaptive
tested. The results further worsened compared to the archi- moment estimation (ADAM)). Both of these optimizers were
tecture with three 128 neurons FC layers and one 3 neurons evaluated using CNN architecture made up of one convo-
FC layer. The architecture’s performance did not improve but lutional layer with six 3 × 3 filters, one 2 × 2 average
dropped in comparison to the previous architecture. pooling layer, one 128 neurons FC layers, and one 3 neu-
The outcome from all the conducted experiments in this rons FC layer with a mini-batch size of 16 and learning
section induced that the CNN architecture with one 128 neu- rate of 0.001. Based on the results in Figure 14, ADAM
rons FC layer and one 3 neurons FC layer performed the performed better than SGD in all terms of performance mea-
best. Architectures with FC consisting of too few neurons sures. Both optimizers presented a low standard deviation
(i.e., 32 and 64 ) caused the network’s low learning capability and a small range of confidence intervals, indicating a stable
and resulted in underfitting. The lower number of neurons performance.
in the FC layer failed to detect and learn from the flattened The original SGD without momentum oscillate along the
feature map. This scenario can also be seen in the architecture steepest descent path towards the optimum, making the archi-
of 1 FC layer with three neurons. tecture harder to finalize the local minima. Adding a momen-
On the other hand, using too many neurons in the FC layers tum term to the weights update can overcome this issue
result in overfitting. Overfitting occurred when the archi- by adding momentum in the direction of consistent gradi-
tecture took up too much information processing capacity. ents and discard the momentum if gradients are in opposite
Thus, the limited amount of information in the training set is directions [51]. SGD with momentum shows comparable
insufficient to fit all the neurons. Indication of overfitting can performance and converges faster than the original SGD as
be seen in the architecture with one 256 neurons FC layer and bigger steps are taken towards the same direction following
one 3 neurons FC layer. Its performance degraded compared the momentum.
to the architecture with only 128 neurons FC layer and one However, in this study, it was observed that SGD with
3 neurons FC layer. momentum present lower performance than ADAM. SGD
Besides, CNN architecture with more than 3 FC layers is with momentum solved the issue of oscillation, but it can
also one of the overfitting evidence, where the performance easily overshoot the local minima due to the large steps taken
further reduced compared to the CNN architecture with 3 FC following the momentum. It can result in swinging back
layers. In summary, the two best-performed architectures and forth between the local minima due to the overflow of
used in the subsequent experiments to facilitate the design momentum and hardly achieve the local minima.
VOLUME 9, 2021 24957

FIGURE 14. a) Accuracy for Different Optimizer b) Precision for Different Optimizer.
On the other hand, ADAM is an optimizer that is a combi- respectively. The performance suggested SoftMax classifier
nation of SGD with momentum and root mean square propa- at the output of the last FC layer did not perform well in clas-
gation (RMSProp). Therefore, ADAM carries the advantage sification. Hence, it becomes a motivation to propose SVM
of momentum solving the problem of noisy oscillation and to replace the SoftMax classifier. The error-correcting output
also the strong side of RMSProp that changes the step size by coding (ECOC) algorithm is introduced to combined with
adapting to the gradient. SVM to perform multi-class classification. SVM is a robust
Based on the experimental results, advantages carried by and powerful binary classifier due to its ability to perform
ADAM are useful to enable efficient architecture learning. class separation and the kernel space facilities. Combining
Computing a unique learning rate for each weight and bias SVM with the ECOC algorithm can efficiently handle the
becomes a more compatible method for this study, enhancing multi-class problem by utilizing the binary set of ECOC with
the architecture’s learning. Hence, ADAM is selected as the suitable coding rules to achieve a non-linear classification
optimizer for the proposed architecture. while reducing the trained models’ bias and variance.
To make full use of the ECOC-SVM, activations from
H. CONSTRUCTION OF PROPOSED ARCHITECTURE the fully connected layers are used as feature vectors. Three
1) EVALUATION OF POTENTIAL ARCHITECTURES experiments were first conducted using architecture 1. First,
From all the conducted experiments, two potential architec- Architecture 1 is trained according to the defined training
tures are selected as the proposed architecture. Both of the procedure. Activations from each of the fully connected lay-
potential architectures are summarized in Table 3. The main ers in the architecture (i.e., from Table 3, Layer 3, and
difference between Architecture 1 and 2 is the number of 4) are extracted as features, respectively. The results of the
FC layer. Architecture 1 has two FC layers with 128 and three experiments are presented in Figure 15. Two experi-
3 neurons, respectively, while Architecture 2 has three FC ments were conducted using features from Layer 3 and Layer
layers with two of them with 128 neurons and one with 4 to train two ECOC-SVM separately. In contrast, the third
3 neurons, respectively. experiment was conducted using the concatenation of fea-
tures from Layer 3 and Layer 4 to train an ECOC-SVM
TABLE 3. Potential CNN Architectures.
classifier.
From Figure 15, each trained ECOC-SVM showed
improvement compared to the architecture 1 using SoftMax
as the classification tool. ECOC-SVM classifier trained using
feature vector from Layer 3, Layer 4, and a combination of
features from Layer 3 and 4 achieved classification accuracy
above 68%. The features extracted from each FC layer is the
linear combination of the feature vector from the previous
However, both architectures can only achieve performance layer and trainable weights plus a bias term. Therefore, each
with the classification accuracy of 54.16% and 53.43%, EEG input will be assigned to a unique weight and bias
24958 VOLUME 9, 2021

FIGURE 15. a) Accuracy for the Performance of Different ECOC SVM Models Using Features From Different
Layer of Architecture 1 b) Precision for the Performance of Different ECOC SVM Models Using Features From
Different Layer of Architecture 1.
via backpropagation that strongly correlates to its respective ensemble architecture under-performed with the classifica-
classes. tion accuracy of 65.42% and the precision of 66.78%. This
The feature vector from Layer 3 performed slightly lower outcome is an indication that the quality of the trained
than others because the input to the layer is the flattened ECOC-SVM models is not good enough to result in robust
feature map from the last average pooling layer (i.e., Layer 2), voting, causing inaccurate classification decisions.
which is a low-level feature. As the features passed through All ECOC-SVM classifiers presented in Figure 15 and
the rest of the FC layers, the low-level features will eventually Figure 16 are able to reach classification accuracy that are
build up into higher-level features, describing the input EEG above 65%. This result indicates that ECOC-SVM classifier
better. Each neuron in the FC layer received inputs from every that learns from FC layers’ activations performs better than
neuron of the previous layer. Thus, each neuron contained the CNN Architecture 1 and 2.
information from the previous feature vector, in which the
end product is the summation of all low-level features. From 2) EFFECT OF PRE-PROCESSING
Figure 15, it can be seen that feature vector extracted from Note that the input EEG to all conducted experiments
Layer 4 at a higher level can provide sufficient information previously did not undergo any pre-processing. Therefore,
for a robust classifier training. Among three experiments to investigate the effect of pre-processing on the archi-
conducted, feature fusion of Layer 3 and 4 present the highest tecture’s performance, the experiments are repeated using
performance with the classification accuracy of 69.91% and pre-processed EEG data. The pre-processing procedure is
the precision of 68.12%. Feature vectors result from the described in Section III. Results from the experiment are
combination carried distinct information from both Layer presented in Figure 17.
3 and 4, resulting in better training of an ECOC-SVM model. From Figure 17, all the ECOC-SVM trained using the
However, feature vector from the respective FC layer may pre-processed data improved drastically compared to those
contain repeated information passed to the next FC layer. trained with raw EEG data. All the performance measures
Therefore, there is a chance of information redundancy, increased to 99% and above, indicating efficient training
which may confuse the training of architecture. using pre-processed data. The pre-processing procedure
Hence, an experiment was conducted using architecture 2. used in this study efficiently removed unwanted elements
Features were extracted from the FC layers (i.e., Layer 3, 4, (i.e., noises, artifacts, extreme values), which may confuse
and 5) and trained three ECOC-SVM, respectively. A major- the learning process of the architecture.
ity voting ensembles algorithm was developed using the Figure 17 shows that the CNN ECOC-SVM majority
trained ECOC-SVM classifiers. For the classification predic- voting ensembles architecture gives the best performance,
tion output decision, the majority decision from the respective presenting a high classification accuracy of 99.79% and
ECOC-SVM will be the final output. The results from the the precision of 99.73%. The CNN ECOC-SVM major-
experiment are presented in Figure 16. Compared to the ity voting ensembles architecture also presents a low stan-
feature fusion used in CNN architecture 1, the ECOC-SVM dard deviation for the 250 bootstrap resampling runs,
VOLUME 9, 2021 24959

FIGURE 16. a) Accuracy for the Performance of ECOC-SVM Majority Voting Ensembles Using Features From
Layer 3, 4 and 5 of Architecture 2 b) Precision for the Performance of ECOC-SVM Majority Voting Ensembles
Using Features From Layer 3, 4 and 5 of Architecture 2.
FIGURE 17. a) Accuracy for the Performance of Different ECOC-SVM Using Features of Pre-processing EEG From
FC layers of Architecture 1 and ECOC-SVM Majority Ensembles From FC layers of Architecture 2 b) Precision for
the Performance of Different ECOC-SVM Using Features of Pre-processing EEG From FC layers of Architecture
1 and ECOC-SVM Majority Ensembles From FC layers of Architecture 2.
indicating this architecture has stable performance. Undergo- aiding the weaker classifiers to produce better prediction
ing pre-processing, a clean EEG data can provide informa- performance. Stabilization of the architecture can be seen by
tive features without confusing the learning process, hence highlighting the high precision (i.e., 99.73%).
developed quality ECOC-SVM classifiers that can make
accurate votings and decision-making. 3) EVALUATION OF FINAL ARCHITECTURE
The advantage of CNN ECOC-SVM majority voting By referring to the results of the conducted experiments,
ensembles architecture is that each of the ECOC-SVM clas- the final architecture proposed is a CNN ECOC-SVM
sifiers is combined to cover different regions of competence majority voting ensembles by ensemble three ECOC-SVM
(i.e., correcting wrong predictions of each other), which indi- classifiers trained by activations from the FC layers of
cates that this architecture has stable performance. Stronger Architecture 2 respectively. The proposed architecture’s
ECOC-SVM classifiers performed well in error correction, optimum parameters were determined via experiments
24960 VOLUME 9, 2021

FIGURE 18. a) Accuracy for the Performance of 250 and 2000 Bootstrap Resampling for proposed CNN
ECOC-SVM Voting Ensembles Architecture 2 b) Precision for the Performance of 250 and 2000 Bootstrap
Resampling for proposed CNN ECOC-SVM Voting Ensembles Architecture.
(i.e., the learning rate of 0.001, 16 mini-batch sizes, and methods and the proposed CNN ECOC-SVM are presented
ADAM optimizer). For the proposed architecture to perform in Table 4.
at its best, pre-processing has to be done on the input EEG
signal, as described in Section III. The proposed architecture TABLE 4. Accuracy and Precision for the Performance existing works and
shows high-performance measures and stability. To further proposed CNN ECOC-SVM Voting Ensembles Architecture.
assure the high performance of the proposed architecture,
the experiment is repeated using 2000 bootstrap resampling
to obtain an ambitious measure of CI for all the perfor-
mance measures with a pre-processed EEG signal. The results
are presented in Figure 18, together with the experiment
done using 250 bootstrap resampling. There are small decre-
ments in performance measures of the architecture conducted
using 2000 bootstrap resampling (i.e., the classification accu- Having established the features extracted from the fre-
racy of 99.76% and the precision of 99.72%). Again, these quency bands can provide important information dur-
results strengthen the robust performance of this proposed ing the training on the classifier, den Brink et al. [24],
architecture. McNerney et al. [25] and our previous works performed fea-
ture extractions relying on the frequency bands. On the other
I. COMPARISON OF THE PROPOSED METHOD WITH hand, for this paper’s proposed method, the pre-processed
EXISTING WORKS signal did not undergo any feature extraction. The EEG is
The proposed method is compared with two similar arranged in matrix form and fed to the input of the CNN topol-
existing methods [24], [25], and two of our previous ogy. The convolution layers performed feature extraction to
works [52], [53]. The first method for comparison is the obtain distinct features from the input. The convolution layers
work by den Brink et al. [24], which uses task-free EEG and made up of learnable kernels are aimed at extracting local
Naive Bayes classifier for TBI classification. The second features from the input. The feature extraction takes place in
method compared was proposed by McNerney et al. [25] the convolution layers started by extracting low-level features
that uses the AdaBoost classifier. From our previous and subsequently progressed to extract higher-level features.
works [52], [53], the same pre-processing procedure pre- Results showed that the proposed architecture outper-
sented in Section III was used to pre-process the data. Alpha formed the other two methods with the classification accuracy
band power and theta power spectral density (PSD) were of 99.76% and the precision of 99.72%. Naive Bayes presents
extracted to train the SVM classifier. For a fair compari- a comparable performance (i.e., the classification accuracy
son, the same dataset, training procedure, and performance of 97.01%). However, to ensure such high performance,
measure are used (i.e., as presented in Section V) for all pre-processing and feature extraction have to be performed to
the compared methods. The performance of each of the ensure quality features can be extracted. On the other hand,
VOLUME 9, 2021 24961

the Adaboost classifier can only present the classification the proposed method was shown to outperform drastically
accuracy of 62.68 %. similar works in the literature, as well as our previous studies.
Naive Bayes made assumptions that each feature is inde-
pendent of each other, ignoring the dependency between EEG REFERENCES
channels. It caused the correlations between channels to be [1] P. Ortega, C. Figueroa, and G. Ruz, ‘‘A medical claim fraud/abuse detection
neglected, which can cause information lost in classifier train- system based on data mining: A case study in chile,’’ in Proc. Int. Conf.
Data Mining, 2006, pp. 224–231.
ing. Thus, the proposed method that makes use of CNN can [2] M. L. Lassey, W. R. Lassey, and M. J. Jinks, Health Care Systems Around
overcome the shortcoming of Naive Bayes. The AdaBoost the World: Characteristics, Issues, Reforms. Upper Saddle River, NJ, USA:
classifier is a machine learning method that requires less Prentice-Hall, 1996.
[3] A. Maas, D. Menon, P. Adelson, N. Andelic, M. Bell, A. Belli, P. Bragge,
tweaking of parameters and is easy to use. However, it is
A. Brazinova, A. Buerki, R. Chesnut, G. Citerio, M. Coburn, D. Cooper,
sensitive to noises and outliers, which is unavoidable in EEG A. T. Crowder, E. Czeiter, M. Czosnyka, R. Diaz-Arrastia, J. Dreier, and
recordings. More efforts have to be done to ensure noises A.-C. Duhaime, ‘‘Traumatic brain injury: Integrated approaches to improve
prevention, clinical care, and research,’’ Lancet Neurol., vol. 16, no. 12,
and artifacts to be removed to ensure an effective classifier
pp. 987–1048, Dec. 2017.
training. The proposed method only has to undergo simple [4] B. Lee and A. Newberg, ‘‘Neuroimaging in traumatic brain imaging,’’
bandpass filtering and to remove segments containing arti- NeuroRX, vol. 2, pp. 372–383, Apr. 2005.
facts, yet presents a performance that is near to 100% [5] P. S. Ngoya, W. E. Muhogora, and R. D. Pitcher, ‘‘Defining the diagnostic
divide: An analysis of registered radiological equipment resources in a low-
In comparison to our previous works [52], [53], alpha band income African country,’’ Pan Afr. Med. J., vol. 25, p. 99, Oct. 2016.
power, and theta band power spectral density (PSD) were [6] A. M. Rosenbaum, A. Weintraub, R. Seel, J. Whyte, and
extracted from the EEG as features to train ECOC-SVM R. Nakase-Richardson, Severe Traumatic Brain Injury: What to Expect
Trauma Center, Hospital, Beyond (The Traumatic Brain Injury Model
classifiers. However, it has shown a lower classification than System). Arlington, TX, USA: Model Systems Knowledge Translation
the proposed method. Alpha band power and theta PSD can Center, Jul. 2017.
be included as one of the features for moderate TBI classi- [7] L. M. Ryan and D. L. Warden, ‘‘Post concussion syndrome,’’ Int. Rev.
Psychiatry, vol. 15, no. 4, pp. 310–316, 2003.
fication, but using alpha band power or theta PSD alone is [8] A. I. R. Maas, N. Stocchetti, and R. R. Bullock, ‘‘Moderate and severe
insufficient. Other features have to be extracted to provide traumatic brain injury in adults,’’ Lancet Neurol., vol. 7, no. 8, pp. 728–741,
sufficient information to train an SVM, like correlation coef- 2008.
[9] Traumatic Brain Injury. Accessed: Feb. 17, 2020. [Online]. Available:
ficient, phase difference, and more. https://www.aans.org/en/Patients/Neurosurgical-Conditions-and-
Treatments/Traumatic-Brain-Injury
VII. CONCLUSION [10] R. Ibrahim, S. Samian, M. Mazli, M. Amrizal, and S. M. Aljunid,
‘‘Cost of magnetic resonance imaging (MRI) and computed tomography
In this article, the final proposed architecture and param-
(CT) scan in UKMMC,’’ BMC Health Services Res., vol. 12, no. S1,
eters are obtained by conducting extensive experiments. Nov. 2012.
To obtain the parameters and proposed architecture, 250 boot- [11] T. M. Reeves and B. S. Colley, Electrophysiological Approaches Traumatic
strap resampling was performed for each of the experiments. Brain Injury. Totowa, NJ, USA: Humana Press, 2012, pp. 313–330.
[12] D. B. Arciniegas, ‘‘Clinical electrophysiologic assessments and mild trau-
3-fold cross-validation was used on the downsampled raw matic brain injury: State-of-the-science and implications for clinical prac-
EEG to conduct the experiments. Parameters that are suc- tice,’’ Int. J. Psychophysiol., vol. 82, no. 1, pp. 41–52, Oct. 2011.
cessfully determined are 0.001 learning rate, mini-batch size [13] D. Hanley, L. S. Prichep, N. Badjatia, J. Bazarian, R. Chiac-
chierini, K. C. Curley, J. Garrett, E. Jones, R. Naunheim, B. O’Neil,
of 16, one convolutional layer with two 5×5 filters, one 2×2 J. O’Neill, D. W. Wright, and J. S. Huff, ‘‘A brain electrical activ-
average pooling layer, two FC layer with 128 neurons, and ity electroencephalographic-based biomarker of functional impairment in
one FC layer with three neurons. traumatic brain injury: A multi-site validation trial,’’ J. Neurotrauma,
vol. 35, no. 1, pp. 41–47, Jan. 2018.
Activations from FC layers are used to train three [14] J. N. Ianof and R. Anghinah, ‘‘Traumatic brain injury: An EEG point of
ECOC-SVM classifiers respectively and were connected in view,’’ Dementia Neuropsychol., vol. 11, no. 1, pp. 3–5, Mar. 2017.
parallel to form a high-performance majority voting ensem- [15] A. Tolonen, M. O. K. Särkelä, R. S. K. Takala, A. Katila,
J. Frantzén, J. P. Posti, M. Müller, M. van Gils, and O. Tenovuo,
ble algorithm for the classification of non-severe TBI and ‘‘Quantitative EEG parameters for prediction of outcome in severe
healthy subjects. It was shown that ECOC-SVM classi- traumatic brain injury: Development study,’’ Clin. EEG Neurosci., vol. 49,
fier trained using FC layer activations performs better than no. 4, pp. 248–257, Jul. 2018, doi: 10.1177/1550059417742232.
[16] P. Nunez, ‘‘Toward a quantitative description of large-scale neocortical
conventional CNN architectures that utilize SoftMax for dynamic function and EEG,’’ Behav. Brain Sci., vol. 23, no. 3, pp. 371–473,
non-severe TBI classification. The effect of pre-processing 2000.
was explored using the proposed architecture, and it was [17] N. S. E. M. Noor and H. Ibrahim, ‘‘Machine learning algorithms and
quantitative electroencephalography predictors for outcome prediction
found that pre-processed EEG improves performance archi-
in traumatic brain injury: A systematic review,’’ IEEE Access, vol. 8,
tecture than using raw EEG. pp. 102075–102092, 2020.
Proposed novel CNN ECOC-SVM voting ensem- [18] J. A. N. Fisher, S. Huang, M. Ye, M. Nabili, W. B. Wilent, V. Krauthamer,
bles architecture that uses pre-processed EEG presents M. R. Myers, and C. G. Welle, ‘‘Real-time detection and monitoring of
acute brain injury utilizing evoked electroencephalographic potentials,’’
high-performance measures, and its performance is reassured IEEE Trans. Neural Syst. Rehabil. Eng., vol. 24, no. 9, pp. 1003–1012,
using 2000 bootstrap resampling and 3-fold cross-validation. Sep. 2016.
The proposed CNN ECOC-SVM majority voting ensemble [19] J. McBride, X. Zhao, T. Nichols, V. Vagnini, N. Munro, D. Berry, and
Y. Jiang, ‘‘Scalp EEG-based discrimination of cognitive deficits after
architecture presents high performance with the classification traumatic brain injury using event-related tsallis entropy analysis,’’ IEEE
accuracy of 99.76% and the precision of 99.72%. In addition, Trans. Biomed. Eng., vol. 60, no. 1, pp. 90–96, Jan. 2013.
24962 VOLUME 9, 2021

[20] P. E. Rapp, D. O. Keyser, A. Albano, R. Hernandez, D. B. Gibson, [39] M. W. McNerney, T. Hobday, B. Cole, R. Ganong, N. Winans,
R. A. Zambon, W. D. Hairston, J. D. Hughes, A. Krystal, and A. S. Nichols, D. Matthews, J. Hood, and S. Lane, ‘‘Objective classification of mTBI
‘‘Traumatic brain injury detection using electrophysiological methods,’’ using machine learning on a combination of frontopolar electroen-
Frontiers Hum. Neurosci., vol. 9, p. 11, Feb. 2015. cephalography measurements and self-reported symptoms,’’ Sports Med.-
[21] N. K. Yadav and K. J. Ciuffreda, ‘‘Objective assessment of visual atten- Open, vol. 5, no. 1, pp. 1–8, Dec. 2019.
tion in mild traumatic brain injury (mTBI) using visual-evoked poten- [40] L. S. Prichep, S. Ghosh Dastidar, A. Jacquin, W. Koppes, J. Miller,
tials (VEP),’’ Brain Injury, vol. 29, no. 3, pp. 352–365, Feb. 2015, doi: T. Radman, B. O’Neil, R. Naunheim, and J. S. Huff, ‘‘Classification
10.3109/02699052.2014.979229. algorithms for the identification of structural injury in TBI using brain
[22] S. Schmitt and M. A. Dichter, ‘‘Electrophysiologic recordings in traumatic electrical activity,’’ Comput. Biol. Med., vol. 53, pp. 125–133, Oct. 2014.
brain injury,’’ in Traumatic Brain Injury, Part I (Handbook of Clinical [41] A. Jacquin, S. Kanakia, D. Oberly, and L. S. Prichep, ‘‘A multimodal
Neurology), vol. 127, J. Grafman and A. M. Salazar, Eds. Amsterdam, The biomarker for concussion identification, prognosis and management,’’
Netherlands: Elsevier, 2015, ch. 21, pp. 319–339. Comput. Biol. Med., vol. 102, pp. 95–103, Nov. 2018.
[23] E. Başar, A. Gönder, and P. Ungan, ‘‘Important relation between EEG and [42] B. Albert, J. Zhang, A. Noyvirt, R. Setchi, H. Sjaaheim, S. Velikova, and
brain evoked potentials,’’ Biol. Cybern., vol. 25, pp. 27–40, Mar. 1977. F. Strisland, ‘‘Automatic EEG processing for the early diagnosis of trau-
[24] R. L. van den Brink, S. Nieuwenhuis, G. J. M. van Boxtel, G. van Lui- matic brain injury,’’ Procedia Comput. Sci., vol. 96, pp. 703–712, 2016.
jtelaar, H. J. Eilander, and V. J. M. Wijnen, ‘‘Task-free spectral EEG [43] G. E. Hine, E. Maiorana, and P. Campisi, ‘‘Resting-state eeg: A study on its
dynamics track and predict patient recovery from severe acquired brain non-stationarity for biometric applications,’’ in Proc. Int. Conf. Biometrics
injury,’’ NeuroImage: Clin., vol. 17, pp. 43–52, 2018. [Online]. Available: Special Interest Group (BIOSIG), Sep. 2017, pp. 1–5.
http://www.sciencedirect.com/science/article/pii/S2213158217302449 [44] B. Efron, ‘‘Nonparametric estimates of standard error: The jackknife, the
[25] M. W. McNerney, T. Hobday, B. Cole, R. Ganong, N. Winans, bootstrap and other methods,’’ Biometrika, vol. 68, no. 3, pp. 589–599,
D. Matthews, J. Hood, and S. Lane, ‘‘Objective classification of mTBI 1981.
using machine learning on a combination of frontopolar electroen- [45] M. Kuhn and K. Johnson, Applied Predictive Modeling. New York, NY,
cephalography measurements and self-reported symptoms,’’ Sports Med.- USA: Springer, 2013.
Open, vol. 5, no. 1, p. 14, Apr. 2019, doi: 10.1186/s40798-019-0187-y. [46] S. Khan, H. Rahmani, S. A. A. Shah, and M. Bennamoun, ‘‘A guide
to convolutional neural networks for computer vision,’’ Synth. Lectures
[26] M. K. Islam, A. Rastegarnia, and Z. Yang, ‘‘Methods for artifact detection
Comput. Vis., vol. 8, no. 1, pp. 1–207, Feb. 2018.
and removal from scalp EEG: A review,’’ Neurophysiologie Clinique/Clin.
[47] Y. Bengio, ‘‘Practical recommendations for gradient-based training of deep
Neurophysiol., vol. 46, nos. 4–5, pp. 287–305, Nov. 2016.
architectures,’’ in Neural Networks: Tricks of the Trade. Berlin, Germany:
[27] D. C. Cireşan, U. Meier, J. Masci, L. M. Gambardella, and
Springer, 2012.
J. Schmidhuber, ‘‘Flexible, high performance convolutional neural
[48] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and
networks for image classification,’’ in Proc. 22nd Int. Joint Conf. Artif.
P. T. P. Tang, ‘‘On large-batch training for deep learning: Generalization
Intell., 2011, pp. 1237–1242.
gap and sharp minima,’’ 2016, arXiv:1609.04836. [Online]. Available:
[28] K. Muhammad, J. Ahmad, I. Mehmood, S. Rho, and S. W. Baik, ‘‘Convo- https://arxiv.org/abs/1609.04836
lutional neural networks based fire detection in surveillance videos,’’ IEEE [49] D. Masters and C. Luschi, ‘‘Revisiting small batch training for
Access, vol. 6, pp. 18174–18183, 2018. deep neural networks,’’ 2018, arXiv:1804.07612. [Online]. Available:
[29] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, https://arxiv.org/abs/1804.07612
W. Hubbard, and L. D. Jackel, ‘‘Backpropagation applied to handwrit- [50] P. Golik, P. Doetsch, and H. Ney, ‘‘Cross-entropy vs. squared error training:
ten zip code recognition,’’ Neural Comput., vol. 1, no. 4, pp. 541–551, A theoretical and experimental comparison,’’ in Proc. INTERSPEECH,
Dec. 1989. 2013, pp. 1756–1760.
[30] Z. Wen, R. Xu, and J. Du, ‘‘A novel convolutional neural networks for [51] K. P. Murphy, Machine Learning: A Probabilistic Perspective. Cambridge,
emotion recognition based on EEG signal,’’ in Proc. Int. Conf. Secur., MA, USA: MIT Press, 2013.
Pattern Anal., Cybern. (SPAC), Dec. 2017, pp. 672–677. [52] C. Q. Lai, M. Z. Abdullah, A. I. A. Hamid, A. Azman, J. M. Abdullah,
[31] S. U. Amin, M. Alsulaiman, G. Muhammad, M. A. Bencherif, and and H. Ibrahim, ‘‘Moderate traumatic brain injury identification from
M. S. Hossain, ‘‘Multilevel weighted feature fusion using convolutional power spectral density of electroencephalography’s frequency bands using
neural networks for EEG motor imagery classification,’’ IEEE Access, support vector machine,’’ in Proc. IEEE Int. Circuits Syst. Symp. (ICSyS),
vol. 7, pp. 18940–18950, 2019. Sep. 2019, pp. 1–4.
[32] S. Yang, F. Deravi, and S. Hoque, ‘‘Task sensitivity in EEG bio- [53] C. Q. Lai, M. Z. Abdullah, J. M. Abdullah, A. Azman, and H. Ibrahim,
metric recognition,’’ Pattern Anal. Appl., vol. 21, no. 1, pp. 105–117, ‘‘Screening of moderate traumatic brain injury from power feature of
Feb. 2018. resting state electroencephalography using support vector machine,’’ in
[33] C. Q. Lai, H. Ibrahim, M. Z. Abdullah, J. M. Abdullah, S. A. Suandi, Proc. 2nd Int. Conf. Electron. Electr. Eng. Technol., New York, NY, USA,
and A. Azman, ‘‘A literature review on data conversion methods Sep. 2019, pp. 99–103.
on EEG for convolution neural network applications,’’ in Proc. [54] L. Tan and J. Jiang, ‘‘Signal sampling and quantization,’’ in Digital Signal
10th Int. Conf. Robot., Vision, Signal Process. Power Appl. Processing, L. Tan and J. Jiang, Eds., 2nd ed. Boston, MA, USA: Aca-
(Lecture Notes in Electrical Engineering), vol. 547, M. Zawawi, demic, 2013, ch. 2, pp. 15–56.
S. S. Teoh, N. Abdullah, and M. M. Sazali, eds Singapore: Springer,
2019, pp. 521–527.
[34] C. Q. Lai, H. Ibrahim, M. Z. Abdullah, J. M. Abdullah, S. A.
Suandi, and A. Azman, ‘‘Arrangements of resting state electroen-
cephalography as the input to convolutional neural network for bio-
metric identification,’’ Comput. Intell. Neurosci., vol. 2019, Jun. 2019,
Art. no. 7895924.
[35] C. Q. Lai, H. Ibrahim, A. I. Abd Hamid, M. Z. Abdullah, A. Azman,
and J. M. Abdullah, ‘‘Detection of moderate traumatic brain injury from
resting-state eye-closed electroencephalography,’’ Comput. Intell. Neu-
rosci., vol. 2020, Mar. 2020, Art. no. 8923906. CHI QIN LAI (Member, IEEE) received the
[36] Y. Bengio, ‘‘Learning deep architectures for AI,’’ Found. Trends Mach. B.Eng. degree in electrical and electronic engi-
Learn., vol. 2, no. 1, pp. 1–127, 2009. neering and the master’s degree in image process-
[37] A. Sehgal and N. Kehtarnavaz, ‘‘A convolutional neural network smart- ing and machine learning from Universiti Sains
phone app for real-time voice activity detection,’’ IEEE Access, vol. 6, Malaysia, Malaysia, in 2016, where he is cur-
pp. 9017–9026, 2018. rently pursuing the Ph.D. degree. He joined Intel
[38] L. Bottou, ‘‘Large-scale machine learning with stochastic gradient as a Product Development Engineer. His research
descent,’’ in Proc. 19th Int. Conf. Comput. Statist. (COMPSTAT), interests include machine learning and biomedical
Y. Lechevallier and G. Saporta, Eds. Paris, France: Springer, Aug. 2010, signal processing and analysis.
pp. 177–187.
VOLUME 9, 2021 24963

HAIDI IBRAHIM (Senior Member, IEEE) AZLINDA AZMAN received the Ph.D. degree
received the B.Eng. degree in electrical and elec- in clinical social work from New York Univer-
tronic engineering from Universiti Sains Malaysia, sity. She was a Fulbright Scholar with New York
Malaysia, and the Ph.D. degree in image pro- University. She is currently a Professor in social
cessing from the Centre for Vision, Speech and work and the Dean of the School of Social Sci-
Signal Processing (CVSSP), University of Surrey, ences, Universiti Sains Malaysia (USM), Penang,
U.K., in 2005. His research interests include digital Malaysia. She is also the Convenor of the AIDS
image and signal processing and analysis. Action and Research Group (AARG), USM.
Her research interests include social work edu-
cation/curriculum, theory and methods in social
work, and social work research. Her research interests also include poverty,
HIV/AIDS, and drug related issues.
JAFRI MALIN ABDULLAH received the M.D.

degree from the School of Medical Sciences, Uni-
versity Sains Malaysia, in 1986, and the Diplomate
Certification of specialization in neurosurgery and
the Ph.D. degree (magna cum laude) from the
University of Ghent, Belgium, in 1994 and 1995,
respectively. He studied at Sekolah Menengah
Sains Bukit Mertajam and SMS Kelantan,
Pengkalan Chepa and the Leederville Technical
College, Perth, Western Australia, and the Univer-
sity of Western Australia. He was previously the Director of P3Neuro, The MOHD ZAID ABDULLAH (Member, IEEE)
Center for Neuroscience Services and Research, University Sains Malaysia, received the B.App.Sc. degree in electronic from
from 2013 to 2018. He is currently the Director of the Brain Behaviour Universiti Sains Malaysia (USM), in 1986, and
Cluster, School of Medical Sciences, USM. His research interests include the M.Sc. degree in instrument design and applica-
new treatments in the field of aging brain cells in vivo and in vitro, aging brain tion and the Ph.D. degree in electrical impedance
and brain trauma, human pluripotential stem cells, neurooncology, medicinal tomography from the Institute of Science and
chemistry in the field of biodegradable wafer antibiotics, drugs for movement Technology, The University of Manchester, U.K.,
disorders, CNS tuberculosis, epilepsy and pain, as well as ethnopharmacol- in 1989 and 1993, respectively. He joined Hitachi
ogy. He is a member of the National STEM Movement and is working with Semiconductor, Malaysia, as a Test Engineer. He is
the Ministry of Education, USM, and UPSI on the STEM Comic Project. currently a Lecturer and a Professor with USM’s
He is a Fellow of the Academy Science Malaysia, the American College of School of Electrical and Electronic Engineering. He has published numerous
Surgeons, the Royal College of Surgeons of Edinburgh, the Royal Society research articles in international journals and conference proceedings. His
of Medicine, U.K., the World Federation of Neurosurgical Societies, and research interests include microwave tomography, digital image processing,
the American Association of Neurological Surgeons. He was awarded the computer vision, and ultra-wide band sensing. One of his papers was awarded
prestigious Young National Malaysian Scientist Award in 1999 and the Top The Senior Moulton Medal for the Best Article published by the Institute of
Research Scientist Award, Academy Science Malaysia, in 2013, by the Prime Chemical Engineering, in 2002.
Minister of that period.
24964 VOLUME 9, 2021

Convolutional Neural Network Utilizing Error-Correcting Output Codes Support Vector Machine For Classification of Non-Severe Traumatic Brain Injury From Electroencephalogram Signal

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Convolutional Neural Network Utilizing Error-Correcting Output Codes Support Vector Machine For Classification of Non-Severe Traumatic Brain Injury From Electroencephalogram Signal

Uploaded by

Copyright:

Available Formats

Received January 15, 2021, accepted January 30, 2021, date of publication February 3, 2021, date of current version

February 12, 2021.

Convolutional Neural Network Utilizing

Corresponding author: Haidi Ibrahim (haidi_ibrahim@ieee.org)

INDEX TERMS Accuracy, batch normalization, convolutional neural networks, electroencephalography,

I. INTRODUCTION care fraud detection plays an important role. Health insurance

VOLUME 9, 2021 24947

24948 VOLUME 9, 2021

FIGURE 2. Overview of pre-processing procedure.

VOLUME 9, 2021 24949

FIGURE 3. Proposed CNN ECOC-SVM Voting Ensembles Architecture.

IV. OVERVIEW OF PROPOSED CNN ECOC-SVM VOTING

24950 VOLUME 9, 2021

VOLUME 9, 2021 24951

using an initial CNN with one convolutional layer with six

B. SELECTION OF OPTIMUM MINI BATCH SIZE

24952 VOLUME 9, 2021

features. From the experiment outcome, a CNN with one

D. SELECTION OF POOLING TYPE

VOLUME 9, 2021 24953

24954 VOLUME 9, 2021

VOLUME 9, 2021 24955

24956 VOLUME 9, 2021

VOLUME 9, 2021 24957

24958 VOLUME 9, 2021

VOLUME 9, 2021 24959

24960 VOLUME 9, 2021

VOLUME 9, 2021 24961

24962 VOLUME 9, 2021

VOLUME 9, 2021 24963

JAFRI MALIN ABDULLAH received the M.D.

24964 VOLUME 9, 2021

You might also like