Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

A Report On

Motor Imagery Classification


Using Deep Learning
Atharva Dubey
1 Introduction
Brain-computer interfaces (BCI) establishes a direct link between a human
brain and an intelligent system which interprets this information and trans-
lates this to human understandable format. BCI systems have been widely
applied to real life applications for example, prosthetic limbs, night terror
and epilepsy detection and even wheelchairs with basic control. With the
advent of non-invasive and safe techniques to read and record EEG signals,
there have been multiple researches to make EEG signals more useful and get
a deeper understanding. Motor imagery classification, one of the branches
of BCI has been researched, developed and even deployed. Yet the complex-
ity of subject increases as the number of imagery (actions) increases and
also, EEG signals differ from person to person and cannot be generalized
for the same action. The general flow of classifying motor imagery actions
is obtaining data, pre-processing feature extraction and classification. EEG
signals have low SNR (signal to noise ratio) and have high correlation val-
ues among the different channels. This calls manual channel selection by
a domain expert and therefore classification paradigms are dependent on
human knowledge of the domain instead extracting features independently.
My work here focuses on exploring deep learning techniques, namely convo-
lutional neural networks and representation learning to classify motor im-
agery data into various action with no/minimum data preprocessing, manual
channel selection or channel weighting.

2 Related Work
Convolutional neural networks have proved to be a powerful feature extrac-
tor in the field of computer vision, NLP and even in the field of time series
problems. Even in the field of motor imagery classification, CNNs have been
the first choice of many. Research at US Amy Research Laboratory intro-
duced a compact convolutional-based motor imagery classifier and achieved
state-of-the-art results. Casual Temporal networks are CNN based neural
networks which are known to mimic the nature of recurrent models and fo-
cus only on one part of input at a time. Multiscale CNNs (citation here) had
used both spatial as well as temporal domain convolutions to generate a rich
feature bank and classify based on the features generated. EEG signals have
spatial correlations among each other. Apart from extracting common spa-
tial patterns (CSP) (citation here), projecting the data into the Riemannian
space is extracts the correlation amongst the input channel by generating
the SPD matrix. Learning these correlations in this manifold space proved
to be quite useful especially in the case where there are a smaller number
of training samples present and also have proven better than the traditional
methods in the case of dimensionality reduction. Deep belief networks, are
also popular for feature extraction provide the feature in the encoded stage.

2
Figure 1: Time series plots of two datapoints from the same class.

3 Dataset Description and Analysis


The dataset chosen is the BCI Competition IV dataset 2a (citation here).
The data has been preprocessed using a bandpass filter of 0.5 to 100Hzs
and is sampled at 250Hz and a 50Hz notch filter to remove any noise due
to electrical appliances. EEG signals are recorded from 9 different subjects
and a 22 channel EEG headset is used with 3 EOG channels. The motor
imagery signals have been isolated per class from the continuous stream of
EEG signals, giving a total of 72 samples per class per subject and each
motor imagery event is 313 timesteps long. Out of these 72 samples, 20
have been selected as test and remaining 52 for training.

3.1 Data Visualization


Before getting started with classification methods, it is useful to visualize
and visually understand the data and look for common patterns if any Fig-
ure 1 , Figure 13 Inspecting the figure above, there is a clear distinction
between Class 769 and Class 770 as it can be seen from the channel repre-
sented by the yellow channel. Also, class 769 is the one without any peaks
emerging from the yellow channel. Class 770 has a distinctive peak at the
timestamp of 100. This high-level view of the different channels provides a
good overview to ascertain the differences between the four classes. How-
ever, there is no clear distinction between class 771 and 772. Heatmaps

3
Figure 2: A time series plots of classes, from the top left in clockwise direc-
tion, class 769, class 770, class 771, class 772

Figure 3: Heatmaps of classes 771(above) and 772(below)

4
Figure 4: Time series plots of two datapoints from the same class.

provide a closer view of this Figure 3.


The heatmap, as seen above provides minute details about ‘activation’ of a
different channel. The lighter the color, the higher the value of the signal of
the particular channel. It can be inferred that class 772 has activations at
the starting and an activation later for a small duration.

3.2 Current Methods and their Implementations


To get a baseline model, a baseline score two models were implemented and
experimented to understand the caveats and intricacies of the model. Two
basic models were chosen, first an LSTM-based model Figure 4, and second
a temporal convolutional network to gauge the effectiveness of convolutions
models on small datasets. LSTM-based classifier – In any sequence classifi-
cation task, the first mode of approach is a long-term memory-based model,
which is one of the most popular methodology and also provides a strong
baseline for evaluation and comparisons of the model. The model was based
on the paper LSTM-Based EEG Classification in Motor Imagery Tasks and
consisted of stacked LSTMs followed by a linear layer with SoftMax activa-
tion for classification. The methodology adapted as described in the paper
relied more on data pre-processing and feature engineering for classification.
The paper has proposed the following steps-

ˆ Normalize and apply a bandpass filter of 8-35 Hz

ˆ Apply piecewise aggregation approximation on each channel.

ˆ Channel Weighting.

ˆ The pre-processed data is then put into the LSTM network for classi-
fication.

5
I took 2 approaches to this paper. First just raw data, without any pre-
processing (to get a baseline score) and second is the method proposed in
the paper. I was not able to implement the channel weighting step therefore
skipped it and implemented the rest as it is.

ˆ Raw Data, no pre-processing – 27-30 % accuracy

ˆ Pre-processed Data – 38% average accuracy with 45% being the best.

3.3 Casual Temporal Convolutions

Figure 5: Visualization Temporal Casual Convolution

Inspired from convolutional networks for sequential data, temporal con-


volutions are built such that they have long memory and even autoregres-
sive property, Figure 6. However Simply Dilated convolutions only have the
ability to look back at the immediate previous layer. To overcome this, the
receptive field is exponentially grown to gather a large receptive field. This
enables the last layer to receive a broader representation of the input before
classification. The architecture presented Figure 7 in the paper consisted

Figure 6: Visualization Temporal Casual Convolution

of 10 such dilated convolutional blocks with a residual connection followed


by the classification layer. However, owing to the small size of the dataset,
10 layers had a large number of parameters leading to poor generalization
and overfitting. Therefore, only 5 such layers were used.
The data was pre-processed using ICA followed by Z-score normalization
as done in the paper. The learning rate was experimented with to prevent
overfitting. Learning rates were [0.01, 0.001, 0.0001, 1e-5]. The optimizer
was chosen as SGD instead of Adam as it converges faster and proved to be
more stable. To overcome the issue of less data, one of the most popular
data augmentation techniques is to feed the same data to the network but
this can in turn lead to overfitting. Therefore, the same data point was

6
Figure 7: Visualization Temporal Casual Convolution

added if the random number generated (an equiprobable distribution) was


greater than 0.2 i.e. a data point has an 80% probability of being added
again. Epochs were kept as 10. Looking at the Loss vs Epoch loss, the model
converges quite fast and the maximum accuracy of 42% was at the learning
rate of 0.0001. The below plot compares accuracy obtained vs Learning rate

Figure 8: Training Plot

while keeping all other parameters constant.

7
4 Proposed Models
The sections henceforth contain the discussion of the models implemented,
and explored. The methods adopted draw motivation from the field of com-
puter vision as well as the others based on representation learning.

4.1 Deep Residual Networks For Motor Imagery Classifica-


tion Using Stacked Feature Maps
4.1.1 Abstract
Ever since AlexNets came on top in the Imagenet classification challenge,
residual connections have been the most groundbreaking work in the field
of computer vision and in the field of deep learning. This enables deeper
architectures for more complex tasks and circumvents the issue of vanishing
gradients in the earlier layers by adding a shortcut connection which enables
the gradient flow. Owing to the complex nature of signals and slight varia-
tions between classes, it becomes imperative to draw a balance between the
depth of the model and the size of the dataset. Increasing the dimension
of the feature map results in exposing more features from one class leading
to a better discriminative power. I propose stacking up feature maps that
serve as an input to a residual neural network. This enables the proposed
Resnet to learn collectively from all feature maps, essentially treating it like
an image. The proposed method has achieved 82.5% in 4-class classification
and 100% in binary way classification.

4.1.2 Materials and Method


Let the input EEG signal belonging to the class Cj ∈ 0, 1, 2, 3 be denoted
by x. The dimension of vector x will be c ∗ t, c being the number of channels
and t being the number of time steps.

Three feature maps are obtained by performing one-dimensional convo-


lution along the temporal domain on successive output obtained from each
map. If N denotes the batch size and K be the output channels, then
K
X
x(N ∗ K ∗ t) = β(K) + ω(k) ∗ x(c)
0
K
X
ϕ(N ∗ K ∗ t) = β(K) + ω(k) ∗ x(K)
0
K
X
ρ(N ∗ K ∗ t) = β(K) + ω(k) ∗ θ(K)
0
K
X
γ(N ∗ K ∗ t) = β(K) + ω(k) ∗ rho(k)
0

8
where ∗ is the cross correlation function given by -
N
X −1
f ∗ g[n] = f [m]kg [m + n]
0

where f ∈ Rn , g ∈ Rm and θ, ρ, γ are the first second and third feature maps
respectively. The obtained feature maps are then stacked up onto each other
with θ on the top and on the bottom. The resultant feature vector is of the
shape N ∗ 3 ∗ K ∗ t. Note that there is no non-linearity applied between any
of the obtained feature maps. This new feature vector obtained becomes
the input for the residual network.
The residual network contains three residuals unit, with increasing number
of channels from 64, 256 and finally 512. There is a max pooling layer after
each residual block with kernel 5 and stride 2. This is followed by linear
classifier layer which is preceded by a global max pool. The residual block
can be defined by the set of following equations where n is the number of
channels -
n
X
x(n) = r(N ∗ n ∗ K ∗ t) = β(n) + ω(n) ∗ x(n)
0
n
X
F (N ∗ n ∗ K ∗ t) = β(n) + ω(n) ∗ xn
0
F (N ∗ n ∗ K ∗ t) = Relu(F (N ∗ n ∗ K ∗ t))
F (N ∗ n ∗ K ∗ t) = BatchN orm(F (N ∗ n ∗ K ∗ t))
r(n) = F (N ∗ n ∗ K ∗ t)Output = r(n) + x(n)

where n is the nth residual layer.

4.1.3 Training Paradigm


Out of the 72 samples available per class, 52 samples chosen for training
and the remaining 20 were taken for testing. The optimizer was chosen as
Adam with a cyclic learning rate varying from 1e-3 to 1e-5. Results below
present the average accuracy obtained after a 5-fold validation.

4.1.4 Results
Following the above training paradigm, this section discusses the 4 four class
classification. I also compare a 2-way classification which throws light on
per class performance.

9
Subject Accuracy(4 way)
Subject 1 77.5
Subject 2 65
Subject 3 67.5
Subject 4 63.75
Subject 5 74
Subject 6 67.5
Subject 7 71.25
Subject 8 67.5
Subject 9 82.5

Figure 9: Training Plot

4.1.5 Conclusion
Stacking feature maps and serving it as an input to a residual network is
developed in this paper. The network contains one layer for upscaling and
3 layers for extracting feature maps followed by residual blocks, a down-
sampling layer, global max pooling, and lastly a fully connected for classi-
fication, with a total of 24 layers. Using a cyclic learning rate enables the
model to navigate the loss surface and make appropriately large steps. Con-
trary to conventional approaches where motor imagery classification relies
heavily on data preprocessing, this method does not make use of data pre-
processing or feature engineering whatsoever. The model can be made more
robust by increasing the receptive field by stacking more feature maps.

4.2 Learning One Dimensional Representations of EEG Sig-


nals for Motor Imagery Classification
4.2.1 Abstract
Representation learning is essentially a feature vector which can be used to
describe each and every sample belonging to the class and be representative
of it. Time series and sequential data have leveraged the most of it, for
example in speech recognition or audio classification achieved state of the
art result using the popular time2vec model. The word2vec models is also
an example of how powerful the embeddings are. This paper explores on
learning the one-dimensional representations of EEG signals obtained by
recreating the signal

10
4.2.2 Introduction
A representation l of dimension d of an input signal x is a good represen-
tation of the input vector such that x = ϕ(l) where ϕ, the decoder network
recreates x, the generated signal with minimum variance from the input sig-
nal. This representation l is considered as a feature vector, a signature on
the basis of which classification can be done. Apart from manual channel
selection and various dimensionality and feature extraction techniques like
principle or independent component analysis, the powerful Boltzmann ma-
chines have also been used to extract a feature vector by recreating the same
signal using their greedy training approach. Symbolic aggregation methods,
which are based on regression techniques are also popular in the field of time
series analysis. Convolutional autoencoders have proven to be very useful in
the field of computer vision, in tasks like image denoising, anomaly detection
and complex tasks like single image super resolution. They are generative
networks with an unsupervised manner of learning where the input is first
encoded and then further reconstructed from the encoded state. This en-
coded state can be considered a signature representative of that particular
class upon which classification is performed. EEG signals are long and
multi channeled in nature. It is imperative that the feature vector captures
both spatial as well as the temporal relations between the channels. Here,
I propose a deep layered convolutional auto-encoder with skip connections,
which encodes the given EEG signal into a 1-dimensional representation.
This 1-dimensional encoded feature vector serves as an input vector to a
classification model.

4.2.3 Materials and Method


Let the input EEG signal belonging to the class Cj ∈ 0, 1, 2, 3 be denoted
by x. The dimension of vector x will be c ∗ t, c being the number of channels
and t being the number of time steps. This input vector x serves as the
input to the autoencoder which is to be reconstructed.
The autoencoder 11-layer deep network symmetric network with the first 5
1D convolutional layers used encode the input and the next 5 transposed
convolutional layers used to recreate the input data. The non-linearity used
here is the parametric ReLu (PReLU). The decoder part of the network
consists of transposed convolution, which are similar to deconvolutions. A
transposed convolution is simply a convolution operation where the output
is transposed. Therefore, instead of shrinking, the output expands. Let yi
denote the hidden the state after a convolution during the encoding opera-
tion, and yi denote the hidden state during the decoding stage where yi and
yi have the same shape. The skip connection adds state during encoding yi
to the hidden decoded state yi. The overall architecture is visualized below
where the blue boxes represent the vector after convolutions and the green
boxes represent deconvolution. The size of the boxes represents the size of
the vector Figure 10. Here I have explored 3 techniques, namely a support
vector classifier (SVC), GRUs and ANN and an analysis of each method is

11
Figure 10: A high level structure of the Encoder Decoder Architecture

followed.

4.2.4 Capturing the 1-dimensional signature using gated recur-


rent units
A gated recurrent unit (GRU) is quite similar to an LSTM and similarly
handles the vanishing or exploding gradient problem by using the forget
and the update gate. The main difference where GRU actually differ is that
it manages the past information throughput and the decision to update via
a single gate. Given f is the forget gate and h the current hidden state of the
cell, and the biases, input weights and the recurrent weights for the forget
gates are given by b, U, W respectively then the current state is updated by
X X
h(t) = u(t − 1)h(t − 1) + (1 − u(t − 1))σ(b + U x(t − 1)) + W r(t − 1)h(t − 1)

where x represents the input and t the time step, u the update vector and
r the reset gate which decide the amount of past information to flow. The
amount information which is allowed to be passed is given by –
X X
u = σ(b + U x(t) + W h(t))

The reset and the update gates can individually ignore the parts of the state
vector. The overall architecture consists of 313 GRUs with a hidden size of
512 followed by a linear layer for classification. The encoded data is first
normalized using layer normalization and then fed into the network.

4.2.5 Using SVMs for Classification of the 1D signature


The idea of the support vector classifier is to find the differentiating hy-
perplanes as their decision boundaries in the D- dimensional plane. In this
case, the dimension would be 313 as the shape of our representative vector
is [1x313]. The classification of a data point depends on its closeness to the
decision hyperplane. For the kernel trick, the RBF kernel was used.

12
Figure 11: A plot across timesteps showing the MSE between the original
and reconstructed.

4.2.6 Using Artificial Neural Networks (MLP) for classification


Here, the humble feed forward neural network can prove to be a powerful
classifier and needs no introduction. The model consists of a 5 layered
network with Prelu as the activation with SoftMax at the end. The basic
block layer can be formalized as

y = W.x + b
out = P ReLU (y)

4.2.7 Training Paradigm


For every model discussed above, the motor imagery signals corresponding
to each class were discretized giving a signal vector of shape [22x313]. The
autoencoder was trained with SGD optimizer with a learning rate of 1e-3
and all the other above-mentioned methods were trained with Adam as their
optimizer with a learning rate of 1e-3.

4.2.8 Results
All the above methods rely on the quality of data reconstruction and how
distinct are the signatures of one class from the other. The autoencoder
achieves a mean squared error loss 0.004 on the test dataset thus achieving
a satisfactory quality of reconstructed data. The plots below show the dif-
ference in the generated and the original data across all 22 EEG channels
during the 313 timesteps present. These plots were generated from ran-
domly chosen data in the test dataset Figure 11. The Figure 12 visualizes
the signature obtained of the various classes. As seen from the plots above,
the one-dimensional feature vectors are distinct and achieve their peaks at
different times by significantly different amounts therefore a good decision
boundary can be expected.

13
Figure 12: Signaturues Obtained, from the top left in clockwise direction,
class 769, class 770, class 771, class 772

Subject Accuracy
Subject 1 62.5
Subject 2 60
Subject 3 65
Subject 4 65
Subject 5 55
Subject 6 79
Subject 7 70
Subject 8 63
Subject 9 70

Table 1: Accuracy Results of the ANN classifier

Subject Accuracy
Subject 1 67.5
Subject 2 60
Subject 3 65
Subject 4 65
Subject 5 70
Subject 6 67.5
Subject 7 70
Subject 8 63
Subject 9 72.5

Table 2: Accuracy Results of the SVM classifier

14
Subject Accuracy
Subject 1 40
Subject 2 35
Subject 3 50
Subject 4 23
Subject 5 45
Subject 6 23
Subject 7 40
Subject 8 39
Subject 9 50

Table 3: Accuracy Results of the GRU based classifier


The results are presented in , Table 1 for ANN, Table 2 for SVM and
Table 3 for the GRU based classifier.

4.3 Geometrical Deep Learning for Motor Imagery Classifi-


cation
4.3.1 Abstract
Classification of EEG signals for motor imagery classification has mainly
been treated like a time series classification problem, where the classification
is based on the temporal and spatial patterns appearing along the timesteps.
This paper explores an unconventional method for motor imagery classifica-
tion where instead of treating it like a time series, the whole EEG electrodes
spread across the brain modeled like a graph data structure where node
corresponds to signals generated over time making the data non-Euclidean
in nature.

4.3.2 Introduction
Graph neural networks are one of the most powerful deep learning techniques
which can work on non-Euclidian data as well as in the Euclidian space
if modelled correctly. Many applications such as social-network analysis,
graph-analysis, point cloud segmentation and molecular structures cannot
be represented as a vector and have to be modelled as a graph to correctly
capture the relations amongst all the elements in the data. Graph neural
networks are also useful for manifold classification. EEG signals can be
visualized in the same way, where every EEG channel are connected by an
edge. Graph convolution networks can capture the structure of channels
spread across the EEG headset and the signals being generated throughout
the time at once.

4.3.3 Materials and Method


A graph can be formalized as G = (V, E) where V are the vertices and
E are the edges of the graph. Graph networks work by learning the node

15
Figure 13: A topological graphs of various classes.

embedding by taking in consideration a node even the neighborhood of each


node. The neighborhood of a node v is defined as N v = u ∈ V |v, u ∈ E
where V denotes a node. The graph constructed from the EEG signals has
node attributes X ∈ Rnd where n denotes the number of nodes and each
attribute is in the d dimension real space. In our case, the node feature
matrix is of shape [22x313] where each node feature vector is the signal
corresponding to that channel. The graph can even have edge attributes
and in this case the edge attribute is the correlation among the channels.
For different motor imagery events, the correlation among different channels
are expected to be different, signifying the influence of one channel on its
neighbors and also capturing what channels are “activated” during the motor
imagery event.

4.3.4 Training Paradigm


Out of the 72 samples available per class, 52 samples chosen for training
and the remaining 20 were taken for testing. The optimizer was chosen as
Adam with a cyclic learning rate varying from 15 to 2. Results below present
the average accuracy obtained after a 5-fold validation. Table 4 reports the
accuracy obtained using a graph based approach

16
Subject Accuracy
Subject 1 70
Subject 2 82.3
Subject 3 82.5
Subject 4 75
Subject 5 75
Subject 6 77.5
Subject 7 81
Subject 8 77
Subject 9 81

Table 4: Accuracy results of the graph based neural network

5 Conclusion
During the course of six months, I studied a wide variety of deep learning
techniques for the classification of EEG signals. To broadly classify it, the
residual convolutional method was mostly inspired by the field of computer
vision, specifically from video classification. Learning the representations of
the EEG signals were inspired by the domain of natural language processing,
the word2vec models which enabled us to discover new avenues. This got
me to think that for a network to recreate data from the same class, there
has to be a distinctive signature for the same, and classifying that would
be much easier. While studying the Riemannian techniques and manifold
learning, I read about how non-Euclidian data can be distinguished using
the sophisticated field of geometrical learning. This gave me the idea to
capture the EEG data as a graph, and “view” the whole topology at once.
The main objective of all the methodologies obtained was to avoid manual
feature selection or pre-processing to extract features as much as possible,
with the ideology of “Let the network figure it out.” There lies a lot of
undiscovered avenues even in the above-mentioned approaches, for example,
the time2vec model by Facebook research is more sophisticated in represen-
tation learning than the autoencoder method as described. In the residual
convolutional architecture, the receptive field can be increased by stacking
more feature maps which can further increase the performance of the model.
The balance between the complexity of the model and the amount of data
available was handled tactfully, for example by increasing the receptive field
and the results obtained were satisfactory and some were accuracies obtained
were higher on this particular dataset from the other methodologies I have
read about. On a personal note, the previous 6 months have been full of
learning and exploring new concepts in machine learning and all the places
there can be applicable and how to could they be implemented in the field
of time series classification. I would continue to find and implement various
techniques as and when I come across them which would be relevant to the
field of motor imagery classification. My sincere thanks for providing me

17
with this wonderful opportunity.

18

You might also like