Professional Documents
Culture Documents
EI2023 Habib Deep Learning Based Speech Emotion Recognition For Parkinson Patient
EI2023 Habib Deep Learning Based Speech Emotion Recognition For Parkinson Patient
Habib Khan1 Mohib Ullah2 Fadi Al-Machot3 Faouzi Alaya Cheikh2 Muhammad Sajjad 2
1
Islamia College University Peshawar, 25000 Peshawar, Pakistan
2
Norwegian University of Science and Technology, 2815 Gjøvik, Norway
3
Faculty of Science and Technology (REALTEK), Norwegian University of Life Sciences (NMBU),
1430 Ås, Norway
2.1. Pre-processing
Data preprocessing is a crucial step in preparing data to
achieve high accuracy. In the proposed framework, the model
starts preprocessing by taking audio data from the dataset. Fig. 3: Samples of the 2D generated spectrograms from the
An adaptive wavelet thresholding function is introduced to speech data
enhance speech performance and the accuracy of automated
speech recognition (ASR). The utilized adaptive threshold
show how strong a signal is at various frequencies. To visu-
based on the wave [19] cleans the audio signals from the
ally depict frequency information over time intervals in voice
background noises, unnecessary information, and silent por-
signals, the short-term Fourier transformation (STFT) is em-
tions [20]. Here, we use a direct relation approach to deter-
ployed. By employing STFT to chop up a lengthier voice sig-
mine the correlation between energy and amplitude in voice
nal into equal-length segments (frames), and then using fast
signals. An energy amplitude connection indicates that there
Fourier transformation (FFT) on those clips to get the Fourier
is a correlation between the energy carried by a wave and its
spectrum. The x-axis of a spectrogram depicts time, while
amplitude. The amplitude of a wave is a measure of its energy
the y-axis shows the frequencies over a short interval of time.
level, with larger values indicating more powerful waves and
The spectrograms improve SER performance as spectrograms
smaller values indicating weaker ones. The amplitude of a
include a plethora of information and discriminative features;
wave is defined as the largest increase by which a central com-
that is why we are using spectrograms instead of using voice
ponent is moved away from its initial position. The following
data signals directly to 1D CNN. The basic concept is to ex-
justification justifies using the energy-amplitude connection
tract high-level discriminative features from voice signals. In
to filter out noise from audio transmissions. This procedure
this regard, the spectrograms are suitable to extract visual fea-
consists of three steps:
tures of the speeches. In speech signal S (t, f), Spectrogram
1 Read the audio file at 16,000 sampling rates step by step. (S) comprises numerous type frequencies (f) over different
times (t). The spectrograms were then fed to the proposed DL
2 Following the discovery of the energy-amplitude con- model for training. The given Figure. 3 shows samples of
nection in waves, we calculate the maximum amplitude extracted spectrograms of each audio file using STFT.
in each using Equation (1) and run it through an ap-
propriate threshold to filter out the noise and retain the Table 1: Hierarchical Layering structure of the proposed
salient portions. CNN model.
3 we recreate it using the same sample rates in order to Layers Kernel Size No. of Neurons Activation f (.) Dropout rate
remove the noise and inaudible signals from the original Conv2D1 9×1 10 ReLU -
audio file. Conv2D2 3×1 10 ReLU -
BatchN orm1 - - - -
M axP ool1 2×1 - - -
D = A × sin(2 × π × f × t) (1) Conv2D3 3×1 20 ReLU -
Conv2D4 3×1 20 ReLU -
In Equation 1, D represents the particle’s displacements, M axP ool2 2×1 - - -
f is the frequency with respect to time t, and A denoted BatchN orm2 - - - -
Conv2D5 3×1 20 ReLU -
the signal’s peak or amplitude. Figure 2 depicts a block M axP ool3 2×1 - - -
schematic of the pre-processing. BatchN orm3 - - - -
Flatten - - - -
Dense1 - - - -
BatchN orm4 - - - -
2.2. Generating Spectrograms Dense2 - - - -
Emotion Nature Precision Recall F1 Score Emotion Nature Precision Recall F1 Score
Anger 0.81 0.62 0.70 Anger 0.98 0.99 0.99
Happy 0.91 1.00 0.95 Happy 0.98 1.00 0.99
Neutral 0.67 0.88 0.76 Neutral 0.99 0.96 0.97
Sad 0.98 0.84 0.90 Sad 1.00 1.0 1.0
Weighted Avg. 0.84 0.83 0.83 Weighted Avg. 0.99 0.99 0.99
Unweighted Avg. 0.84 0.82 0.82 Unweighted Avg. 0.99 0.99 0.99
CNN model for classification. The first layer which extracts 10 OS, having GTX GeForce TITAN 1070 graphic process-
features from an input image is convolution. This layer ex- ing unit (GPU) with a memory of GB, the processor of intel ®
ecutes a process known as ”convolution.” By learning image X5560, and clock speed of 2.80GH. Further, different frame-
features with small squares of input data, convolution retains works and libraries are utilized during training; the proposed
the connection between pixels. It is an operation with two framework is utilized for training as a backend TensorFlow-
inputs: an image matrix and a kernel or filter. The ReLu GPU and a frontend Keras-GPU of versions 2.0.0 and 2.34,
function is employed as an activation function in a convolu- respectively. The categorical cross-entropy loss function and
tion layer to create nonlinearity and maintain feature values Adam optimizer with an initial learning rate of 0.0001 are
within a specific range. After convolutional layers, pooling used to calculate the loss of the model and update its weights
layers (PL) are applied to extract the more activated features while training, respectively. Moreover, we trained the pro-
from feature maps and reduce computational model complex- posed model on a mini-batch of size 30 for 100 iterations of
ity. Moreover, dropout and batch normalization layers are epochs, which took almost two hours to complete the training
employed to avoid overfitting and normalizing input values. of our proposed framework. Fig.4 shows the training and loss
CNN layers are mostly connected after feature extraction with graph of the proposed model.
fully connected (FC) layers, in which every neuron in the in-
put layer is linked to every neuron in the next layer. FC lay-
ers are primarily used to extract global features, which are 4.2. Dataset
then fed into a SoftMax classifier to determine the probabil-
ity for each class. CNN arranged all these CLs and PLs fol- It is generally believed that a lack of training data causes
lowed by FC layers in a hierarchical order. Finally, a Softmax poor prediction. The primary barrier is the availability and
layer applies this representation to give the probability of the accessibility of well-suited datasets for DL tasks. There are
final classification task. We have suggested the lightweight databases with millions of samples in domains like image or
CNN model for SER of PD patients, shown in Figure 2. The speech recognition, such as ImageNet with 14 million sam-
network receives input of size 100 x 100 spectrogram im- ples and Google AudioSet with 2.1 million samples. How-
ages produced from emotional speech signals. The proposed ever, several SER databases have a limited amount of sam-
framework mainly consists of five convolution layers and two ples. Furthermore, the datasets are often made up of dis-
dense layers. Technical details of the proposed model are crete emotional speech utterances, this is not always the case
given in Table 1. in real-life circumstances since overlaps often occur across
speaker speech flows. As a result, models built on discrete
utterances would perform poorly in continuous speech envi-
4. EXPERIMENTAL RESULTS ronments. We can also create a superset by merging some
dataset classes, e.g., excited and happy features are almost
The evaluation performance of the suggested approach is pre- the same in some acted datasets. Extensive research works
sented in this section. First, we will go through the experi- are conducted to develop intelligent SER-based systems for
mental setup, then the dataset utilized in the model’s training PD but the datasets are not available due to patients’ privacy.
and assessment, and the evaluation matrix, ablation analysis, However, the IEMOCAP dataset is a multimodal and multi-
and real-time testing. All of these stages are detailed in the speaker database collected by the Speech Analysis and In-
sections below. terpretation Laboratory (SAIL) at the University of Southern
California (USC) [21]. The dataset is divided into five ses-
4.1. Experimental Setup sions, and each session has two actors who record a script in
different emotions like anger, happy, disgust, sadness, fear,
All experiments were carried out in a Python 3.7 virtual envi- neutral, and excitement. Since our target is four desired
ronment installed on a PC with the specifications of Windows classes, so used only four emotions (anger, neutral, happy,
and sad) for the model’s training. Our research aims to de- Table 4: The performance of the proposed CNN model on
sign a system that can accurately recognize common SEs of spectrograms experiments
PD patients such as anger, happiness, normal, and sad. The
Image size Training Accuracy Validation Accuracy Model size
given four classes contain 4180 total speech files, and each 50 × 50 88.99 87.25 2.5 MB
class consists of 1045 voice files. We have taken 500 samples 60 × 60 94.93 94.40 3.0 MB
70 × 70 96.45 95.83 3.5 MB
from the excited class of the same dataset and merged them 80 × 80 97.29 97.50 4.0 MB
into the happy class. This is due to the similar characteristics 90 × 90 97.64 98.09 4.5 MB
100 × 100 98.84 98.69 5.0 MB
of both classes. A happy class is essential for PD patients be-
cause the only psychological treatment for PD patients is to
make them happy as they are always sad.
doing many experiments, we gained that our proposed model
achieved good accuracy on the image of size 100×100. Table
4. Shows the statistical details of experiments based on the
size of input spectrogram images.
5. CONCLUSION
Fig. 4: Accuracies and losses of the proposed model.
This study aimed to develop an SER system for PD patients.
The SER literature confronts so many problems in improving
recognition accuracy while also reducing the entire model’s
4.3. Evaluation computing complexity. We developed a lightweight CNN-
based ISER-Net with certain salient feature extraction to in-
Experiments were conducted using the IEMOCAP dataset to
crease the accuracy and minimize the overall computing cost
evaluate the performance of the proposed technique for the
of the SER model in response to these problems. To reduce
emotion recognition of PD patients. The two sets of experi-
noise and silence signals from voice, we employed a dynamic
ments were conducted based on raw and clean spectrograms
adaptive threshold approach which improved the performance
and image-based validation of the proposed CNN. The results
of the network. The enhanced speech signals are then trans-
in detail are discussed in the following sub-sections.
formed into spectrograms to improve the accuracy of the pro-
posed model while reducing its computing complexity. The
4.4. Raw and Clean Spectrograms suggested model’s performance is assessed using the IEMO-
We have compared the results of raw spectrograms and clean CAP dataset. In the future, we will explore attention-based
spectrograms using the data of the IEMOCAP dataset. The networks by employing more datasets to make more robust
proposed model achieved an accuracy of 82% and 99% on the networks. The model will be deployed on edge devices for
raw spectrogram and clean spectrogram, respectively. Table 2 real-time use to assist doctors in clinical treatments of parkin-
illustrates the results based on raw and clean spectrogram in sonians.
detail.