Professional Documents
Culture Documents
Feature Extraction Techniques Comparison For Emotion Recognition Using Acoustic Features
Feature Extraction Techniques Comparison For Emotion Recognition Using Acoustic Features
Feature Extraction Techniques Comparison For Emotion Recognition Using Acoustic Features
ISSN No:-2456-2165
IV. RESULTS
the minimum and maximum value along with the mean. avg 0.5259 0.7430 0.6345
NumPy library is utilized for this function. A KNN algorithm mel minmaxavg 0.5556 0.6890 0.6223
is used for classification purpose. After training the model the 0.6741 0.8343 0.7542
results are stored and the same procedure is repeated for a new avg
combination of the dataset, feature extraction technique and mfcc minmaxavg 0.6667 0.8287 0.7477
the output selection type. avg 0.3556 0.5885 0.4720
tonnetz minmaxavg 0.2741 0.5512 0.4126
The combined audio dataset contains different categories
of emotions, out of which 4 emotions are selected to train the As it is evident from the table below, the top 3
model (Angry, Sad, Neutral, Happy). These 4 emotions are performing feature extraction techniques are Mel
chosen as the dataset contained a high number of records for Spectrogram Frequency, MFCC and Spectral Contrast.
these particular emotions.
Further, more tests are carried out, to check how the model
performs with different combinations of these 3 features.
Here we can see that the model with CORPUS JL The following is a plot of the confusion matrix. As seen
dataset, and all the features mel, mfcc, and contrast performs in the confusion matrix the model sometimes cannot predict
the best out of all, with a combined accuracy of 0.9128. correctly between sad and neutral. This can be a cause as both
Though it should be noted that the accuracy is not a lot from sad and neutral emotion have a lower tone resulting in weak
the combination individual features only. A similar thing can features.
be seen with RAVDESS dataset. MFCC alone being the best
feature with an accuracy of 0.9017 does almost as good a job
as all the features combined. Without a significant upgrade in
the accuracy, it cannot be said that the model with the
combination of more than 1 feature performs better than their
individual counterparts. It should also be noted that MFCC
majorly works well with small duration data, and the audio
clips used here are clips with duration of 1 to 2 seconds.
Hence the other features or a combination of multiple features
may show different results for audio clips with longer
duration.
The classification report on the test set is shown below. Fig. 2. Confusion Matrix
precision recall f1-score support
angry 0.98 0.98 0.98 48 V. CONCLUSION
happy 0.92 0.92 0.92 48
neutral 0.81 0.88 0.84 48 Using just the intensity and the variance of the sound
sad 0.89 0.81 0.85 48 waves in a human speech, the underlying emotions can be
understood to some extent without linguistic processing. This
accuracy 0.90 192 means that acoustic features can act as an important factor in
macro avg 0.90 0.90 0.90 192 Speech Emotion Recognition (SER) with its big brother
weighted 0.90 0.90 0.90 192 Natural Language Processing (NLP). Even without knowing
avg the actual words, the sound waves can be analyzed to
understand the emotion, this result can further be developed
by classifying the words in the speech based on their
sentiments, positive and negative. This combination would
yield much better results.
REFERENCES