Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 21

SPEECH EMOTION

RECOGNITION

RAYA NADLIRA NURUL F - 2301172031


MEDIA INFORMATICS
TELKOM UNIVERSITY
08/17/2022 1
2020
BACKGROUND
Sad, Happy, Disgust, Fear,
Boredom, Neutral, etc.

1. speech features of people in


different physiological state
and neuropsychiatric status
why does the engine guess [4].
emotions? 2. features indicating different
Behavioural states may be overlapping,
Modalities and there may be multiple
sets of features expressing
the same emotional state [5].

In this study, use the combined


feature extraction of Discrete
Wavelet Transform (DWT) [6]
and several other features, i.e.,
zero-crossing rate, energy, peak,
Function of and fast-fourier transform (FFT)
Speech Emotion 08/17/2022 2

Recognition
OBJECTIVE

• THIS RESEARCH WILL PRESENT A EMOTION RECOGNITION APPROACH AIMING AT


IMPROVING THE RECOGNITION CORRECT RATE IDENTIFICATION FOR HUMAN
EMOTIONS[1]

08/17/2022 3
REFERENCE TRACING

08/17/2022 4
Author Year Focus Dataset Accuracy
Wei-Hua Cao 2017 Speaker-independent CASIA corpus : 4 peoples (2 male,2 female) , six basic achieves 78.6%, 2.2%
et.al Speech Emotion emotions (surprise, happy, sad, angry, fear, and neutral) higher than using
Recognition Based on Feature : Pitch, Short Time Energy, Zero Crossing Rate, Spearman.
Random Forest Feature First-order derivative and Second-order derivative
Selection Algorithm The features use OpenSmile.

Ingo et.al. Utilizing Psychoacoustic


2018 - AVIC : 3 emotion, achieved performance
Modeling to Improve
Speech-Based Emotion - Emodb: 7 emotion. gains between 0.94%
Recognition
- SUSAS : 5 emotion. and 4.86% absolute
- Feature : F0, F0 env, LSPs. MFCC 0-12, ZCR,
VoiceProb
The features use OpenSmile.
Jian et.al 2018 Speech Emotion - Interactive Emotional Dyadic Motion Capture the ladder network
Recognition Using Semi- (IEMOCAP) dataset, 10 peoples can achieve 0.591of
supervised Learning with - 4 emotion (angry, happy, sad, neutral) different model
Ladder Networks - These features use OpenSMILE
08/17/2022 5
REQUIREMENT SPECIFICATION
• DATASET
• RYERSON AUDIO-VISUAL DATABASE OF EMOTIONAL SPEECH AND SONG (RAVDESS)
DATASET
• 24 ACTORS
• SPEECH WITH EIGHT EMOTIONAL INTENTIONS (NEUTRAL, CALM, HAPPY, SAD, ANGRY,
FEARFUL, SURPRISE, AND DISGUST)
• THIS STUDY USING AUDIO UTTERANCES WITH NORMAL INTENSITY.
• FEATURE EXTRACTION FOR SPEECH EMOTION RECOGNITION
• USING DISCRETE WAVELETTable
TRANSFORM [6]
1. The Proposed Spectral Features

Spectral Features
Peak, Energy, Zero-Crossing Rate
(ZCR), and Fast-Fourier
Transform (FFT)
  08/17/2022 6
DESIGN SYSTEM OF SPEECH-EMOTION
RECOGNITION
Discrete Wavelet
Input Speech Remove Silence Area Transform

Frame Segmentation

Recognized Emotion Classification


Feature Extraction
Emotion
08/17/2022 7
INPUT SPEECH SIGNAL EMOTION

Figure 2. Sad Emotion Signal

Figure 1. Neutral Emotion Signal

08/17/2022 8

Figure 3. Angry Emotion Signal


CONVERTING THE SPEECH SIGNAL INTO
DISCRETE WAVELET TRANSFORM (DWT)

Figure 4. DWT of Neutral Emotion

Type Wavelet : Haar

Figure 5. DWT of Sad Emotion

08/17/2022 9
FRAME SEGMENTATION

Segment 1 Segment 3 Segment 4 Segment 5


Segment 2

08/17/2022 10
Figure 7. Frame Segmentation of Emotion Signal
FEATURE Zero-crossing

• ZCR Wavelet Signal


1.5
THE RATE AT WHICH THE SIGNAL CHANGES FROM POSITIVE 1 1

0.5
TO ZERO TO NEGATIVE OR FROM NEGATIVE TO ZERO TO POSITIVE
0 0 0 0
1 2 3 4 5
• PEAK -0.5

-1 -1
PEAKS ARE DETECTED AS LOCAL MAXIMA -1.5

BASED ON FIGURE 8. THE NUMBER AND THE AVERAGES VALUES OF


Figure 8. an Example of Wavelet Signal
PEAKS : 2 & [1, 1]

• ENERGY
• AREA TRAPEZOID =

WHERE A= BASE 1, B=BASE 2, H=HEIGHT


• ENERGY =
08/17/2022 11
WHERE = FRAME SEGMENT 1 …. N
FEATURE (CONT.)

• FOURIER TRANSFORM
• DECOMPOSES A FUNCTION OF TIME INTO THE FREQUENCIES.
• FORMULA FOURIER TRANSFORM
FOR 0
WHERE : DISCRETE INPUT SERIES, : FREQUENCY MAGNITUDE, AND N : A NUMBER DISCRETE INPUT
SERIES
• THUS, TO DETERMINE FOURIER TRANSFORM VALUE OF EACH FRAME SEGMENT USING FORMULA

𝑓 𝑛 =max ( 𝐻 ( 𝑘 ) )
WHERE = FRAME SEGMENT 1 …. N

08/17/2022 12
FEATURE EXTRACTION

• A PEAK, ENERGY AND FOURIER TRANSFORM USE SEGMENTATION FRAME FOR ANALYZE
INFORMATION WHILE THE ZERO-CROSSING RATE FEATURE NOT USE IT.
• THUS, A TOTAL NUMBER OF FEATURES EXTRACTED FOR EACH SPEECH IS 17 FEATURES
CONSISTING OF THE VALUES OF THE FOLLOWING :
• ZERO-CROSSING RATE.
• NP IS THE NUMBER OF PEAKS SPEECH SIGNAL.
• E1- E5 IS ENERGY OF EACH FRAME SEGMENT.
• AP 1 – AP 5 IS THE AVERAGES PEAK VALUE OF EACH FRAME SEGMENT.
• F1 – F5 IS THE FOURIER TRANSFORM VALUE OF EACH FRAME SEGMENT.

08/17/2022 13
EMOTION CLASSIFICATION

• TRAINING SET USE 307 DATA AND TEST SET USE 77 DATA
• THREE WIDELY USED CLASSIFICATION TECHNIQUES WITH
• KNN : COMPUTED FROM A SIMPLE MAJORITY VOTE OF THE NEAREST NEIGHBORS OF EACH POINT.  EACH OBJECT
VOTES FOR THEIR CLASS AND THE CLASS WITH THE MOST VOTES IS TAKEN AS THE PREDICTION. FOR FINDING
CLOSEST POINTS USING EUCLIDEAN DISTANCE.
• RANDOM FOREST : AN ALGORITHM USED IN THE CLASSIFICATION OF LARGE AMOUNTS OF DATA. DETERMINATION
OF CLASSIFICATION BY RANDOM FOREST IS TAKEN BASED ON THE RESULTS OF VOTING FROM THE TREE FORMED.
THE WINNER OF THE TREE FORMED IS DETERMINED BY THE MOST VOTES CALLED MAJORITY VOTES.
• NEURAL NETWORK : INFORMATION PROCESS BASED ON HOW THE HUMAN BRAIN WORKS. THE CHARACTERISTICS
OF NEURAL NETWORK ARE SEEN FROM THE PATTERN OF RELATIONSHIPS BETWEEN NEURONS, THE METHOD OF
DETERMINING THE WEIGHTS OF EACH CONNECTION, AND THE ACTIVATION FUNCTION.

08/17/2022 14
THE EMOTION CLASSIFICATION RESULT OF
SPEECH USING KNN
Prediction
Neutral Calm Happy Sad Angry Fearful Disgust Surprise
Neutral 3 3 0 0 0 0 0 0
Calm 2 2 2 0 0 0 0 0
TRUE CLASS

Happy 7 2 0 4 2 0 0 0
Sad 1 2 2 0 0 1 0 0
Angry 0 2 0 0 8 1 1 0
Fearful 0 0 2 1 2 1 0 0
Disgust 0 0 0 3 3 5 2 2
Surprise 0 0 0 0 0 2 1 8
Table 3. Classification Result using KNN

Table 3 shows that only happy and sad emotion cannot recognized. Thus, 08/17/2022 15

the accuracy achieve 31%.


THE EMOTION CLASSIFICATION RESULT OF
SPEECH USING RANDOM FOREST
Prediction
Neutral Calm Happy Sad Angry Fearful Disgust Surprise
Neutral 6 0 3 0 1 0 0 0
Calm 0 15 1 0 0 0 0 0
TRUE CLASS

Happy 0 0 6 1 0 0 0 0
Sad 0 0 2 5 0 0 0 0
Angry 0 0 0 0 8 0 0 0
Fearful 0 0 0 0 0 13 0 0
Disgust 0 0 0 0 0 0 6 0
Surprise 0 0 0 0 0 0 2 8
Table 4. Classification Result using Random Forest
Table 4 shows that just angry and fearful all can be recognized .
08/17/2022 16
However, the others emotions can not be recognized as itself. Thus, the
accuracy achieve 87%.
THE EMOTION CLASSIFICATION RESULT OF
SPEECH USING NEURAL NETWORK
Prediction
Neutral Calm Happy Sad Angry Fearful Disgust Surprise
Neutral 15 1 0 1 1 0 0 0
Calm 1 6 0 0 0 0 0 0
TRUE CLASS

Happy 0 0 10 0 0 0 0 0
Sad 0 0 0 13 0 0 0 0
Angry 0 0 0 0 4 0 0 0
Fearful 0 0 0 0 0 7 0 0
Disgust 0 0 0 0 0 0 6 1
Surprise 0 0 0 0 0 0 2 9
Table 5. Classification Result using Neural Network

Table 5 shows only happy, sad, angry, and fearful can be recognized all. 08/17/2022 17

Thus, accuracy achieve 90%.


EXPERIMENT RESULT OF EMOTION
CLASSIFICATION
• BASED ON THE RESULTS TABLE 3-5. SHOW THAT THE EMOTIONS CAN RECOGNIZED. THUS, THE
PERFORMANCE ACCURACY OF KNN , RANDOM FOREST AND NEURAL NETWORK ACHIEVES 31%,
87%, AND 90%, RESPECTIVELY ON THE 10TH LEVEL DWT.
• BETWEEN THREE CLASSIFIER, KNN GET THE SMALLEST PERFORMANCE RESULTS THAN THE OTHER
CLASSIFIER.
• THE PREVIOUS RESULTS ON RAVDESS DATABASE SUCH AS IN [1], THE AVERAGE ACCURACY WAS
79.4%.
• THE NEXT STEP IS TO CONDUCT AN EXPERIMENT AT ANOTHER LEVEL OF DISCRETE WAVELET
TRANSFORMATION (DWT) TO MAKE IT MORE CLEAR ABOUT THE ACCURACY OF THE EMOTIONAL
RECOGNITION OF THE THREE CLASSIFIERS.

08/17/2022 18
THE PERFORMANCE RESULT OF EACH LEVEL
SIGNAL IN DISCRETE WAVELET TRANSFORM
(DWT)
Accuracy of Emotion Classification (%)
KNN RANDOM NEURAL
FOREST NETWORK
LEVEL SIGNAL OF Level 8 39 91 74
DWT
Level 9 34 84 88
Level 10 31 87 90
Table 6. The Performance of Three Classifier on The Each Level Signal in
DWT
• The table 6 shows the performance of KNN, Random Forest and
Neural Network emotion classification on the each level signal in
DWT. In neural network get improved performance in each level
74%, 88%, and 90%, respectively, while the two other classifier have
decreased work in each level.
08/17/2022 19
• This shows that neural networks have the best accuracy for emotion
recognition.
CONCLUSION
1. THIS RESEARCH PRESENTS EMOTION RECOGNITION APPROACH AIMING AT IMPROVING THE
RECOGNITION RATE FOR HUMAN EMOTIONS [1].

2. IN THIS STUDY, A PERSON'S EMOTIONAL STATE IS IDENTIFIED USING SEVERAL LEVELS OF DWT SIGNALS
AND SEVERAL OTHER FEATURES, I.E., ZERO-CROSSING RATE, ENERGY, PEAK, AND FOURIER
TRANSFORM.

3. KNN, RANDOM FOREST AND NEURAL NETWORK CLASSIFIER WERE ADOPTED TO CLASSIFY EMOTION.
TABLE 3-5. SHOW THAT THE ACCURACY OF K-NN, RANDOM FOREST AND NEURAL NETWORK IS 31%, 87%,
AND 90%, RESPECTIVELY ON THE 10TH LEVEL OF DWT.

08/17/2022 20
08/17/2022 21

PROGRESS THESIS OF
“SPEECH EMOTION RECOGNITION”
Advice Previous Monitoring Progress Result
SUPERVISOR
1. Hertog Nugroho, Ph.D 1. Emotion Classification using 1. Slide 17
Neural Network 2. Slide 19
2. Implementation level signals
of DWT to improve
performance emotion
recognition.
REVIEWER
1. Rimba Whidiana C, Ph.D 1. Removes the word novel in Slide 4
research purposes.
2. Dr. Ema R 1. Choose the best classifier for Slide 16 and 17
emotion recognition.

You might also like