Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

This article has been accepted for publication in IEEE Access.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3196733

Confirming Customer Satisfaction with Tones


of Speech
Yen Huei Ko1, Ping-Yu Hsu2, Yu-Chin Liu3 and Po-Chiao Yang4
1
Department of Business Administration, National Central University, Taoyuan, CO 320317 R.O.C.
2
Department of Business Administration, National Central University, Taoyuan, CO 320317 R.O.C.
3
Department of Information Management, Institute of the Shih Hsin University, Taipei, CO 116002 R.O.C.
4
Department of Business Administration, National Central University, Taoyuan, CO 320317 R.O.C.

Corresponding author: Yen Huei Ko (e-mail: yensintong@gmail.com)

ABSTRACT As customer satisfaction explicitly leads to repurchase behavior, the level of customer
satisfaction affects both sales performance and enterprise growth. Traditionally, measuring satisfaction
requires customers spending extra time to fill out a post-purchase questionnaire survey. Recently, ASR
(Automatic Speech Recognition) is utilized to extract spoken words from conversation to measure customer
satisfaction. However, as oriental people tend to use vague words to express emotion, the approach has its
limitation. To solve the problem, this study strived to complete following tasks: devising a process to
collect customer voice expressing satisfaction and corresponding verifiable ground truth; a data bank of 150
customer voices speaking in Mandarin was collected; the features of the voice data were extracted with
MFCC; as the size of data bank was limited, Auto Encoder was utilized to further reduce the features of
voices; models based on Long Short-Term Memory (LSTM) Recurrent Neural Networks (RNNs) and
Support Vector Machine (SVM) were constructed to predict satisfaction. With nested cross validation, the
average accuracy of LSTM and SVM could reach 71.95% and 73.97%, respectively.

INDEX TERMS Customer satisfaction, LSTM-RNNs, MFCCs, SVM.

I. INTRODUCTION interactions between operators and customers [8]. However,


Customer satisfaction is crucial for enterprise growth. several limitations constrain the effective application of
Satisfaction strongly affects customer repurchase behavior telephone surveys. First, most customers are reluctant to
[1], [2]. Oliver [3] proposed the Expectancy Confirmation participate in a telephone survey because of the time required
Theory (ECT), which states that consumers evaluate the for participation. Second, the cost of conducting telephone
performance of products against prior expectations and that surveys is high because of the additional tasks imposed on
the discrepancy between expectations and reality determines call center agents conducting the surveys. Moreover,
the degree of customer satisfaction. Garbarino and Johnson customers who choose to participate in such surveys tend to
[4] stated that customer decision-making behavior is have extreme opinions. Customers with mild satisfaction or
governed by two mental constructs, namely, perceived dissatisfaction usually skip these types of surveys [1].
service quality and customer satisfaction with the service or Therefore, developing a tool for directly evaluating customer
product being purchased. Satisfied customers are typically satisfaction without the need for post-purchase questionnaire
those who consider that their expectations are fulfilled. surveys is invaluable.
Therefore, evaluating customer satisfaction with each Several studies have adopted text mining and natural
transaction is essential for enterprises to not only improve language recognition techniques for measuring customer
their service quality but also gain a clearer understanding of satisfaction. For example, Kang and Park [9] measured the
customer expectations [5]. levels of customer satisfaction with mobile services by using
Traditionally, customer satisfaction is measured through text mining to compile sentiment words from customer
post-purchase questionnaire surveys or interviews [4], [6], reviews and then conducting sentiment analysis to calculate
[7]. In the telemarketing industry, surveys based on pre- sentiment scores. Park [10] employed text mining and
prepared questionnaires are performed through telephone statistical analysis to calculate sentiment scores for online

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3196733

customer reviews in order to measure customer satisfaction. 150 customer voices speaking in Mandarin was collected.
Automatic Speech Recognition (ASR) is a technique that is The process along with feasible hyperparameters of MFCCs
primarily used for transcribing speech into spoken words, to extract the features of these voices was shown. As the size
and the texts corresponding to these spoken words can then of data bank was limited, AE was proved to be beneficial in
be analyzed through various analytical models to measure this study. Different hyperparameters of LSTM and SVM
customer satisfaction [11]–[14]. were evaluated and the best result was shown for future
However, text alone might be unable to reflect the references. To balance between data limitation and
satisfaction level of customers. In particular, for reasons of hyperparameter selections, the accuracy of the proposed
politeness or avoiding awkwardness, East Asian customers model was verified with the approach of nested cross
might utter vague words that somewhat suggest satisfaction. validation. The best average accuracy of LSTM-RNNs and
For example, Tanaka et al. [15] conducted an experiment that SVM was 71.95% and 73.97%, respectively. SVM
involved an emotional Stroop-like task in which participants outperformed LSTM-RNNs may be due to its simplicity as
were required to identify the emotions of facial expressions the data size was limited in this study.
and ignore the corresponding voice (or vice versa). The The remainder of this paper is organized as follows:
experimental results indicated that Japanese people are more Section 2 reviews the importance of customer satisfaction
conciliatory than Dutch people when expressing their and prior research using speech recognition technique to
feelings. In general, Japanese people hesitate to express their judge if customers were satisfied, and describes the
true thoughts and tend to utter ambiguous answers when popularity of MFCC process being used to extract features of
pressed. Liu et al. [16] confirmed that compared with the voices , Section 3 describes the process to solicit customer
comprehension of English, that of Mandarin is more voice, the tool to measure the ground truth of the collected
complicated and relies more on tone of voice. Furthermore, a data, MFCC process and hyperparameters to extract voice
Mandarin speaker is likely to mainly focus on the tone of features, the mathematical models and pseudocode of
someone’s voice instead of their facial or verbal expression LSTM-RNNs, AE and SVM used in this study, Section 4
alone to understand their true feeling [17]. Banse and Scherer describes the collected voice data, verification of the data
[18] posited that spoken language is embedded with a variety validity, process of data clustering into three level of
of nonverbal means to express emotions. As indicated by satisfactions, the process to decide AE hyperparameters, the
Bostanov and Kotchoubey [19], tone of voice is the most process of the nested cross validation to decide the
essential prosodic cue for emotional recognition. Thus, hyperparameters of LSTM-RNNs and SVM, and compare
without the recognition of tone of voice, spoken words alone and contrast the accuracies of models under various settings,
cannot precisely represent customer responses. and Section 5 summarizes the contributions of this study,
To catch the level of customer satisfaction without relying discusses implications of the experimental results, and
on post-consumption questionnaire surveys and speech suggests further research directions.
recognition, this study strived to collect customer voices with
different levels of satisfaction and build a model to II. RELATED WORKS
distinguish their satisfaction. The aim of the present study
was five folds: design an experiment to collect voices A. SATISFACTION
corresponding to different levels of satisfaction; verify the According to ECT, satisfaction is an emotional attitude or a
ground truth of voice data, of which the satisfaction was psychological state that results from the evaluative judgment
measured with questionnaire surveys adapted from [20] and of the expectation for and performance of a product or
[21]; implement Mel-Frequency Cepstral Coefficients service being purchased. Oliver [3] posited that expectation
(MFCCs) to extract voice features which were transformed is the baseline or reference level for a customer to evaluate a
into quantified vectors; propose an Auto-Encoder (AE) to product or service. Ando et al. [5] demonstrated that the
reduce attributes as the number of manually collected voice degree of customer satisfaction indicates the service quality
data was limited; devise two classifiers based on Long Short- of enterprises and their performance in customer engagement.
Term Memory (LSTM) Recurrent Neural Networks (-RNNs) Satisfaction is the cognitive assessment of expectation versus
and Support Vector Machine (SVM) as they have shown actual performance, and satisfaction is achieved when the
good performance at analyzing sequential ratio data. The last level of perceived performance is higher than the initial level
but not least, this research also compare their classification of expectation.
performances under different settings, namely, with and Several studies have evaluated customer satisfaction to
without Auto-Encoder, and with different setting of neuro predict repurchase intention [20], [22]–[24]. Moreover,
networks and hyper parameters. studies have revealed that customer satisfaction is an
The research result contributed to the literature of essential factor for retaining long-term loyal customers and
measuring customer satisfaction with voices in several influencing repurchase behavior [1]–[3], [6], [25]. Customer
aspects. A process to collect customer satisfaction voices repurchases indicate that customers’ expectations are fulfilled.
with verifiable ground truth was designed. A data bank of Kincade et al. [7] posited that customer demand leads to

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3196733

purchase behavior and that repurchase behavior is driven by accuracy than that achieved using LPCCs. Currently, MFCC-
customer satisfaction. Furthermore, customer satisfaction based feature extraction is the most widely used technique
explicitly leads to reduced operating costs and generates both for speaker recognition [36].
customer referrals and positive word-of-mouth, which Voice features extracted using MFCCs are short-term
increase revenues and profitability [26]–[29]. Therefore, an features that reflect the frequency spectrum of a speech
enterprise must measure the level of customer satisfaction to signal. Prosodic features with rhythmic properties are long-
improve its internal management and maintain valuable term features that reflect a voice’s fundamental frequency.
relationships with customers [1]. Accordingly, extracting specific types of prosodic features,
Traditionally, customer satisfaction is measured through including pitch, intensity, and duration, can enhance acoustic
post-consumption questionnaire surveys [7], [20], [21]. Text applications [37]–[41]. The extraction of prosodic features is
mining has been used to analyze sentiment words from a relatively simple and effective process in speech and
customer reviews for measuring the level of customer speaker recognition tasks. Sönmez et al. [42] developed a
satisfaction [9] and [10]. Additionally, speech recognition piecewise linear model for capturing stylized pitch contours
techniques have been adopted to measure customer to represent an individual’s speaking style in order to achieve
satisfaction [11]–[14]. Sun et al. [13] used information fusion speaker verification. Pitch features reflect the rising-falling
to analyze customer satisfaction. In telemarketing, patterns of signals and enable the achievement of robust and
communication between call agents and customers is accurate recognition performance [38]. Sinith et al. [41] fed
conducted through voice calls; a telephone operator different combinations of MFCC-based features and prosodic
administers a questionnaire survey to customers through features to an SVM classifier for speech emotion recognition.
phone calls. However, a telephone survey has some Their results indicated that the combination of MFCC-based
limitations, such as customer reluctance and cost constraints. features, pitch, and intensity engendered the highest accuracy
Therefore, a tool must be developed for determining in speech emotion recognition.
customer satisfaction by using acoustic data without Accordingly, the present study collected quantified feature
conducting a survey. vectors that comprised voice features extracted using MFCCs
In the present study, a discriminative model was and prosodic features namely voice pitch and voice intensity
developed for determining customer satisfaction. The study extracted using Praat and fed them to the constructed LSTM-
conducted an experiment to collect voice samples for RNN model for measuring customer satisfaction.
analysis in the model. This model was developed to verify
whether voice data can be used to distinguish the degree of III. METHODOLOGY
customer satisfaction.
A. RESEARCH DESIGN
B. FEATURE EXTRACTION TECHNIQUES The experiment conducted in this study was based on those
Because people’s voice tracts have unique physiological conducted in [43] and [44]. The experiment was designed to
structures, the frequency spectrum of speech signals can be capture participants’ confirmation of expected utility and
used for various speech applications, including speech and utterances conveying their level of satisfaction
speaker recognition [30]–[35]. For effective speech simultaneously. Two sets of data were collected in the
recognition, spectral feature extraction is used to transform experiment: the first set comprised customer responses to a
raw speech into compressed signals. Effective feature questionnaire survey adapted from the relevant literature, and
extraction techniques include those that are based on MFCCs, the second set comprised audio clips indicating the same
Linear Prediction Coefficients (LPCs), and Linear Prediction participants’ satisfaction level in Mandarin. The
Cepstral Coefficients (LPCCs). MFCCs have the advantages questionnaire data were validated through statistical analysis
of reducing error rates, improving efficiency, and increasing and utilized for supervised learning to classify audio data.
accuracy [30]–[33], [35]. Singh et al. [36] presented that Voice data collection was conducted in a soundproof
MFCC-based features extraction methods achieve superior laboratory to avoid noise interference. Through a data
performance to methods based on prosodic features alone; preprocessing process, features extracted from the voice data
this is because MFCCs can be used to accurately were transformed into quantified vectors that comprised
discriminate signal characteristics. Tiwari [32] demonstrated MFCCs, logarithmic energy, voice pitch, and voice intensity.
that combining MFCCs with Vector Quantization (VQ) is These vectors were then fed to the constructed LSTM-RNN
suitable for feature extraction for speaker identification. model for predicting customer satisfaction. This model was
Singh and Rajan [34] concluded that voice samples used for trained and evaluated using a five-fold nested cross-
model training and testing should be collected under the validation method. Fig. 1 shows the architecture of research
same conditions; they also determined that noise is the main framework.
factor affecting the accuracy of MFCC-VQ based method for
speaker recognition. Martinez et al. [33] observed that
feature extraction using MFCCs and VQ achieved higher

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3196733

The post-consumption questionnaire used in this study was


adapted from those presented in [20] and [21], which were
based on the work of Oliver [3]. The level of satisfaction was
measured using a 5-point Likert scale ranging from 0 (“Very
Dissatisfied”) to 4 (“Very Satisfied”). The devised
questionnaire is presented in Table 2.
TABLE 2
SCALE ITEMS RELATED TO SATISFACTION IN THE POST-STUDY
QUESTIONNAIRE
How do you feel after drinking the beverage?
(highly unacceptable, unacceptable, neutral, acceptable,
S1 I feel
or highly acceptable)
(very displeased, displeased, neutral, pleased, or very
S2 I feel
pleased)
(very dissatisfied, dissatisfied, neutral, satisfied, or very
S3 I feel
satisfied)
(absolutely terrible, terrible, neutral, delighted, or
S4 I feel
absolutely delighted)
Satisfaction Labeling: After verifying the validity of the
satisfaction construct, this study used the factor loading and
FIGURE 1. Architecture of research framework. score obtained for each item to calculate the satisfaction level
of each participant. The scores were discretized into three
B. DATA COLLECTION PROTOCOL bins through the Visual Binning (VB) process in
The design of the conducted experimental was based on ECT IBM® SPSS® . The three bins were marked as satisfaction,
[3], which states that satisfaction is the discrepancy between neutrality, and dissatisfaction. The participants and their
a consumer’s expectation and the actual utility of a product. corresponding audio files were labeled appropriately into the
Therefore, satisfaction is determined by two factors: three bins.
expectation (e.g., pre-consumption) and confirmation (e.g.,
post-consumption or experience). To collect information C. MFCC PROCESSING
related to expectation, prior to being served, the participants MFCC-based techniques have been widely used to extract
were requested to indicate their order of preference for the crucial voice features [45]. The process of extracting voice
following beverages: coffee, milk tea, green tea, fruit and features are shown in Fig. 2.
vegetable juice, and Chinese bitter tea. Subsequently,
approximately one-third of the participants were each served
their most preferred, moderately preferred, and least
preferred beverage; each participant was served one drink.
The data collection process involved the following steps:
1) Before drinking a beverage, the participants recorded a
voice file by uttering a short statement in Mandarin,
which is presented in Table 1. This statement translates to
“I am satisfied with this item.” This step helped
familiarize the participants with the prompted utterance.
2) The participants were then invited into the recording
room and served a beverage. The beverage was poured FIGURE 2. The process of voice feature extraction.
into a dark glass before the participants entered the room
to ensure that they had no prior knowledge of what MFCC processing typically involves the following step:
beverage would be served to them. 1) Processing of the input signal: In the time domain, a
3) Immediately after drinking the provided beverage in the signal waveform expresses the relationship between
blinded test, the participants uttered the statement signal intensity and time. The dynamic signal is a
presented in Table 1, which was then recorded. This step continuous function of time. However, for conducting
was conducted to determine the participants’ degree of digital sampling and numerical calculations, must
satisfaction with the provided beverage. be converted into a discrete function . The most
TABLE 1 common approaches for converting continuous analog
SATISFACTION STATEMENT IN MANDARIN signals into discrete analog signals are Pulse Amplitude
Please utter the following statement Modulation (PAM) and Pulse Code Modulation (PCM).
我對於該產品滿意

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3196733

PAM and PCM are used to sample the amplitude of an 6) Mel Filtering: Although can be derived through
analog signal at equal intervals. calculations, not all values affect human hearing in
2) Framing: Because a voice signal changes with time, it the same manner. Moreover, frequencies related to
must be partitioned into short frames to capture the human hearing, should be emphasized. Therefore, Mel
acoustic characteristics. The size of a frame may be set to filter banks can be used to enhance and reduce the
20 to 40 ms (milliseconds) to avoid excessive changes weights of in particular frequency intervals. A Mel
between adjacent frames. According to the procedure filter bank contains a set of band-pass filters defined on a
described by [46], this study set the frame size to 25 ms. constant Mel scale interval. As displayed in Fig. 3, Mel
To avoid considerable changes between frames, frequencies overlap with adjacent frequencies.
consecutive frames can be set to overlap with each other. Consequently, if the number of intervals is M, M+2
As in most relevant studies [30], [31], [33], [41], the border points exist in the frequency domain.
overlap interval in this study was set to 10 ms.  Definition 1: Let be a frequency (in Hz). The Mel
3) Pre-emphasis: In the vocalization step, a high-pass filter frequency can be computed as follows:
can be used to compensate for the energy of high
frequencies suppressed by a pronunciation system. The
pre-emphasis formula can be expressed as follows: Given intervals and and be the
lower and upper bounds of the frequencies of
(1)
interest, respectively. The Mel frequency of the
Where is the voice signal of the sampling point border point can be expressed as follows:
in a frame, is the output of the voice signal obtained ,
after pre-emphasis, and is the coefficient of the high-
pass filter. In this study, ,which is usually a constant, where .
was set to 0.97 in accordance with the suggestion of [46].  Definition 2: Let be the inverse Mel
4) Hamming Windowing: After the framing step, a frequency of the border point.
Hamming window function can be applied to each frame
to reduce signal discontinuity for minimizing the
distortion in the spectra at the beginning and end of a
frame.
 Definition 3: Given the Mel frequency filter
(2) operated in the interval;
is the weight of the Mel scale, where
(3) . The relevant formula can be expressed as
follows:
where is the output signal obtained after Hamming
windowing, is the Hamming window function, and
is the adjustment parameter for reducing the
signal intensity at the beginning and end of a frame. The
parameter is usually set to approximately 0.46 [31] and
[47].
Where H h  f   1 when f = If  h 
5) Fast Fourier Transform: This step utilizes Discrete
Fourier Transform (DFT) to transform from the time
domain to the frequency domain for effectively
describing signal characteristics. The DFT of a frame, for
frequency among a total of discrete
frequencies can be expressed as follows:
(4)
where is an imaginary number.
The periodogram-based power spectrum of each frame
at frequency can be expressed as follows: FIGURE 3. Mel filter banks.

(5)
When the entire signal of each frame passes through
The characteristics of a signal can be obtained by a Mel filter bank, this operation is used to imitate
observing the energy distribution of a spectrum at the human ear to achieve precise frequency
different frequencies.
5

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3196733

identification, smooth the spectrum, and reduce the hot encoding to denote which were satisfaction, neutrality,
number of calculations. and dissatisfaction, labeled with "001", "010", and "100",
The logarithmic filter bank energy is the respectively. Table 3 showed the information of feature
output of the weighted sum of the power spectrum vector and labeling.
passing through the Mel filter bank. TABLE 3
 Definition 4: The following equation can be INFORMATION OF FEATURE
obtained when the number of Mel filters is M and
the number of discrete frequencies is :
M
Through the Discrete Cosine Transform (DCT), M
log energy values can be obtained in the frequency
domain, and feature can be extracted.

where is the feature extracted using MFCCs


and is the total number of extracted features for
each examined frame. In practice,
D. LSTM MODEL CONSTRUCTION
According to the procedures described in [33], [41],
The 45-diminsional vectors along with corresponding labels
[46], and [48], the present study set and to 12
were fed into LSTM-RNNs to develop a discriminative
and 26, respectively.
7) Delta Cepstrum Coefficient: A voice signal changes over model for determining the level of customer satisfaction.
time, similar to the slope of a formant at its transitions. Table 4 shows the notations for LSTM-RNN formulation.
Therefore, the trajectory of MFCCs over time (i.e., the is the array of cepstrum of frame for participant
Delta cepstral coefficient) must be determined. The term and is the extended with Delta and Delta-Delta,
time derivative.
is the first-order derivative of the feature vector
. To simplify the notation is
of an MFCC in an time window. The addition of omitted in LSTM-RNN formulation.
acceleration features such as Delta-Delta cepstral
coefficients, which are obtained through double partial TABLE 4
NOTATIONS FOR LSTM-RNN FORMULATION
differentiation with respect to time, generally leads to
Notation Description
superior classification performance [41]. the weights from input cepstrum to the input
8) Logarithmic Energy: The energy of each frame is a gate, output gate and forget gate, respectively
crucial feature that represents the variation in amplitude the weights for the transmission of the value in
and provides acoustic information [33], [41], and [48]. the hidden units to the input gate, output gate,
and forget gate, respectively
 Definition 5: Given sampling points in a the bias values of the input gate, output gate,
Hamming window, the logarithmic energy of each forget gate, and hidden layer, respectively
frame can be expressed as follows: the output values of hidden units for frame
the values of the input gate, output gate, and
forget gate, respectively, for frame
the short-term memory and long-term memory,
respectively, for frame
the weights from the input cepstrums and
previous output values to the hidden unit,
9) Quantified Feature Vector: This study combined 13 respectively
features derived through MFCC processing with prosodic the bias value of the short term memory,
features of voice signals which are widely used for the weights from output values to the output
speaker recognition [32], [36], [38], [48], and [49]. The nodes which is a categorical variable
Number of hidden layer
prosodic features, namely voice pitch contours and Values of hidden layer
intensity, were extracted using Praat (version 6.0). Thus, the ratio of the output value of .
the quantified feature vector for each frame contained 15 where
dimensions. When the first- and second-order derivatives
of time windows were considered, the vector had a total
of 45 dimensions. The vectors were fed to LSTM-RNNs
for satisfaction classification. Because the outputs were where
the categorical variables, the present study adopted one-

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3196733

where , is the error distance, is the cost of error


The output value, inflicted on the classifier. To classify a sample, , the
decision function of prediction is as follows:

as
where is the function to justify the positive or negative
argument by returning and , respectively.

F. AUTO-ENCODER NEURAL NETWORKS


Fig. 4 shows the complete pseudo code. AE neural networks comprise three major components: an
encoder is used to reduce the dimensionality of input data
and develop a compressed representation of input data; a
latent space shows the compressed representation; a decoder
is to reconstruct the input data from the compressed
representation. The encoder and decoder consist of a series of
neural layers containing the decreasing and increasing nodes,
respectively.
1) Processing of Encoder:

where , , and is the activation function, bias


vector and input weight vectors of the uth encoding layer,
respectively, and .
2) Processing of Decoder:

where , , and is the activation function, bias


vector and input weight vectors of the uth decoding layer,
respectively, and is the layers of encoding.
The lost function is .

IV. EXPERIMENTAL RESULTS AND ANALYSIS


FIGURE 4. Pseudocode of the discriminative model.
A. DATA COLLECTION
E. SVM MODEL CONSTRUCTION This study comprised two major tasks: determining the
SVM is a robust machine learning algorithm which can reach participants’ degree of satisfaction through a post-
good performance with the relatively limited training data consumption questionnaire survey; and developing a
[41]. SVM is mostly used in pattern classification by finding discriminative model of satisfaction by feeding quantified
a best hyperplane with maximum margin to separate data of feature vectors to LSTM-RNNs. Because a dataset
one class from those of the other. Typically, by using a non- containing acoustic files related to satisfaction is not publicly
linear kernel function, the original feature vector can be available, this study collected an acoustic dataset by using the
transformed into a hyper plane where the data can be linearly protocol described in Section III.
classified or non-linearly separable. Given a training dataset The recording equipment used in the experiment was an
, where , kernel function, external microphone (E-books S60) that can collect similar
the goal is to adjust the vector of separating signals to those collected by mobile phones. The participants
hyperplane, ( to maximize the margin between were instructed to make the statement “我對於該產品滿意”
supporting vectors under constraints. (it means, I am satisfied with this item) after drinking the
provided beverage. The voices of the participants were
recorded as acoustic files in .wav format.

B. EXPERIMENTAL RESULTS
with constrains that 1) Statistical Analysis: The demographic information of the
150 participants is presented in Table 5. The research
sample had a diverse age profile, with 44% of the
participants being aged 20-39 years. Moreover, 59% and
7

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3196733

76% of the sample comprised women and college the survey results could be used for accurately identifying
graduates, respectively. the level of satisfaction in the collected acoustic files.
TABLE 5 TABLE 8
DEMOGRAPHICS OF THE PARTICIPANTS (TOTAL = 150) VALIDITY ANALYSIS
Numbers of Factor
Categories Percentage Construct Items AVE(0.5) CR(0.7)
participants loadings
Age (years) 19 and below 18 12% S01 0.822
20-39 66 44% S02 0.805
Satisfaction 0.600 0.857
40-64 45 30% S03 0.764
65 and above 21 14% S04 0.701
Gender Female 88 59% The satisfaction vector obtained from the survey data for
Male 62 41%
Education High school 24 16% each participant comprised four features. The inner product
College graduate 114 76% of this vector with the factor loading represented each
Postgraduate 12 8% participant’s level of satisfaction. The VB process in SPSS
Table 6 presents the relationship between the level of (version 18) was used to automatically cluster the
satisfaction and the drink served. Green tea and Chinese participants into three bins, and the corresponding result is
bitter tea were the most and least preferred beverages, displayed in Fig. 5, where the X-axis represents the degree
respectively. The number of different drinks served was of satisfaction and the Y-axis represents the number of
balanced among the participants. Moreover, the numbers of participants. The cut points were set to 6.184 and 7.811.
satisfied, neutral, and dissatisfied participants were similar. The numbers of participants clustered into satisfaction,
TABLE 6 neutrality, and dissatisfaction were 51, 52, and 47,
SATISFACTION-RELATED INFORMATION FOR THE PROVIDED BEVERAGES respectively.
Satisfaction Neutrality Dissatisfaction Total
Coffee 11 13 10 34
Milk tea 9 12 9 30
Green tea 15 12 4 31
Fruit and 11 10 10 31
Vegetable juice
Chinese bitter tea 5 5 14 24
Total 51 52 47 150
After consuming the beverage, the participants were
requested to fill out the adopted questionnaire surveys to
indicate their level of satisfaction with the beverage served.
The descriptive statistics of the questionnaire data are
presented in Table 7. Four items related to satisfaction were
evaluated on a 5-point Likert scale ranging from 0 (“very
dissatisfied”) to 4 (“very satisfied”). The total number of
responses for each item was 150. The mean scores for the
four items were 2.31, 2.11, 2.03, and 2.08, demonstrating
FIGURE 5. Clusters of participant satisfaction.
the tendency of neutrality in the conducted survey.
TABLE 7 2) Acoustic Extraction: The voice of each participant was
DESCRIPTIVE STATISTICS OF THE QUESTIONNAIRE DATA recorded for approximately 2.5 s and saved in .wav
Item 1 Item 2 Item 3 Item 4
format. The recorded voice file was then cleaned using
Numbers of responses 150 150 150 150
Minimum 0 0 0 0 Audacity 2.3 which is an open-source software program.
Maximum 4 4 4 4 The noise reduction function of this software was used
Mean 2.31 2.11 2.03 2.08 to eliminate environmental noise. Voices were divided
Median 2.50 2.00 2.00 2.00
Mode 3.00 2.00 2.00 2.00 into multiple 25-ms frames, and 12 features were
S.D. 1.12 0.95 1.01 1.05 extracted. By combining the logarithmic energy as well
The factor loadings of the four items are presented in as the voice intensity and voice pitch determined using
Table 8, indicating that they are above the threshold of 0.6. Praat with the aforementioned 12 features, this study
The Cronbach’s Alpha (CA), Average Variance Extracted derived a 15-dimensional vector for each frame. Finally,
(AVE), and Composite Reliability (CR) of the satisfaction by calculating the Delta and Delta-Delta cepstral
construct were 0.950, 0.600, and 0.857, respectively, which coefficients between frames, the study derived 45-
are above the relevant thresholds of 0.7, 0.5, and 0.7, dimensional feature vectors. These vectors were fed to
respectively [50] and [51]. These results confirm the the developed discriminative model for detecting the
reliability and validity of the adopted questionnaire; thus, level of satisfaction. The parameter values adopted for
feature extraction are presented in Table 9.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3196733

TABLE 9 The feature vectors extracted from the most dense layer
PARAMETERS USED FOR FEATURE EXTRACTION
of the AE were fed to both SVM and LSTM-RNN
Items Parameter setting
classifiers to develop the discriminative model.
Frame size, 25 ms 4) Classification: To compare and contrast the performance,
Frames overlap interval, 10 ms the original 45-dimensional and the reconstructed
Sampling rate, 44.1kHz feature vectors were fed into LSTM-RNNs and SVM to
Bit depth 16 develop the discriminative model independently. Each
model was established and verified by five-fold nested
Sampling points, N 1100
cross-validation to tune hyperparameters, train model
Pre-emphasis coefficient, 0.97
parameters, and evaluate the accuracy of the developed
Hamming window parameter, 0.46 models.
Number of Mel scale filter bank, 26 For LSTM-RNNs, categorical cross-entropy and
Number of extracted features, 12 Softmax were selected as the loss and activation
functions of the output layer, respectively. A Grid
3) Feature Reconstruction with Auto-Encoder: As the
Search was conducted to determine values of
amount of voice clips was 150 and the feature vectors
hyperparameters among given ranges, which are
were of 45 dimensions, this study utilized Auto-Encoder
presented in Table 12. Table 13 shows the determined
(AE) to further reduce dimensionality and reconstruct
hyperparameters for the model with the original 45-
the feature vectors. To decide the optimal number of
dimensional feature vectors as the input. The optimized
hidden layers deployed in the AE neural networks, a bi-
networks contained only 1 level of hidden layer with 30
node encoding and decoding approach was adopted: the
cells.
encoding side started with a layer of 43 hidden nodes,
followed by a series of layers, each reduced two nodes TABLE 12
from the previous layer until the desired number of THE RANGES OF HYPERPARAMETERS USED FOR MODEL OPTIMIZATION
features was reached. The decoding side mirrored the Hyperparameters Range
encoding with each layer adding two nodes. The loss Hidden layer, 1, 2, 3
Number of cells, 25, 30, 35, 40
function adopted was mean square error (MSE).
Learning rate, 0.001, 0.01, 0.1
The ranges of hyperparameters being considered are
Number of Epochs, 1 to 50
presented in Table 10. Five-fold cross validation was
Activation function Sigmoid, ReLU, Tanh
adopted to decide what hyperparameters were the most Optimizer Adam, SDG, RMSProp
feasible for each AE configuration. The hyperparameters
with the best average performance were selected to TABLE 13
reconstruct the original feature vectors. Table 11 LSTM-RNN PARAMETERS FOR 45 DIMENSIONS
summarizes the AE configurations with the best Parameter Setting
performance. Hidden layer, 1
Number of cells, 30
TABLE 10
Learning rate, 0.001
THE RANGES OF HYPERPARAMETERS USED FOR AUTO-ENCODER NEURAL
NETWORKS Input variable, 45
Hyperparameters Range Output variable, 3
Hidden layer in Encoder 1, 2, 3, 4, 5 Batch All input data
Dimensions of hidden layer 43, 41, 39, 37, 35 Epoch, 10
Number of Epochs 1 to 500 Activation function Sigmoid
Batch size 10 Output activation function Softmax
Activation function Sigmoid, ReLU, ELU Optimizer Adam
Optimizer Adam Loss function Cross-entropy
Loss function Mean Square Error Objective function Minimize (loss)
Objective function Minimize (loss)
The input data are sequentially into the LSTM-RNN
TABLE 11 model. The output data are accumulated in two ways. In the
THE SUMMARIES OF AE CONFIGURATION first one, input is combined with the hidden value of the
previous step to calculate the current state. The hidden
values of the last step which contains the summarized
information are delivered for classification. In the second
way, to preserve the information of all states, a flatten layer
is utilized to collect the hidden value of each step for
classification.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3196733

In this study, a hidden layer was affixed to the flatten TABLE 17


RANGES OF THE HYPERPARAMETERS OF SVM USED FOR MODEL
layer for further feature reduction. The experimental results OPTIMIZATION
are shown in Table 14. Hyperparameters Range
TABLE 14 Kernel Polynomial, RBF, Sigmoid
THE RESULTS OVERVIEW OF LSTM-RNNS WITH DIFFERENT AE Cost 1 to 10
CONFIGURATIONS Gamma Scale, 0 to 1

TABLE 18
THE EVALUATION RESULTS OF THE SVM WITH DIFFERENT AE
CONFIGURATIONS

At first, the model without the flatten layer seemed to


outperform the model with the flatten layer. However, with
the enhancement of AE, the models with flatten layers
gradually improved and reached the best with 37- Suprisingly, with 37-dimensional latten features, SVM
dimensional latten features as the input, of which the model with Radial Basis Function (RBF) Kernel, Cost and
hightest and average prediction accuracy were 75.86% and Gamma being set at 7 and 0.0991, reached the best and
71.95%, respectively. The hyperparameters of the best average accuary of 79.31% and 73.97%, repsectively,
model is shown in Table 15 and the confusion matrix is which were higher than the corresponding LSTM-RNN
illustrated in Table 16. model’s. The average accuracy of the SVM models with the
TABLE 15 reconstructed feature vectors were significantly improved
LSTM-RNN HYPERPARAMETERS OF THE BEST PERFORMANCE MODEL from 53.79% to 73.97%. Table 18 and 19 show the
Parameter Setting performace of various parameters and the confusion matrix
Hidden layer, 1 of the best model.
Number of cells, 30 TABLE 19
Learning rate, 0.1 CONFUSION MATRIX OF SVM WITH THE BEST PERFORMANCE
Input variable, 37 Satisfaction Neutrality Dissatisfaction
Output variable, 3 (001) (010) (100)
Predicted satisfaction 5 3 2
Batch All input data
Predicted neutrality 0 9 1
Epoch, 10
Predicted dissatisfaction 0 0 9
Activation function Sigmoid
Output activation function Softmax AE neural networks reduced the high dimensionality of
Optimizer RMSProp the feature vectors and improved the accuracy of both
Loss function Cross-entropy LSTM-RNN and SVM models. Interestingly, in this study,
Objective function Minimize (loss) SVM models consistenly outperformed LSTM-RNN
models while applying AE as shown in Fig. 6. The result
TABLE 16 revealed that when the number of data was limited, SVM
CONFUSION MATRIX OF LSTM-RNNS WITH THE BEST PERFORMANCE
may outperform LSTM-RNNs due to its simplicity.
Satisfaction Neutrality Dissatisfaction
(001) (010) (100)
Predicted satisfaction 7 2 1
Predicted neutrality 2 7 1
Predicted dissatisfaction 1 0 8
An SVM, which is a supervised machine learning
classifier, is among the most commonly used models in
emotion classification [41], [48] and [52]. Accordingly, this
study compared the classification performance of the SVM
models with that of the LSTM-RNN models. Three SVM
hyperparameters were tuned through nested cross validation,
namely, Kernel, Cost, and Gamma. The ranges of these
hyperparameters are presented in Table 17. The evaluation
results of the SVM with the original and the reconstructed
FIGURE 6. The Performance Comparison of LSTM-RNNs and SVM
feature vectors as input are summarized in Table 18. with different AE Configurations.

10

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3196733

V. CONCLUSION satisfaction through acoustic analysis can facilitate the


Customer satisfaction drives customer decisions and highly implementation of customer relationship management.
affects customer repurchase behavior. Therefore, determining According to the results of this study, instead of measuring
customer satisfaction is essential for enterprises to gain a customer satisfaction through a post-purchase survey,
clearer understanding of customer expectations and improve customer satisfaction may be determined using a voice
their customer service quality. Traditionally, a post-purchase recognition model during interactions between agents and
questionnaire survey is conducted to collect the information. customers. The use of such a model can enhance customer
However, such surveys require considerable effort from both relationships, provide managerial insights to improve internal
customers and enterprises. Therefore, a tool that can management, and even support new marketing campaigns if
conveniently collect customer satisfaction should be in high it can be integrated into the business process of enterprises.
order. However, before the effectiveness of a tool can be The findings of this study pave the way for understanding
verified, a sound data set has to be collected. customer preference and attitude through voice analysis.
This study developed a discriminative model for detecting
the level of customer satisfaction by analyzing the implicit B. LIMITATIONS AND SUGGESTIONS FOR FURTHER
information embedded in the tone of customers’ voices. The RESEARCH
study methodology involved three steps: feature extraction, This study has some limitations. First, the voice recordings
feature reduction, and model construction. Features extracted used in this study were in Mandarin. Future studies can
through an MFCC-based technique were combined with collect acoustic data in other languages to verify the
prosodic features, namely voice pitch and voice intensity, to feasibility of the proposed method. Second, because self-
obtain 45-dimensional feature vectors. AE was used to collected voice files were used in this study, the performance
reduce features for model construction. The reconstructed of the classification model might have been affected by the
feature vectors were then fed to SVM and LSTM-RNNs to quality of the recorded voices, environmental noise, and
devise the discriminative model. variations in the participants’ voices. Future studies can
The level of satisfaction was divided into three categories: execute voice recordings in different environmental settings
satisfied, neutral, and dissatisfied. The developed SVM and examine whether the developed model is still accurate.
model combined with AE achieved the highest accuracy of Third, beverages were served free of charge in this study.
79.3 % in identifying participants’ satisfaction level; it Customers’ satisfaction level might change when they must
outperformed the LSTM-RNN model with AE, which pay for products or services. Therefore, cautions should be
reached accuracy of 75.86%. exercised when generalizing the results of this research to
environments where services or products are provided for a
A. IMPLICATIONS fee.
During the recording of voice tones, this study instructed all
participants to read the same statement: “I am satisfied with REFERENCES
[1] N. Kamaruddin, A. W. A. Rahman, and A. N. R. Shah, “Measuring
this item.” To extend the findings of this study, future studies customer satisfaction through speech using valence-arousal
could request participants to read different sentences to verify approach,” in Proc. 6th Int. Conf. Inf. Commun. Technol. Muslim
whether satisfaction levels can still be accurately identified. World (ICT4M), Nov. 2016, pp. 298–303.
Furthermore, researchers could use real-world acoustic data [2] Q. Llimona, J. Luque, X. Anguera, Z. Hidalgo, S. Park, and N.
Oliver, “Effect of gender and call duration on customer satisfaction
to verify the feasibility of the approach used in this study. in call center big data,” in Proc. 16th Annu. Conf. Int. Speech
The results of this study indicated that MFCC-based Commun. Assoc.( INTERSPEECH), Sep. 2015, pp. 1825–1829.
technique can extract voice features for satisfaction [3] R. L. Oliver, “A cognitive model of the antecedents and
consequences of satisfaction decisions,” J. Mark. Res., vol. 17, no.
recognition. Future research could perform experiments by 4, pp. 460–469, Nov. 1980.
using other feature extraction methodologies, such as LPC- [4] E. Garbarino and M. S. Johnson, “The different roles of
and LPCC-based techniques, and compare the results satisfaction, trust, and commitment in customer relationships,” J.
Mark., vol. 63, no. 2, pp. 70–87, Apr. 1999.
obtained. [5] A. Ando, R. Masumura, H. Kamiyama, S. Kobashikawa, Y. Aono,
This study confirmed that SVM and LSTM-RNNs can be and T. Toda, “Customer satisfaction estimation in contact center
used to construct a model that can detect the level of calls based on a hierarchical multi-task model,” IEEE/ACM Trans.
customer satisfaction. AE neural networks used to Audio Speech Lang. Process., vol. 28, pp. 715–728, Jan. 2020.
[6] J. J. Cronin and S. A. Taylor, “Measuring service quality: A
reconstruct features fed into SVM and LSTM-RNNs can reexamination and extension,” J. Mark., vol. 56, no. 3, pp. 55–68,
significantly improve the predictive accuracy. However, Jul. 1992.
because machine learning technology is advancing rapidly, [7] D. H. Kincade, V. L. Giddings, and H. J. Chen-Yu, “Impact of
product-specific variables on consumers’ post-consumption
new and improved methodologies for evaluating customer behaviour for apparel products: USA,” J. Consum. Stud. Home
satisfaction can be explored in future research. Econ., vol. 22, no. 2, pp. 81–90, Jun. 1998.
Telemarketing has the advantages of saving time and [8] M. Brennan, S. Benson, and Z. Kearns, “The effect of
introductions on telephone survey participation rates,” Int. J. Mark.
effort in the execution of marketing campaigns and provision Res., vol. 47, no. 1, pp. 65–74, Jan. 2005.
of customer support. The identification of customer

11

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3196733

[9] D. Kang and Y. Park, “Review-based measurement of customer purchasing setting,” Int. J. Serv. Ind. Manag., vol. 14, no. 4, pp.
satisfaction in mobile service: Sentiment analysis and VIKOR 374–395, Oct. 2003.
approach,” Expert Syst. Appl., vol. 41, no. 4, Part 1, pp. 1041–1050, [29] F. F. Reichheld, “The one number you need to grow,” Harv. Bus.
Mar. 2014. Rev., vol. 81, no. 12, pp. 46–55, Dec. 2003.
[10] J. Park, “Framework for sentiment-driven evaluation of customer [30] O. Abdel-Hamid, A.-R. Mohamed, H. Jiang, L. Deng, G. Penn,
satisfaction with cosmetics brands,” IEEE Access, vol. 8, pp. and D. Yu, “Convolutional neural networks for speech recognition,”
98526–98538, May 2020. IEEE/ACM Trans. Audio Speech Lang. Process., vol. 22, no. 10,
[11] S. Godbole and S. Roy, “Text classification, business intelligence, pp. 1533–1545, Oct. 2014.
and interactivity: Automating c-sat analysis for services industry,” [31] P. M. Chauhan and N. P. Desai, “Mel frequency cepstral
in Proc. 14th ACM SIGKDD Int. Conf. Knowl. discovery data coefficients (MFCC) based speaker identification in noisy
mining, Aug. 2008, pp. 911–919. environment using wiener filter,” in Proc. Int. Conf. Green
[12] Y. Park and S. C. Gates, “Towards real-time measurement of Comput. Commun. Elect. Eng. (ICGCCEE), Mar. 2014, pp. 1–5.
customer satisfaction using automatically generated call [32] V. Tiwari, “MFCC and its applications in speaker recognition,” Int.
transcripts,” in Proc. 18th ACM Conf. Inf. Knowl. Manage., Nov. J. Emerg. Tech., vol. 1, no. 1, pp. 19–22, Feb. 2010.
2009, pp. 1387–1396. [33] J. Martinez, H. Perez, E. Escamilla, and M. M. Suzuki, “Speaker
[13] J. Sun et al., “Information fusion in automatic user satisfaction recognition using Mel frequency cepstral coefficients (MFCC) and
analysis in call center,” in Proc. 8th Int. Conf. Intell. Hum. Mach. vector quantization (VQ) techniques,” in Proc. 22nd Int. Conf.
Syst. Cybern. (IHMSC), Aug. 2016, vol. 01, pp. 425–428. Elect. Commun. Computers (CONIELECOMP), Feb. 2012, pp.
[14] M. Płaza and Ł. Pawlik, "Influence of the contact center systems 248–251.
development on key performance indicators," IEEE Access, vol. 9, [34] S. Singh and E. G. Rajan, “MFCC VQ based speaker recognition
pp. 44580-44591, Mar. 2021. and its accuracy affecting factors,” Int. J. Comput. Appl., vol. 21,
[15] A. Tanaka, A. Koizumi, H. Imai, S. Hiramatsu, E. Hiramoto, and B. no. 6, pp. 1–6, May 2011.
de Gelder, “I feel your voice: Cultural differences in the [35] L. Wang, K. Minami, K. Yamamoto, and S. Nakagawa, “Speaker
multisensory perception of emotion,” Psychol. Sci., vol. 21, no. 9, identification by combining MFCC and phase information in noisy
pp. 1259–1262, Aug. 2010. environments,” in Proc. IEEE Int. Conf. Acoust. Speech Signal
[16] P. Liu, S. Rigoulot, and M. D. Pell, “Culture modulates the brain Process. (ICASSP), Mar. 2010, pp. 4502–4505.
response to human expressions of emotion: Electrophysiological [36] N. Singh, R. A. Khan, and R. Shree, “MFCC and prosodic feature
evidence,” Neuropsychologia, vol. 67, pp. 1–13, Jan. 2015. extraction techniques: a comparative study,” Int. J. Comput. Appl.,
[17] “Cross-cultural communication: Much more than just a linguistic vol. 54, no. 1, pp. 9–13, Sep. 2012.
stretch,” ScienceDaily, Feb. 24, 2015. [Online]. Available: [37] P. P. Dahake, K. Shaw, and P. Malathi, “Speaker dependent speech
https://www.sciencedaily.com/releases/2015/02/150224102843.ht emotion recognition using MFCC and support vector machine,” in
m. [Accessed: Apr. 22, 2020]. Proc. Int. Conf. Autom. Control Dyn. Optim. Techn. (ICACDOT),
[18] R. Banse and K. R. Scherer, “Acoustic profiles in vocal emotion Sep. 2016, pp. 1080–1084.
expression,” J. Pers. Soc. Psychol., vol. 70, no. 3, pp. 614–636, [38] G. Friedland, O. Vinyals, Y. Huang, and C. Muller, “Prosodic and
Apr. 1996. other long-term features for speaker diarization,” IEEE Trans.
[19] V. Bostanov and B. Kotchoubey, “Recognition of affective Audio Speech Lang. Process., vol. 17, no. 5, pp. 985–993, Jul.
prosody: Continuous wavelet measures of event‐related brain 2009.
potentials to emotional exclamations,” Psychophysiology, vol. 41, [39] I. Luengo, E. Navas, I. Hernáez, and J. Sánchez, “Automatic
no. 2, pp. 259–268, Mar. 2004. emotion recognition using prosodic parameters,” in Proc. 9th Eur.
[20] A. Bhattacherjee, “Understanding information systems Conf. Speech Commun. Technol. (INTERSPEECH), 2005, pp.
continuance: An expectation-confirmation model,” MIS Q., vol. 25, 493–496.
no. 3, pp. 351–370, Sep. 2001. [40] R. W. M. Ng, T. Lee, C. Leung, B. Ma, and H. Li, “Analysis and
[21] S. Davis and P. Mermelstein, “Comparison of parametric selection of prosodic features for language identification,” in Proc.
representations for monosyllabic word recognition in continuously Int. Conf. Asian Lang. Process., Dec. 2009, pp. 123–128.
spoken sentences,” IEEE Trans. Acoust. Speech Signal Process., [41] M. S. Sinith, E. Aswathi, T. M. Deepa, C. P. Shameema, and S.
vol. 28, no. 4, pp. 357–366, Aug. 1980. Rajan, “Emotion recognition from audio signals using support
[22] E. W. Anderson and M. W. Sullivan, “The antecedents and vector machine,” in Proc. IEEE Recent Advances Intell. Comput.
consequences of customer satisfaction for firms,” Mark. Sci., vol. Syst. (RAICS), Dec. 2015, pp. 139–144.
12, no. 2, pp. 125–143, May 1993. [42] K. Sönmez, E. Shriberg, L. Heck, and M. Weintraub, “Modeling
[23] P. G. Patterson and L. W. Johnson, “Disconfirmation of dynamic prosodic variation for speaker verification,” in Proc. 5th
expectations and the gap model of service quality: An integrated Int. Conf. Spoken Lang. Process. (ICSLP), Nov. 1998, pp. 3189–
paradigm,” J. Consum. Satisfaction Dissatisfaction Complaining 3192.
Behav., vol. 6, no. 1, pp. 90–99, 1993. [43] P. Rosso, L. Hurtado, E. Segarra and E. Sanchis, “On the voice-
[24] V. A. Zeithaml, L. L. Berry, and A. Parasuraman, “The behavioral activated question answering,” IEEE Trans Syst Man Cybern C
consequences of service quality,” J. Mark., vol. 60, no. 2, pp. 31– Appl Rev., vol. 42, no. 1, pp. 75-85, Jan. 2012.
46, Apr. 1996. [44] X. Sun, Z. Pei, C. Zhang, G. Li and J. Tao, “Design and analysis of
[25] E. Karahanna, D. W. Straub, and N. L. Chervany, “Information a human-machine interaction system for researching human's
technology adoption across time: A cross-sectional comparison of dynamic emotion,” IEEE Trans. Syst. Man. Cybern. Syst., vol. 51,
pre-adoption and post-adoption beliefs,” MIS Q., vol. 23, no. 2, pp. no. 10, pp. 6111–6121, Oct. 2021.
183–213, Jun. 1999. [45] “Welcome to python_speech_features's documentation!,” Welcome
[26] B. Edvardsson, M. D. Johnson, A. Gustafsson, and T. Strandvik, to python_speech_features's documentation! -
“The effects of satisfaction and loyalty on profits and growth: python_speech_features 0.1.0 documentation. [Online]. Available:
Products versus services,” Total Qual. Manag., vol. 11, no. 7, pp. https://python-speech-features.readthedocs.io/en/latest/. [Accessed:
917–927, Sep. 2000. May. 20, 2017].
[27] Y.-K. Lee, C.-K. Lee, J. Choi, S.-M. Yoon, and R. J. Hart, [46] F. J. Harris, “On the use of windows for harmonic analysis with
“Tourism’s role in urban regeneration: Examining the impact of the discrete Fourier transform,” Proc. IEEE, vol. 66, no. 1, pp. 51–
environmental cues on emotion, satisfaction, loyalty, and support 83, Jan. 1978.
for Seoul's revitalized Cheonggyecheon stream district,” J. Sustain. [47] P. Boersma, “Accurate short-term analysis of the fundamental
Tour., vol. 22, no. 5, pp. 726–749, Jan. 2014. frequency and the harmonics-to-noise ratio of a sampled sound,”
[28] C. Ranaweera and J. Prabhu, “The influence of satisfaction, trust Proc. Inst. Phonetic sciences, vol. 17, pp. 97–110, Mar. 1993.
and switching barriers on customer retention in a continuous [48] M. Pakyurek, M. Atmis, S. Kulac, and U. Uludag, “Extraction of
novel features based on histograms of MFCCs used in emotion

12

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3196733

classification from generated original speech dataset,” Elektron. ir Dr. Yu-Chin Liu got bachelor and master
Elektrotechnika, vol. 26, no. 1, pp. 46–51, Feb. 2020. degree from the Institute of Information
[49] P. G. Patterson and R. A. Spreng, “Modelling the relationship Management of National Chiao Tung University
between perceived value, satisfaction and repurchase intentions in from 1988 to 1994, and Ph.D. degree from the
a business‐to‐business, services context: An empirical Business Administration department of National
examination,” Int. J. Serv. Ind. Manag., vol. 8, no. 5, pp. 414–434, Central University in 2006. She is an associate
Dec. 1997. professor in the Information Management
[50] C. Fornell and D. F. Larcker, “Evaluating structural equation department of Shih Hsin University in Taiwan.
models with unobservable variables and measurement error,” J. Her research interest focuses on Business
Mark. Res., vol. 18, no. 1, pp. 39–50, Feb. 1981. Analytics and Data Mining applications in
[51] J. F. Hair, R. E. Anderson, R. L. Tatham and W. C. Black, business domains. She has published several journal and conference
Multivariate Data Analysis, Hoboken, NJ, USA: Prentice-Hall articles. Her articles have been published in Information Sciences, Expert
(Inc.), 1998. System with Applications and various other journals.
[52] A. Shirani and A. R. N. Nilchi, “Speech emotion recognition based
on SVM as both feature selector and classifier,” Int. J. Image
Graph. Signal Process., vol. 8, no. 4, pp. 39–45, Apr. 2016.

Po-Chiao Yang received his B.S. degree in


Information Management from the TamKang
University in 2017 and the MBA degree from
Yen Huei Ko is a Ph.D. candidate in the National Central University in 2019. He is
Business Administration Department of Currently work as Infrastructure Consultant and
National Central University in Jhongli District Solution Architecture. His research interests
Taoyuan City, Taiwan. He received the include Data Mining, Business Intelligence,
Bachelor degree from the Department of Artificial Intelligent, Signal Analytic, and
Chemical Engineer in National Cheng Kung Adoption issue of enterprise System. From 2017
University in 1987, the Master degree from the to 2019, he focus on the research of building
department of Chemical Engineer in University business model with Artificial Intelligence, mainly with Neural Network.
of California at Davis in 1990, and the Master
degree of Business Administration from
University of Southern California in 1993. The
current research focuses on utilizing meachine learning for data mining,
business intelligence, and data modeling in business domains.

Dr. Ping-Yu Hsu graduated from the CSIE


department of National Taiwan University in 1987,
got his master degree from the Computer Science
Department of New York University in 1991, and
Ph.D. degree from the Computer Science
Department of UCLA in 1995. He is a
distinguished professor in the Business
Administration department of National Central
University in Taiwan and the secretary-in-chief of
the Chinese ERP association. He is currently the
Dean of School of Management at National
Central University. His research interest focuses in the business data
related applications, including Business Analytics, Data mining, Business
Intelligence, and Adoption issues of Enterprise Systems. He has published
more than 100 Journal and conference articles. His papers have been
published in Decision Support Systems, European Journal of Information
Systems, IEEE Transactions, Information Systems, Information Sciences,
and various other journals.

13

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4

You might also like