Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

A Study of VoIP Quality Evaluation: User Perception

of Voice Quality from G.729, G.711 and G.722


Therdpong Daengsi Apiruck Preechayasomboon
Faculty of Information Technology TOT Innovation Institute
King Mongkut’s University of Technology North Bangkok TOT Public Company Limited
Bangkok, Thailand Pathumthani, Thailand

Chai Wutiwiwatchai Saowanit Sukparungsee


Human Language Technology Laboratory Dept. of Applied Statistics, Faculty of Applied Science
National Electronics and Computer Technology Center King Mongkut’s University of Technology North Bangkok
Pathumthani, Thailand Bangkok, Thailand

Abstract-This paper presents new evidence about user perception


of VoIP quality that is inconsistent with the general II. BACKGROUND
understanding of three codecs know as G.729, G.711 and G.722. A. Tonal Feature of the Thai Language and the Tone
The focus of the study is aimed at VoIP quality evaluation by Thai
Response of Thai Brains
users that use the Thai language which is tonal. This study was
conducted by using conversation-opinion tests. The results, called Thai is a tonal language similar to Chinese but at the same
MOS-CQS, were then analyzed carefully. After the study and time completely different in terms of the sound system. The
analysis, it has been found that the perception of subjects, who Thai language is currently used by over 60 million people in
were Thai native speakers, to G.729, G.711 and G.722 is not Thailand. The tonal feature is important because different tones
significantly different. change the meaning of Thai words, as shown in Table I.
It has been surprisingly reported that the left frontal
VoIP quality evaluation/measurement/assessment; subjective
tests; conversation-opinion tests; G.729; G.711; G.722 operculum (FO) of the left hemisphere of a Thai brain is
activated when listening to Thai sounds [13] which is their
I. INTRODUCTION mother language [14], whereas, the English and Chinese brains
are not activated although Chinese is also a tonal language.
Nowadays, VoIP technology has been used broadly not This can be explained by Chinese subjects not being familiar
only in organizations but also in personal use (e.g. Google Talk with Thai sounds, as shown in Figure 1.
and Skype). However, compared to some traditional
technologies such as ISDN which is used in public switched
telephone networks (PSTN), VoIP technology still has some TABLE I. COMPARISON BETWEEN TWO THAI WORDS THAT HAVE A
SIMILAR SOUND BUT NOT TONE.
limitations referring to voice quality. Therefore, many methods
to evaluate the problem have been issued [1]. Thai Sentence Phonetic Symbol English Meaning
In the past, VoIP quality evaluation has been studied based
on English, which is a non-tonal language. Even though there บานฉันอยูใกลวัด ban tan ju klaj wat My house is near a temple.
is some research for tonal-languages available such as the
Chinese [2-5] those works mainly studied improved objective บานฉันอยูไกลวัด ban tan ju klaj wat My house is far (from) a temple.
evaluation methods only. From the years 2006-07, Thai sounds
had been presented in small parts of studies with BV16 and
BV32 for IP telephony [6-7]. Moreover, there was also some
research aimed at Asian languages with subjective evaluation
methods but those excluded Thai subjects [8-10].
This means the Thai language and/or Thai users have not
been taken into consideration in VoIP quality evaluation yet,
particularly with subjective measurement methods. Therefore,
the VoIP quality evaluation with Thai native speakers, who use
Thai as their mother language, have been conducted and
Figure 1. When comparing the Thai lexical tone to the pitch task, only the
investigated in this paper. Due to evidence found in this Thai brain showed significant activation in the FO near Broca’s area [13].
research, reports that cultural variation and a different language
affects voice quality perception [11-12].
B. Quality of Experience, Voice Quality Evaluation and Mean 2) Language variation: many countries have their own
Opinion Score languages. Some languages have some special features
Definition of Quality of Experience (QoE) was stated that such as Chinese and Thai, called a tonal feature.
“QoE is defined as the measure of how well a system or an 3) Individual variation: it is assumed that the standard
application meets the user’s expectations” [15]. That means deviation of individual variation is 0.5. Therefore to
QoE focuses on user perception of application directly, instead obtain 95% confidence interval for MOS-CQS, it
of considering network effects [15-16]. Therefore, QoE requires at least 24 subjects.
evaluation is beyond and covers voice quality evaluation for 4) Balance of conditions: different laboratories could
VoIP applications. However, QoE is related to Quality of provide different conditions in a test and in turn
Service (QoS) directly [15], similar to Mean Opinion Score different results.
(MOS) which is the output of VoIP quality evaluation, as
C. Voice Codecs
presented in Figure 2 [17].
When the voice is carried over IP networks, its process
The methods to evaluate voice quality are classified into
needs a voice codec to change the voice signals into voice
subjective and objective evaluation methods [1]. It has been
packets before transporting. Normally, G.729 is used over
mentioned that subjective evaluation methods are accurate and
WAN, whereas, G.711 is used over LAN [22], For G.722, it is
highly reliable [18-19]. Conversation-opinion tests are one of
an option for use over LAN. Therefore, only these three voice
the recommended methods by ITU-T because it can reach the
codecs, are described in this paper as follows [23-26]:
same standard of realism [20]. However, the disadvantages of
1) G.729: it is an 8 Kbps coding technique called
the conversation-opinion tests are requiring two low
Conjugate Structure - Algebraic code-excited linear
background noise rooms, high cost, high effort and good
prediction (CS-ACELP). Its major advantage is
management skills. Moreover, these tests waste time. The
requiring less bandwidth than G.711, which is the
result from these tests are called Mean Opinion Score –
reason it is recommended to be used over WAN. G.729
Conversational Quality (MOS-CQS) [21]. The scale normally
has several annexes such as Annex A, called G.729A,
uses 5 for excellent, 4 for good, 3 for fair, 2 for poor and 1 for
that is the reduced complexity version of the G.729. Its
bad [20]. The result from the evaluation is then averaged to
MOS is 3.92.
provide a score (voted by the subjects). However, the results of
2) G.711: it is a 64 Kbps coding technique called Pulse
the MOS-CQS could be affected by some variations as follows
Code Modulation (PCM). Subtypes of G.711 are
[11]:
G.711μ-law and G.711A-law. The μ-law is mainly used
1) Cultural variation: in different cultures, there may be a
in USA, Canada and Japan, whereas the A-law is used
difference between the meanings of excellent,
widely in Europe and the rest of the world, including
good,…and bad.
Thailand [24]. Actually, this kind of codec is not new
because it has been used in ISDN since the decade 90. It
is concluded that its MOS is about 4.1.
3) G.722: it is a codec that can be used for a variety of
higher quality speech. It uses the bit rate of 64 kbps,
like G.711. It has been mentioned in [24] that it was the
first wideband codec that had been issued by ITU-T. It
can support bandwidth up to 7000 Hz. It is supposed to
(a) provide better voice quality than other narrow band
codecs, such as G.711. However, it is not better than
G.711 significantly, due to its MOS at about 4.1, as
mentioned in [24].

III. METHODOLOGY

A. Test Design and Test Condition.


This study was conducted using conversation-opinion tests.
Each pair of subjects were invited to sit in separate rooms pair-
by-pair to have a conversation together, about 3-5 minutes.
(b) Richard’s task [27] has been selected for conversation.
The condition variable for this test is codec. The codecs for
Figure 2. (a) The point of QoE evaluation and (b) The points of VoIP quality the testing were G.729, G.711A-law (hereafter called G.711)
evaluation are the same point of view, adopted and revised from [17]. and G.722. Each codec has been designed to be tested by at
least 12 pairs of subjects, due to subjective tests usually
conduct with 24-32 subjects [11]. The tests were in the best VI. DISCUSSION
possible condition for the VoIP system and network, or called Figure 3 (a) presents that G.722 is the highest MOS-CQS
the direct condition. of 4.21. While G.711 is the middle MOS-CQS of 4.15, and
B. Laboratory and Test Facilities G.729 is the lowest MOS-CQS of 4.13. This result is consistent
Following [20], the conversation-opinion tests were in a with general understanding that said G.722 is better than G.711
studio room and a sound recording room, which have been and G.711 is better than G.729.
used as a laboratory, at the Central Library in KMUTNB,
Bangkok [28].
As in Figure 2, the conversation-opinion tests were
conducted with the ‘real’ VoIP system. The two phones used
for testing were also ‘real’ IP phones that support the SIP
signaling protocol. The VoIP server was implemented by using

MOS
the open-source Asterisk software, version 1.6.2. Theoretically,
VoIP systems with direct condition can provide the same QN of
infinity, as mentioned in [29-30]. For packet delay and packet
loss, it is very low packet delay (< 10 ms) and it is packet
lossless.
D. Subjects
Each codec tests requires at least 24 subjects, both male Codec
and female subjects. Therefore, this study requires at least 72
subjects totally. Of course, the subjects who represent a group (a)
of Thai native listeners were student volunteers from
KMUTNB.
E. Data Gathering
This study was conducted using a paper-based form.
Mainly, subjects read and listened to instructions about
Richard’s task briefly. After finishing the task, each pair of
subjects had to answer questions about voice quality and some
% of Votes

personal information (e.g. age). The results would be


calculated to gain the MOS-CQS.

IV. RESULTS
G.729 and G.722 were equally tested by a different group
of 12 pairs of subjects, whereas G.711 was tested by a group of
13 pairs of subjects. That means the number of subjects was
totally 74 subjects, consisting of 47 male and 27 female
subjects with an average age of 20.66 years old (SD = 1.80
years). The results are presented in Figure 3.
Opinion Score
V. ANALYSIS
(b)
From the results in Figure 3 (a), it can be seen that the
MOS-CQS of G.729, G.711 and G.722 are not very different. Figure 3. Comparison among G.729, G.711 and G.722 for (a) MOS-CQS and
Therefore, to verify whether the user perception to these three (b) percent of votes.
codecs is significantly different, the raw data was analyzed
using ANOVA, a statistic tool, as follows: TABLE II. HYPOTHESIS TESTED RESULT
H0: The user perception to G.729, G.711 and G.722 is the
same Hypotheses p-value
H1: The user perception to G.729, G.711 and G.722 is the
different The user perception to G.729 VS G.711 VS G.722 0.880
The analyzed result from ANOVA is presented in Table II.
When considering votes, from Figure 3 (b), no codec was [4] F. L. Chong, K. Pawkikowski, and I. V. McLoughlin, “Evaluation of
ITU-T G.728 as a Voice over IP codec for Chinese Speech,” Australian
voted with a score of 1 or 2, while, all of them obtained scores Telecommunication Networks and Applications Conference, Dec 2003.
of 3 with almost the same percentage. However, with the score [5] Z. Ding, I. V. McLoughlin, and E. C. Tan, “Intelligibility evaluation of
of 5, it can be seen obviously that, G.722 obtained the highest GSM coder for Mandarin speech using CDRT,” Speech Communication,
vol. 38(1), pp. 161–165, Sep 2002.
vote, while G.711 is in the middle and G.729 is the lowest. [6] J.-H. Chen and J. Thyssen, “BroadVoice®16: A PacketCable Speech
Whereas, for the score of 4, the highest vote is G.729, the Coding Standard for Cable Telephony,” Proc. Asilomar Conf. Signals,
middle is G.711 and the lowest is G.722. This could be the Systems, Computers, Asilomar, CA, Oct 2006.
[7] J.-H. Chen and J. Thyssen, "The BroadVoice Speech Coding Algorithm,"
reasons that G.722 obtains the MOS-CQS of 4.21, while G.729 Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, pp. IV-537 -
obtains the MOS-CQS of 4.13. IV-540, Apr 2007.
However, when considering Table II, the p-value obtained [8] C. Quinquis, “Quality comparison of wideband coders including
tandeming and transcoding,” ETSI Workshop on Speech and Noise in
from hypothesis test using ANOVA is 0.880, that is higher Wideband Communication, May 2007.
than 0.05. It means the perception of users to these three codec, [9] N. Kitawaki and T. Tamada, “Subjective and Objective Quality
G.729, G.711 and G.722, is not significantly different. This is Assessment for Noise Reduced Speech,” ETSI Workshop on Speech and
Noise in Wideband Communication, May 2007.
inconsistent with the general understanding, although G.729 [10] Z. Cai, N. Kitawaki, T. Yamada, and S. Makino, ''Comparison of MOS
uses only 8 kbps for its payload, whereas, G.711, a narrow evaluation characteristics for Chinese, Japanese, and English in IP
band codec, and G.722, a wideband codec, use 64 kbps. telephony, '' in Proc. International Universal Communication Symposium,
pp. 1-4, Oct. 2010.
[11] A. W. Rix, “Comparison between subjective listening quality and P.862
PESQ score,” Psytechnics, Sep 2003.
VII. CONCLUSION AND FUTURE WORK [12] E.M. Yiu, B. Murdoch, K. Hird, P. Lau and E.M. Ho, “Cultural and
After studying a group of Thai subjects, the MOS-CQS for language differences in voice quality perception: a preliminary
investigation using synthesized signals,” Folia Phoniatr Logop, Vol. 60
G.729, G.11 and G.722 have been obtained. Although they (3), 2008, pp. 107–119
were certainly not representative of the general Thai population, [13] J. Gandour, D. Wong, L. Hsieh, B. Weinzapfel, D. V. Lancher and G. D.
the results could be the benchmark for VoIP quality evaluation Hutchins, “A crosslinguistic PET study of tone perception,” J. Cogn.
Neurosci., Massachusetts Institute of Technology, Jan 2000, Vol. 12, No.
based on Thai users and might be used for calibration of 1, pp. 207-222.
objective measurement tools to be used in Thai environments. [14] W. Sittiprapaporn, C. Chindaduangratn, and N. Kotchabhakdi, “Long-
Also, based on the Thai users, it could be recommended to use term memory traces for familiar spoken words in tonal languages as
revealed by the Mismatch negativity,” Songklanakarin J. Sci. Technol.,
G.729 to obtain good voice quality as in G.711 and G.722 to 2004, Vol. 26, No. 6, pp. 779-786.
reduce traffic in network because it requires payload [15] H. Batteram et al., “Delivering Quality of Experience in Multimedia
bandwidth of only 8 kbps, whereas, G.711 and G.722 requires Networks,” Bell Labs Tech. J., 2010, Vol. 15(1), pp. 175-194
[16] Nokia, “Quality of Experience (QoE) of mobile services: Can it be
a payload bandwidth of 64 kbps. Another point of view, this measured and improved?,” White paper, 2004.
study presents the evidence of user perception of these three [17] K. Kilkki, “Quality of Experience in Communications Ecosystem,” J.
codecs and found that they are not significantly different. It is UCS, Mar 2008, Vol. 14, No.5, pp. 615-624
[18] M. Goudarzi, “Evaluation of Voice Quality in 3G Mobile Networks”,
inconsistent with the general understanding about voice quality Thesis, University of Plymouth, Jun 2009.
provided by G.729, G.711 and G.722. Therefore, this process [19] T. A. Hall, Objective speech quality measures for Internet telephony, in:
can be considered to verify other languages. Thus, this Voice over IP (VoIP) Technology, Proceedings of SPIE, vol. 4522,
Denver, CO, USA, 2001, pp. 128-136.
evidence can be used to challenge or improve developments of [20] ITU-T Recommendation P.800, “Methods for subjective determination of
voice quality provided by novel codecs for VoIP. transmission quality”, Aug 1996.
[21] ITU-T Recommendation P.800.1, “Mean Opinion Score (MOS)
terminology,” Jul 1996.
ACKNOWLEDGMENT [22] Avaya Labs., “Avaya IP Voice Quality Network Requirements,” Avaya
Inc, CO, Apr 2006.
Thanks you very much all reviewers for useful comments. [23] O. Hersent, J. Petit and D. Gurlr, “IP Telephony Deploying Voice-
over0IP Protocols,” Wiley, 2005.
Thank you to lecturers, students, and staff in KMUTNB who [24] S. Karapantazis and F.-N. Pavlidou, "Voip: A comprehensive survey on a
supported, particularly Mr. Wiwat Suwanuntawong, the promising technology," Comput. Networks, vol. 53, no. 12, pp. 2050-
Central Library Studio staff, and Mr. Gary Sherriff, the 2090, August 2009.
international coordinator, Faculty of Information Technology [25] ITU-T Recommendation G.729, “Coding of speech at 8 kbit/s using
conjugate-structure algebraic-code-excited linear prediction (CS-
(for editing). Lastly, the first author would like to dedicate this ACELP),” Jan 2007.
paper to Dr. Gareth Clayton, advisor who sadly passed away. [26] ITU-T Recommendation G.722, “7 kHz Audio - Coding within 64
kbit/s,” 1988.
[27] ITU-T Recommendation P.805, “Subjective evaluation of conversational
REFERENCES quality,” Apr 2007.
[28] T. Daengsi and K. Tontiwattanakul, “A Case of Improvement of Building
[1] F. D. Rango, M. Tropea, P. Fazio, and S. Marano, “Overview on VoIP: Acoustics Using Available Equipments and Limited Resources”
Subjective and Objective Measurement Methods”, IJCSNS, Vol. 6 No. Naresuan Research Conference 2010, Phitsanulok, Thailand, Jul 2010.
1B, 2006. [29] ITU-T Recommendation P.830, “Subjective Performance Assessment of
[2] J. Ren, H. Zhang, Y. Zhu and C. Gao , “Assessment of effects of Telephone-Band and Wideband Digital Codecs,” Feb 1996.
different language in VOIP,” ICALIP2008, Shanghai, 2008. [30] ITU-T Recommendation P.810, “Modulated Noise Reference Unit
[3] F. Chong, I. McLoughlin, and K. Pawliowski, “A methodology for (MNRU),” Feb 1996.
improving PESQ accuracy for chinese speech,” presented at the IEEE
Region 10 Conf., TENCON, Melbourne, Nov. 2005.

You might also like