Nemo Voice Quality - White Paper

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Confidential and Proprietary

Nemo Voice Quality

White Paper

1 (12)
Confidential and Proprietary

1 THEORETICAL BACKGROUND ..................................................................................................... 3


1.1 OVERVIEW ................................................................................................................................ 3

1.2 ITU-T RECOMMENDATIONS ................................................................................................... 3


1.2.1 ITU-T P.800 ........................................................................................................................ 3
1.2.2 PERCEPTUAL MODELING ............................................................................................... 4

1.2.3 ITU-T P.861 ‘PSQM’........................................................................................................... 5


1.2.4 ITU-T P.862 ‘PESQ’ ........................................................................................................... 6
1.3 PERFORMANCE EVALUATION............................................................................................... 7

1.3.1 ALGORITHM COMPARISON ............................................................................................ 7


1.3.2 PESQ SCORE RELIABILITY ............................................................................................. 8
2 NEMO VOICE QUALITY................................................................................................................... 9
2.1 OVERVIEW ................................................................................................................................ 9
2.1.1 STANDARD COMPLIANCE............................................................................................... 9
2.1.2 NEMO AUDIO MODULE .................................................................................................... 9
2.2 MOBILE-TO-MOBILE TESTING ............................................................................................. 10
2.3 MOBILE-FIXED-MOBILE TESTING ....................................................................................... 11
3 REFERENCES................................................................................................................................ 12
Confidential and Proprietary

1 THEORETICAL BACKGROUND
1.1 OVERVIEW
Speech transmission path amongst mobile and fixed networks consist of many different elements -
along the path there can be multiple speech codecs, analog to digital and digital to analog conversions,
echo cancellers, noise suppressors, adaptive level controllers, voice activity detectors, comfort noise
generators, signal enhancers and so on. In modern packet-switched networks variable delays and
packet losses inflict other types of problems. Moreover, especially in mobile networks additional quality
degradation may, and usually will, happen due to bit errors on the air interface layer and also by silent
gaps caused by e.g. handovers.

This kind of complicated systems can inflict large variety of degradations to speech signals. These
degradations include: loudness loss, talker and listener echo, temporal gaps on speech signal, filtering,
amplitude clipping, variable delays, distortions, channel errors, effects/artifacts from noise reduction
algorithms and from operation of echo cancellers and so on.

1.2 ITU-T RECOMMENDATIONS


These chapters are based on reference [1] under permission from Opticom GmbH.

1.2.1 ITU-T P.800

Historically being related to the assessment of telephone connections, useful methods for testing
telephone band speech signals were first standardized within ITU-T. Recommendation P.800 [2]
defines the absolute category rating test method which has been used for the assessment of speech
codecs since 1993. Within the ACR test method, the ITU five-grade impairment scale is applied, see
table 1. It should be noted, that because of the telecommunication environment, testing is done without
a comparison to an undistorted reference. This copes with a typical situation of a phone call, where the
listener has no access to a comparison with a reference, for example the original voice of the other
party. However, it should be noted that the listening test according to P.800 could be regarded as a
comparison between a test signal and a reference “in the mind” of the listener. This is because of the
fact that the listener is very familiar with the natural sound of a human voice.

For comparison reasons, and in order to be able to merge the results of different individuals, it is
necessary to adjust the listener’s opinions to an absolute scale. For this purpose, predefined examples
with well defined noise insertions of fixed modulated noise reference units (MNRU) [3] are presented at
the beginning of test. Each sample represents an example distortion corresponding to the ITU-T
version of the five grade impairment scale.
Confidential and Proprietary

Impairment Grade
Imperceptible 5
Perceptible, but not annoying 4
Slightly annoying 3
Annoying 2
Very annoying 1

Table 1: The ITU-T five-grade impairment scale.

Based on these test conditions a population of typically 20 to 50 test subjects will be presented with an
identical series of speech fragments. Every test subject will be asked to score each sample by applying
the impairment scale. After statistical processing of the individual results, a mean opinion score
(MOS) can be calculated. With thorough setups, such test results can be reproduced quite well, even
at different locations. Of course, the effort needed in terms of subjects and time is tremendous. And of
course, such test methods can not be applied within a practical or field environment in the daily life.

1.2.2 PERCEPTUAL MODELING

During all the years in the development of compression schemes assessing the quality was a pending
issue. Consequently, the idea of substituting the subjective tests by objective, computer-based
methods has been an ongoing focus of research and development. Early work motivated from the
development in speech coding was reported in [7]. Since then several methods were introduced.

International standardization of perceptual audio measurement techniques was mainly driven by two
expert groups within the International Telecommunications Union (ITU).

The underlying concepts of the proposed algorithms for perceptual techniques are all quite similar. The
common structure of these algorithms is depicted in figure 1. The process of human perception is
modeled by employing a difference measurement technique which compares both, a reference signal
(i.e. the "input" signal to a codec) and a test signal (i.e. the "output" signal of the codec). First, the
algorithms process an ear model for the reference and the test signal, in order to calculate an estimate
for the audible signal components. The result can be imagined as the "internal representation inside
the human auditory system. The comparison of the internal representations of the reference, and the
test signal leads to an estimate of the audible difference. In order to derive an overall quality figure, this
information, which is a function of time, must be processed accordingly, like the human brain of a
subject would do in a listening test. The respective part of processing within an algorithm is referred to
as cognitive modeling. In the end, a total quality figure will be derived, which can be compared to a
MOS (Mean Opinion Score) resulting from a listening test.
Confidential and Proprietary

Figure 1: The underlying concept for perceptual measurement.

1.2.3 ITU-T P.861 ‘PSQM’

Subjective quality assessment of speech codecs is one of the key technologies in designing digital
telecommunication networks. Recommendation P.830 [4] defined subjective testing methodologies for
speech codecs. Since subjective quality assessment is time-consuming and expensive, it was
therefore desirable to develop an objective quality assessment methodology to estimate the subjective
quality of speech codecs with less subjective testing.

In the past, the most widely-used objective speech quality measure demonstrating the performance of
speech codecs was the Signal-to-Noise Ratio (SNR = S/N). However, it was pointed out that the SNR
does not adequately predict subjective quality for modern network components. This is especially true
for recent low bit-rate codecs.

Within the telecommunication sector of the ITU, in 1996 study group 12 finalized the recommendation
P.861 [5] for the objective analysis of speech codecs. After a wide-ranging comparison of proposed
methods, the group opted for the PSQM algorithm. PSQM correlated up to 98 percent with the scores
of subjective listening tests.

To the extent that PSQM is a faithful representation of human perception and judgment processes,
inaudible differences between input and output will receive the same PSQM score. In particular, if the
input and the output are identical, PSQM will predict perfect quality irrespective of the quality of the
input signal.

Within PSQM, the physical signals constituting the source and coded speech are mapped onto
psychophysical representations that match the internal representations of the speech signals (the
representations inside our heads) as closely as possible. The quality of the coded speech is judged on
the basis of differences in the internal representation. This difference is used for the calculation of the
noise disturbance as a function of time and frequency. In PSQM, the average noise disturbance is
directly related to the quality of coded speech.

Besides perceptual modeling, the PSQM method also uses cognitive modeling in order to achieve a
high correlation between subjective and objective measurements [4]. The result is the estimated quality
of the received signal.
Confidential and Proprietary

1.2.4 ITU-T P.862 ‘PESQ’

Driven by the demand for a verified test procedure for VoIP, an expert group within ITU-T SG12 has
been working on an improved speech quality model. After a competitive phase, the new model ‘PESQ’
has been devised. PESQ stands for "Perceptual Evaluation of Speech Quality". PESQ combines a
further refinement of PSQM and PAMS. Extensive tests showed PESQ's superior performance
especially for VoIP applications. In February 2001 PESQ was accepted as the ITU-T Rec. P.862 [6].

When PSQM was standardized as P.861 [5], the scope of the standard were at that time state of the
art codecs as they were mainly used for mobile transmission, like GSM. VoIP was not yet a topic at this
time. The requirements for measurement equipment have changed dramatically since then. As a
consequence, the ITU set up a working group to revise the P.861 standard in order to cope with the
new demands arising from modern networks like VoIP. With these networks the measurement
algorithm has to deal with much higher distortions as with GSM codecs, but maybe the most eminent
factor is that the delay between the reference and the test signal is not constant anymore.

A first approach to overcome these problems was the development of PSQM+ (however it is not
included in the standard). It could well handle the larger distortions as they are caused by e.g. burst
errors, but still had significant problems with the compensation of the varying delay.

With the new ITU standard P.862 (PESQ) [6] this problem is now finally eliminated. PESQ combines
the excellent psychoacoustic and cognitive model of PSQM+ with a time alignment algorithm that
perfectly handles varying delays. The only drawback of PESQ is that it is absolutely not designed for
streaming applications. This is in turn why it cannot fully replace PSQM+. With PSQM and PESQ there
are now two standards that cover the entire problem of measuring speech quality. Figure 2 gives an
overview of the structure of the PESQ algorithm and shows the new blocks which have been added to
the PSQM algorithm.

Figure 2: The structure of the PESQ algorithm.

One of the major advantages of PESQ compared to PSQM (+) is that it contains a real good time
alignment algorithm, which is capable of handling varying delays. With PSQM, such time alignment
was missing in the standard, and it was up to the implementers to take care of this issue. As
Confidential and Proprietary

experience showed, only very few PSQM implementations came with a time alignment algorithm that
was well suited for static delays on real networks, and even less measurement systems were capable
of handling varying delays, as they appear on e.g. packet based networks. As a consequence of the
wrong time alignment, two parts of the reference and the test signal were compared that did not match
and therefore did sound different. This sonic difference of course led to a PSQM score which was too
pessimistic and simply wrong. With PESQ, this shortcoming is now finally eliminated and the user will
get realistic results for his device under test. There is no more danger, that the tested system is
downgraded, just because of a deficiency of the measurement algorithm.

The PESQ Score expresses the voice quality on a MOS like scale. The PESQ Score as defined by the
ITU recommendation P.862 ranges from –0.5 (worst) up to 4.5 (best). This may surprise at first glance
since the ITU scale ranges up to 5.0, but the explanation is simple: PESQ simulates a listening test
and is optimized to reproduce the average result of all listeners (remember, MOS stands for Mean
Opinion Score). Statistics however prove that the best average result one can generally expect from a
listening test is not 5.0, instead it is ca. 4.5. It appears the subjects are always cautious to score a 5,
meaning "excellent", even if there is no degradation at all.

The PESQ score is frequently also referred to as the PESQ MOS, which indicates the high correlation
of the two values PESQ Score and MOS. This is however scientifically not correct. The PESQ Score
can be mapped to the ITU P.800 scale by applying a simple mapping function. One such function has
been standardized by the ITU in P.862.1 [9]. Another mapping function is the so called PESQ-LQ. Due
to the new recommendation P.862.1, PESQ-LQ is obsolete now.

1.3 PERFORMANCE EVALUATION

1.3.1 ALGORITHM COMPARISON

The ITU-T study group 12 has made an evaluation of the performance of different objective speech
quality assessment algorithms. Study group 12 question 9 was "Objective measurement of speech
quality under conditions of non-linear and time-variant processing". The performance of five algorithms,
named as PACE, PAMS, TOSQA, VQI and PSQM were investigated in [8]. Some results from the
evaluation can be seen in table 2. The third order monotonic polynomial mapping is performed to
account for normal variations between groups of subjects and between test labs, and the context of the
subjective experiment.
Confidential and Proprietary

Table 2: Correlation coefficients per condition after third order monotonic polynomial mapping. Some
legacy algorithm results have been removed from original table.

The table 2 indicates that the PESQ method has the best overall performance and meets the
requirements set as seen in Req-column – as any other algorithm on this evaluation does not. It is also
seen that many of the older algorithms not only fail, but fail miserably in numerous cases.

1.3.2 PESQ SCORE RELIABILITY

Generally the voice quality measure produced by Nemo Outdoor, that is PESQ score mapped on
P.800 scale, is an exact and reliable indicator of network’s speech quality. However, one should not
pay too much attention to differences of hundredth parts, or even differences of a few tenths in
objective MOS scores because the algorithm’s precision is something like 0.25MOS.

Due to nature of GSM speech codecs (Half Rate, Full Rate, Enhanced Full Rate) that are based on
vocoding, the theoretical maximum quality is 3.85 MOS on EFR channel. This result is based on simple
measurement: All one needs is software codec (like GSM EFR), that encodes a PCM speech file and
writes it to a file which then contains the GSM-encoded bit-stream. After that one can decode it back to
PCM. Then give original and encoded-decoded files to PESQ library and the PESQ score should be
near above mentioned value. In real life, the sample goes also through DA and AD conversions that
affect the score minimally.

Some guidelines for evaluating the MOS. Generally, one can achieve 3.5-4.0MOS scores only in good
laboratory conditions by using GSM EFR codec. In casual test drives a good score would be between
2.0 and 3.0, and normally listeners cannot tell so much difference between 2.0 and 3.0MOS. After the
quality drops between 1.0 and 1.9 degradations become easily noticeable. The AMR codec may give
up to 0.5MOS lower scores than GSM-EFR because it lowers the bit-rate in good conditions – hence
the codec’s name: Adaptive Multi-Rate.

It’s not essential whether some other tools claim to produce more precise or higher MOS scores than
PESQ implementation. Other tools can use different MOS scale and not the official P.800 scale, and
Confidential and Proprietary

naturally their results are not inter-comparable with P.800 MOS-LQO mapped results, and therefore
they can be compared only with other results produced with that tool. At the end, MOS just means
Mean Opinion Score and can be freely used with whatever test material so different vendors can use it
in different ways. What comes to precision, when compared with actual tests made with human
test subjects, there is no official algorithm that outperforms PESQ.

As PESQ is the latest objective algorithm and also the latest ITU-T recommended method, it is the best
possible method to be used in objective end-to-end measurements. Being seamlessly integrated into
Nemo Outdoor, PESQ-based voice quality makes the voice quality testing easy and enables network-
wide quality performance evaluation as seen by the end-user.

2 NEMO VOICE QUALITY


2.1 OVERVIEW
Nemo Voice Quality is an option for Nemo Outdoor measurement system. Each Outdoor system can
include one to four VQ measurement ends - if there is more than one measurement end installed on
Nemo Outdoor, it is also a multi-system having Nemo Multi platform.

One Nemo Outdoor Multi test laptop can run up to four simultaneous tests and these can be combined
freely from voice quality and data transfer tests, and any test terminals supported by Nemo Outdoor,
except that some legacy terminals that have max 1 per test system limitation.

Nemo Outdoor VQ uses the PESQ library that is licensed from Opticom GmbH. The PESQ score is
calculated offline by comparing the original and the degraded sample files (so called intrusive testing).

2.1.1 STANDARD COMPLIANCE

Nemo Voice Quality uses the algorithm specified in ITU-T recommendation P.862 [6] that is also
known as PESQ. The PESQ score is mapped onto MOS-LQO scale as specified in the ITU-T
recommendation P.800 [2], by using mapping function as described in ITU-T recommendation P.862.1
[9]. PESQ replaces older P.861 PSQM [5] algorithm that is now obsolete.

PESQ is compatible with all known mobile technologies, such as, 2G, GSM, CDMA, 3G, and codecs
such as HR, FR, EFR and AMR.

2.1.2 NEMO AUDIO MODULE

Nemo Voice Quality sub-system is based on a custom-made advanced audio card (Nemo Audio
Module) that handles not only the sending and receiving of test samples, but also all kind of digital-
analog-digital conversions and mutual synchronization between measurement ends, and adjustment of
line levels for optimal performance.
Confidential and Proprietary

Nemo Audio Module is an easily scalable device, semi-automatic and does not consume CPU
processing power. Test persons can listen to the ongoing test by connecting 3.5mm headset to Nemo
Audio module and there is also line output connector for external recorders. Nemo Outdoor can also
store received samples to be played later during measurement file playback or for further analysis by
an expert listener. Each sample file is auto-named so that there is date, time and MOS included in the
name so it is easy to pick up later the most interesting samples from all recorded material.

2.2 MOBILE-TO-MOBILE TESTING

Nemo Audio Module Nemo Audio Module

audio audio

serial trace trace


TEST LAPTOP serial trace
and power and power
Nemo Outdoor and power

MMAC2

USB

The test terminals used in voice quality measurements can be on separate laptops or attached to the
MMAC2 unit. The MMAC2 unit is needed when there is multiple measurement ends connected to the
Nemo Outdoor. Nemo Audio Module units can be installed on top of the MMAC2 rack, in which case
they get power from it. The Nemo Audio Module units can also be used as stand-alone devices,
powered by an external power supply or by the nearest PS/2 or USB port.

The test proceeds as follows. At first, a reference sample is uploaded to each Nemo Audio Module. A
test mobile initiates a test call to the other mobile. After a connection is established, the mobile
initiating the call starts sending the reference sample and the other end receives it. The receiving end
calculates the PESQ score after the sample is fully received.
Confidential and Proprietary

After that, depending on the test mode, the initiator can send the sample again (simplex TX mode) or
the receiver sends in turn (duplex mode). The sending is repeated, and alternated in duplex mode, as
long as the test call lasts.

2.3 MOBILE-FIXED-MOBILE TESTING


Nemo Audio-Test Server is available during 2Q05. The hardware platform is the same as in Nemo
Data-Test Server and both functions can be installed onto the same server.
Each option has four PSTN- or Ethernet lines. The server is based on Fedora Core 2 Linux and is
administration and configuration-free after initial set-up. Server stores measurement files in the same
file format as Nemo Outdoor, and the measurement files can be collected by using FTP connection.

TEST LAPTOP Test Terminal Nemo Audio Test Server


Nemo Outdoor

trace

PLMN / PSTN

audio signal

Nemo Audio Module - Up to 12 PSTN fixed


line interfaces
trace

The test proceeds as follows:


• At first, a reference sample is uploaded to Nemo Audio Module. A test mobile initiates a test
call to one of the fixed line numbers of Nemo Server, which has one or more Audio-test options
installed.
• The server answers the call. Nemo Outdoor/Audio module starts sending the configured test
sample to uplink direction to Nemo Server.
• The server recognises the received voice sample. The server starts sending the same test
sample to downlink direction to Nemo Outdoor.
• Both ends (Nemo Outdoor, Nemo Server) record the measurement samples, calculate the
MOS scores, and store the MOS results in the corresponding measurement files.
Along with test samples, one can hear short pseudo-noise bursts between samples: those are used for
synchronization purposes and do not affect the PESQ score, because they are eliminated before the
calculation.

Results from the server can be collected by using e.g. FTP.


Confidential and Proprietary

3 REFERENCES
[1] Opticom GmbH, PesqOemLibraryManualV1.6.4.pdf

[2] ITU-T Recommendation P.800, Methods for subjective determination of transmission quality,
1996

[3] ITU-T Recommendation P.810, Modulated Noise Reference Unit (MNRU), 1996

[4] ITU-T Recommendation P.830, Subjective Performance Assessment of Telephone-Band and


Wideband Digital Codecs, 1996

[5] ITU-T Recommendation P.861, Objective Quality measurement of telephone-band (300 -


3400 Hz) speech codecs, 1996

[6] ITU-T Recommendation P.862, PESQ an objective method for end-to-end speech quality
assessment of narrowband telephone networks and speech codecs, February 2001

[7] KARJALAINEN M., A New Auditory Model for the Evaluation of Sound Quality of Audio
Systems, Proc. of the ICASSP 1985, pp. 608-611

[8] ITU-T SG12 contribution COM12-117, Solothurn, Germany, 6-10 March 2000

[9] ITU-T Recommendation P.862.1, Mapping function for transforming P.862 raw result scores to
MOS-LQO

You might also like