Representation Switch Smoothing For Adaptive HTTP Streaming

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Representation Switch Smoothing for Adaptive HTTP Streaming

Michael Grafl1, Christian Timmerer1


1
Institute of Information Technology (ITEC), Alpen-Adria-Universität (AAU), Klagenfurt, Austria
michael.grafl@itec.aau.at, christian.timmerer@itec.aau.at

Abstract smooth transition between the two representations. Frame by


frame, the playout quality is slightly reduced. Vice versa, the
When an adaptive media streaming system has to switch from playout quality is smoothly increased after a higher
one representation of the content to another, the switch causes representation has been received. This concept is illustrated in
viewer distraction. We introduce the concept of representation Figure 1.
switch smoothing for alleviating the distraction and improving In client-driven streaming scenarios such as DASH, the
the overall quality of experience. As adaptive HTTP streaming adaptation decision is typically known at least one segment
systems typically deploy video buffers on the client side, the duration ahead of the playout time. While the current segment is
adaptation decision is known far enough ahead of playout time to played, the next segment has to be requested to ensure timely
perform a seamless transition between quality representations. arrival. For the deployment of Scalable Video Coding (SVC) [4]
We discuss implementation considerations for an adaptive HTTP in DASH, the time frame might be shorter, depending on
streaming system with scalable video coding, present a whether enhancement layers of the segment are downloaded
subjective evaluation of the proposed approach, and identify using HTTP pipelining [5][6]. Typical DASH clients already
factors that influence how smooth transitions are perceived. decide to adapt to a lower representation when still three or more
Index Terms: adaptive streaming, representation switching, 2-second segments are buffered [6]. If the adaptation logic
quality of experience pursues a conservative buffer management (e.g., [7]), the
adaptation decision is taken even further ahead. In any case, the
1. Introduction and Concept receiver is aware of pending representation switches ahead of
playout time and can thus react by smoothing the quality
Adaptive HTTP streaming has gained widespread adoption in transition.
delivering multimedia content within heterogeneous
Representation switch smoothing can be realized by an
environments (i.e., networks, terminals, users). It is typically
additional component in the decoding chain. This component is
deployed on top of the existing (network) infrastructure without
notified by the client's adaptation logic whenever the adaptation
any quality guarantees and, thus, delivered on a best-effort basis.
decision is changed. The amplitude of the switch has to be
Various proprietary formats are available which may eventually
signaled as well. For SVC with medium-grain scalability (MGS)
converge to MPEG's standard on Dynamic Adaptive Streaming
layers, this can be represented as the difference in MGS layers.
over HTTP (DASH) [1]. However, the way in which services,
In a more general system, the bitrates or the video qualities (e.g.,
based on these formats, are perceived by the end user is
PSNR) of the higher and lower representation may be signaled.
deliberately not determined by the format itself but subject to
If the first frame of the lower representation can already be
research. In general, the multimedia content is provided in
decoded, its quality could be used by the representation switch
multiple versions – referred to as representations comprising
smoothing component as reference to adjust the amount of noise
different bitrates, resolutions, codecs, languages, etc. – and
it adds to frames of the higher representation. Depending on the
available in segments of about 2-10 seconds length. The client
amplitude of the representation switch, the smoothing
requests segments based on its context (e.g., current throughput
component chooses the duration of the transition; higher
which translates to available bandwidth) and may switch to
amplitudes require longer durations.
different representations, typically at segment boundaries.
In case of down-switching, the component adds increasing noise
Frequent quality switches with high amplitudes in adaptive
to the frames of the higher representation as detailed in Section 3
HTTP streaming sessions – e.g., switching from (very) high to
until it matches the quality of the lower representation just before
(very) low bitrates – have been shown to annoy viewers and,
the switching. In case of up-switching, the component adds noise
thus, reduce the Quality of Experience (QoE) [2]. The
with temporally decreasing intensity to the frames of the higher
disturbance can be reduced through intermediate quality
representation, such that the transition between representations
levels [3] but in practice only very few levels (3-5) are deployed.
becomes seamless.
Previous work focused only on quality switches at segment
boundaries and viewers may still notice abrupt quality changes. The remainder of this paper is structured as follows. Related
work is discussed in Section 2. In Section 3, implementation
In this paper, we propose a more fine-grained approach, a
options for the representation switch smoothing component are
smooth transition between representations, which we
explained. In Section 4, we conduct a subjective evaluation on
subsequently call representation switch smoothing. The goal of
whether representation switch smoothing has a positive impact
representation switch smoothing is to reduce the annoyance of
on the QoE. Section 5 identifies several factors that influence
quality switches even further. When the receiver is aware of an
how quality switches are perceived and discusses the potential
imminent switch to a lower representation, it can already reduce
impact of these factors on representation switch smoothing. The
the playout quality of the current representation, enabling a
(a) (a)

(b)

Figure 1: Adaptation with (a) traditional representation


switching and (b) representation switch smoothing.

paper is concluded in Section 6 together with an outlook on


future work. (b)

2. Related Work Figure 2: Simplified block diagram of the SVC decoding


process for (a) traditional decoding, adopted from [9]
In adaptive HTTP streaming an important factor for QoE is and (b) decoding with representation switch smoothing.
flickering due to switches between representations. Ni et al. [2]
have evaluated the impact of flickering on the video acceptance In a recent study, Sieber et al. [7] have proposed an SVC
by the viewer on mobile devices. They have investigated the adaptation logic that reduces the number of quality switches by
effects of changing video qualities (noise flicker), video striving for a stable buffer level before increasing the number of
resolutions (blur flicker), and frame rates (motion flickering) for consumed SVC layers. Their evaluations show a very high and
SVC at various configurations with periodic flickering durations. stable overall playback quality of the proposed algorithm
Periodic flickering means that a switch from the higher to the compared to other state-of-the-art SVC-DASH adaptation
lower representation, and vice versa, occurred periodically, e.g., techniques. However, the comparison does not take the
every 2 seconds. Their results show that frequent noise flickering amplitude of quality switches into account.
between two SNR representations with a period below 2 seconds
impairs the viewing quality down to a point where viewers
would prefer the lower video representation altogether. For blur
3. Implementation Options
flickering, viewers preferred the constant lower representation There are three options for implementing the smooth reduction
(at half the original resolution) for flickering periods up to 2 of quality: either before, within, or after the decoder. As
seconds. smoothing for down-switching is performed analog to up-
Mok et al. [3] have proposed a QoE-aware DASH system based switching, we only consider the former in our discussion.
on AVC. As quality switches of high amplitude (e.g., from The first option, denoted pre-decoder implementation, is to add a
highest to lowest representation) are annoying to viewers, the filter component before the decoder. This component alters the
proposed adaptation algorithm inserts intermediate steps to avoid encoded bitstream by removing certain picture fidelity data. For
abrupt quality changes. Thus, the reduced amplitude of quality SVC with MGS enhancement layers, a straight-forward
switches seems to outweigh the additional number of quality implementation is to remove transform coefficients (i.e., set them
switches in terms of QoE. This also confirms an earlier study on to 0) from the enhancement layer. For the th frame in the smooth
quality switches by Zink et al. [8] that has evaluated viewers' transition, transform coefficients are removed as calculated in
preferences of various quality switching patterns. General trends Equation ( 1 ).
in those patterns are that high amplitudes in down-switches
should be avoided and that switching up is preferred to switching
down (i.e., it is better to start with a low quality and switch up ⌊ ⌋ (1)
than to start with a high quality and switch down).
(a) (a)

(b) (b)

Figure 3: Snapshots of Sequence 1 at (a) 2,000 kbps and Figure 4: Snapshots of Sequence 2 at (a) 2,000 kbps and
(b) 400 kbps. (b) 250 kbps.

Let be the duration of the smooth transition and be the total Since motion compensation is still based on the original,
difference of transform coefficients between the higher and the unimpaired coded video data, we expect the reconstructed frame
lower representation. to slightly differ from the case where the respective transform
This approach is easy to implement and independent of the coefficients had been set to 0 in the encoding process. The
decoder. However, a drawback is that changes from one frame assessment of the resulting video quality is subject to future
are propagated within the group of pictures (GOP) due to motion work. Nevertheless, an implementation within the decoder is
compensation drift [4], causing unwanted artifacts. more accurate and robust than the pre-decoder implementation
option as it avoids error propagation. Of course, it requires a
The second option is an implementation inside the decoder specialized decoder.
referred to as in-decoder implementation. Again, some picture
fidelity data is removed from the coded frames, but without Note that the first two implementation options will have to
affecting the motion compensation of other frames. For SVC, consider that SVC allows for custom scaling matrices, which
this implies that inverse transform of the residual data has to be even may change between frames. The scaling matrix provides
performed twice. The number of transform coefficients to be the values by which the transform coefficients of a macroblock
removed per frame is the same as in the first implementation are inversely quantized. Full support for custom scaling matrices
option. A simplified block diagram of the decoding process is might increase the computational complexity of the
given in Figure 2. implementation.
Figure 2 (a) shows the original SVC decoder structure adopted The third implementation option is to add a video filter
from [9] with handling of base layer and enhancement layer component after the decoder for inserting additional noise into
residual data. Figure 2 (b) highlights the additional steps the decoded frames. We denote this as post-decoder
necessary for maintaining the original decoded picture buffer implementation. This noise mimics the degrading quality to
when performing representation switch smoothing. In contrast to enable a smooth transition to the lower representation. The
the first implementation option the representation switch computational complexity is still slightly higher than for the first
smoothing is performed after the inverse quantization. The two implementation options.
operations are commutative; setting a transform coefficient to 0 This third implementation option is independent of the decoder
has the same result before and after inverse quantization. and the video coding format and also avoids drift.
considered down-switching in our first evaluation in order to
decide whether the approach is worth pursuing.
We used two test sequences, both extracted from the open-
content short film Tears of Steel [10]. Both sequences have
durations of 15 seconds at resolution 1280x720 and 24 fps frame
rate. Sequence 1 has high-motion content and was extracted
starting at time point 7:43; Sequence 2, with low-motion content,
was extracted starting at 1:57. The sequences were selected such
as to avoid confusing scene changes, although both contain cuts.
The 15-second sequences were split into 5-second segments. We
simulated a quality switch from a high bitrate (2,000 kbps) to a
low bitrate (400 kbps for Sequence 1 and 250 kbps for
Sequence 2) after 10 seconds. As Sequence 1 has higher temporal
information, it was harder to compress for the encoder, causing
already strong visible artifacts at 400 kbps. Snapshots of the high
and low bitrate encodings are shown in Figure 3 for Sequence 1
and in Figure 4 for Sequence 2.
Each sequence was encoded once with a quality switch (after 10
seconds), and once with a smooth downward transition (between
(a)
seconds 5 and 10). For the purpose of this test, the sequences
were encoded to AVC at constant target bitrates with the
FFmpeg encoder.
We observed that the encoder badly allocates bitrates for the first
few frames, especially at low target bitrates. In per-segment
encoding, this caused unwanted distortion at segment
boundaries. We thus decided to always encode the entire
sequences and to split them into segments after encoding. In the
absence of a working implementation of any of the
aforementioned options, the smooth transition was realized by
encoding the sequences at predetermined target bitrates (one per
frame in the transition segment) and stitching the respective
frames to a continuous segment. Thus, 120 encodings were used
to obtain the 5-second transition.
The bitrates for the smooth transition were determined as
follows. The sequence was first encoded at 5 sample bitrates
(from 2,000 kbps to the lowest bitrate). The PSNR for the
transition segment was calculated to obtain the rate-distortion
performance. As the RD performance typically follows a
logarithmic curve, a logarithmic curve fitting ( ) was
(b) computed as shown in Equation ( 2 ) in order to approximate the
video quality ( ) for bitrate and model parameters
Figure 5: Per-frame PSNR results for quality switching and .
and representation switch smoothing for (a) Sequence 1
and (b) Sequence 2. ( ) ( ) (2)

While the third implementation option is video coding format The inverse function ( ) of this curve fitting is shown in
independent, it has to know the extent to which the quality Equation ( 3 ).
changes with the representation switch, and, subsequently, how
the new quality can be approximated by the synthetic distortion.
( ) ( ) (3)
Such a general model for video quality approximation remains ( )
an open research challenge.
Based on this inverse function, the 120 bitrates were calculated
4. Evaluation that predicted a linear decrease of PSNR over the entire
transition duration. The per-frame PSNR results for both
We have performed an initial evaluation of the representation
versions are shown in Figure 5 for the two test sequences.
switch smoothing approach for down-switching scenarios
through subjective tests. As up-switching might be perceived One drawback of the applied solution is that the encoder uses
differently from down-switching, the combination of both up- different blocks for motion (and intra-) prediction at each bitrate.
and down-switching in a single test sequence could bias the With low bitrates, blocking artifacts become increasingly visible.
results. For example, viewers might experience a sudden Due to the different predictions, the positions of the blocking
increase in video quality as a positive event. Thus, we only artifacts change randomly for the extracted frame of each
female participants, although male participants tended to prefer
Table 1: Subjective test results for the evaluation of representation switch smoothing slightly more than female
representation switch smoothing. participants. While the overall results show only a slight
preference towards representation switch smoothing, we argue
Preferred Representation that further tests should be conducted, investigating the effects of
Version Quality No smooth transitions on various configurations. Note also that the
Switch
Switching Difference aforementioned temporal noise in the smooth transitions may
Sequence Smoothing
have affected the test results.
Sequence 1 5 7 6
5. Discussion
Sequence 2 3 12 3
For future subjective tests, the following evaluations should be
respective bitrate. When stitching the frames from these performed. Main influence factors to test are the amplitude of the
encodings, this causes some temporal noise. This noise is quality switch (e.g., measured as the bitrate difference), the
particularly visible in low-motion areas of the picture. In duration of the smooth transition, as well as the amount of spatial
contrast, the low-bitrate segment at the end of a sequence has and temporal information. Based on our experiences and
blocking artifacts that continuously move through the scene. So, feedback from test participants, we assume representation switch
even though the blocking artifacts are clearly visible, their smoothing to achieve the highest gain for scenes with high
movements correlate with the actual motions in the scene. Due to spatial and low temporal information. Furthermore, we speculate
the temporal noise in the transition, the actual visual quality that longer transition durations (e.g., 10 seconds) will better
might be lower than what is reported by PSNR. As this effect mask the quality changes.
was only recognized after time-consuming encoding of the Other possible influence factors that we identified in our
transition segments, and due to the lack of a more accurate short- evaluations are the base quality (in contrast to just the bitrate
term solution, the subjective tests were performed with the differences), the presence of cuts, the resolution, and the duration
described transition segments. This means that representation for which only low quality segments are available (e.g., only 2
switch smoothing based on one of the implementation options seconds of low quality might not justify two 10-second
discussed in Section 3 may even provide better results than our transitions).
evaluation.
Furthermore, a comparison to the intermediate switching
The subjective tests were performed with 18 participants (13 qualities of the QoE-aware DASH system [3] should be
male, 5 female) of age 23 to 45 adopting pair-wise comparison. considered in future evaluations. A hybrid solution between
The participants were told that the test concerned changes in QoE-aware DASH and representation switch smoothing might
video quality. No further indication as to the nature of the quality lead to a simpler implementation. That is, instead of a continuous
changes was given. The participants were presented with the two decrease of video quality, several small discrete steps could be
versions of each sequence (labeled Version a and Version b). used, all being below the just-noticeable difference (JND) [12].
One version contained the quality switch, the other the smooth However, this would require an on-the-fly estimation of JND in
transition. The attribution to Version a and Version b was order to set the number and amplitude of switches accordingly.
changed between the two sequences (i.e., representation switch
It has to be investigated whether smooth transitions are also
smoothing was shown in Version b of Sequence 1 and in Version
useful for up-switching scenarios. As evaluated by Zink et
a of Sequence 2). The participants were instructed that they may
al. [8], viewers prefer to watch low-quality segments followed by
start with either version and may watch each version as often as
high-quality segments rather than the other way around. Thus,
they wanted. The videos were shown in full-screen mode on a
we infer that up-switching is perceived to be less annoying than
Dell 1907FPc LCD monitor having a native display resolution of
down-switching. Furthermore, Seshadrinathan and Bovik [13]
1280x1024. The videos were shown without audio. The
have reported that viewers give poor quality ratings to sharp
participants were asked to rate whether they preferred Version a,
video quality drops but do not increase ratings as eagerly when
Version b, or saw no difference.
the video quality resumes to its previous high state. From those
The results of the subjective tests are provided in Table 1. We results, we reason that up-switching is noticed less than down-
performed the Kruskal Wallis test [11] for both sequences to test switching. These two effects may diminish the benefits of a
for significance of our results. The Kruskal Wallis test is the smooth transition for up-switching.
non-parametric counterpart of the one-way analysis of variance.
For test content generation, the aforementioned temporal noise
For Sequence 1, the -value is ( ), which means
should be avoided by implementing one of the suggested
that the null hypothesis (i.e., viewers voting equally often for
implementation options from Section 3. Instead of allowing
each of the three samples, thus being generally indifferent
participants to watch versions as often as they like, the test
towards the transition technique) cannot be rejected. For
material could contain around 3-5 quality switches and be shown
Sequence 2, the -value is ( ), which means that
only once to create the same conditions for all participants.
the null hypothesis has to be rejected for . Additionally, a 5-point Likert scale could be used to better
Representation switch smoothing performed significantly better distinguish preferences between the tested versions.
for Sequence 2 than for Sequence 1. Several participants reported
that the high motion of Sequence 1 made the two versions look 6. Conclusions
indifferent. Many participants viewed each version at least two
or three times before making a decision. There were no In this paper, we have introduced the concept of representation
significant differences in the test results between male and switch smoothing. The approach avoids abrupt quality switches
by smoothly reducing the video quality on a per-frame basis. [5] C. Müller, “libdash supports now persistent connections and
This avoids unnecessary viewer distraction in adaptive HTTP pipelining,” blog entry, URL:
streaming. We have discussed three implementation options for “http://dash.itec.aau.at/?p=553”, March 1, 2012. Accessed
the smoothing component in an SVC-based DASH system. July 7, 2013.
While down-switching is generally considered annoying, abrupt [6] C. Müller, D. Renzi, S. Lederer, S. Battista, and C.
up-switching might even increase the QoE as viewers might be Timmerer, “Using Scalable Video Coding for Dynamic
happy to notice visual improvements in the video quality. It has Adaptive Streaming over HTTP in Mobile Environments,”
to be evaluated whether representation switch smoothing is in Proc. 20th European Signal Processing Conference
beneficial for up-switching at all. (EUSIPCO), Bucharest, Romania, August 2012.
Our initial evaluations indicate a tendency towards the benefit of [7] C. Sieber, T. Hoßfeld, T. Zinner, P. Tran-Gia, C. Timmerer,
representation switch smoothing compared to hard quality “Implementation and User-centric Comparison of a Novel
switches. So far, we have only evaluated down-switching Adaptation Logic for DASH with SVC,” in Proc.
scenarios with very few configurations. Based on these IFIP/IEEE International Workshop on Quality of
evaluations, we have identified parameters and test methods for Experience Centric Management (QCMan), Ghent,
future subjective tests on the impact of representation switch Belgium, May 2013.
smoothing on the QoE. Future work shall derive a model from [8] M. Zink, O. Künzel, J. Schmitt, and R. Steinmetz,
these evaluations for configuring the duration of a quality "Subjective Impression of Variations in Layer Encoded
transition against the amplitude of the representation switch. Videos,” in Quality of Service — IWQoS 2003, K. Jeffay, I.
Stoica, and K. Wehrle, Eds. Springer Berlin Heidelberg, pp.
7. Acknowledgements 137–154, 2003.
This work was supported in part by the EC in the context of the [9] A. Segall and Jie Zhao, “Bit stream rewriting for SVC-to-
ALICANTE project (FP7-ICT-248652). AVC conversion,” in Proc. 15th IEEE International
Conference on Image Processing, San Diego, CA, pp.
8. References 2776–2779, 2008.
[10] Tears of Steel, “Tears of Steel | Mango Open Movie
[1] I. Sodagar, “MPEG-DASH: The Standard for Multimedia Project,” Home Page, URL: "http://mango.blender.org/",
Streaming Over Internet”, IEEE Multimedia, vol.18, no.4, accessed July 7, 2013.
pp.62-67, October-December 2011.
[11] R. Lowry, “Concepts and Applications of Inferential
[2] P. Ni, R. Eg, A. Eichhorn, C. Griwodz, and P. Halvorsen, Statistics,” R. Lowry, 1998-2013. Available online:
“Flicker effects in adaptive video streaming to handheld "http://vassarstats.net/textbook/", accessed June 21, 2013.
devices,” in Proc. 19th ACM International Conference on
Multimedia, New York, NY, USA, pp. 463–472, 2011. [12] Y. Jia, W. Lin, and A. A. Kassim, “Estimating just-
noticeable distortion for video,” IEEE Transactions on
[3] R. K. P. Mok, X. Luo, E. W. W. Chan, and R. K. C. Chang, Circuits and Systems for Video Technology, vol. 16, no. 7,
“QDASH: a QoE-aware DASH system,” in Proc. 3rd pp. 820–829, 2006.
Multimedia Systems Conference, New York, NY, 2012.
[13] K. Seshadrinathan and A. C. Bovik, “Temporal hysteresis
[4] H. Schwarz, D. Marpe, and T. Wiegand, “Overview of the model of time varying subjective video quality,” in Proc.
Scalable Video Coding Extension of the H. 264/AVC IEEE International Conference on Acoustics, Speech and
Standard,” IEEE Trans. on CSVT, vol. 17, no. 9, 2007. Signal Processing (ICASSP '11), Prague, Czech Republic,
pp. 1153–1156, May 2011.

You might also like