Transcd Ov

Video Transcoding Architectures
and Techniques: An Overview
IEEE Signal Processing Magazine March 2003

Present by Chen-hsiu Huang
Outline
Background & Introduction

Bit-rate Reduction
Spatial Resolution Reduction
Temporal Resolution Reduction
Error Resilience Transcoding
Scalable Image Coding
Scalable Coding
Conclude Remarks
Bit-rate Reduction
Background & Error Resilience Transcoding
Introduction Scalable Coding
Conclude Remarks
As the number of networks, types of devices,

and content formats increase, interoperability
between different systems and networks is
becoming more important.
We need provide a seamless interaction
between content creation and consumption.
In general, a transcoder relays video signals
from a transmitter in one system to a receiver
in another.
In earlier works, the majority of interest
focused on reducing the bit rate to meet an
available channel capacity.
– Conversions between CBR (constant bit-rate)
streams and VBR (variable bit-rate.)
– Achieve spatial & temporal resolution reduction
due to mobile devices with limited computing
power & display.
Scalable
Coding
Bit-rate Mobile device Spatial/Temporal

Reduction Reduction
Packet Radio services over

wireless network
Error-resilience Transcoding
Some of the common transcoding
operations:
Format, Bit-rate,
spatial, temporal
Bit-rate
Cascaded Pixel-domain
Approach
It is always possible to decode the original
signal, perform the intermediate processing,
and fully re-encode the processed signal
subject to any new constraints.
– But it’s very costly to do so.
– The quest for efficiency is major driving force
behind most of the transcoding activity.
However, any gains in efficiency should have
a minimal impact on the quality of the
transcoded video.
Illustration of cascaded pixel-domain
approach:
Complexity issue
Re-quantized to satisfy the new

output bit rate
Bit-rate Reduction
Bit-Rate Reduction Error Resilience Transcoding

Scalable Coding
Conclude Remarks
The objective of bit-rate reduction is to reduce the

bit-rate while maintaining low complexity and
achieve the highest quality possible.
ÎApplication: broadcasting, Internet streaming
The most straightforward:
Îdecode the video bit stream and fully re-encode
the reconstructed signal.
The best performance can be achieved by calculating
new motion vectors and model decisions for every
MB at the new rate. [2]
By re-using information contained in the original bit
stream and consider simplified architectures,
significant complexity saving can be achieved while
still maintaining acceptable quality. [1]-[7]
Over the years, focus has been centered on two
specific aspects, complexity & drift effect.
Architectures for MPEG Compressed Bitstream Scaling

Complexity & Drift Reduction
Drift can be explained as the blurring or
smoothing of successively predicted frames.
It is caused by the loss of high frequency data,
which creates a mismatch between the actual
reference frame used for prediction in the
encoder and the degraded reference frame.
To demonstrate the tradeoff between
complexity and quality, we will consider two
types of systems, a closed-loop system and
an open-looped system.
Transcoding Architecture
Simplified transcoding for bit-rate

reduction:
– (a) open-looped, partial decoding to DCT
coefficients, then re-quantize
– (b) closed-looped, drift compensation for
re-quantized data.
A frame memory is not
Will lead
required to drift
and there is no
needeffect
for IDCT
Open-looped System
In the open-loop system, the bit stream is variable-
length decoded (VLD) to extract the VLC words
corresponding to the quantized DCT coefficients, as
well as MB data corresponding to the motion vectors
and other MB-level information.
The quantized coefficients are inverse quantized and
then simply re-quantized to satisfy the new output bit
rate.
Finally the re-quantized coefficients are stored MB-
level information are variable length coded.
Another alternative way is to directly cut high
frequency data from each MB [2], which is less
complex.
– To cut the high frequency data without VLD, a bit profile for
the AC coefficients is maintained.
As MBs are processed, code words corresponding to
high frequency coefficients are eliminated to meet the
target bit-rate.
Alone similar lines, techniques to determine the
optimal breakpoint of non-zero DCT coefficients (in
zig-zag order) were presented in [3].
– Carried out for each MB, so that distortion is minimized and
rate constraints are satisfied.
Above two alternatives may also be used in
the closed-loop system, but their impact on
the overall complexity is less.
Regardless of the techniques used to achieve
the reduced rate, open-looped system are
relatively simply since a frame memory is not
required and there is no need for IDCT.
In terms of quality, better coding efficiency
can be obtained by the re-quantization
approach. However, open-loop architectures
are subject to drift effect.
Drift Effect
The reason for drift is due to loss of high-frequency
information.
Beginning with the I-frame, information is discarded
by the transcoder to meet the new target bit rate.
Incoming residual blocks are also subject to loss.
When a decoder receives the transcoded bit stream,
it will decode the I-frame with reduced quality.
When it’s time to decode the next P,B frames the
degraded I-frame is used as a predictive component
and added to a degraded residual component.
Consider the purpose of residual is to accurately
represent the difference between the original signal
and the motion-compensated prediction.
And now both the residual and predictive
components are different then what was originally
derived by the encoder, errors will be introduced in
the re-constructed frame.
As time goes on, the mismatch progressive increases,
resulting in the reconstructed frames becoming
severely degraded.
degraded prediction degraded residual

Closed-loop System
The closed-loop system aims to eliminate the
mismatch between predictive and residual
components by approximating the cascading
decoder-encoder architecture [4].
In the simplified structure, only one reconstruction
loop is required with one DCT and one IDCT.
Some architecture inaccuracy is introduced due to
the non-linear nature in which the re-construction
loops are combined.
However it has been found the approximation has
little effect on the quality [4].
Post-processing of MPEG-2 coded video
for transmission at lower bit-rate
Traditional Cascading Pixel-domain Approach
With one
Only the exception
of this slight
reconstruction
inaccuracy,
loop this
is required
Combined
architecture is
re-construction loop
with one DCT
mathematically
and one IDCT
equivalent to a
cascaded decoder-
encoder
architecture
In [8], additional causes of drift, e.g. due to floating
point inaccuracies, have been further studied.
Overall though, in comparison to the open-loop
architecture discussed earlier, drift is eliminated since
the mismatch between predictive and residual
components is compensated for.
compensated prediction compensated residual

Simulation Results
Closed-loop system is very close

to the cascaded architecture
Open-loop system suffers

from severe drift
Motion Compensation in the
DCT Domain
The closed-loop architecture described in the
previous section provides an effective transcoding
structure in which the MB re-construction is
performed in the DCT domain.
However, since the memory stores spatial domain
pixels, the additional DCT/IDCT is still needed.
This can be avoided though by utilizing the
compressed-domain methods for MC proposed by
S.F. Chang and Messerschmidt [9].
– In this way, it is possible to reconstruct reference frames
without decoding to the spatial domain.
Manipulation and compositing of MC-DCT

compressed video
Decoding completely in the compressed domain
could yield equivalent quality to spatial domain
decoding [10]. However, this was achieved with
floating-point matrix multiplication and proved to be
quite costly.
In [12] this computation was simplified by
approximating the floating-point elements by power of
two fractions so that shift operations could be used,
and in [13], simplifications have been achieved
through matrix decomposition techniques.
In [14], further simplifications of the DCT-based MC
process were achieved by exploiting the fact that the
stored DCT coefficients in the transcoder are mainly
concentrated in low-frequency areas.
– Therefore, only a few low-frequency coefficients are
significant and an accurate approximation to the MC process
that uses all coefficients can be made.
Bit-rate Reduction
Spatial Resolution Temporal Resolution Reduction
Reduction Scalable Image Coding
Scalable Coding
Conclude Remarks
With the emergence of mobile multimedia-

capable devices and the desire for users to
access video originally captured in a high
resolution, there is a strong need for efficient
way to reduce the spatial resolution of video
for delivery.
Some researchers have focused on the efficient
reuse of MB-level data in the context of the cascaded
pixel-domain architecture, while others have explored
the possibility of new alternative.
In [6],[7], the problem associated with mapping
motion vectors and MB-level data was addressed.
(including performance of motion vector refinement)
In [18], the authors propose to use DCT-domain
down-scaling and MC for transcoding.
In [19], a comprehensive study on the transcoding to
lower spatial/temporal resolutions and to different
encoding formats has been provided based on the
reuse of motion parameters.
CIF-to-QCIF video bitstream down-

conversion in the DCT domain
Motion Vector Mapping
When down-sampling four MBs to one MB, the
associated motion vectors have to be mapped, where
the number of motion vectors that may be associated
with an MB depends on the standard used and the
coding tools available in a given profile or extension.
Several methods to perform the particular mapping
have been described in [6],[7],[17],[19],[24].
To map from four motion vector to form the new MB,
a weighted average or median filters can be applied.
– This is referred to as a 4:1 mapping.
However, with certain compression standards, such
as MPEG-4 Visual and H.263, there is support in the
syntax for advanced prediction modes that allow one
motion vector per 8x8 luminance block.
– This is referred to as 1:1 mapping.
While 1:1 mapping provides a more accurate
representation of the motion, it is sometimes
inefficient to use since more bits must be used to
code four motion vectors.
An optimal mapping would adaptively select the best
mapping based on a rate-distortion criterion. (a good
evaluation: [6],[7],[19])
Field-to-frame transcoding with spatial and
temproal downsampling
Because MPEG-2 supports interlaced video, we also

need to consider field-based MV mapping.
In [27], the top-field motion vector was simply used.
An alternative scheme that averages the top and
bottom field motion vectors under certain conditions
was proposed in [21].
It is our opinion that the appropriate motion vector
mapping is depended on the down-conversion
scheme used.
We feel this is particular important for interlaced data,
where the target output may be a progressive frame.
Drift compensation for reduced spatial

resolution transcoding
DCT-domain Down Conversion
The most intuitive way to perform down conversion in
the DCT domain is to only retain the low-frequency
coefficients of each block and recompose the new
MB using the compositing techniques in [9].
– For conversion by a factor of 2, only the 4x4 DCT
coefficients of each 8x8 block in a MB are retained
– These low frequency coefficients from each block are then
used to form the output MB
A set of DCT-domain filters can been derived by
cascading these two operations.
More sophisticated filters that attempt to retain more
of the high frequency information in [28],[29]
These down-conversion filters can be applied
in both the horizontal and vertical directions
and to both frame-DCT and field-DCT blocks.
Variations of this filtering approach to convert
field-DCT blocks to frame-DCT blocks, and
vice-versa, have also been derived in [10].
Conversion of MB Type
In transcoding video bit streams to a lower spatial
resolution, a group of four MBs in the original
corresponds to one MB in the target video.
To ensure that the down-sampling process will not
generate an output MB in which its sub-blocks have
different coding modes, the mapping of MB modes to
the lower resolution must be considered ([6],[21]).
ZeroOut Intra-Inter Inter-Intra

ZeroOut
ZeroOut, the MB modes of the mixed MBs

are all modified to intermode. The MVs for the
intra-MBs are reset to zero and so are
corresponding DCT coefficients.
In this way, the input MBs that have been
converted are replicated with data from
corresponding blocks in the reference frame.
Intra-Inter
Intra-Inter, maps all MBs to intermode, but the motion
vectors for the intra-MBs are predicted.
The prediction can be based on the data in
neighboring blocks, which can include both texture
and motion data.
As an alternative, we can simply set the motion
vector to be zero, depending on which produces less
residual.
In an encoder, the mean absolute difference of the
residual blocks is typically used for mode decision.
Based on the predicted motion vector, a new residual
for the modified MB must be calculated.
Inter-Intra
Inter-Intra, the MB modes are all modified to

intramode.
In this case, there is no motion information
associated with the reduced resolution MB,
therefore all associated motion vector data is
reset to zero and the intra-DCT coefficients
are generated to replace the inter-DCT
coefficients.
To implement the Intra-Inter and Inter-Intra methods,
we need a decoding loop to reconstruct the full
resolution picture.
The reconstructed data is used as a reference to
convert the DCT coefficients from intra-to-inter or
inter-to-intra.
For a sequence of detail, the low complexity strategy
of zero-out can be used.
The performance of Inter-Intra is a little better than
intra-inter, because Inter-Intra can stop drift
propagation by transforming inter-blocks to intra-
blocks.
Intra-Refresh Architecture
In reduced resolution transcoding, drift error is
caused by many factor, such as requantization,
motion vector truncation, and down-sampling.
– By converting some percentage of intercoded blocks to
intracoded blocks, drift propagation can be controlled.
In the past, the concept of intra-refresh has
successfully been applied to error-resilience coding
scheme [30], and it has been found that the same
principle is also very useful for reducing the drift in a
transcoder [21].
Intra-refresh architecture for reduced spatial resolution trandcoding
Output blocks that originate from the frame store

are independent of other data, hence coded as
intrablocks; there is no picture drift associate with
these blocks.
The decision to code an intrablock from the frame
store depends on the MB coding modes and picture
statistics.
– Based on the coding mode, an output MB is converted if the
possibility of a mixed block is detected.
– Based on the picture statistics, the motion vector and
residual data are used to detect blocks that are likely
contribute to a larger drift error.
Picture quality can be maintained by employing an
intracoded block in its place.
Of course, the increase in the number of intrablocks
must be compensated for by the rate control by
adjust the quantization parameters so that the target
rate. [21]
Motion Vector Refinement
In all of the trandcoding methods described here,
significant complexity is reduced by assuming that
the motion vectors computed at the original bit rate
are simply reused in the reduced-rate bit stream.
It has been shown that reusing the motion vectors in
this way leads to nonoptimal transcoding results due
to the mismatch between prediction and residual
components [6],[7],[15].
To overcome this loss of quality without performing a
full motion re-estimation, motion vector refinement
schemes have been proposed.
Typically, the search window used for motion
vector refinement is relative small compared
to the original search window, e.g. [-2,+2].
Such schemes can be easily used with most
bit-rate reduction architectures for improved
quality, as well as the spatial and temporal
resolution reduction architectures.
Additional techniques and simulation results
for motion vector refinement can be found in
[6],[7],[15],[36],and [37].
Increasing the search window farther allows
more blocks to find their best match; however,
the overall gain is less.
Bit-rate Reduction
Temporal Resolution Temporal Resolution Reduction
Reduction Scalable Image Coding

Scalable Coding
Conclude Remarks
Reducing the temporal resolution of a video

bit stream is a technique that may be used to
reduce the bit-requirements imposed by a
network, to maintain higher quality of coded
frames, or to satisfy processing (battery)
limitations.
For temporal resolution reduction, it is necessary to
estimate the motion vectors from the current frame to
the previous non-skipped frame that will serve as a
reference frame in the receiver.
Solutions to this problem have been proposed in

[15],[32],and [33].
The re-estimation of motion is all that needs
to be done since new residuals will be
calculated.
However, if a DCT-domain transcoding
architecture is used, a method of re-
estimating the residuals in the DCT domain is
needed [34].
In [35], the issue of motion vector and
residual mapping has been addressed in the
context of a combined spatial-temporal
reduction in DCT domain based on the intra-
refresh architecture.
Motion Vector Re-estimation
We can solve it by tracing the motion vectors
back to the desired reference frame.
Since the predicted blocks in the current
frame are generally overlapping with multiple
blocks, bilinear interpolation of the motion
vectors may be used, where the weighting of
each input motion vector is proportional to the
amount of overlap with the predicted block.
A dominant vector selection scheme as
proposed in [15] and [35] may also be used.
Residual Re-estimation
With pixel-domain architectures, the residual between
the current frame and the new reference frame can
be easily computed.
For DCT-domain transcoding, the calculation should
be done directly using DCT-domain MC techniques
[9].
A novel architecture to compute this new residual in
the DCT-domain has been presented in [34] and [35].
In this work, the authors utilize direct addition of DCT
coefficients for MBs without MC, as well as an error-
compensating feedback loop for motion-
compensated MBs.
Bit-rate Reduction
Error-Resilience Temporal Resolution Reduction
Transcoding Scalable Image Coding

Scalable Coding
Conclude Remarks
Transmitting video over wireless channels

requires taking into account the conditions in
which the video will be transmitted because
wireless channels have lower bandwidth and
higher error rate.
Error-resilience transcoding for video over wireless
channels has been studied in [38] and [39].
In [38],
– First, they use a transcoder that injects spatial and temporal
resilience into an encoder bit stream where the amount of
resilience is tailored to the content of the video and the
prevailing error conditions, as characterized by bit error rate.
– Second, they derive analytical models that characterize how
corruption propagates in a video that is compressed using
motion-compensated encoding and subject to bit errors.
– Third, they use rate distortion theory to compute the optimal
allocation of bit rate between spatial resilience, temporal
resilience, and source rate.
The transcoder then injects this optimal resilience
into the bit stream.
In [39], the author propose error-resilience
video transcoding for internetwork
communication via GPRS.
Experiments showed superior transcoding
performances over the error-prone GPRS
channels to the non-resilience video.
[39] Error-resilience
[38] Error-resilience video transcoding
transcoding for for robust inter-network
video over communications
wireless channels using GPRS
Bit-rate Reduction
Scalable Image Coding Error Resilience Transcoding
Scalable Coding
Conclude Remarks
Although the primary focus of this article is on

video, it is worthwhile to mention the current
state-of-art in scalable image coding, namely the
JPEG-2000 standard.
The JPEG-2000 coder also employs bit-plane coding
techniques to achieve scalability, where both SNR
and spatial scalability are supported.
In contrast to existing scalable video coding schemes
that typically reply on a non-scalable base layer and
which are based on the DCT, this coder does not
reply on separate base and enhancement layers and
is based on the discrete wavelet transform (DWT).
The coding scheme employed by JPEG-2000 is often
referred to as an embedded coding scheme since the
bits that correspond to the various qualities and
spatial resolution can be organized into the bit stream
syntax in a manner that allows the progressive
reconstruction of images and arbitrary truncation at
any point in the stream.
Bit-rate Reduction
Scalable Coding Error Resilience Transcoding
Scalable Coding
Conclude Remarks
Encode the video once and then by simply

truncating certain layers or bits from the
original stream, lower qualities, spatial
resolutions, and/or temporal resolution could
be obtained.
MPEG-2 Video
The signal is encoded into a base layer and a few
enhancement layers, in which the enhancement
layers add spatial, temporal, and/or SNR quality to
the reconstructed base layer.
Specifically, the enhancement layer in SNR
scalability adds refinement data for the DCT
coefficients of the base layer.
For spatial scalability, the first enhancement layer
uses predictions from the base layer without the use
of motion vectors.
The enhancement layer in temporal scalability uses
predictions from the base layer using motion vectors.
Fine Granular Scalability (FGS)
FGS allows for a much finer scaling of bits in the
enhancement layer [42] (adopted by MPEG-4 visual
standard).
This is accomplished through a bit-plane coding
method of DCT coefficients in the enhancement layer,
which allows bit-plane to be truncated at any point.
In this way, the is rather proportional to the number of
enhancement bits received.
Another variation of the FGS scheme, known as
FGS-temporal, combines the FGS techniques with
temporal scalability [45].
Comparison
To make a comparison between scalable coding and
transcoding is rather complex since they address the
same problem from different points of view.
Scalable coding specifies the data format at the
encoding stage independently of the transmission
requirements, while transcoding converts the existing
data format to meet the current transmission
requirement.
Although scalable coding can provide low-cost
flexibility to meet the target, traditional schemes
sacrifice the coding efficiency compared to single-
layered coding.
Considering a cascaded transcoding architecture that
fully decodes and re-encodes the video according to
the new requirement, its coding performance will
always be better than traditional scalable coding; this
has been shown in at least one study [6].
Certainly, more studies is needed that accounts for
the latest scalable coding schemes. Steps in this
direction are now being made within the MPEG
community [48],[49].
Recent advances in video coding are showing that
the possibility for an efficient universally scalable
coding scheme is within reach; e.g. [50],[51].
In addition to the issue of coding efficiency, which is
likely to be solved soon, scalable coding will need to
define the application space that it could occupy.
For instance, content providers for high-quality
mainstream applications have already adopted
single-layered MPEG-2 video coding as the default,
hence a large number of MPEG-2 video content
already exists.
To access these existing MPEG-2 video contents
from various devices with varying terminal and
network capabilities, transcoding is needed.
For this reason, research on video transcoding of
single-layer streams has flourished and is not likely to
go away anytime soon.
The Future
However, in the short term, scalable coding may
satisfy a wide range of video applications outside this
space, and in the long term, we should not dismiss
the fact that a scalable coding format could replace
existing coding formats.
For now, scalable coding and transcoding should not
be viewed as opposing or competing technologies.
Instead, they are technologies that meet different
needs in a given application space and it’s likely that
they can happily coexist.
Bit-rate Reduction
Conclude Remarks Error Resilience Transcoding
Scalable Coding
Conclude Remarks
Other Transcoding Scheme
There are additional video transcoding
schemes including object-based transcoding
[48], transcoding between scalable and
single-layered video [53]-[55], and various
types of formats conversions [56]-[60].
Joint transcoding of multiple streams has
been presented in [61] and system filtering
has been described in [62].
Transcoding techniques that facilite trick-play
modes, e.g., fast and reverse playback, have
been discussed in [63] and [64].
Problems in the Future
Finding an optimal transcoding strategy
– Given several transcoding operations that would
satisfy given constraints, a means for deciding the
best one in a dynamic way [65].
Further study is needed toward a complete
algorithm that can measure and compare
quality across spatial-temporal scales.
– Possibly taking into account subjective factors,
and account for a wide range of potential
constraints (e.g., terminal, network, and user
characteristics).
Another topic is the transcoding of encrypted
bit streams.
– Transcoding of encrypted bit streams include
breaches in security by decrypting and re-
encrypting within the network.
These problems have been circumvented in
[69] with a secure scalable streaming format
that combines scalable coding techniques
with a progressive encryption technique.
However, handling this for non-scalable video
and streams encrypted with trditional
encryption techniques is still an open issue.

Transcd Ov

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Transcd Ov

Uploaded by

Copyright:

Available Formats

Video Transcoding Architectures

and Techniques: An Overview

IEEE Signal Processing Magazine March 2003

 Background & Introduction

As the number of networks, types of devices,

Bit-rate Mobile device Spatial/Temporal

Packet Radio services over

Re-quantized to satisfy the new

Bit-Rate Reduction Error Resilience Transcoding

The objective of bit-rate reduction is to reduce the

Architectures for MPEG Compressed Bitstream Scaling

 Simplified transcoding for bit-rate

degraded prediction degraded residual

compensated prediction compensated residual

Closed-loop system is very close

Open-loop system suffers

Manipulation and compositing of MC-DCT

With the emergence of mobile multimedia-

CIF-to-QCIF video bitstream down-

 Because MPEG-2 supports interlaced video, we also

Drift compensation for reduced spatial

ZeroOut Intra-Inter Inter-Intra

 ZeroOut, the MB modes of the mixed MBs

 Inter-Intra, the MB modes are all modified to

Output blocks that originate from the frame store

Reduction Scalable Image Coding

Reducing the temporal resolution of a video

 Solutions to this problem have been proposed in

Transcoding Scalable Image Coding

Transmitting video over wireless channels

Although the primary focus of this article is on

Encode the video once and then by simply

You might also like

Background & Introduction

Simplified transcoding for bit-rate

Because MPEG-2 supports interlaced video, we also

ZeroOut, the MB modes of the mixed MBs

Inter-Intra, the MB modes are all modified to

Solutions to this problem have been proposed in