A Statistical Adaptive Block-Matching

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 13, NO.
4, APRIL 2003 417
A Statistical Adaptive Block-Matching

Motion Estimation
Fulvio Moschetti, Member, IEEE, Murat Kunt, Fellow, IEEE, and Eric Debes, Member, IEEE
Abstract—In this paper, we address the problem of motion It is well known that the main computational bottleneck in a
estimation (ME) in digital video sequences and propose a new block-based video encoder is the motion estimation (ME). Most
fast, adaptive, and efficient block-matching algorithm. Higher often, ME is based on block-matching techniques. The use of an
quality and efficiency are achieved using a statistical model for the
motion vectors. This model introduces adaptation in the search exhaustive search algorithm is computationally so demanding
window, drastically reducing the number of positions where that every software implementation of a digital video encoder
correlation-type computation is performed. The efficiency is fur- uses a suboptimal search. Otherwise, it becomes impossible to
ther improved by progressively undersampling the macroblock. work in real time for CCIR 601 formats, with a decent quality,
Patterns for undersampling are proposed to obtain the maximum without specific added hardware.
benefit from single instruction multiple data (SIMD) instructions.
In contrast with existing motion-estimation techniques, search For this reason, several techniques have been proposed to re-
strategy and subsampled patterns are closely linked. This shows duce the computational burden of full search. However, as the
that a good search strategy is much more important than blindly hypotheses of previously introduced techniques based on a re-
reducing the number of pixels considered for the matching duced number of search positions turn out to be untrue in most
pattern. We describe an implementation of the proposed matching
of the real scenes, the resulting coding quality is often question-
strategy that exploits the very long instruction word (VLIW) and
SIMD technology available in the new Itanium Processor Family. able.
Results show that the proposed algorithm adapts easily to the To overcome the deficiencies of previous approaches, the
evolution of the scene avoiding annoying quality drops that can be proposed method relies on the statistical behavior of the
observed with other deterministic algorithms. The total number motion vectors data, thus improving adaptation. Furthermore,
of operations required by the proposed method is inferior to those
required by traditional approaches.
combining the search strategy with the subsampling of the
macroblock (MB) leads to a dramatic increase of efficiency.
Index Terms—Adaptive motion estimation, block matching, dig- Finally, by selecting the adequate subsampling patterns, a huge
ital video, MPEG, SIMD, VLIW.
benefit is gained from single instructions multiple data (SIMD)
instructions in the new generation of the Itanium Processor
I. INTRODUCTION Family (IPF).
This paper is organized as follows. In Section II, a short
R ESEARCH on digital video communication has received
continuously increasing interest over the last decade.
Some of this work converged into standards set on the market
overview of the block matching is given, along with general
considerations about its complexity; major algorithms that
diminish the number of operations are briefly described. In
(MPEG-1 [1] and MPEG-2 [2]; H.261 [3] and H.263 [4]),
Section III, an adaptive algorithm based on the statistical prop-
including the new multimedia standard (MPEG-4) [5] that is
erties of the motion field is introduced. In Section IV, we study
mainly based on the idea of second generation video coding
the impact of the adoption of particular subsampling patterns
[6], and the newly arrived H.264 [7]. However, the availability
for a MB; these patterns are particularly suited for processors
of higher bandwidth in communication channels, web-tv, video
that exploit the single instruction multiple data capabilities.
e-mail, video mobile peer to peer communications, and wireless
In the fifth session, an implementation for Itanium, the first
video servers broadcasting has continuously set new challenges
general purpose processor of IPF that exploits the VLIW and
and opened new horizons for digital video. Realtime MPEG
SIMD technology, is proposed. Results and Conclusions are
playback is nowadays a common practice on every personal
the last two sections of the paper.
computer. The situation worsens, though, when we consider
the video editing and encoding.
II. OVERVIEW OF BLOCK-MATCHING MOTION ESTIMATION
Manuscript received February 17, 2003; revised March 4, 2003. This paper A. Block Matching
was recommended by Associate Editor H. Sun.
F. Moschetti is with NTT DoCoMo, Inc. Multimedia Laboratories, Kanagawa In the representation used for block matching, an image is
239-8536, Japan (e-mail: fulvio@spg.yrp.nttdocomo.co.jp).
M. Kunt is with LTS-EPFL, Signal Processing Laboratory, Swiss Fed- divided into blocks of a general rectangular shape. In practical
eral Institute of Technology, CH 1015 Lausanne, Switzerland (e-mail: applications, these are squares of dimensions . For each
Murat.Kunt@epfl.ch). block in the current frame, the block of pixels in a reference
E. Debes is with Microprocessor Research, Intel Labs, Santa Clara, CA 95052
USA (e-mail: Eric.Debes@intel.com). frame that is the most similar—according to a particular crite-
Digital Object Identifier 10.1109/TCSVT.2003.811363 rion—is searched. The position difference between the current
1051-8215/03$17.00 © 2003 IEEE
418 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 13, NO. 4, APRIL 2003
Fig. 1. Percentage of CPU resources dedicated to different modules for an MPEG encoder. 90% of resources are taken by ME.
block and the block that turns out to be the most similar in the Using the MAD, for every position investigated inside the
reference frame represents the motion vector , given by search window, the number of operations needed is:
Subtractions Absolute values Additions
So, it clearly appears that the algorithm that is looking for every
(1) possible position in the search window, called a full-search al-
gorithm (FSA), represents a very intensive computational task
where represents the luminance component of the video signal at the encoding process.
and represents the time interval between the current and the Fig. 1 shows the computational effort required in an MPEG
reference frame. In the case of MPEG encoding, could also encoder by the different modules at the encoding process. The
take on a negative value because of the backward prediction of module that is clearly overwhelming the others is the ME with
the frames (MPEG-2) or VOP (MPEG-4). The values more than 90% of the total CPU time spent just for this task. The
and in (1) represent the horizontal and vertical displacement measurement has been taken running the MPEG Software Sim-
of the block in the reference frame and must be valid values ulation Group MPEG-2 [8] encoder on a PentiumIII processor,
within a certain area in which the best match is sought. This area using a profiling tool called Vtune [9]. MPEG 1, 2, and 4 en-
is called the search window. The argument of the sum in (1) is coders all show the same behavior.
a norm. This can be an or an norm called, respectively,
the mean squared error or mean absolute difference (MAD). In C. Existing Methods
terms of the accuracy of the estimation, they give similar results, Block matching can be seen as a three level nested loop
whilst from a computational point of view, the norm is more (Fig. 2). The outermost loop represents the search that covers
efficient, as it does not require an extra multiplication. In this the entire frame ( ). The next inner loop considers all the
work, since we are interested in fast solutions, only the MAD points in the search window ( ). Finally, the innermost
will be considered. loop represents the content of a block. The computational load
of block matching can be decreased by reducing the number of
B. Computational Complexity of the Block-Matching operations in any of these loops.
Algorithm Some algorithms decrease the complexity at the most external
If a block is pixels, the cost of searching the minima in loop exploiting the spatial and temporal redundancy of the mo-
a search window on a sequential machine is . tion information among the MBs, such as genetic algorithms
Since the match has to be computed for all the block patterns [10], [11] and hierarchical motion-estimation algorithms [12],
of the frame, the asymptotical complexity of the whole process [13]. The hypothesis behind the genetic algorithms is that the
is , where is the number of blocks in the frame. motion field does not change so much from one image to the
For example, in the case of CIF images (352 288 pixels), with next. In the case of changes, the population of motion vectors
and (corresponding to a horizontal, vertical, adapts to the new situation. The hierarchical ME relies on the
maximum motion vector of value 16), 2.8 giga-matches of hypothesis that ME computed onto subsampled images gives
pixels per second are needed, if this task has to be performed quite a fair motion field for the entire image. When this is done,
in real time (25 frames per second). a refinement is still necessary to improve the precision of motion
MOSCHETTI et al.: A STATISTICAL ADAPTIVE BLOCK-MATCHING ME 419
Fig. 2. Representation of block matching as a three level nested loop.
vectors. Usually, hierarchical search uses coarse-to-fine transition grows in a monotonic way with the distance from the posi-
tion to the next resolution level. However, as shown in [12], it tion of the absolute minimum in all the domain represented by
is also possible to implement the inverse transition in order to the search window. However, this is hardly true in the general
recover from local minima and to widen the search range. case, as was shown in [22], and it represents a possible source of
The second internal loop seeks for the best match inside error for the correct identification of the minimum of the MAD.
the search window. At this level, there are also several con- The second hypothesis is that “the motion field is gentle, smooth
tributions that have studied how to reduce the complexity of and varies slowly” [16]. Most of the improvements introduced
the block-matching operation without degrading the quality as to the tss strategy, like [16]–[19] are based on this assumption.
much [14]–[19]. The two-dimensional (2-D) logarithmic search This is quite true for video-conferencing-like sequences, but the
[14] is one of the first contributions in this domain. Three-step scenario changes significantly in the general case.
search (tss) [15] and its improvements, as the new three-step For example, let us consider the general structure of a group
(ntss) [16] and the four-step search (4ss) [17] also represent of pictures (GOP) or a group of visual objects (GOV) in an
important contributions in terms of reduction of computational MPEG encoding, that is IBBPBBPBBPBB. The standard does
complexity. In the tss, the center position, as well as eight not require encoding with this GOP format, but this is the sug-
coarsely spaced positions around the center, are tested first. gested and most commonly used one. The time distance between
Then, eight positions are again checked around the previous the two reference frames, in the case of the prediction, is ,
point of minimum, but the spacing is finer. This process is where is the time interval between two successive frames. In
continued for another step to give the final vector. Some other these cases, the magnitudes of the motion vectors are not any
contributions at this level are worth mentioning, such as the longer heavily packed around zero, except for very static se-
cross-search algorithm [18] and the unrestricted center-biased quences. It is also worth adding that the goodness of the predic-
diamond search (ucbds) [19]. tion of the frames is very important, since they represent then
Work has also been done at the most internal loop in order to the references for the following frames to be predicted.
the reduce the number of operations used to determine the sum In Fig. 3, the motion vectors distributions are shown for the
of the absolute difference at a MB level. In [20], Liu and Zac- sequences “Basket” and “Flower Garden” in CCIR 601 format,
carin propose quite an efficient method based on pixel subsam- and for “Stefan” and “Akiyo” in CIF. Measurements are shown
pling at the block level. The subsampling patterns are alternated for each sequence in an interval of 20 pixels and have been
over the different positions examined so that all the pixels of taken using a FSA block matching. Frames are considered with
the block contribute to the determination of the motion vector. a temporal distance of , and , with ms. The
In [21], the authors have better results, proposing a set of repre- first two time intervals correspond, respectively, to a rate of 25
sentative pixel patterns, even though they propose this approach frames per second and to the temporal interval between two
for hardware implementation. -frames or two video object planes (VOPs) in the case of tra-
ditional MPEG GOP or GOV. The last time interval considered
( ) is an example of scenarios coming from the extensive ex-
III. STATISTICAL ADAPTIVE BLOCK MATCHING-ALGORITHM ploitation of the feature of the temporal redundancy by strongly
reducing the frame rate to get higher compression ratios. We no-
A. Probability Distributions of Motion Vectors
tice that, apart from the very static “Akiyo” sequence, the distri-
References [14]–[19] can be considered as the most impor- bution of motion vectors along the axis is much more widely
tant algorithms that exploit the concept of reducing the points spread, while for the axis it is concentrated around zero, re-
checked in the search window. These algorithms are based on gardless the temporal distance between the current frame and its
two main hypotheses. First, it is assumed that the matching func- reference.
Fig. 3. Motion vector distributions for the sequences: (a) Basket, (b) Flower Garden, (c) Stefan, and (d) Akiyo. For each sequence, measurements are shown,
respectively, from left to right with a temporal distance between the current and the reference frame of T (1), 2T (2), and 5T (3).
In view of these results, a new algorithm is proposed, accu- Any motion vector of the MBs of this object can be expressed
rately searching along the more likely directions. as
B. Modeling the Motion Vector (3)
In MPEG-4 encoding, an object is delimited by its bounding where can be viewed as a noise vector representing the inac-
box. In case segmented objects are not available at the encoder, curacies of the block matching. When the whole frame is con-
object-oriented coding becomes a traditional coding and the sidered as an object, then obviously becomes the global motion
whole scene is viewed as an object. So in the general case, we vector.
can start considering a rigid object moving through the scene.
Let be the number of MBs covering this object and be the C. Noise Distribution
motion vector of the MB . The mean motion vector of the object
is given by Let be the probability density function (pdf) of the
vectors . The statistical characteristics, the shape, and the re-
gion of support of this function will indicate where to search and
(2) in which priority in order to reach the minimum of the MAD as
fast as possible. Experimental analysis shows that can
Fig. 4. Theoretical and measured cumulative density functions of the noise vectors distributions along the x and y directions for different sequences. The frames
for predictions are taken at various temporal distance 3T , with T = 40 ms.
be modeled as a generalized Gaussian distribution (ggd) as fol- the values 1 or 2, respectively. The ggd can also tend toward a
lows: uniform distribution when the parameter tends to infinity.
Since the ggd is a parametric distribution, it is important
to understand the behavior of the parameter that optimizes
the model. Experimental results (see Fig. 4) have shown that
the distribution model varies with values of comprised in
(4) an interval [0.6,2], according to the complexity of the scene.
This means that in most of the cases, the distribution of motion
vectors is comprised in a family of curves between a Laplacian
where is the mean vector of the noise vector distribution, is and a Gaussian. Actually, in some cases, the distribution can be
the standard deviation, and is the complete gamma func- even more concentrated around zero than a Laplacian, as this
tion. The main property of the ggd is that it spans from the Lapla- generally happens for static sequences. In Fig. 4, we can see
cian to the Gaussian distribution with the parameter taking on the cumulative density functions of the empirical and estimated
distributions for the sport sequences of Stefan (CIF) TABLE I

and Basket (CCIR 601), and for the more static sequence PERCENTAGE OF SIGNIFICANCE OF THE MATCH OF THE THEORETICAL AND
MEASURED DISTRIBUTION ALONG THE HORIZONTAL AND VERTICAL
of Coastguard (CIF). In particular, for each sequence, the DIRECTIONS. THESE VALUES ARE THE AVERAGE OF THE TWO SIGNIFICANCE
distributions along the two directions are shown. TESTS: AND KOLMOGOROV–SMIRNOV
The and the Kolmogorov–Smirnov tests have been carried
out in order to verify and measure the similarity between the
theoretical and measured distributions. Extensive description of
these two tests can be found in [23].
The test allows to measure up to which level of signif-
icance the “null hypothesis” that two data sets are drawn from
the same population distribution can be admitted or disregarded.
For the test, let us consider the partition
of the event space and the probabilities of these events denoted
by , with .
Let be the probabilities of these events given by the model.
We then have
Hypothesis H0 for checked in the frame . We use a traditional full search for ini-
Hypothesis H1 for tialization at the first predicted frame of a sequence, even though
for the initialization of the motion field, other fast algorithms,
Let be the number of successes of the event in a set of such as tss for example, might have also been employed.
trials. Exactly defining the points to be checked according to the
In our case, the partition is the set of possible vectors , noise pdf may be cumbersome in practical implementations.
determined with one pixel precision. The dimension of this As an approximation, one can define a dense search area
partition , also called the number of degrees of freedom, is and a coarse search region outside. The former is defined as
the number of admissible noise vectors. Given various parame- , around the position indicated by .
ters of our system, we can set to be comfortable. The Let and be the pdf of the noise vector distri-
number of noise vectors measured for each event is . The butions along the horizontal and vertical directions. According
measure of the test is then formulated as follows: to the target probability we want to find the minimum in
the SA area, the two values , are determined solving the
(5) following integral:
We can see from (5) that the difference is zero if

and it increases as increases. This test
gave us a first measure of the similarity between the theoretical (7)
and measured distributions. Once the value of is determined according to (4), the analyt-
For the Kolmogorov–Smirnov test, we define as the max- ical expression of the symmetrical distributions and
imum value of the absolute difference between two cumula- is available. The value of the integral gives the proba-
tive distribution functions, for the data and for the bility to get the minimum in the SA zone.
model. We then have Let us assume a Gaussian distribution as an extreme case of
the model. Our aim is to check the area where the probability to
(6) find the minimum is higher, in a more detailed way. A reason-
able value for the probability could be in this case around two
What makes the Kolmogorov–Smirnov test useful is that the thirds, (let us use 68% for this example), then we have ,
probability of the null hypothesis (data sets drawn from the same .
distribution) can be calculated, at least to a useful approximate Within the SA area, the points to be checked are set along a
level. This gives its significance to any observed nonzero value two pixels resolution grid. Outside the SA area, the other points
of . The percentage of significance of the match between the to be checked are set at a distance of , along the hori-
theoretical and measured distributions are shown in Table I. zontal and vertical directions, up to , , where
is fixed by the size of the search window [see Fig. 5(a)]. We
D. The Sigma Search Algorithm (SSA) use and since we are estimating the frame and
The algorithm exploits the previous results relative to the the statistical values for the current frame are not yet known.
noise vectors distribution. The minimum of the MAD in the If one of the inner grid points of the SA area indicates a min-
search window is found by distributing the points to be checked imum for the MAD, the motion vector is found by a one-step
according to the variance of the distribution. Let us consider the search around this point (eight pixel positions checked). If there
mean motion vector of an object in frame , as given by is no minimum within the SA area, then the direct neighbors
. The position pointed at by of the first minimum are checked at two-pixel’s resolution [see
determines the starting point around which further positions are Fig. 5(b)]. This type of search is iterated until the minimum is
(a)
(b)
6
Fig. 5. (a) Pattern checked at the first step for a search window of 14 pixels
6
horizontal and 7 vertical, for a case of = 4, = 2. (b) Direct neighbors
of the temporary minimum found at a 2 pixel resolution.
found at the center. Then, as before, a one-step search around

this point gives the motion vector.
By adapting the search strategy to the data, it is clear at this
stage that this algorithm has less chances to be trapped in a local
minima compared to other non-full-search methods [14]–[19].
The flowchart of the algorithm is shown in Fig. 6. Fig. 6. Flowchart of the SSA algorithm.
E. Complexity of Practical Implementation IV. SUBSAMPLING STRUCTURE
Concerning complexity, it is worth noting that updating the Another essential issue in speeding up the ME concerns
correct value of is a complex operation. In a practical case, the computation of the matching metric used. When a block
though, using the values obtained from experimental results, we matching must be performed between two frames, the matching
fixed the value of to 2. Identifying an exact value of has an criterion is usually evaluated using every pixel of the matching
important meaning for a theoretical understanding of the motion block. Since block matching is based on the idea that all pixels
field behavior, since an accurate and adaptive value can better in a block move by the same amount, a good estimation of the
approximate the distribution in the generalized Gaussian motion could be obtained, in principle, by employing only a
functions family. From a practical point of view, though, its im- fraction of all these pixels. It is evident that the fewer the pixels
pact on the algorithm implementation is mainly reflected in the considered in computing the match, the higher the inaccuracies
size of the search area SA, an area that heavily depends also on in determining the real motion vectors. Liu and Zaccarin [20]
. Given that in practical cases varies between 0.6 and 2, we proposed a technique based on this idea, which considered
have fixed to its maximum range value; as a consequence, it just one fourth of the total number of pixels forming the MB.
does not affect the SA area unnecessarily shrinking it. There- They use four subsampling patterns that are alternated with the
fore, the sole role of shaping the dimension of the SA area is searched locations. It is a very effective algorithm, but all the
attributed to , that is updated at every predicted frame. Up- positions in the search window, even if subsampled, have to
dating requires a number of operations that are proportional be considered, so it cannot be applied to every search strategy.
to the number of motion vectors; therefore, if is the number Chan et al. [21] instead proposed an adaptive pixel decimation
of MVs, we will need sub add mul plus two divisions in order to vary the number of selected pixels to be used for
to update and each frame, which is absolutely negligible in the computation of the matching criterion. They consider 4
the total accounting of operations of the ME. 4 block-patterns and select the subsampling patterns according
Fig. 7. Different subsampling patterns considered.
(a) (b)
Fig. 8. PSNR loss versus relative execution timegain. Results obtained applying the subsampling patterns P4, P3, P2, P1 to the FSA for the sequences (a) Akiyo
and (b) Stefan.
to the value of the gradient of the luminance. Even if it exploits Let be the probability of error of the prediction of the
the subsampling within the MB, a huge amount of tests are MAD, given the PMAD. Using this probability, we can find a
required to check which is the best pattern to be dynamically threshold parameter , such that is less than some threshold
chosen. The final amount of operations and branches issued for probability of error
an implementation onto general-purpose processors would be
too high. (8)
MAD PMAD
A. MAD and Partial MAD (PMAD)
In contrast, once the threshold is found, that can be set ac-
Let PMAD be the norm computed on a subset of pixels of cording to its impact in the overall distortion, it is possible to mea-
the MB. In order to better understand the impact of the usage of sure the probability of error in estimating the MAD using PMAD.
the PMAD instead of the MAD, it is particularly useful to rely Readers may refer to [25] for a complete description of the
on the model introduced by Lengwehasatit and Ortega in [24]. MAD versus PMAD model. We are interested here in adopting
The authors state and show that: (8) to measure the probability of correct MAD prediction using
a) the partial MAD is a good estimate of the final MAD; fixed PMAD patterns.
b) there exists a reliable model for the MAD-PMAD error,
and this model is about the same for any values of PMAD, B. SIMD-Oriented Subsampling Patterns and Their Impact
i.e., MAD PMAD on ME
MAD PMAD .
From the experimental data, it is possible to assume a The basic idea in choosing the different subsampling patterns
Laplacian model for the random variable corresponding used in our algorithm is the following. Since the block matching
to the difference between MAD and PMAD, namely is an operation well suited to a parallel implementation, and
MAD PMAD) , where is the Laplacian SIMD instruction sets are now present in almost every gen-
parameter found for the particular PMAD subsampling set eral-purpose processor, the subsampling pattern should exploit
adopted. This model is verified also for the four subsampling as much as possible the SIMD concept. In Fig. 7, the different
patterns P1–P4 shown in Fig. 7. patterns considered are shown. The choice of the patterns has
The introduction of a model for the MAD PMAD has in- been done considering a 16 16 pixels block that represents
deed a double advantage. First, it allows the development of fast a MB in a block-based encoder. The different patterns consid-
adaptive techniques. Second, for a fixed threshold of the error ered allow to understand the progressive impact of subsampling
in the prediction of the MAD once given the PMAD, it allows the macroblock in the estimation process. Pixels are grouped to-
measuring of the probability of error in fast matching for a fixed gether into a group of eight in order to fully exploit the SIMD
pattern. possibilities in 64-bit SIMD registers. In Fig. 8, the loss in terms
It should be stressed here that the general behavior of the ME

algorithm is determined by the search strategy, as shown in
Fig. 11(a). Using a subset of the MB pixels of course introduces
a degradation in the quality performance; nevertheless, this
degradation is limited. The priority of the search strategy
over the matching set in the overall performance of the ME
algorithm is clear. The final goal in fast motion-estimation
algorithms is obviously to achieve a good coding performance
while still reducing the number of operations. We have shown
here that the search strategy is the main responsible for the
overall quality, therefore reduction of operations should be
transferred to the adoption of adequate MB subsampling
patterns, instead of reducing strategically important search
positions that are the main responsible for the quality of the
Fig. 9. Scatter plot for MAD versus PMAD obtained for the different patterns
respectively: (a) P4, (b) P3, (c) P2, (d) P1. The test sequence is Akiyo.
ME operation. Moreover, we demonstrated that adoption of
subsampling patterns linked to parallel architectures has a
limited impact on quality. Establishing a precise relationship
between the approximation of the MAD through the PMAD
estimation and the impact on the PSNR is almost impossible,
since that is sequence and bit-rate dependent. The experiments
conducted suggest however that an error up to 1 in determining
the value of MAD does not have a noticeable impact on the
image quality. These values have been measured from different
sequences at different bit rates. So, a threshold equal to 2
in (8) could be a reasonable value for the correctness of the
estimation obtained through the subsampling. The probabilities
for getting a correct prediction using the subsampling patterns
in Fig. 7 are summarized in Table IV. Calculations have been
carried out using (8). We can see that the subsampling pattern
P2 gives good probabilities even when applied to very difficult
sport sequences such as Stefan.
Fig. 10. Scatter plot for MAD versus PMAD obtained for the different
patterns respectively: (a) P4, (b) P3, (c) P2, (d) P1. The test sequence is Stefan.
V. MOTION ESTIMATION ON ITANIUM
In this section, we want to show the usefulness, of the sub-
of decibels (PSNR) and the gain in terms of execution time rela-
sampling pattern P2, from an implementation point of view. In
tive to the FSA are plotted, using the different subsampling pat-
order to do that, we consider the Itanium architecture.
terns P1–P4 just introduced. In Fig. 8(a), the plot is relative to
Itanium is the result of a joint design by Intel and Hewlett
the sequence Akiyo; so, as would be expected, the loss is very
Packard. It includes a new set of features that its designers call
limited. In Fig. 8(b), the plot is relative to the sequence Stefan, a
EPIC (explicit parallel instruction computing). EPIC is similar
sport sequence. We can notice that considering half of the points
in concept to VLIW in the sense that both allow the compiler to
the PSNR result is the same as considering all of them. The real
explicitly group instructions for parallel execution. This tech-
turning point for the quality would be between one quarter and
nique eliminates much of the out of order logic that occupy a
one eighth of the pixels.
large portion of the die in current RISC and x86 processors.
Figs. 9 and 10 show the scatter plot of the MAD versus
Beside the introduction of the predication and speculation
PMAD for the subsampling patterns just introduced. We can
technique, IPF benefits from fundamental hardware enhance-
notice that MAD is still very close to PMAD for P2. Using a
ments. Particularly interesting for our needs are the 128 64-bit
more complex sequence such as Stefan instead of Akiyo does
general-purpose registers (r0–r127). These are used to hold
not change the behavior of the two variables.
values for integer and multimedia operations.
It is interesting now to further elaborate about the relationship
between the threshold, the probability of correct estimation, the
value of the MAD and the final influence on PSNR. A. Implementation of the Two-Step Search of the SSA on
Fig. 11(a) shows the plot of MAD in the sequence Stefan, Itanium Using the P2 Pattern
for the SSA algorithm, using different subsampling patterns, In this paragraph, we describe the implementation of the first
compared to tss and 4ss. In Fig. 11(b), the PSNR behavior step of the two-step search algorithm realized for Itanium. The
for the same frames and the same algorithms is depicted. We two-step search is used as an alternative to the search ending in
can see that, from the PSNR point of view, subsampling the the search area SA in the SSA algorithm. The one-step search
MB has a limited impact: this is an emblematic case. More approach adopted in the search area SA can be considered as a
extensive results for the pattern P2 are presented in Table II. subcase of the approach here described.
(a) (b)
Fig. 11. (a) MAD in the sequence Stefan for the SSA algorithm with the different subsampling patterns applied and for tss and 4ss. (b) PSNR in the sequence
Stefan for the SSA algorithm with the different subsampling patterns applied and for tss and 4ss.
TABLE II
PSNR AVERAGED OVER 100 FRAMES FOR THE DIFFERENCE SEQUENCES CONSIDERED. SSAP2 REFERS TO THE SSA ALGORITHM WITH THE PATTERN P2 OF FIG. 6
The algorithm is scalar and would definitely benefit from the

presence of further processing units than what is available now
for Itanium. This would probably be the case for the future scal-
able implementations of IPF as McKinley and Madison [26].
The algorithm implemented here adopts the MB subsampling
pattern already presented in the previous section whose quality
impact on the prediction has already been extensively analyzed.
Since just eight (64 bits) registers are necessary to store the in-
formation content of a MB, extensive use of registers and multi-
media instructions can dramatically reduce the amount of cache
and memory accesses.
When the difference between two MBs is computed, the cur-
rent MB (cur) is stored in eight registers (e.g., r40–r47 for the
Itanium implementation).
For what concerns the reference area, we can indicate four
main zones that have to be stored in the registers. These are rep-
resented as gray (downward diagonal pattern), orange (wide up- Fig. 12. Registers needed for the first step of the two-step search. Bolded black
ward diagonal pattern), blue (vertical pattern), and green (hori- lines show how the reference MB is shifted in the search area with a two pixel
distance.
zontal pattern) in Fig. 12. The gray and orange zones need ten
registers, since they also consider the vertical movement, lim-
ited in this case to two pixels displacement. The blue zone, for so there is no need to reload the registers. The blue zone then
the vertical displacement, partly overlaps with the orange zone, needs four register, as well as the green zone. So for the first step
Fig. 13. “shrp” instruction used to obtain intermediate positions.
(a)
(b)
Fig. 14. “psad” instruction for Itanium. Fig. 15. (a) PSNR and (b) MPQM for the sequence Stefan encoded at 1.1
Mbits/s.
of the two-step search, 28 registers are needed to store the area

where the matchings are going to be computed. Of course, since
we have to test positions shifted, in the reference area, of a value
of two pixels, consequently we need a certain amount of regis-
ters to store these temporal positions. Considering the number
of integer units that allow to compute the operations in parallel,
it will be sufficient to store in the registers just two subsampled
MBs for comparison with the reference block. They could be
stored respectively in the registers r80–r87 and r90–r97.
These temporal positions are created starting from the already
loaded reference. In this case, the instruction shrp shift right pair
is of great help (see Fig. 13) This shows how, once loaded the
area needed for the comparisons, the information corresponding (a)
to the different MBs is obtained, in the registers, from the pre-
viously loaded registers.
The great advantage of this approach allows solving two fun-
damental problems:
1) frequent cache accesses (saving in this way the cycles
needed to extract the information);
2) frequent memory misalignments.
The algorithm shows that the availability of further registers
would allow continuous improvements in the rapidity of execu-
tion. The other fundamental instruction is the “psad1,” the IPF
version of the Penitum III streaming instruction “psadbw” [27]. (b)
In the Itanium version the result is written into another register Fig. 16. (a) PSNR and (b) MPQM for the sequence Flower Garden encoded
and not in the second register as in the PIII (see Fig. 14). at 5 Mbits/s.
Itanium instructions are grouped in bundles of three. Up to two
bundles can be executed in parallel according to particular rules. bundles. An important limit is also represented by the data depen-
The limited number of integer and memory accesses anyway does dencies that set a limitation to the way in which data can be loaded
not allow an easy distribution of the instructions in the different into the registers and consequently processed.
TABLE III
MPQM SUBJECTIVE DISTORTION AVERAGED OVER 100 FRAMES FOR THE DIFFERENCE SEQUENCES CONSIDERED. SSAP2 REFERS TO THE SSA ALGORITHM
WITH THE PATTERN P2 OF FIG. 6
At the beginning, the implementation starts loading the regis- TABLE IV

ters. Once the loading ends, only integer instructions are left. Be- PROBABILITY OF CORRECT ESTIMATION OF THE MAD CONSIDERING
Th = 1 IN (8)
fore finishing the loading, it is possible to fill all the integer and
memory slots, according to the rules for the bundles execution.
This scalable algorithm has the benefit of a limited data and
control dependency so branches are completely avoided; data
dependencies would be further diminished by augmenting the
number of registers, allowing to process more blocks in par-
allel. It is worth noticing that without the introduced one-fourth
SIMD friendly MB subsampling, all this becomes impossible.
This confirms the adaptability of this algorithm and its subsam-
pling method to architectures integrating SIMD technology. So,
in order to execute all the operations needed for the first step
of the two-step search algorithm (nine matches), only 36 load-
ings are needed. With the previous approach, 64 loadings were
necessaries just to compare one MB to the other. Previously,
misalignments and register limitations determined the necessary
reloading of the MB at every position. In this case, the gain—not
considering the number of integer instructions that can be exe- assigned for the frame case. Half-pel precision vectors are
cuted in parallel within the bundle—is a factor of 16. found in a refined operation once determined the integer preci-
sion vector, as common practice in standard codecs. Original se-
quences are MPEG-2 encoded at a constant bit rate. Obviously,
VI. RESULTS
the quality of the encoded bitstream is directly affected by the
Quality performance results of the algorithm are presented in precision of the ME algorithm used.
terms of peak signal-to-noise ratio (PSNR) and moving picture In Fig. 15(a) and (b), results of the PSNR and MPQM are
quality metric (MPQM) [28], [29]. MPQM is a distortion mea- shown for the sequence Stefan (CIF), encoded at 1.1 Mbit/s
sure based on the analysis of the human visual system. It has using the different algorithms. For the Flower Garden sequence
been chosen for two main reasons: first, it is meant to be a mea- (CCIR 601) encoded at 5 Mbit/s the corresponding results are
sure that reflects what the user effectively sees, and second, it reported in Fig. 16(a) and (b). Table II summarizes the PSNR
can represent a cross-checking of the same PSNR simulations results for the different sequences. As we can see the SSA algo-
results. High values of MPQM represent poor quality. rithm (as well as SSAP2) gives always better results when com-
The MPEG Software Simulation Group (SSG) version of the pared to the ntss, tss, 4ss or ucbds algorithms. Table III shows the
MPEG-2 encoder [8] has been used for simulations. The 12 MPQM results. For quite static sequences, like “Coastguard” or
frames GOP IBBPBBPBBPBB has been considered. The am- “Akiyo,” no matter which algorithm is used, the differences are
plitude of the search window for the frames has been fixed very subtle. In contrast, in a sequence with a considerable mo-
to 32 pixels horizontal and 16 pixels vertical for the CCIR tion like “Stefan,” “Flower Garden,” or “Basket,” the difference
601 format, and to 16 pixels horizontal and 7 pixels vertical among the proposed and the other best performing algorithms is
for the CIF format. These values are commonly employed and noticeable. The presence of a sudden acceleration of the motion
have been used for better comparison. For the frames, in case in the sequence generally represents the main cause of an imme-
of forward or backward prediction, the amplitude of the search diate drop in quality, as we can see in Figs. 15 and 16. The same
window has been adequately scaled, according to the values figures show also that the SSA algorithm immediately adapts to
TABLE V
AVERAGE NUMBER OF DIFFERENCES AND EXTRACTIONS OF ABSOLUTE VALUES, NEEDED TO FIND THE MINIMUM OF THE MAD FOR A
MACROBLOCK, WITHIN THE RELEVANT SEARCHING WINDOW. THIS REPRESENTS ALSO THE NUMBER OF TIMES THE ARGUMENT OF THE
DOUBLE SUMMATION IN (1) IS EXECUTED FOR EACH MB TO BE MATCHED
Fig. 17. Performance of various ME algorithms when a scene change occurs at frame 40 (from Akiyo to Stefan).
the new situation avoiding very annoying quality drops. Com- coding quality, one of the main problems that might actually
pared to full search, these may reach 3 or 4 dB, as is the case for arise is related to scene changes, where temporal coherency of
the other deterministic algorithms. Table II gives the PSNR of the motion field is lost. Anyway, when inserting this algorithm
the SSA algorithm for the subpatterns presented in Section IV. in a standard video codec, this issue can be easily solved. In
The SSAP2 considers just one fourth of the number of pels of the the MPEG-2 standard, whenever a scene change arises, relevant
SSA and these are purposely sparsed in the MB to profit from MBs are coded as Intra. This is an unequivocal signal of scene
the SIMD registers and instructions. Table IV shows the com- change in a predicted frame. Therefore, when at least one fourth
plexity of the different algorithms for the different sequences. of the MBs are coded as Intra in a predicted frame, we reini-
This complexity is expressed in terms of number of differences tialize the motion field properties using a tss search strategy.
and extractions of absolute value necessary to find the minimum The case of a scene change occurring exactly at the temporal
of the MAD, for each MB, within the relevant search window, position of an Intra frame still remains to be explained. Even in
namely it describes how many times the argument of the double this case, the first predicted frame would not be so badly pre-
summation in (1) is executed for each MB to be matched. The dicted: in fact, the positions checked out of the SA area make
sequences “Barca,” “Basket,” and “Flower Garden ” are CCIR the algorithm behave as a repetitive 2 two-step search looking
601format, while “Akiyo,” “Stefan,” and “Coastguard” are CIF for convergence. At the next frame, though, the variance of the
format. From the computational point of view, the SSA algo- motion field would immediately increase. This would enlarge
rithm with the P2 pattern generally requires half of the oper- the search area and, consequently, the search algorithm would
ations needed by the other algorithms [15]–[17]. Concerning rapidly adapt to the new scene, as the variance of motion vectors
population at the previous frame would have been higher. At the [7] Joint model number 1, ITU-T SG16 Q.6 and ISO/IEC SC29/WG11,
third predicted frame, the situation is completely stabilized. A 2001.
[8] MPEG Software Simulation Group.. [Online]. Available:
similar situation appears in sequences with sudden change of http://www.mpeg.org/MPEG/MSSG/
directions as in Stefan. [9] Intel.. [Online]. Available: http://developer.intel.com/vtune/analyzer
In Fig. 17, we show the behavior of various search algorithms [10] K. H.-K. Chow and M. L. Liou, “Genetic motion search algorithm for
video compression,” IEEE Trans. Circuits Syst. Video Technol., vol. 3,
when a scene change occurs: at frame 40 of the sequence Akiyo, pp. 440–445, Dec. 1993.
we pass to frame 40 of the sequence Stefan, still at the same bit [11] M. Mattavelli, Motion Analysis and Estimation: From Ill-Posed Discrete
rate of 1.1 Mbits/s. Inverse Linear Problems to MPEG-2 Coding. Lausanne, Switzerland:
EPFL-LTS, 1997.
[12] F. Dufaux and M. Kunt, “Multigrid block matching motion estimation
VII. CONCLUSIONS with an adaptive local mesh refinement,” Proc. SPIE Visual Communi-
cation and Image Processing, Nov. 1992.
In this paper, a new algorithm to realize the block-matching [13] J. Chalidabhongse and C.-C. J. Kuo, “Fast motion vector estimation
ME has been presented. Compared to exhaustive and loga- using multiresolution-spatio-temporal correlations,” IEEE Trans. Cir-
cuits Syst. Video Technol., vol. 7, pp. 477–488, June 1997.
rithmic search, it requires fewer operations. This is obtained [14] J. R. Jain and A. K. Jain, “Displacement measurement and its application
in two different steps. The first step consists of reducing the in interframe image coding,” IEEE Trans. Commun., vol. COM-29, pp.
number of checking points in the search window by considering 1799–1808, 1981.
[15] T. Koga et al., “Motion compensated intraframe coding for video con-
a pattern that dynamically exploits the statistical properties ferencing,” in Proc. NTC 81, New Orleans, LA, 1981.
of the motion field. The second step consists of subsampling [16] R. Li, B. Zeng, and M. L. Liou, “A new three step search algorithm for
the MB for the computation of the MAD. The subsampling of block motion estimation,” IEEE Trans. Circuits Syst. Video Technol.,
vol. 4, pp. 438–442, Aug. 1994.
the MB is realized in such a way that the advantage of using [17] L. M. Po and W. C. Ma, “A novel four-step search algorithm for fast
SIMD technologies currently available onto general purpose block motion estimation,” IEEE Trans. Circuits Syst. Video Technol.,
processors is considerable. Simulations have been run using vol. 6, pp. 313–317, June 1996.
[18] M. Ghanbari, “The cross-search algorithm for motion estimation,” IEEE
PSNR and MPQM. Overall, they show a better quality for Trans. Commun., vol. 38, pp. 950–953, July 1990.
the proposed algorithm when compared to the other classical [19] J. Y. Tham et al., “A novel unrestricted center-biased diamond search
searches, and also when the subsampling pattern is applied. algorithm for block motion estimation,” IEEE Trans. Circuits Syst. Video
Technol., vol. 8, pp. 369–377, Aug. 1998.
According to the local statistical properties of the sequence, [20] B. Liu and A. Zaccarin, “New fast algorithms for the estimation of block
the temporal possibility of having a higher number of points to motion vectors,” IEEE Trans. Circuits Syst. Video Technol., vol. 3, pp.
check in the search window is counterbalanced by the reduced 148–157, Apr. 1993.
[21] Y. L. Chan, W. L. Hui, and W. C. Siu, “A block motion vector estimation
number of pixels needed to compute the difference between using pattern based pixel decimation,” in Proc. IEEE Int. Symp. Circuits
two MBs. This brings an overall computational improvement and Systems, June 1997, pp. 1153–1157.
through statistical adaptation. Results also allow to validate [22] Y. L. Chan and W. C. Siu, “Relaiable block motion estimation through
the confidence measure of error surface,” Signal Processing, vol. 76, pp.
the hypothesis that all the pixels of the MB are moving with 135–146, 1999.
the same motion vector, as subsampling within the MB does [23] A. Papoulis, Probability, Random Variables and Stochastic Processes,
not affect the quality in a significant way. Accordingly, to M. G. Hill, Ed. New York: McGraw Hill, 1991.
[24] K. Lengwehasatit and A. Ortega, “Complexity-distortion tradeoffs in
obtain a better performance for rapidly varying sequences, it vector matching based on probabilistic partial distance techniques,” in
is preferable to check more points in the search window than Proc. Data Compression Conf. DCC’99, Salt Lake City, UT, 1999, pp.
to use all the pixels within the MB. All in all, the proposed 394–403.
[25] K. Lengwehasatit, “Complexity-distortion tradeoffs in image and video
algorithm provides better quality with higher increase in speed. compression,” in Electrical Engineering. Los Angeles: Univ. of
The performance of this algorithm may definitely improve Southern California, May 2000.
when semantically segmented objects are at one’s disposal, [26] L. Gwennap, Intel’s Itanium and IA-64: Technology and Market Fore-
cast: Micro Design Resources, 2000.
like in MPEG-4 object-oriented coding, thanks to the increased [27] J. H. Wolff, “Programming methods for the Pentium III processor’s
temporal coherency in the motion field. streaming simd extensions using the vtune performance enhancement
environment,” Intel Technol. J., 2nd quarter 1999.
[28] S. Winkler, “Vision models and quality metrics for image processing ap-
ACKNOWLEDGMENT plications,” in Signal Processing Laboratory. Lausanne, Switzerland:
EPFL, 2000.
Special thanks go to M. Sorci for constructive suggestions [29] L. C. J. Van den Branden, “Color moving pictures quality metrics,” in
and valuable insights. Proc. ICIP ’96, Lausanne, Switzerland, 1996.
REFERENCES
[1] Coding of moving pictures and associated audio information—Part 2:
Video, ISO/IEC 11 172-2, Information Technology, 1994. Fulvio Moschetti (S’99–M’01) received the M.S.
[2] Generic coding of moving pictures and associated audio informa- degree (cum laude) in electrical engineering from
tion—Part 2: Video, ISO/IEC Committee Draft 13 818-2, Information the University of Pavia, Pavia, Italy, in 1997, and
Technology, 1994. the Ph.D. degree from the Swiss Federal Institute of
[3] Video codec for audiovisual services at px64Kbit/s, ITU Recommenda- Technology, Lausanne, Switzerland, in 2001.
tion H.261. He was previously with Siemens Microelec-
[4] Video coding for low bitrate communications, ITU-T, Recommendation tronics Austria and the Swiss Federal Institute of
H.263, 1996. Technology. He is currently with NTT DoCoMo
[5] Motion pictures expert group-overview of the MPEG-4 standard, Multimedia Laboratories, Yokosuka, Japan. His
ISO/IEC JTC1/SC29/WG11 N2459. research interests include digital image and video
[6] M. Kunt, A. Ikonomopoulos, and M. Kocher, “Second generation image processing and image and video coding for mobile
coding techniques,” Proc. IEEE, vol. 73, no. 4, pp. 549–575, Apr. 1985. applications.
Murat Kunt (SM’70–M’74–SM’80–F’86) was born Eric Debes (M’03) received the M.S. degree in
in Ankara, Turkey, in 1945. He received the M.S. de- electrical and computer engineering from Supélec,
gree in physics and the Ph.D. degree in electrical en- France, in 1996, the M.S. degree in electrical engi-
gineering, both from the Swiss Federal Institute of neering from the Technical University Darmstadt,
Technology (EPFL), Lausanne, Switzerland, in 1969 Darmstadt, Germany, in 1997, and the Ph.D. degree
and 1974, respectively. in signal processing from the Swiss Federal Institute
From 1974 to 1976, he was a Visiting Scientist at of Technology, Lausanne, Switzerland.
the Research Laboratory of Electronics, Massachu- Since 2001, he has been a Researcher in the Media
setts Institute of Technology, Cambridge, where he Systems Group of the Microprocessor Research
developed compression techniques for X-ray images Labs, Intel Corporation, Santa Clara, CA. His
and electronic image files. In 1976, he returned to research interests include image and video coding
EPFL, where he is currently Professor of Electrical Engineering and Director and processing algorithms, as well as computer architecture and parallelism. In
of the Signal Processing Institute. He conducts teaching and research in dig- particular, he is working on new hardware features in the CPU and chipset to
ital signal and image processing, with applications to modeling, coding, pat- provide better media application performance and end-user quality of service
tern recognition, scene analysis, industrial developments, and biomedical en- with a given system and processor power envelope and/or energy budget.
gineering. His Laboratory participates in a large number of European projects
under various programmes such as Esprit, Eureka, Race, HCM, Commett, and
Cost. He is the author or co-author of more than 200 research papers and 15
books, and holds seven patents. He consults for governmental offices, including
the French General Assembly.
Dr. Kunt is the Editor-in-Chief of the Signal Processing Journal and
a founding member of the European Association for Signal Processing
(EURASIP). He serves as a Chairman and/or member of the scientific
committees of several international conferences and on the Editorial Boards
of the PROCEEDINGS OF THE IEEE, Pattern Recognition Letters, and Traitement
du Signal. He was the Co-Chairman of the first European Signal Processing
Conference in 1980 and the General Chairman of the International Image
Processing Conference (ICIP’96) in 1996, both held in Lausanne, Switzerland.
He was the President of the Swiss Association for Pattern Recognition from its
creation until 1997. He received the Gold Medal of EURASIP for meritorious
services, the IEEE ASSP Technical Achievement Award, the IEEE Third
Millennium Medal, an Honorary Doctorate from the Catholic University of
Louvain, the Technical Achievement Award of EURASIP, and the Imaging
Scientist of the Year Award of the IS&T and SPIE in 1983, 1997, 2000, 2001,
2003, and 2003, respectively.

A Statistical Adaptive Block-Matching

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Statistical Adaptive Block-Matching

Uploaded by

Copyright:

Available Formats

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 13, NO.

4, APRIL 2003 417

A Statistical Adaptive Block-Matching

Subtractions Absolute values Additions

Fig. 2. Representation of block matching as a three level nested loop.

B. Modeling the Motion Vector (3)

distributions for the sport sequences of Stefan (CIF) TABLE I

We can see from (5) that the difference is zero if

found at the center. Then, as before, a one-step search around

E. Complexity of Practical Implementation IV. SUBSAMPLING STRUCTURE

Fig. 7. Different subsampling patterns considered.

It should be stressed here that the general behavior of the ME

The algorithm is scalar and would definitely benefit from the

Fig. 13. “shrp” instruction used to obtain intermediate positions.

of the two-step search, 28 registers are needed to store the area

At the beginning, the implementation starts loading the regis- TABLE IV

You might also like