Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Computer Vision and Image Understanding 116 (2012) 16–24

Contents lists available at SciVerse ScienceDirect

Computer Vision and Image Understanding


journal homepage: www.elsevier.com/locate/cviu

Improving sub-pixel accuracy for long range stereo


Stefan K. Gehrig a,⇑, Hernán Badino b, Uwe Franke a
a
Daimler AG, HPC 050-G024, 71059 Sindelfingen, Germany
b
Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213-3890, USA

a r t i c l e i n f o a b s t r a c t

Article history: Dense stereo algorithms are able to estimate disparities at all pixels including untextured regions.
Received 19 May 2008 Typically these disparities are evaluated at integer disparity steps. A subsequent sub-pixel interpolation
Accepted 20 July 2011 often fails to propagate smoothness constraints on a sub-pixel level.
Available online 16 August 2011
We propose to increase the sub-pixel accuracy in low-textured regions in four possible ways: First, we
present an analysis that shows the benefit of evaluating the disparity space at fractional disparities. Sec-
Keywords: ond, we introduce a new disparity smoothing algorithm that preserves depth discontinuities and enforces
Stereo vision
smoothness on a sub-pixel level. Third, we present a novel stereo constraint (gravitational constraint)
Sub-pixel interpolation
3D reconstruction
that assumes sorted disparity values in vertical direction and guides global algorithms to reduce false
Object detection matches, especially in low-textured regions. Finally, we show how image sequence analysis improves
stereo accuracy without explicitly performing tracking. Our goal in this work is to obtain an accurate
3D reconstruction. Large-scale 3D reconstruction will benefit heavily from these sub-pixel refinements.
Results based on semi-global matching, obtained with the above mentioned algorithmic extensions are
shown for the Middlebury stereo ground truth data sets. The presented improvements, called Improve-
SubPix, turn out to be one of the top-performing algorithms when evaluating the set on a sub-pixel level
while being computationally efficient. Additional results are presented for urban scenes. The four
improvements are independent of the underlying type of stereo algorithm.
Ó 2011 Elsevier Inc. All rights reserved.

1. Introduction disparity steps is described. The following section introduces our


novel disparity smoothing algorithm. In Section 5, a new weak ste-
Recent work on stereo vision has focused on dense global algo- reo constraint is introduced. For image sequences, we present in
rithms that are able to obtain very accurate results. For a survey of Section 6 an efficient option to improve sub-pixel disparity estima-
the current state of the art refer to [1] and to the Middlebury stereo tion by combining several measurements relying on rough ego-
website that hosts the list of top-performing stereo algorithms motion information e.g. obtained by inertial sensors. Section 7
compared on ground truth data [2]. gives detailed results on fractional disparity sampling, on the dis-
The applications for dense, accurate disparity estimations are parity smoothing and on our gravitational stereo constraint. In
numerous, ranging from view interpolation to 3D reconstruction. our evaluation we use ground truth data and urban scenes. The
Our main interest lies in obtaining high accuracy disparity images final section comprises conclusions and future work. An abbrevi-
for 3D reconstruction at small disparities, i.e. at large distances. In ated version of this paper has been published in [3].
this paper we present four improvements for sub-pixel stereo
computation, focusing on low-textured image regions. Textured
regions benefit from the improvements as well. Our scope in this 2. Related work
paper is limited to obtaining accurate 3D reconstructions. 3D data
storage and view synthesis are not addressed. The 3D results are We limit our review to papers addressing sub-pixel estimation
shown from perspectives that allow us to assess the accuracy of of stereo correspondences.
the methods – they are not deemed suitable for view synthesis. Most stereo algorithms obtain disparities on an integer level. A
The paper is organized in the following way: Section 2 gives an simple sub-pixel interpolation can be done via parabola fitting of
overview of stereo sub-pixel interpolation techniques. In Section 3, the disparities in the vicinity of the best disparity [4]. Besides
the technique of sampling the disparity space at fractional parabola fitting, performing a so-called equiangular fit has been
introduced by Shimizu and Okutomi [5]. The results are similar
to parabola fitting, but reduce the disparity errors slightly for linear
⇑ Corresponding author. similarity criteria (e.g. sum of absolute differences). There has also
E-mail address: stefan.gehrig@daimler.com (S.K. Gehrig). been a growing interest in obtaining accurate sub-pixel disparities

1077-3142/$ - see front matter Ó 2011 Elsevier Inc. All rights reserved.
doi:10.1016/j.cviu.2011.07.008
S.K. Gehrig et al. / Computer Vision and Image Understanding 116 (2012) 16–24 17

since the parabola fitting approaches exhibit artifacts known as Besides smoothness, the epipolar constraint always holds for
pixel-locking [6]. These approaches either perform a careful inter- stereoscopic cameras setups. In addition, the ordering constraint
polation of the integer disparity results [7] or work on a disparity is known to be helpful in establishing stereo correspondences[1],
space image that is sampled at fractional disparities [8]. We extend yet is violated for thin objects smaller than the stereo baseline.
the latter idea in our contribution. We propose another helpful constraint for low-textured regions
A few global algorithms deliver sub-pixel disparity estimates which we call the gravitational constraint introduced in Section
directly: Belief propagation using an MMSE (minimum mean 5. It develops its full potential in outdoor scenes.
squared error) estimator yields sub-pixel disparities implicitly
[9]. In this paper, marginal probabilities per pixel and disparity
2.1. Semi-global matching
step constitute weights resulting in sub-pixel disparity values.
Another recent belief propagation variant also obtains sub-pixel
One of the many dense stereo algorithms listed on the Middle-
disparities by fitting surfaces on small segments [10]. An energy-
bury web page is semi-global matching (SGM) [19]. Our investiga-
based formulation of the disparity estimation problem is presented
tions are conducted with this algorithm due to its computational
in [11], which also results in sub-pixel disparity estimates. There,
efficiency. Among the top performing algorithms, Hirschmueller’s
several ideas from the optical flow estimation problem were
SGM has the lowest computational complexity. Real-time imple-
applied to stereo vision. These algorithms require large computa-
mentations for SGM exist for graphics processors (GPU) [20] and
tional resources. The basic dense disparity estimation of this ap-
for reconfigurable hardware (FPGA) [21].
proach is described in [12].
Roughly speaking, SGM performs an energy minimization in a
Most other sub-pixel algorithms get their integer disparity
dynamic-programming fashion on multiple (8 or 16) 1D paths
values with a correlation-based stereo method (e.g. [1]) and per-
approximating the 2D image. The energy consists of three parts:
form a refinement subsequently. Psarakis and Evangelidis [13]
a data term for photo-consistency, a small smoothness energy term
conduct a linear interpolation of adjacent correlation values and
for slanted surfaces that change the disparity slightly (parameter
solve a one-dimensional optimization problem. Nehab et al. [7]
P1), and a smoothness energy term for depth discontinuities
observe a bias in stereo matching due to the use of typically the left
(parameter P2).
image as the reference image. Nehab et al. treat both images sym-
SGM is computationally efficient and has been used very
metrically and a two-dimensional fit in the disparity space yields
successfully in aerial image 3D reconstruction. The algorithm has
accurate sub-pixel disparities. Stein et al. [14] take the integer
a real-time potential on pipe-line architectures such as graphic
disparities and perform a gradient-descent in the spirit of [15].
cards (GPU) or field-programmable gate-arrays (FPGA) and pro-
All three methods present results free of the pixel-locking effect.
cesses VGA image pairs in few seconds on standard CPUs.
The computational burden of these methods is quite low, espe-
A multi-baseline extension of SGM is straightforward and de-
cially Psarakis method. A fast method to remedy the pixel-locking
scribed in [19]. Instead of working on rectified images one walks
effect is also presented by Shimizu and Okutomi [6]. There, the
along slanted epipolar lines of the undistorted image which can
integer correlation values are resampled shifted by a half pixel
be done for all match images with one reference image. The
and both results are averaged. The result exhibits a slight locking
resulting disparity map results are combined with a median fil-
to integer and to half pixel values. All these methods rely on suffi-
ter. The algorithms described below can be combined for a mul-
cient structure to successfully correlate image patches.
ti-baseline application in the same way, allowing large-scale
A rigorous analysis of the sub-pixel effects in stereo matching
reconstructions.
by Scharstein and Szeliski has been conducted in [8]. There, a Fou-
rier analysis shows that using a sinc interpolator is in theory the
best interpolation to evaluate the disparity space image at frac- 3. Sampling the disparity at fractional disparities
tional disparities. Chen and Katz [16] comes to the same conclusion
in the context of optical flow estimation. As shown by Szeliski and Scharstein [8], the sinc interpolator is
The analysis of [8] neglected effects of the sensor geometry, the best interpolation to sample the disparity space. For image
especially the pixel fill ratio. When approaching such accuracies areas with sufficient texture this results in excellent sub-pixel
the physical layout of the camera becomes relevant. An investiga- disparities. However, in real images, a good portion of the image
tion of the effect of sensor geometry on correspondence measure- consists of low-textured or untextured areas. Due to noise, these
ments is presented in [17]. This analysis gets even more areas do not benefit from this interpolation strategy but need
complicated considering recent trends using micro-lenses on every support from a smoothness constraint. When working on discrete
pixel which gives a complicated photo-sensitivity distribution for disparity steps, the number of subdivisions of one disparity step is
every pixel. the dominant factor for improvement in low-textured areas. We
We are interested in obtaining accurate sub-pixel estimates for extend the analysis from [8] and perform an evaluation of the dis-
all pixels in the scene, including areas of low or no texture, at little parity steps at integer pixel, at half pixel and at quarter pixel level,
computational expense. We implemented several of the above sub- considering the full image including low-textured regions. Using
pixel algorithms, confirmed their accuracy provided sufficient semi-global matching we present results that show a significant
structure, however, obtained high disparity noise in low-textured improvement in stereo accuracy for fractional disparity sampling.
areas. This is due to the fact that the correlation values or other These results can be extended to any global stereo algorithm that
similarity costs do not have sufficient information in low-textured incorporates smoothness constraints. We also neglect the effect
regions. Such areas can only be filled with meaningful disparities of sensor geometry on sub-pixel interpolation as do most other
using global algorithms with smoothness constraints. One method Computer Vision publications.
that performs disparity smoothing similar to our algorithm is The evaluation of the disparity space at fractional disparities is
introduced in [18]. There, bilateral filtering (adaptive filter coeffi- straightforward, we only need to generate our matching costs at
cients depending on adjacency and on photoconsistency to the fractional pixel positions. Then, the original SGM algorithm is ap-
central pixel) is applied to obtain an improved disparity map, plied using the cost matrix with the disparity dimension increased
resulting in depth-discontinuity-preserving smoothing. This non- by the sampling factor 1, 2 or 4. We consider two similarity met-
iterative method yields excellent results but is computationally rics: the Birchfield–Tomasi (BT) metric [22], and pixel-wise mutual
quite expensive. information [23].
18 S.K. Gehrig et al. / Computer Vision and Image Understanding 116 (2012) 16–24

X
Table 1 Etot ¼ ðEdata ðx; yÞ þ kEsmooth ðx; yÞÞ: ð2Þ
Error percentages and MSE (in px2) for stereo ground truth data sets. The results get x;y
better with higher fractional disparity sampling (FDS). Tsu = Tsukuba, Ven = Venus,
Ted = Teddy, Con = Cones image pair. Disparity errors beyond 0.5 px are counted as Let d0 be the disparity of the central pixel of a given support region.
errors.
The empirical local disparity variance,
FDS 1 px 0.5 px 0.25 px
% err MSE % err MSE % err MSE 1 X N
 2 ¼ r2 ðd Þ;
Esmooth ðd0 Þ ¼ ðdi  dÞ 0 ð3Þ
Tsu 23.72 1.117 20.30 0.966 14.99 0.919
N  1 i¼1
Ven 10.80 0.643 8.60 0.502 8.17 0.330
Ted 23.71 8.782 21.16 8.731 19.96 9.674 is used as a smoothness energy contribution of each pixel. N is the
Con 15.04 5.194 12.98 4.861 12.57 5.302 number of pixels within the considered patch and d  its the average
disparity. This formulation shall emphasize the fact that the
Taking the BT similarity metric as described in [22], we obtain smoothness can be improved by slight modifications of d0. The
smoothness energy of d0 becomes minimal for d0 ¼ d. 
the interpolated BT values at half pixel resolution by interpolating
the grayvalues of the left and right image and then computing the How can an appropriate data term be formulated? Let dint be the
BT metric with the interpolated values. This is equivalent to integer disparity computed by a stereo algorithm of your choice.
running the BT metric on the interpolated image with twice the The standard parabolic fit delivers not only an improved estimate
original resolution horizontally. In our investigations with ground dsub, but also the curvature a of the parabola. The curvature is small
truth data, we obtained a slight performance gain when evaluating in low-textured regions, whereas it is large in textured regions. If
BT with these interpolated half-pixel values. When evaluating the we define the data term according to
disparity space at quarter pixel resolution, we only interpolate the
Edata ðd0 Þ ¼ aðd0  dsub Þ2 ; ð4Þ
BT matching costs based on above half-pixel values.
For pixel-wise mutual information, which is in essence a lookup the cost of choosing d0 unequal to the locally estimated dsub de-
table, we implement a linear interpolation of the grayvalues in the pends on the confidence of the fit. One could also incorporate the
match image. Instead of obtaining the cost by looking up image gradient or the grayvalue variance as a confidence measure
CMI(p, d) = MI[Ip, ref][Ip  d, m] we obtain the cost on the enlarged of the stereo fit. Weighting the data term with a confidence mea-
disparity space image: sure results in a significant reduction of disparity noise in low-tex-
C MI ðp; 2dÞ ¼ MI½Ip;ref ½Ipd;m  ured regions as shown in Table 3 from Section 7.
  It is easy to compute the best solution d0 for a certain image
Ipd;m þ Ipd1;m
C MI ðp; 2d þ 1Þ ¼ MI½Ip;ref  ð1Þ point. Partial derivation @Etot/@d0 = 0 yields
2
We stick to the notation of Hirschmueller, where p is a pixel in the 
adsub þ Nk d
image, d the investigated disparity, MI the mutual information look- d0 ¼ k
: ð5Þ
aþN
up table, Iref the reference image intensity and Im the match image
intensity. In Section 7 we show the results of our analysis with The higher that k is chosen, the smoother is the resulting disparity
different sampling steps. Notice that the increase in computation image. Applying Eq. (5) to the disparity image favors smooth recon-
time is linear with the number of (fractional) disparities to check. structions everywhere and ignores the fact that the disparities can
However, the gain in disparity accuracy is clearly visible when look- be discontinuous at object boundaries. Therefore, we modify Eq.
ing at Table 1. The percentage of wrong disparities is reduced by (3) according to
20% when comparing integer disparity sampling with quarter pixel  
disparities sampling. r2 ðd0 Þ
Esmooth ¼ r2norm arctan ; ð6Þ
r2norm
4. Disparity smoothing where r2norm denotes the expected variance due to noise. In areas
with small disparity variances (r  rnorm), the characteristic re-
Our second technique for improving the sub-pixel accuracy is mains unchanged, whereas at disparity edges the slope of the func-
based on the smoothness constraint. This constraint is widely used tion is small. That means that a change of d0 has only little effect on
to obtain good disparity estimates but often on a pixel level only. the total energy, thus switching off the smoothing at disparity
All commonly used local interpolation schemes neglect this discontinuities.
constraint while estimating depth with sub-pixel accuracy. Imag- Thanks to the easy derivative of the arctan function, the optimal
ine a house at 50 m distance and a stereo rig that yields a disparity d0 becomes:
of 5 pixels. An uncertainty of  14 px then leads to estimations
between 47.6 m and 52.6 m distance corresponding to an uncer- 
adsub þ Hd k 1
d0 ¼ with H ¼ : ð7Þ
tainty of ±2.5 m in depth. Note that the mean of these estimations aþH N 1 þ ðr=rnorm Þ4
is not the true value due to the non-linear relation between depth
and disparity. Obviously, such a reconstruction is not very likely to H expresses the homogeneity (approaching 0 for depth discontinu-
be accurate. ities, 1 for homogeneous areas) of the disparity values in the consid-
Our approach performs a depth-edge-preserving smoothing on ered patch. k is constant and gives equal weight to all disparities.
the disparity image, similar to [18] where bilateral filtering was The above equations are applied to small patches within the
used. Bilateral filtering and adaptive smoothing are closely related disparity image. In order to get close to the optimal solution of
to each other as shown in [24]. Our approach is similar to adaptive the above stated problem, we need to iterate Eq. (7) to propagate
smoothing, however, unlike other methods we also exploit the the updated disparity values. dsub remains the original value of
confidence of an established disparity value. Basically we again the input disparity image, whereas d0 is updated in every iteration.
assume a piecewise constant 3D world, now for disparities on a This disparity smoothing algorithm favors solutions that are
sub-pixel level. planar in 3D, i.e. fronto-parallel or slanted planes. This way, the
We treat the disparity estimation as an energy minimization algorithm is especially helpful for reconstructing buildings.
problem similar to [11] with: See Section 7.2 for results of this disparity smoothing algorithm.
S.K. Gehrig et al. / Computer Vision and Image Understanding 116 (2012) 16–24 19

5. Gravitational constraint

In Fig. 1 an urban scene is shown where the sky yields some


erroneous correspondences due to matching ambiguities. For cam-
eras where the optical axes point to the horizon, the following
observation is made: When traversing the image from the bottom
to the top, the distance to the 3D points on the rays tend to
increase (see Fig. 2). This is due to the fact that objects in the scene
are connected to the ground (due to gravity, therefore we refer to
the constraint as gravitational constraint). Summarized, we
assume sorted depths for every column in the image. If the optical
axes and the ground have a small angle, the constraint is still valid
for most scenes. The constraint is absolutely useless for scenes
viewed from the bird’s eye perspective.
Even for the scenario with the optical axes parallel to the ground,
scene elements which violate the constraint can easily be found:
Traffic signs, bridges, or ceilings in indoor environments. So, this
can only be incorporated as a weak constraint to disambiguate
matches in low-textured areas similar to the smoothness constraint. Fig. 2. Principle of the gravitational constraint.
This constraint is possible to apply for the majority of the stereo
algorithms. We show the adaptation of the gravitational constraint
to the semi-global matching algorithm. Here, this constraint is easy This equation is modified to the following term:
to apply since the algorithm works on multiple 1D paths. For the if dbest þ 1 P d :
bottom-up path the original cost accumulation is (see [19, Eq. 
(12)]): L0r ðp; dÞ ¼ Cðp; dÞ þ min L0r ðp  r; dÞ; L0r ðp  r; d  1Þ
 
L0r ðp; dÞ ¼ Cðp; dÞ þ min L0r ðp  r; dÞ; L0r ðp  r; d  1Þ þPG1 ; L0r ðp  r; d þ 1Þ þ P 1 ; min L0r ðp  r; iÞ þ P2
i
 ð9Þ
else : 
þP 1 ; L0r ðp  r; d þ 1Þ þ P1 ; min L0r ðp  r; iÞ þ P 2 ð8Þ
i
L0r ðp; dÞ ¼ Cðp; dÞ þ min L0r ðp  r; dÞ; L0r ðp  r; d  1Þ
Lr denotes the accumulated cost along one path, C(p, d) the match- 
ing cost for pixel p at disparity d. p  r is the previous pixel along þPG1 ; L0r ðp  r; d þ 1Þ þ P 1 ; min L0r ðp  r; iÞ þ PG2 ;
i
the considered path.

Fig. 1. Street scene with traffic sign-left image shown on the top left, right image on the top right. The disparity image using SGM with BT is shown at the bottom left, the
bottom right image shows the result with gravitational constraint. Note the wrong depth estimation in the sky on the left. Dark green marks unmatched areas. (For
interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
20 S.K. Gehrig et al. / Computer Vision and Image Understanding 116 (2012) 16–24

Fig. 3. Van scene (rectified left image shown on the left) and corresponding disparity scene (right). Here, SGM with BT is used. The cameras deliver 12 bits per pixel and have
a 55 cm baseline with a 26° field of view. Dark green marks unmatched areas. (For interpretation of the references to colour in this figure legend, the reader is referred to the
web version of this article.)

 
with PGi = PG  Pi. L0r represents a possible path L00 . . . L015 . dbest is the  Measurement: Disparity and variance images are computed
disparity position of the lowest cost from the previously visited pix- based on the current left and right images with the stereo
el, PG > 1 is the penalty factor for violating the gravitational con- method presented in the previous sections.
straint. The top-down path is modified in the same way. All other  Update: Every filter is updated with its corresponding measure-
paths keep their original shape of Eq. (8). In Section 7.3 we show ment. Current predicted disparities with no corresponding
the effect of this constraint compared on ground truth data and measurement (i.e. no disparity computed for the current image
on urban scenes. position) are not immediately deleted, in the expectation that
a measurement for this estimate will be computed in the next
6. Pixel-wise estimation of dynamic stereo information frame. Disparity measurements with no corresponding estimate
are considered new, and are added into the current Kalman filter
Up to now, we only considered single stereo pairs. In this sec- pool. Every remaining measured disparity has a corresponding
tion, we will evaluate stereo image sequences. The idea is to reduce estimate. If the measurement does not lie within a 3r distance
the main error of triangulated stereo measurement which is in the from the current estimate, the measurement is added as new,
depth component (see e.g. [25]). Tracking of features in an image replacing the estimate. Otherwise, the estimate is updated with
over time allows reduction of the position uncertainty [26]. Never- standard Kalman filter correction equations [27].
theless, tracking is computationally expensive and highly restricts
the real-time capability of a system with an increasing number of A detailed description of this pixel-wise estimation of stereo
features. In [27] an iconic (pixel-based) representation for disparity information can be found in [28].
estimation, without tracking, is introduced. The disparity image
represents the state vector of a Kalman filter, under the assump-
tion that every pixel is independent. This is equivalent to having 7. Results
many independent Kalman filters, one for every pixel position of
the disparity image. This leads to a more efficient computation. 7.1. Results for Fractional Disparity Sampling (FDS)
Nevertheless, the method assumes a static world. In complex traf-
fic scenarios, with multiple longitudinal moving objects, this leads The benefit of evaluating the disparity space at fractional dis-
to a bias in the estimation of disparity [28]. In order to cope with parities is shown in Table 1. The parameters P1 and P2 of Eq. (8)
this problem we extend this method to estimate not only disparity, were kept constant at 10 and 20, respectively. Adaptive P2 and
but also disparity rate, i.e. the longitudinal speed of the corre- hierarchical mutual information as explained in [19] was used. Dis-
sponding disparity point for every pixel of the image. continuity preserving interpolation as explained in [29] was
This method exploits the fact that multiple disparity observa- applied in the post-processing step. The remaining disparities are
tions of the same point are possible thanks to the knowledge of filled with the disparities from the neighbors to the left or right,
the camera motion. Disparity measurements which are consistent so no intelligent extrapolation is performed.
over time are considered as belonging to the same spatial point, The computation time for Cones and Teddy is less than 7 s for
and therefore, disparity variance is reduced accordingly. By observ- the quarter pixel resolution. We count all pixels exceeding 0.5 px
ing the change of disparity estimates over time, the simultaneous from ground truth as error for the remainder of the paper. The
estimation of disparity rate and disparity rate variance is available. mean square error (MSE) is computed based on all pixels, so a high
This stereo integration requires three main steps: value occurs due to a few mismatched pixels, but the reduction due
to sub-sub-pixel estimation (parabola interpolation and fractional
 Prediction: The state and variance of every Kalman filter is pre- disparity evaluation) can clearly be seen. Excluding outliers in
dicted up to the current time. This is equivalent to computing the MSE computation has the problem that the number of outliers
expected optical flow and disparity based on ego-motion [27], change. The MSE at quarter pixel accuracy slightly degrades for the
while assuming constant disparity rate. The prediction of the Cones and Teddy set which could be accounted for the ground
variances includes the addition of a driving noise parameter truth being accurate to a quarter pixel only [30].
that models the uncertainties of the system, such as ego-motion The evaluation of the disparity space at fractional disparities is
inaccuracy. Such an ego-motion prediction might also come especially beneficial at large distances where one disparity step
from additional inertial sensors-we use velocity and yaw-rate corresponds to several meters in 3D. One example is shown in Figs.
sensors in our experiments. 3 and 4.
S.K. Gehrig et al. / Computer Vision and Image Understanding 116 (2012) 16–24 21

Fig. 5. 3D reconstruction of the van rear shown above viewed from the bird’s eye
perspective. Disparity smoothing yields an almost planar surface.

Fig. 4. 3D reconstruction of the rear of the van from the previous figure viewed
from the bird’s eye view. We used pixel disparity sampling at the top and quarter
pixel sampling at the bottom image. The result with fractional disparity sampling
exhibits significantly less noise. The van has a perfectly planar rear.

Table 2
Error percentages and MSE (in px2) for stereo ground truth data sets. The MSE results
are better with disparity smoothing compared to Table 1. Disparity errors beyond
0.5 px are counted as errors.

FDS 1 px 0.5 px 0.25 px


% err MSE % err MSE % err MSE
Tsu 22.82 1.083 19.87 0.943 15.75 0.857
Ven 10.21 0.413 8.46 0.342 8.70 0.328
Ted 24.17 7.701 21.97 7.183 22.17 7.046
Con 15.74 4.912 13.86 4.334 14.78 4.596

Fig. 6. Top: Left image of an image pair viewing a building at 12 m distance.


Bottom: 3D reconstruction of the building viewed from the bird’s eye perspective.
Disparity smoothing yields almost planar building surfaces. The cameras deliver
12 bits per pixel and have a 55 cm baseline with a 26° field of view.

Table 3
Error percentages and MSE (in px2) for stereo ground truth data sets. The MSE results
improve significantly with disparity smoothing compared to Table 1. Only the low-
textured image parts are considered here. Between 24% (Cones) and 50% (Venus) of One can see clearly that the rear of the truck is excellently
the pixels in the scenes are low-textured. Disparity errors beyond 0.5 px are counted reconstructed for the quarter pixel case. This accuracy gain in ste-
as errors. reo facilitates a subsequent 3D analysis to detect objects. The sep-
Algo Base Smooth Smooth aration of two objects at similar distances becomes much easier.
FDS 1 px 1 px 0.25 px
% err MSE % err MSE % err MSE 7.2. Disparity smoothing results
Tsu 48.67 3.48 37.73 2.25 29.28 1.85
Ven 17.11 1.03 13.74 0.84 11.40 0.62 In Table 2 the MSE and error percentages for the disparity
Ted 39.22 14.76 31.94 11.62 25.17 9.72
Con 31.45 9.48 23.37 7.13 18.78 6.43
smoothing algorithm is shown on the stereo ground truth data
set. The MSE decreases for all image and sampling configurations
22 S.K. Gehrig et al. / Computer Vision and Image Understanding 116 (2012) 16–24

Table 4 Fig. 5 shows a 3D reconstruction of the van rear shown in Fig. 3.


Comparison of error percentages and MSE (in px2) of stereo ground truth data. The rear is reconstructed to an almost planar surface keeping
Incorporating the gravitational constraint yields consistently better results. Disparity
errors beyond 0.5 px are counted as errors.
smoothing artifacts minimal in the surrounding street or sky.
In Fig. 6 a facade reconstruction using this algorithm is shown.
FDS 0.25 px 0.25 px + gravitation The walls appear very planar and the two walls meet at a 90 angle.
% err MSE % err MSE Note the clean separation between the staircase and the wall. The
Tsu 7.83 0.9203 7.66 0.9682 eaves gutter on the right wall is also well preserved in its shape.
Ven 7.35 0.3493 6.64 0.3115
Ted 17.54 9.0838 17.43 7.1768
7.3. Gravitational constraint results
Con 10.76 4.9547 10.42 4.3229

Table 4 shows the performance gain when applying the gravita-


tional constraint to the ground truth data set. Here we did not ap-
compared to Table 1 while the error percentage slightly increases ply the parabola interpolation and used a penalty factor of PG = 3.
for the Cones and Teddy image pair. This is due to the fact that a At time of publication our algorithm using the gravitational
slight smoothing still occurs along depth discontinuities. We use constraint and fractional sampling, ImproveSubPix, was among
parameters k = 50,000, rnorm = 0.01, a mask size of 3  3 and 10 the top10-performing stereo algorithms on sub-pixel level [2],
iterations for the disparity smoothing. The computation time is being third for the Cones image pair. The improved SGM by Hir-
130 ms for the Cones images. In order to investigate the effect of schmueller [29] performs similarly, but uses several additional
disparity smoothing on low-textured regions, we also evaluated post-processing steps. This evaluation is based on a threshold of
the error percentage for these regions. Pixels where a vertical 0.5 px disparity deviation from ground truth. The used ground
3  3 Sobel filter returns a value of less than 2 greyvalues are con- truth accuracy is a quarter pixel or better.
sidered low-textured. The results of these areas are summarized in The gravitational constraint is very helpful in urban scenes to
Table 3. For reference we added the evaluation of error rates and disambiguate matches in low-textured areas, such as the street
MSE for SGM without disparity smoothing and omitted the results and the sky. One example is shown in Fig. 1. The original SGM re-
for 0.5 px disparity sampling and smoothing due to space limita- sult produces wrong matches in the sky (BT metric was used).
tions. Note the significantly higher error percentage level and These mismatches can be corrected with the gravitational
MSE. The error rate is about three times higher for these regions constraint. Often, such wrong estimations are removed by the
compared to the highly textured areas. right-left consistency check [19], which is performed here. Note
From Table 3, we see a significant improvement of the error that the traffic sign at the right side is still correctly measured
rates when enabling disparity smoothing (left vs. middle column), although it violates the gravitational constraint. The computational
reducing the error percentage by about 25%. Before, the error overhead incorporating this constraint is negligible.
percentage for all pixels was reduced only moderately (less than
10%) when enabling disparity smoothing. Adding fractional dispar- 7.4. Results for pixel-wise estimation of dynamic stereo information
ity sampling on top (right column) improves the results by another
20%. An example of the improvement achieved with the iconic repre-
The benefit of disparity smoothing becomes obvious when look- sentation is shown in Fig. 7. Fig. 7 shows two occupancy grids as
ing at slanted surfaces. We investigated a part of the Venus stereo computed in [31]. An occupancy grid registers stereo measure-
pair where an accuracy of 1/8 pixel is available. Taking the lower ments with their uncertainties in a 2D grid viewed from the birds
right image portion of the Venus image pair, we observe an eye view-the camera position on the bottom center. The occupancy
improvement from 7% false pixels to 4% and an MSE drop from grid shown on the left was obtained with standard output from the
0.080 to 0.057px2. stereo algorithm, while the occupancy grid on the right was

Fig. 7. Occupancy grid without stereo integration (left) and with stereo integration (right). Note the increase in accuracy for the trailer rear on the right side.
S.K. Gehrig et al. / Computer Vision and Image Understanding 116 (2012) 16–24 23

Fig. 8. Original disparity image (left) and integrated disparity image (right) for above image. Note the increase in stereo density.

computed using an integrated disparity image. The improvement References


can be seen by the reduction of the dispersion of registered mea-
surements. The trailer rear on the right side at approximately [1] D. Scharstein, R. Szeliski, A taxonomy and evaluation of dense two-frame
stereo correspondence algorithms, IJCV 47 (1) (2002) 7–42.
40 m is marked over the images. The integrated disparity image [2] D. Scharstein, R. Szeliski, Middlebury online stereo evaluation. <http://
(see Fig. 8 left) shows a great improvement over the standard mea- vision.middlebury.edu/stereo> (viewed on 28.07.11).
sured disparity image (see Fig. 8 right), both in depth accuracy and [3] S. Gehrig, U. Franke, Improving sub-pixel accuracy for long range stereo, in:
ICCV-Workshop VRML, Rio de Janeiro, Brasil, 2007.
in stereo measurement density. No discontinuity-preserving inter- [4] Q. Tian, M. Huhns, Algorithms for subpixel registration, Computer Vision,
polation was performed to show the stereo integration effect w.r.t. Graphics, and Image Processing 35 (1986) 220–233.
stereo density. [5] M. Shimizu, M. Okutomi, An analysis of subpixel estimation error on area-
based image matching, in: DSP 2002, 2002, pp. 1239–1242.
[6] M. Shimizu, M. Okutomi, Precise sub-pixel estimation on area-based matching,
in: ICCV, 2001, pp. 90–97.
8. Conclusions and future work [7] D. Nehab, S. Rusinkiewicz, J. Davis, Improved sub-pixel stereo correspondences
through symmetric refinement, in: Proceedings of International Conference on
Computer Vision 2005, 2005, pp. 557–562.
Summarizing, we have shown four ways to improve sub-pixel [8] R. Szeliski, D. Scharstein, Sampling the disparity space image, PAMI 26 (3)
stereo computation: (2004) 419–425.
[9] M. Tappen, W. Freeman, Comparison of graph cuts with belief propagation for
stereo, using identical mrf parameters, in: Proceedings of International
 Evaluate the disparity space image at fractional disparities. Conference on Computer Vision 2003, 2003, pp. 900–907.
 Apply the disparity smoothing algorithm to obtain a spatial [10] A. Klaus, M. Somann, K. Kramer, Segment-based stereo matching using belief
connection between adjacent pixels on a sub-pixel level. propagation and a self-adapting dissimilarity measure, in: International
Conference on Pattern Recognition, 2006.
 Exploit the gravitational constraint if the camera setup is [11] L. Alvarez, R. Deriche, J. Sanchez, J. Weickert, Dense disparity map estimation
designed that way. respecting image discontinuities: a pde and scale-space based approach,
 For image sequence analysis, integrate the stereo images over Journal of Visual Communication and Image Representation 13 (1–2) (2002)
3–21.
several frames. [12] L. Robert, R. Deriche, Dense depth map reconstruction: a minimization and
regularization approach which preserves discontinuities, in: European
For the first two improvements we could demonstrate clearly Conference on Computer Vision (ECCV), 1996, pp. 439–451.
[13] E. Psarakis, G. Evangelidis, An enhanced correlation-based method for stereo
the extra benefit for low-textured regions by conducting a separate
correspondence with sub-pixel accuracy, in: Proceedings of International
evaluation on the low-textured parts of the Middlebury data set. Conference on Computer Vision 2005, 2005, pp. 907–912.
All four improvements can be combined independently of each [14] A. Stein, A. Huertas, L. Matthies, Attenuating stereo pixel-locking via affine
other. To save computation time, it might be appealing to use the window adaption, in: International Conference on Robotics and Automation,
2006, pp. 914–921.
disparity smoothing algorithm without fractional disparity [15] B. Lucas, T. Kanade, An iterative image registration technique with an
sampling. The gravitational constraint is especially helpful to dis- application to stereo vision, in: International Joint Conference on Artificial
ambiguate matches in low-textured areas. In addition, these Intelligence, 1981, pp. 674–679.
[16] J. Chen, J. Katz, Elimination of peak-locking error in piv analysis using the
improvements are independent of the choice of stereo algorithm correlation method, Measurement Science & Technology 16 (2005) 562–574.
with two restrictions: For the gravitational constraint some verti- [17] J. Westerweel, Effect of sensor geometry on the performance of piv
cal connection between disparity estimates have to be established interrogation, in: Laser Techniques Applied to Fluid Mechanics, Springer-
Verlag, 2000, pp. 37–55.
and for image sequence analysis, ego-motion information must be [18] Q. Yang, R. Yang, J. Davis, D. Nister, Spatial-depth super resolution for range
available. One example to combine these algorithms for stereo images, in: Proceedings of International Conference on Computer Vision and
algorithms exploiting smoothness constraints only along the Pattern Recognition 2007, 2007.
[19] H. Hirschmueller, Accurate and efficient stereo processing by semi-global
epipolar line, e.g. dynamic programming stereo. There, it might matching and mutual information, in: Proceedings of International Conference
be beneficial to incorporate the gravitational constraint and dispar- on Computer Vision and Pattern Recognition 2005, San Diego, CA, vol. 2, 2005,
ity smoothing to obtain a vertical coupling. For multi-baseline pp. 807–814.
[20] I. Ernst, H. Hirschmueller, Mutual information based semi-global stereo
matching, the same improvements shown for two-view stereo
matching on the gpu, in: International Symposium of Visual Computing
are expected. The stereo integration approach is a computationally (ISVC), Las Vegas, 2008.
cheap step towards multi-baseline evaluations. [21] S. Gehrig, F. Eberli, T. Meyer, A real-time low-power stereo vision engine using
Finally, looking at our evaluation against ground truth, it semi-global matching, in: ICVS 2009, 2009, pp. 134–143.
[22] S. Birchfield, C. Tomasi, Depth discontinuities by pixel-to-pixel stereo, in:
becomes obvious that more accurate ground truth data would be Proceedings of International Conference on Computer Vision 1998, 1998, pp.
extremely valuable to further push the limits of stereo accuracy. 1073–1080.
24 S.K. Gehrig et al. / Computer Vision and Image Understanding 116 (2012) 16–24

[23] J. Kim, V. Kolmogorov, R. Zabih, Visual correspondence using energy [28] T. Vaudrey, H. Badino, S. Gehrig, Integrating disparity images by incorporating
minimization and mutual information, in: Proceedings of International disparity rate, in: Robot Vision 2008, 2008.
Conference on Computer Vision 2003, 2003, pp. 1033–1040. [29] H. Hirschmueller, Stereo vision in structured environments by consistent
[24] D. Barash, D. Camiciu, A common framework for nonlinear diffusion, adaptive semi-global matching and mutual information, in: Proceedings of
smoothing, bilateral filtering and mean shift, Image and Vision Computing 22 International Conference on Computer Vision and Pattern Recognition 2006,
(1) (2004) 73–81. New York, NY, 2006, pp. 2386–2393.
[25] L. Matthies, S.A. Shafer, Error modeling in stereo navigation, IEEE Transactions [30] D. Scharstein, R. Szeliski, High-accuracy stereo depth maps using structured
on Robotics and Automation RA-3 (3) (1987) 239–248. light, in: Proceedings of International Conference on Computer Vision and
[26] U. Franke, C. Rabe, H. Badino, S. Gehrig, 6d vision: Fusion of motion and stereo Pattern Recognition 2003, Madison, Wisconsin, 2003, pp. 195–202.
for robust environment perception, in: Pattern Recognition, DAGM [31] H. Badino, U. Franke, R. Mester, Free space computation using stochastic
Symposium 2005, Vienna, 2005, pp. 216–223. occupancy grids und dynamic programming, in: Workshop on Dynamical
[27] L. Matthies, R. Szelinski, T. Kanade, Kalman filter based algorithms for Vision@ICCV, 2007.
estimating depth from image sequences, International Journal of Computer
Vision 3 (1989) 209–238.

You might also like