Professional Documents
Culture Documents
Improving Sub-Pixel Accuracy For Long Range Stereo
Improving Sub-Pixel Accuracy For Long Range Stereo
a r t i c l e i n f o a b s t r a c t
Article history: Dense stereo algorithms are able to estimate disparities at all pixels including untextured regions.
Received 19 May 2008 Typically these disparities are evaluated at integer disparity steps. A subsequent sub-pixel interpolation
Accepted 20 July 2011 often fails to propagate smoothness constraints on a sub-pixel level.
Available online 16 August 2011
We propose to increase the sub-pixel accuracy in low-textured regions in four possible ways: First, we
present an analysis that shows the benefit of evaluating the disparity space at fractional disparities. Sec-
Keywords: ond, we introduce a new disparity smoothing algorithm that preserves depth discontinuities and enforces
Stereo vision
smoothness on a sub-pixel level. Third, we present a novel stereo constraint (gravitational constraint)
Sub-pixel interpolation
3D reconstruction
that assumes sorted disparity values in vertical direction and guides global algorithms to reduce false
Object detection matches, especially in low-textured regions. Finally, we show how image sequence analysis improves
stereo accuracy without explicitly performing tracking. Our goal in this work is to obtain an accurate
3D reconstruction. Large-scale 3D reconstruction will benefit heavily from these sub-pixel refinements.
Results based on semi-global matching, obtained with the above mentioned algorithmic extensions are
shown for the Middlebury stereo ground truth data sets. The presented improvements, called Improve-
SubPix, turn out to be one of the top-performing algorithms when evaluating the set on a sub-pixel level
while being computationally efficient. Additional results are presented for urban scenes. The four
improvements are independent of the underlying type of stereo algorithm.
Ó 2011 Elsevier Inc. All rights reserved.
1077-3142/$ - see front matter Ó 2011 Elsevier Inc. All rights reserved.
doi:10.1016/j.cviu.2011.07.008
S.K. Gehrig et al. / Computer Vision and Image Understanding 116 (2012) 16–24 17
since the parabola fitting approaches exhibit artifacts known as Besides smoothness, the epipolar constraint always holds for
pixel-locking [6]. These approaches either perform a careful inter- stereoscopic cameras setups. In addition, the ordering constraint
polation of the integer disparity results [7] or work on a disparity is known to be helpful in establishing stereo correspondences[1],
space image that is sampled at fractional disparities [8]. We extend yet is violated for thin objects smaller than the stereo baseline.
the latter idea in our contribution. We propose another helpful constraint for low-textured regions
A few global algorithms deliver sub-pixel disparity estimates which we call the gravitational constraint introduced in Section
directly: Belief propagation using an MMSE (minimum mean 5. It develops its full potential in outdoor scenes.
squared error) estimator yields sub-pixel disparities implicitly
[9]. In this paper, marginal probabilities per pixel and disparity
2.1. Semi-global matching
step constitute weights resulting in sub-pixel disparity values.
Another recent belief propagation variant also obtains sub-pixel
One of the many dense stereo algorithms listed on the Middle-
disparities by fitting surfaces on small segments [10]. An energy-
bury web page is semi-global matching (SGM) [19]. Our investiga-
based formulation of the disparity estimation problem is presented
tions are conducted with this algorithm due to its computational
in [11], which also results in sub-pixel disparity estimates. There,
efficiency. Among the top performing algorithms, Hirschmueller’s
several ideas from the optical flow estimation problem were
SGM has the lowest computational complexity. Real-time imple-
applied to stereo vision. These algorithms require large computa-
mentations for SGM exist for graphics processors (GPU) [20] and
tional resources. The basic dense disparity estimation of this ap-
for reconfigurable hardware (FPGA) [21].
proach is described in [12].
Roughly speaking, SGM performs an energy minimization in a
Most other sub-pixel algorithms get their integer disparity
dynamic-programming fashion on multiple (8 or 16) 1D paths
values with a correlation-based stereo method (e.g. [1]) and per-
approximating the 2D image. The energy consists of three parts:
form a refinement subsequently. Psarakis and Evangelidis [13]
a data term for photo-consistency, a small smoothness energy term
conduct a linear interpolation of adjacent correlation values and
for slanted surfaces that change the disparity slightly (parameter
solve a one-dimensional optimization problem. Nehab et al. [7]
P1), and a smoothness energy term for depth discontinuities
observe a bias in stereo matching due to the use of typically the left
(parameter P2).
image as the reference image. Nehab et al. treat both images sym-
SGM is computationally efficient and has been used very
metrically and a two-dimensional fit in the disparity space yields
successfully in aerial image 3D reconstruction. The algorithm has
accurate sub-pixel disparities. Stein et al. [14] take the integer
a real-time potential on pipe-line architectures such as graphic
disparities and perform a gradient-descent in the spirit of [15].
cards (GPU) or field-programmable gate-arrays (FPGA) and pro-
All three methods present results free of the pixel-locking effect.
cesses VGA image pairs in few seconds on standard CPUs.
The computational burden of these methods is quite low, espe-
A multi-baseline extension of SGM is straightforward and de-
cially Psarakis method. A fast method to remedy the pixel-locking
scribed in [19]. Instead of working on rectified images one walks
effect is also presented by Shimizu and Okutomi [6]. There, the
along slanted epipolar lines of the undistorted image which can
integer correlation values are resampled shifted by a half pixel
be done for all match images with one reference image. The
and both results are averaged. The result exhibits a slight locking
resulting disparity map results are combined with a median fil-
to integer and to half pixel values. All these methods rely on suffi-
ter. The algorithms described below can be combined for a mul-
cient structure to successfully correlate image patches.
ti-baseline application in the same way, allowing large-scale
A rigorous analysis of the sub-pixel effects in stereo matching
reconstructions.
by Scharstein and Szeliski has been conducted in [8]. There, a Fou-
rier analysis shows that using a sinc interpolator is in theory the
best interpolation to evaluate the disparity space image at frac- 3. Sampling the disparity at fractional disparities
tional disparities. Chen and Katz [16] comes to the same conclusion
in the context of optical flow estimation. As shown by Szeliski and Scharstein [8], the sinc interpolator is
The analysis of [8] neglected effects of the sensor geometry, the best interpolation to sample the disparity space. For image
especially the pixel fill ratio. When approaching such accuracies areas with sufficient texture this results in excellent sub-pixel
the physical layout of the camera becomes relevant. An investiga- disparities. However, in real images, a good portion of the image
tion of the effect of sensor geometry on correspondence measure- consists of low-textured or untextured areas. Due to noise, these
ments is presented in [17]. This analysis gets even more areas do not benefit from this interpolation strategy but need
complicated considering recent trends using micro-lenses on every support from a smoothness constraint. When working on discrete
pixel which gives a complicated photo-sensitivity distribution for disparity steps, the number of subdivisions of one disparity step is
every pixel. the dominant factor for improvement in low-textured areas. We
We are interested in obtaining accurate sub-pixel estimates for extend the analysis from [8] and perform an evaluation of the dis-
all pixels in the scene, including areas of low or no texture, at little parity steps at integer pixel, at half pixel and at quarter pixel level,
computational expense. We implemented several of the above sub- considering the full image including low-textured regions. Using
pixel algorithms, confirmed their accuracy provided sufficient semi-global matching we present results that show a significant
structure, however, obtained high disparity noise in low-textured improvement in stereo accuracy for fractional disparity sampling.
areas. This is due to the fact that the correlation values or other These results can be extended to any global stereo algorithm that
similarity costs do not have sufficient information in low-textured incorporates smoothness constraints. We also neglect the effect
regions. Such areas can only be filled with meaningful disparities of sensor geometry on sub-pixel interpolation as do most other
using global algorithms with smoothness constraints. One method Computer Vision publications.
that performs disparity smoothing similar to our algorithm is The evaluation of the disparity space at fractional disparities is
introduced in [18]. There, bilateral filtering (adaptive filter coeffi- straightforward, we only need to generate our matching costs at
cients depending on adjacency and on photoconsistency to the fractional pixel positions. Then, the original SGM algorithm is ap-
central pixel) is applied to obtain an improved disparity map, plied using the cost matrix with the disparity dimension increased
resulting in depth-discontinuity-preserving smoothing. This non- by the sampling factor 1, 2 or 4. We consider two similarity met-
iterative method yields excellent results but is computationally rics: the Birchfield–Tomasi (BT) metric [22], and pixel-wise mutual
quite expensive. information [23].
18 S.K. Gehrig et al. / Computer Vision and Image Understanding 116 (2012) 16–24
X
Table 1 Etot ¼ ðEdata ðx; yÞ þ kEsmooth ðx; yÞÞ: ð2Þ
Error percentages and MSE (in px2) for stereo ground truth data sets. The results get x;y
better with higher fractional disparity sampling (FDS). Tsu = Tsukuba, Ven = Venus,
Ted = Teddy, Con = Cones image pair. Disparity errors beyond 0.5 px are counted as Let d0 be the disparity of the central pixel of a given support region.
errors.
The empirical local disparity variance,
FDS 1 px 0.5 px 0.25 px
% err MSE % err MSE % err MSE 1 X N
2 ¼ r2 ðd Þ;
Esmooth ðd0 Þ ¼ ðdi dÞ 0 ð3Þ
Tsu 23.72 1.117 20.30 0.966 14.99 0.919
N 1 i¼1
Ven 10.80 0.643 8.60 0.502 8.17 0.330
Ted 23.71 8.782 21.16 8.731 19.96 9.674 is used as a smoothness energy contribution of each pixel. N is the
Con 15.04 5.194 12.98 4.861 12.57 5.302 number of pixels within the considered patch and d its the average
disparity. This formulation shall emphasize the fact that the
Taking the BT similarity metric as described in [22], we obtain smoothness can be improved by slight modifications of d0. The
smoothness energy of d0 becomes minimal for d0 ¼ d.
the interpolated BT values at half pixel resolution by interpolating
the grayvalues of the left and right image and then computing the How can an appropriate data term be formulated? Let dint be the
BT metric with the interpolated values. This is equivalent to integer disparity computed by a stereo algorithm of your choice.
running the BT metric on the interpolated image with twice the The standard parabolic fit delivers not only an improved estimate
original resolution horizontally. In our investigations with ground dsub, but also the curvature a of the parabola. The curvature is small
truth data, we obtained a slight performance gain when evaluating in low-textured regions, whereas it is large in textured regions. If
BT with these interpolated half-pixel values. When evaluating the we define the data term according to
disparity space at quarter pixel resolution, we only interpolate the
Edata ðd0 Þ ¼ aðd0 dsub Þ2 ; ð4Þ
BT matching costs based on above half-pixel values.
For pixel-wise mutual information, which is in essence a lookup the cost of choosing d0 unequal to the locally estimated dsub de-
table, we implement a linear interpolation of the grayvalues in the pends on the confidence of the fit. One could also incorporate the
match image. Instead of obtaining the cost by looking up image gradient or the grayvalue variance as a confidence measure
CMI(p, d) = MI[Ip, ref][Ip d, m] we obtain the cost on the enlarged of the stereo fit. Weighting the data term with a confidence mea-
disparity space image: sure results in a significant reduction of disparity noise in low-tex-
C MI ðp; 2dÞ ¼ MI½Ip;ref ½Ipd;m ured regions as shown in Table 3 from Section 7.
It is easy to compute the best solution d0 for a certain image
Ipd;m þ Ipd1;m
C MI ðp; 2d þ 1Þ ¼ MI½Ip;ref ð1Þ point. Partial derivation @Etot/@d0 = 0 yields
2
We stick to the notation of Hirschmueller, where p is a pixel in the
adsub þ Nk d
image, d the investigated disparity, MI the mutual information look- d0 ¼ k
: ð5Þ
aþN
up table, Iref the reference image intensity and Im the match image
intensity. In Section 7 we show the results of our analysis with The higher that k is chosen, the smoother is the resulting disparity
different sampling steps. Notice that the increase in computation image. Applying Eq. (5) to the disparity image favors smooth recon-
time is linear with the number of (fractional) disparities to check. structions everywhere and ignores the fact that the disparities can
However, the gain in disparity accuracy is clearly visible when look- be discontinuous at object boundaries. Therefore, we modify Eq.
ing at Table 1. The percentage of wrong disparities is reduced by (3) according to
20% when comparing integer disparity sampling with quarter pixel
disparities sampling. r2 ðd0 Þ
Esmooth ¼ r2norm arctan ; ð6Þ
r2norm
4. Disparity smoothing where r2norm denotes the expected variance due to noise. In areas
with small disparity variances (r rnorm), the characteristic re-
Our second technique for improving the sub-pixel accuracy is mains unchanged, whereas at disparity edges the slope of the func-
based on the smoothness constraint. This constraint is widely used tion is small. That means that a change of d0 has only little effect on
to obtain good disparity estimates but often on a pixel level only. the total energy, thus switching off the smoothing at disparity
All commonly used local interpolation schemes neglect this discontinuities.
constraint while estimating depth with sub-pixel accuracy. Imag- Thanks to the easy derivative of the arctan function, the optimal
ine a house at 50 m distance and a stereo rig that yields a disparity d0 becomes:
of 5 pixels. An uncertainty of 14 px then leads to estimations
between 47.6 m and 52.6 m distance corresponding to an uncer-
adsub þ Hd k 1
d0 ¼ with H ¼ : ð7Þ
tainty of ±2.5 m in depth. Note that the mean of these estimations aþH N 1 þ ðr=rnorm Þ4
is not the true value due to the non-linear relation between depth
and disparity. Obviously, such a reconstruction is not very likely to H expresses the homogeneity (approaching 0 for depth discontinu-
be accurate. ities, 1 for homogeneous areas) of the disparity values in the consid-
Our approach performs a depth-edge-preserving smoothing on ered patch. k is constant and gives equal weight to all disparities.
the disparity image, similar to [18] where bilateral filtering was The above equations are applied to small patches within the
used. Bilateral filtering and adaptive smoothing are closely related disparity image. In order to get close to the optimal solution of
to each other as shown in [24]. Our approach is similar to adaptive the above stated problem, we need to iterate Eq. (7) to propagate
smoothing, however, unlike other methods we also exploit the the updated disparity values. dsub remains the original value of
confidence of an established disparity value. Basically we again the input disparity image, whereas d0 is updated in every iteration.
assume a piecewise constant 3D world, now for disparities on a This disparity smoothing algorithm favors solutions that are
sub-pixel level. planar in 3D, i.e. fronto-parallel or slanted planes. This way, the
We treat the disparity estimation as an energy minimization algorithm is especially helpful for reconstructing buildings.
problem similar to [11] with: See Section 7.2 for results of this disparity smoothing algorithm.
S.K. Gehrig et al. / Computer Vision and Image Understanding 116 (2012) 16–24 19
5. Gravitational constraint
Fig. 1. Street scene with traffic sign-left image shown on the top left, right image on the top right. The disparity image using SGM with BT is shown at the bottom left, the
bottom right image shows the result with gravitational constraint. Note the wrong depth estimation in the sky on the left. Dark green marks unmatched areas. (For
interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
20 S.K. Gehrig et al. / Computer Vision and Image Understanding 116 (2012) 16–24
Fig. 3. Van scene (rectified left image shown on the left) and corresponding disparity scene (right). Here, SGM with BT is used. The cameras deliver 12 bits per pixel and have
a 55 cm baseline with a 26° field of view. Dark green marks unmatched areas. (For interpretation of the references to colour in this figure legend, the reader is referred to the
web version of this article.)
with PGi = PG Pi. L0r represents a possible path L00 . . . L015 . dbest is the Measurement: Disparity and variance images are computed
disparity position of the lowest cost from the previously visited pix- based on the current left and right images with the stereo
el, PG > 1 is the penalty factor for violating the gravitational con- method presented in the previous sections.
straint. The top-down path is modified in the same way. All other Update: Every filter is updated with its corresponding measure-
paths keep their original shape of Eq. (8). In Section 7.3 we show ment. Current predicted disparities with no corresponding
the effect of this constraint compared on ground truth data and measurement (i.e. no disparity computed for the current image
on urban scenes. position) are not immediately deleted, in the expectation that
a measurement for this estimate will be computed in the next
6. Pixel-wise estimation of dynamic stereo information frame. Disparity measurements with no corresponding estimate
are considered new, and are added into the current Kalman filter
Up to now, we only considered single stereo pairs. In this sec- pool. Every remaining measured disparity has a corresponding
tion, we will evaluate stereo image sequences. The idea is to reduce estimate. If the measurement does not lie within a 3r distance
the main error of triangulated stereo measurement which is in the from the current estimate, the measurement is added as new,
depth component (see e.g. [25]). Tracking of features in an image replacing the estimate. Otherwise, the estimate is updated with
over time allows reduction of the position uncertainty [26]. Never- standard Kalman filter correction equations [27].
theless, tracking is computationally expensive and highly restricts
the real-time capability of a system with an increasing number of A detailed description of this pixel-wise estimation of stereo
features. In [27] an iconic (pixel-based) representation for disparity information can be found in [28].
estimation, without tracking, is introduced. The disparity image
represents the state vector of a Kalman filter, under the assump-
tion that every pixel is independent. This is equivalent to having 7. Results
many independent Kalman filters, one for every pixel position of
the disparity image. This leads to a more efficient computation. 7.1. Results for Fractional Disparity Sampling (FDS)
Nevertheless, the method assumes a static world. In complex traf-
fic scenarios, with multiple longitudinal moving objects, this leads The benefit of evaluating the disparity space at fractional dis-
to a bias in the estimation of disparity [28]. In order to cope with parities is shown in Table 1. The parameters P1 and P2 of Eq. (8)
this problem we extend this method to estimate not only disparity, were kept constant at 10 and 20, respectively. Adaptive P2 and
but also disparity rate, i.e. the longitudinal speed of the corre- hierarchical mutual information as explained in [19] was used. Dis-
sponding disparity point for every pixel of the image. continuity preserving interpolation as explained in [29] was
This method exploits the fact that multiple disparity observa- applied in the post-processing step. The remaining disparities are
tions of the same point are possible thanks to the knowledge of filled with the disparities from the neighbors to the left or right,
the camera motion. Disparity measurements which are consistent so no intelligent extrapolation is performed.
over time are considered as belonging to the same spatial point, The computation time for Cones and Teddy is less than 7 s for
and therefore, disparity variance is reduced accordingly. By observ- the quarter pixel resolution. We count all pixels exceeding 0.5 px
ing the change of disparity estimates over time, the simultaneous from ground truth as error for the remainder of the paper. The
estimation of disparity rate and disparity rate variance is available. mean square error (MSE) is computed based on all pixels, so a high
This stereo integration requires three main steps: value occurs due to a few mismatched pixels, but the reduction due
to sub-sub-pixel estimation (parabola interpolation and fractional
Prediction: The state and variance of every Kalman filter is pre- disparity evaluation) can clearly be seen. Excluding outliers in
dicted up to the current time. This is equivalent to computing the MSE computation has the problem that the number of outliers
expected optical flow and disparity based on ego-motion [27], change. The MSE at quarter pixel accuracy slightly degrades for the
while assuming constant disparity rate. The prediction of the Cones and Teddy set which could be accounted for the ground
variances includes the addition of a driving noise parameter truth being accurate to a quarter pixel only [30].
that models the uncertainties of the system, such as ego-motion The evaluation of the disparity space at fractional disparities is
inaccuracy. Such an ego-motion prediction might also come especially beneficial at large distances where one disparity step
from additional inertial sensors-we use velocity and yaw-rate corresponds to several meters in 3D. One example is shown in Figs.
sensors in our experiments. 3 and 4.
S.K. Gehrig et al. / Computer Vision and Image Understanding 116 (2012) 16–24 21
Fig. 5. 3D reconstruction of the van rear shown above viewed from the bird’s eye
perspective. Disparity smoothing yields an almost planar surface.
Fig. 4. 3D reconstruction of the rear of the van from the previous figure viewed
from the bird’s eye view. We used pixel disparity sampling at the top and quarter
pixel sampling at the bottom image. The result with fractional disparity sampling
exhibits significantly less noise. The van has a perfectly planar rear.
Table 2
Error percentages and MSE (in px2) for stereo ground truth data sets. The MSE results
are better with disparity smoothing compared to Table 1. Disparity errors beyond
0.5 px are counted as errors.
Table 3
Error percentages and MSE (in px2) for stereo ground truth data sets. The MSE results
improve significantly with disparity smoothing compared to Table 1. Only the low-
textured image parts are considered here. Between 24% (Cones) and 50% (Venus) of One can see clearly that the rear of the truck is excellently
the pixels in the scenes are low-textured. Disparity errors beyond 0.5 px are counted reconstructed for the quarter pixel case. This accuracy gain in ste-
as errors. reo facilitates a subsequent 3D analysis to detect objects. The sep-
Algo Base Smooth Smooth aration of two objects at similar distances becomes much easier.
FDS 1 px 1 px 0.25 px
% err MSE % err MSE % err MSE 7.2. Disparity smoothing results
Tsu 48.67 3.48 37.73 2.25 29.28 1.85
Ven 17.11 1.03 13.74 0.84 11.40 0.62 In Table 2 the MSE and error percentages for the disparity
Ted 39.22 14.76 31.94 11.62 25.17 9.72
Con 31.45 9.48 23.37 7.13 18.78 6.43
smoothing algorithm is shown on the stereo ground truth data
set. The MSE decreases for all image and sampling configurations
22 S.K. Gehrig et al. / Computer Vision and Image Understanding 116 (2012) 16–24
Fig. 7. Occupancy grid without stereo integration (left) and with stereo integration (right). Note the increase in accuracy for the trailer rear on the right side.
S.K. Gehrig et al. / Computer Vision and Image Understanding 116 (2012) 16–24 23
Fig. 8. Original disparity image (left) and integrated disparity image (right) for above image. Note the increase in stereo density.
[23] J. Kim, V. Kolmogorov, R. Zabih, Visual correspondence using energy [28] T. Vaudrey, H. Badino, S. Gehrig, Integrating disparity images by incorporating
minimization and mutual information, in: Proceedings of International disparity rate, in: Robot Vision 2008, 2008.
Conference on Computer Vision 2003, 2003, pp. 1033–1040. [29] H. Hirschmueller, Stereo vision in structured environments by consistent
[24] D. Barash, D. Camiciu, A common framework for nonlinear diffusion, adaptive semi-global matching and mutual information, in: Proceedings of
smoothing, bilateral filtering and mean shift, Image and Vision Computing 22 International Conference on Computer Vision and Pattern Recognition 2006,
(1) (2004) 73–81. New York, NY, 2006, pp. 2386–2393.
[25] L. Matthies, S.A. Shafer, Error modeling in stereo navigation, IEEE Transactions [30] D. Scharstein, R. Szeliski, High-accuracy stereo depth maps using structured
on Robotics and Automation RA-3 (3) (1987) 239–248. light, in: Proceedings of International Conference on Computer Vision and
[26] U. Franke, C. Rabe, H. Badino, S. Gehrig, 6d vision: Fusion of motion and stereo Pattern Recognition 2003, Madison, Wisconsin, 2003, pp. 195–202.
for robust environment perception, in: Pattern Recognition, DAGM [31] H. Badino, U. Franke, R. Mester, Free space computation using stochastic
Symposium 2005, Vienna, 2005, pp. 216–223. occupancy grids und dynamic programming, in: Workshop on Dynamical
[27] L. Matthies, R. Szelinski, T. Kanade, Kalman filter based algorithms for Vision@ICCV, 2007.
estimating depth from image sequences, International Journal of Computer
Vision 3 (1989) 209–238.