Professional Documents
Culture Documents
Perspective Transformation: Technical Background
Perspective Transformation: Technical Background
Related terms:
Technical Background
Rick Parent, in Computer Animation (Third Edition), 2012
First, the data is translated so that the observer is moved to the origin. Then, the
observer's coordinate system (view vector, up vector, and the third vector required to
complete a left-handed coordinate system) is transformed by up to three rotations
so as to align the view vector with the global negative z-axis and the up vector
with the global y-axis. Finally, the z-axis is flipped by negating the z-coordinate.
All of the individual transformations can be represented by 4 × 4 transformation
matrices, which are multiplied together to produce a single compound world space
to eye space transformation matrix. This transformation prepares the data for the
perspective transformation by putting it in a form in which the perspective divide is
simply dividing by the point's z-coordinate.
(6.3)
Here and are the coordinates of a pixel in the input and output images, respectively,
and M is an affine matrix. An example of affine transformation has been given in
Chapter 2, where it was used to rotate an image 90 degrees. So in this chapter, we
focus on the perspective transformation. The API for both functions is very similar,
and everything we learn here can be applied to the affine transformation too.
(6.4)
Here are pixel coordinates in the output image, and are uniform pixel coordinates
in the input image. The normal input pixel coordinates are given by
(6.5)
The algorithm implemented in this node computes the intensity in each output
image pixel by mapping it to an input image using Eqs. (6.4)–(6.5). Since there
usually is no one-to-one mapping between input and output pixels, the output
pixel intensity is computed by interpolating the neighboring pixels intensity. The
specific interpolation method is given by the “type” parameter. If the output pixel
is mapped outside of the input image boundaries, then the border mode is used
to compute the input pixel intensity. The perspective node supports and . Note
that the output image dimensions do not necessarily have to be equal to the input
image dimensions. This puts a not too obvious restriction on the output image: its
dimensions cannot be inferred from the input image dimensions, so the output
image cannot be a virtual image without specified width and height. The same is
true for the affine transformation. The dimensions of the output image for both and
must always be specified.
To illustrate the OpenVX perspective transformation, we will use the previously
developed example of using the Hough transform to detect road lanes. Section
6.4.2 describes finding the vanishing point as a crossing of parallel lanes. We will
extend this sample to generate a bird's eye view from a single image. The bird's
eye view sample is implemented in “birds-eye/birdsEyeView.c,” which is created by
modifying “filter/houghLinesEx.c.” The result of the algorithm is shown in Fig. 6.14.
To reproduce these results, run
Since a road is flat, a change in camera position can be simulated with a perspective
transformation (see [26]). So, we need to come up with a perspective transformation
that sends the vanishing point to infinity, and this will make the road lines parallel
to each other. Since a perspective transformation depends on the vanishing point, it
will have to be generated during graph execution time, so we will need a user node
for that. We will discuss how to do this a little later; for now, let us see how we can
apply the perspective transformation to an image.
After adding the node that calculates a position of the vanishing point “userFind-
VanishingPoint,” we add the user node that returns a perspective transformation:
Then we apply the perspective transformation to the input image. Since “vxWarp-
PerspectiveNode” works with grayscale images only, we split the input image into
three channels, process each of them, and then combine them back into the output
image:
Note that the matrix generated by the node is an input to the . Since no object
metadata change here, graph reverification will not be triggered for each graph
execution.
There will be a substantial amount of pixels in the output image that will be mapped
outside of the input image boundaries. We want them to be black, and so we set
the border mode to with the pixel value equal to 0. Also, note that we use virtual
images, so that an OpenVX implementation can execute this operation in a more
optimal way, for example, running the perspective transformation on a color image
in one pass. Since the cannot figure out the size of the output image from the input
image, the virtual images have to be initialized with specific values for width and
height; see the beginning of the implementation:
Now let us see how we can create a perspective transformation during graph
execution time.
(6.6)
where
(6.7)
(6.8)
The rotation angle is chosen so that the vanishing point maps to infinity. Also,
we need to keep the part of the road in front of the camera in the view; otherwise,
our output will be a black image. So, we will add an additional pan and zoom
transformation given by the matrix Z:
(6.9)
Note that throughout this section, we will use the direct perspective transformation
that maps an input image to an output image. OpenVX uses an inverse matrix that
maps an output image to an input, and we will address this only in the end when
we will generate the output object.
This algorithm is implemented in the . It has two input parameters: a with one
element corresponding to the vanishing point and the input image, which is only
needed to pass the required size of the output image. The output parameter is the
perspective transformation in a object. First, we get the input/output parameters
and image width/height:
Then we initialize the intrinsics matrix and calculate its inverse (needed in (6.6)):
is used here and further because several operations, including camera calibration
and vanishing point detection, were done on an image resized down 4 times
each dimension. is implemented using the LAPACK library. Then we obtain the
coordinates of the vanishing point from the input argument:
Now we find the corresponding uniform coordinates of the vanishing point using
the inverse intrinsic matrix:
We are ready to find the angle from (6.8). Note that we do all matrix operations
with floating point arrays, and we will use only for the output:
Once we know the rotation matrix, we are ready to generate a perspective transfor-
mation that sends the vanishing point to infinity:
We also have to make sure that the important part of the image is visible after
this transformation. We will use an affine transformation that maps parallel lines
to parallel lines, but we cannot make it a separate node since if the image is empty
after the perspective transformation node, then the output image will be empty too.
For simplicity, we will construct this mapping as a pan and zoom transformation,
making sure two control points in the input image map inside the output image.
First, we generate the coordinates of the control points in the input image:
Now we generate a pan and zoom transformation that maps these points to the
upper and lower boundaries of the output image and multiply it to the left from the
perspective transformation:
We have obtained the required perspective transformation. Note that OpenVX deals
with the inverse transposed homography transformation (see (6.4)), so we invert and
transpose the matrix before importing it:
The validation of this user node is implemented in the function. We check that the
output matrix is floating point and set the corresponding metadata:
> Read full chapter
(5)
for small . Here, g(x, y) = 1/Z(x, y) is the inverse scene depth. Clearly, the optical
flow field can be arbitrarily complex, and does not necessarily obey a low-order
global motion model. However, several approximations to (5) exist that reduce the
dimensionality of the flow field. One possible approximation is to assume that
translations are small compared with the distance of the objects in the scene from
the camera. In this situation, image motion is caused purely by camera rotation, and
is given by
(6)
Equation (6) represents a true global motion model, with 3 df ( x, y, z). When the
field of view (FOV) of the camera is small (i.e., when |x|, |y| 1) the second-order
terms can be neglected, giving a further simplified three parameter global motion
model
(7)
(8)
Substituting (8) into (5) gives the eight parameter global motion model
(9)
for appropriately computed {ai, i = 0 … 7}. Equation (9) is called the pseudo-perspective
model or transformation.
Equation (5) relating the optical flow with structure and motion assumes that the
interframe rotation is small. If this is not the case, the effect of camera motion must
be computed using projective geometry [27, 28]. Assume that an arbitrary point in
the 3D scene lies at (X0,Y0,Z0) in the reference frame of the first camera, and moves
to (X1, Y1, Z1) in the second. The effect of camera motion relates the two coordinate
systems according to
(10)
where the rotation matrix [rij] is a function of . Combining (1) and (10) permits the
expression of the projection of the point in the second image in terms of that in the
first as
(11)
Assuming either that (a) points are distant compared to the interframe translation
(i.e., neglecting the effect of translation) or (b) a planar embedding of the real world
(8), the perspective transformation is obtained:
(12)
The flow field (u, v) is the difference between image plane coordinates (x1 − x0,y1 − y0)
across the entire image. When the FOV is small, it can be assumed that |pzxx0|, |pzyy0|
|pzz|. Under this assumption, the flow field, as a function of image coordinate, is
given by
(13)
Other popular global deformations mapping the projection of a point between two
frames are the similarity and affine transformations, which are given by
(14)
(15)
respectively. Free parameters for the similarity model are the scale factor s, image
plane rotation , and translation (b0,b1). Taking the difference between interframe
coordinates of the similarity transform gives the optical flow field model (7) with
one constraint on the free parameters. The affine transformation is a superset of the
similarity operator, and incorporates shear and skew as well. The optical flow field
corresponding to the coordinate affine transform (15) is also a 6-df affine model. The
perspective operator is a superset of the affine, as can be readily verified by setting
pzx = pzy = 0 in (12).
The similarity, affine, and perspective transformations are group operators, which
means that each family of transformations constitutes an equivalence class. The
following four properties define group operators:
4. Inverse: For each operator A G, there exists an inverse A−1 G such that AA−1=
A−1A = I.
The utility of the closure property is that a sequence of images can be rewarped to
an arbitrarily chosen “origin” frame using any single class of operators, and flows
computed only between adjacent frames. Since the inverse of each transformation
exists, the origin need not necessarily be the first frame of the sequence. Note
that the pseudo-perspective transformation (9) is not a group operator. Therefore,
to warp an image under a pseudo-perspective global deformation, it is necessary
to register each new image directly to the origin. This can get tricky when the
displacement between them is large, worse yet when the overlap between them is
small.
In the process of global motion estimation, each data point is the optical flow
at a specified pixel, described by the data vector (u, v, x, y). For the affine and
pseudo-perspective transformations, it is obvious that the unknowns form a set of
linear equations with coefficients that are functions of the data vector components.
The same is true for the perspective and similarity operators, although not obvious.
For the perspective transform, the denominators of (13) are multiplied out, while
for the similarity transform, the substitutions s0 = s cos and s1 = s sin give rise
to linear equations. In particular, the coefficients of the unknowns in the linear
equations for the similarity, affine and pseudo-perspective models are functions of
the coordinate (x, y) of the data point. Assuming that errors in data are present
only in u, v this implies that errors in the linear system for the similarity, affine
and pseudo-perspective transforms are present only in the “right-hand side.” In
contrast, errors exist in all terms for the perspective model. When errors in u, v are
Gaussian, the least squares (LS) solution of a system of equations of the form (9),
(14), or (15) yields the minimum-mean squared error estimate. For the perspective
case, the presence of errors in the “left-hand side” calls for a total least squares (TLS)
[29] approach. In practice, errors in (u, v) are seldom Gaussian, and simple linear
techniques are not sufficient.
7 Perspective Transformations
The most general linear transformation is the perspective transformation. Lines that
were parallel before perspective transformation can intersect after transformation.
This transformation is not generally useful for tomographic imaging data, but is
relevant for radiologic images where radiation from a point source interacts with
an object to produce a projected image on a plane. Likewise, it is relevant for
photographs where the light collected has all passed through the focal point of the
lens. The perspective transformation also rationalizes the extra constant row in the
matrix formulation of affine transformations. Figure 9 illustrates a two-dimensional
perspective image.
As in the one- and two-dimensional cases, all homogeneous coordinate vectors must
be rescaled to make the last element equal to unity. If the vectors are viewed as
two-dimensional rather than one-dimensional, this means that all real one-dimen-
sional coordinates lie along the two-dimensional line parameterized by the equation
y = 1. Rescaling of vectors to make the final element equal to unity is effectively the
same as moving any point that is not on the line y = 1 along a line through the
origin until it reaches the line y = 1. In this context, a one-dimensional translation
corresponds to a skew along the x-dimension. Since a skew along x does not change
the y coordinate, translations map points from the line y = 1 back to a modified
position on that line. This is illustrated in Fig. 10. In contrast, a skew along y will
shift points off of the line y = 1. When these points are rescaled to make the final
coordinate unity once again, a perspective distortion is induced. This is illustrated
in Fig. 10. The matrix description of a pure skew f along y is
FIGURE 10. The geometry underlying the embedding of a one-dimensional per-
spective transformation into a two-by-two homogeneous coordinate matrix. Real
points are defined to lie in along the line y = 1. The upper left shows a one-dimen-
sional object with nine equally spaced subdivisions. Shearing along the x-dimension
does not move the object off of the line y = 1. The coordinates of all of the intervals
of the object are simply translated as shown in the upper right. Shearing along the
y-dimension moves points off of the line y = 1. A point off this line is remapped back
onto y = 1 by projecting a line from the origin through the point. The intersection of
the projection line with the line y = 1 is the remapped coordinate. This is equivalent
to rescaling the transformed vector to make its final coordinate equal to unity.
Projection lines are shown as dashed lines, and the resulting coordinates along y
= 1 are shown as small circles. Note that the distances between projected points
become progressively smaller from right to left. The gray line parallel to the skewed
object intersects the line y = 1 at the far left. This point is the vanishing point of
the transformation. A point infinitely far to the left before transformation will map
to this location. In this case, the skew is not sufficiently severe to move any part
of the object below the origin. Points below the origin will project to the left of the
vanishing point with a reversed order, and a point infinitely far to the right before
transformation will map to the vanishing point. Consequently, at the vanishing
point, there is a singularity where positive and negative infinities meet and spatial
directions become inverted. Two-dimensional perspective transformations can be
envisioned by extending the second real dimension out of the page. Three-dimen-
sional perspective transformations require a four-dimensional space.
so that
As x goes to positive infinity, x will go to s/f. This corresponds to the vanishing point
in the transformed image. The same vanishing point applies as x goes to negative
infinity, Note that a point with the original coordinate –1/f causes the denominator
to become zero. This corresponds to the intersection of the skewed one-dimensional
image with the y-axis.Points to one side of the value –1/f are mapped to positive
infinity, while those on the other side are mapped to negative infinity. The geometry
underlying these relationships can be seen in Fig. 10. From a practical standpoint,
the singularities involving division by zero or the projection of the extremes in either
direction to the same point are generally irrelevant since they pertain to the physical
impossible situation where the direction of light or radiation is reversed.
In two dimensions, there are two parameters that control perspective. In the matrix
formulation here, they are f and g.
Many other theorems and types of invariant exist, but space prevents more than
a mention of them here. As an extension to the line and conic examples given in
this chapter, invariants have been produced which cover a conic and two coplanar
nontangent lines, a conic and two coplanar points, and two coplanar conics. Of
particular value is the group approach to the design of invariants (Mundy and
Zisserman, 1992a). However, certain mathematically viable invariants, such as those
that describe local shape parameters on curves, are too unstable for use in their full
generality because of image noise. Nevertheless, semidifferential invariants have
been shown (Section 19.5) to be capable of fulfilling essentially the same function.
Next, there is the warning of Åström (1995) that perspective transformations can
produce such incredible changes in shape that a duck silhouette can be projected
arbitrarily closely into something that looks like a rabbit or a circle, hence upsetting
invariant-based recognition.5 Although such reports seem absent from the previous
literature, Åström's work indicates that care must be taken to regard recognition via
invariants as hypothesis formation, which is capable of leading to false alarms.
Further research
Two areas demand further research. First, we are extending the domain of oper-
ations to include more general image operations such as geometric transforma-
tions (scaling, rotation, translation, shearing, and perspective transformations) and
filtering (smoothing, noise reduction, and image enhancement). Second, we want
to derive similar results for other compression techniques, including those that use
interframe coding (for example, H.261 or MPEG).
Finally, there is the possibility of designing a coding scheme that would make
transformations easier. The area of image coding and compression is an active,
current topic of interest in the research community. Many schemes for image coding
have been proposed, including vector quantization, transform coding, and sub-band
coding. Many practical algorithms, such as JPEG and MPEG, are hybrid solutions,
drawing ideas from several techniques. Another area for future research would be
to design a coding scheme that offered compression ratios competitive with current
algorithms, but simplified manipulation of the compressed data.
Wrapped-around points are those that undergo a change of sign of their w com-
ponent due to the perspective transformation. Algebraically, a condition for a
wrapped-around point is to start with a positive w and end with a negative one. We
can think of this as converting the point to definition space and testing its w:
Either way, the result is the same: we take the dot product of the pixel space point
with the fourth column of Tsd. A negative result means it's awrapped-around point.
We update the inside-out range [ymin, ymax] to a correctly ordered half-infinite range
by simply updating ymin – ∞ to keep [−∞, ymax], or by updating ymax + ∞ to keep
[ymin, + ∞].
W Pleasure, W Fun
Jim Blinn, in Jim Blinn's Corner, 2003
Mathematical Niceties
To simplify things a bit in this discussion, I'm not going to include the y coordinates
in any calculations. The problem can be adequately understood in terms of only
the x, z, and w coordinates, and the reduction in dimensionality will simplify things
considerably.
Next, let's define our coordinate systems. There are three of interest to us:
1. Eye space: All objects are translated so that the eye is at the origin and is looking
down the positive z axis (this, incidentally, is a left-handed coordinate system).
2. Perspective space: This occurs after multiplying points in eye space by a homo-
geneous perspective transformation.
3. Screen space: This occurs after dividing out the w component of the perspective
space points.
Finally, there is the question of notation. A mathematical symbol can convey a lot
of information if you give it a chance. The mathematical symbols I use here will
designate coordinates of various points in various coordinate systems. The three
things, then, that we want to explicitly convey are
The coordinate system will be a decoration over the letter as follows:x (a bare
letter) means eye space means perspective space before w division means
screen space (perspective space after w division)Essentially, the number of
wiggles over a letter tell how many transformations it has gone through.
The name of the point will be a subscript.