Birds Eye View - RidgeRun's Birds Eye View project research

Follow Us On

⇦ Getting Started/Birds Eye View and Libpanorama

Home

User Guide/Building and Installation ⇨

Background

When cameras are mounted on a vehicle, the captured images exhibit strong perspective distortion, as shown in Fig. 1(a). This perspective effect makes it difficult for drivers—and computer vision algorithms—to accurately judge distances, detect obstacles, or perform reliable image analysis. To address this, the camera views must be geometrically transformed into a top-down, bird’s-eye representation, as illustrated in Fig. 1(b). By applying perspective transformation (or inverse perspective mapping), the raw images are converted into a normalized bird’s-eye view, shown in Fig. 1(c), which enables accurate perception, measurement, and scene understanding for both human operators and automated systems. [3]

Perspective transformation from camera view to top-down view — Figure 1: Illustration of perspective transformation in a parking lot scene [1]

To obtain the output bird-s eye view image a transformation using the following projection matrix can be used. It maps the relationship between pixel (x,y) of the bird's eye view image and pixel (u,v) from the input image.

To generate the output bird’s-eye-view image, a geometric transformation based on a projection matrix is applied. This matrix defines the mapping between each output pixel (x,y) in the bird’s-eye view and its corresponding pixel (u,v) in the original camera image. By using this projection relationship, the system reprojects the ground-plane region from the input image into a top-down perspective. The transformation is expressed as:

${\begin{bmatrix}x'\\y'\\w'\end{bmatrix}}={\begin{bmatrix}a_{11}&a_{12}&a_{13}\\a_{21}&a_{22}&a_{23}\\a_{31}&a_{32}&a_{33}\end{bmatrix}}{\begin{bmatrix}u\\v\\w\end{bmatrix}}\;\;where\;\;\;\;x={\frac {x'}{w'}}\;\;and\;\;\;\;y={\frac {y'}{w'}}.$

This transformation is commonly referred to as Inverse Perspective Mapping (IPM) [4]. IPM takes a frontal-view image as input, applies a homography, and generates a top-down (bird’s-eye) view of the scene by mapping pixels onto a 2D ground-plane frame. In the region immediately surrounding a vehicle—where the road surface can be approximated as planar—IPM works effectively and produces geometrically meaningful top-down projections.

However, IPM introduces distortions for objects farther from the camera because the mapping is non-homogeneous. Vertical structures such as vehicles, pedestrians, poles, or curbs become stretched, skewed, or compressed, as illustrated in the left image of Figure 2. These artifacts limit the accuracy and usable range of applications that depend solely on basic IPM.

More advanced post-transformation techniques, such as Incremental Spatial Transformer [5], can significantly improve geometric consistency at longer distances, producing results similar to the right image in Figure 2. For the purposes of this project, however, we use a standard IPM approach to illustrate the core concepts.

When IPM is applied without additional correction or refinement, the method relies on three key assumptions:

The camera remains in a fixed pose relative to the ground plane.
The road or ground surface is approximately planar.
The ground surface is free of obstacles or significant height variations.

To apply IPM correctly and to generate an accurate bird’s-eye-view projection, we must understand how a camera maps 3D points in the world onto 2D image coordinates. This relationship is defined by the camera model, which combines intrinsic parameters, extrinsic parameters, and the geometry of the ground plane. From this formulation, we can derive a homography matrix that transforms ground-plane pixels from the input image into their corresponding positions in the bird’s-eye-view frame. The following section describes this camera model and the homography used to compute the transformation.

Camera Model and Homography

A pinhole camera can be modeled as a geometric device that projects a 3D point in the world onto a 2D point in the image plane. This projection is described by the general camera equation [6]:

$[u,v,1]^{t}\simeq P_{3x4}\cdot [X,Y,Z,1]^{t}\;\;\;\;\;\;(1)$

Here, ${\textstyle P_{3x4}}$ is the camera projection matrix, which maps a 3D world coordinate (X,Y,Z) into an image coordinate (u,v) in homogeneous form.

Projection Matrix Decomposition

The projection matrix can be decomposed into intrinsic and extrinsic parameters:

$P_{3x4}=K_{3x3}\cdot [R_{3x3}|t_{3x1}]\;\;\;\;\;\;(2)$

Where:

${\textstyle K_{3x3}}$ is the intrinsic matrix, containing focal length and principal point parameters.
${\textstyle R_{3x3}}$ is a rotation matrix describing camera orientation.
${\textstyle t_{3x1}}$ is a translation vector describing camera position.
Together, ${\textstyle [R_{3x3}|t_{3x1}]}$ expresses the camera pose relative to the world coordinate frame.

This decomposition is essential because BEV requires an accurate mapping between ground-plane points and image pixels.

Mapping Points on a Plane

Any point lying on the ground plane can be parameterized in homogeneous form as:

$[X,Y,Z,1]^{t}={\begin{bmatrix}x\cdot i+y\cdot j+d\\1\end{bmatrix}}=Q_{4x3}\cdot [x,y,1]^{t}\;\;\;\;\;\;(3)$

Where:

${\textstyle (x,y)}$ represents a point in ground-plane coordinates.
${\textstyle Q}$ is a 4x3 matrix that embeds the plane equation into homogeneous coordinates.

This formulation constrains all 3D points to lie on a planar surface—an assumption required by standard Inverse Perspective Mapping (IPM).

Deriving the Homography

Substituting (3) into the camera equation (1), we obtain:

$[u,v,1]^{t}\simeq P_{3x4}\cdot Q_{4x3}\cdot [x,y,1]^{t}=H_{3x3}\cdot [x,y,1]^{t}\;\;\;\;\;\;(4)$

This shows that the mapping between a point on the ground plane and its image projection is governed by a 3×3 homography matrix H.

Homography is the mathematical foundation that enables Bird’s Eye View (BEV) systems to warp perspective images into a top-down representation.

RidgeRun BEV Demo Workflow

The following image shows the basic processing path used by the BEV system.

Capture camera frames: Obtain frames from a video source. 4 cameras with 180 degrees view angle.
Resize frames: Resize input frames to desired input size.
Remove lens distortion: Apply lens undistortion algorithm to remove the fisheye effect.
Perspective transformation: Use the IPM
1. Perspective mapping: Map the points from the front view image to a 2D top image (BEV)
2. Enlarging: Enlarge the image to search for the required region of interest (ROI)
3. Cropping: Crop the output of the enlargement process to adjust the needed view.
Create a final top view image

Removing fisheye distortion

Wide-angle cameras are needed to obtain the max point of view and to achieve the Bird's Eye View perspective. Usually, these cameras provide fish eye image which needs to be transformed to remove the distortion. After the camera calibration process was done, done parameters are received:

K: Output 3x3 floating-point camera matrix

$K={\begin{bmatrix}f_{x}&0&c_{x}\\0&f_{y}&c_{y}\\0&0&1\end{bmatrix}}$

D: Output vector of distortion coefficients

$D=[K_{1},K_{2},K_{3},K_{4}]$

As stated in the OpenCV documentation, let P be a point in 3D of coordinates X in the world reference frame (stored in the matrix X) The coordinate vector of P in the camera reference frame is

$Xc=RX+T$

where R is the rotation matrix corresponding to the rotation vector om: R = rodrigues(om); call x, y and z the 3 coordinates of Xc:

$x=Xc_{1},\;\;y=Xc_{2},\;\;z=Xc_{3}$

The pinhole projection coordinates of P is [a; b] where

$a={\frac {x}{z}},\;\;b={\frac {y}{z}},\;\;r^{2}=a^{2}+b^{2},\;\;\theta =a\;tan(r)$

The fisheye distortion is defined by:

$\theta _{d}=\theta (1+k_{1}\theta ^{2}+k_{2}\theta ^{4}+k_{3}\theta ^{6}+k_{4}\theta ^{8})$

The distorted point coordinates are [x'; y'] where

$x'=(\theta _{d}/r)a,\;\;y'=(\theta _{d}/r)b$

Perspective Mapping

The first step is to apply the perspective mapping from the input image to transform the angle image into a top view. Then, the enlarge process allows matching the output image perspective and the crop step handles to capture the required region of interest, as seen in the following image.

Process Optimization

The process described so far is based on state-of-the-art papers and research on how to apply IPM and other techniques to generate a top view image. The minimal steps always include the perspective mapping to get the IPM result, enlargement to adjust the measures between different cameras, and cropping to use only the required part of the image. These steps work great for a single image in an isolated environment to analyze and generate a single output frame. However, when the process needs to be repeated once for each frame at higher frame rates the system stress might be too high, and therefore the frame rate might be lower than expected.

To enhance the performance of the library we designed a simple method to wrap up all of the steps in a single warp perspective method. Usually, in Computer Vision, two images of the same planar surface in space can be related by a homography matrix. The method has many practical applications like image rectification, image registration, or computation of camera motion rotation, resize, and translation between two images. We will rely on the last application to speed up the process.

$H={\begin{bmatrix}h_{00}&h_{01}&h_{02}\\h_{10}&h_{11}&h_{12}\\h_{20}&h_{21}&h_{22}\end{bmatrix}}$

From an input image we need to map each ${\textstyle x_{1},y_{1}}$ source point to another ${\textstyle x_{2},y_{2}}$ destination point. The library uses a chessboard to collect inner corners and automatically adjust them by generating a perfect squared relationship and generate the output image as seen in the following figure.

The relationship between the points is defined by:

${\begin{bmatrix}x_{1}\\y_{1}\\1\end{bmatrix}}=H\cdot {\begin{bmatrix}x_{2}\\y_{2}\\1\end{bmatrix}}={\begin{bmatrix}h_{00}&h_{01}&h_{02}\\h_{10}&h_{11}&h_{12}\\h_{20}&h_{21}&h_{22}\end{bmatrix}}{\begin{bmatrix}x_{2}\\y_{2}\\1\end{bmatrix}}$

The main objective is to focus on how to adjust the homography matrix to absorb all the previous steps. With several geometric transformation the library also allows the user to fine-tune the final image. Matrix multiplication allows the quick and easy modification of the final homography, therefore the operation will be defined by ${\textstyle T_{1}}$ (Move image to origin), ${\textstyle S_{1}}$ (Scale image), ${\textstyle T_{2}}$ (Move image back to original position), ${\textstyle M_{1}}$ (Move image to fine adjust position), ${\textstyle A_{1}}$ (Rotate image). The final homography ( ${\textstyle H_{f}}$ ) will be defined by:

$H_{f}=H\cdot T_{1}\cdot S_{1}\cdot T_{2}\cdot M_{1}\cdot A_{1}$

where ${\textstyle T_{1}}$ , ${\textstyle S_{1}}$ , ${\textstyle T_{2}}$ , ${\textstyle M_{1}}$ , ${\textstyle A_{1}}$ are:

$T_{1}={\begin{bmatrix}0&1&-centerX\\0&1&-centerY\\0&0&1\\\end{bmatrix}}\;\;S_{1}={\begin{bmatrix}scaleX&0&0\\0&scaleY&0\\0&0&1\\\end{bmatrix}}\;\;T_{2}={\begin{bmatrix}0&1&centerX\\0&1&centerY\\0&0&1\\\end{bmatrix}}\;\;M_{1}={\begin{bmatrix}1&0&moveX\\0&1&moveY\\0&0&1\\\end{bmatrix}}\;\;A_{1}={\begin{bmatrix}cos(angle)&-sin(angle)&0\\sin(angle)&-cos(angle)&0\\0&0&1\\\end{bmatrix}}$

The input parameters for each matrix are:

CenterX: Calculated X center between the inner corners of the chessboard
CenterY: Calculated Y center between the inner corners of the chessboard
ScaleX, ScaleY: User defined scale factor for the output image
MoveX, MoveY: User defined amount of image movement adjustment
Angle: User defined rotation angle