Before jumping into the details of the topic, it is useful to recall how a digital image is formed. The rays coming to the camera are refracted by the lens, and the light beams fall on the sensor of the camera. The energy of rays falling on the sensor is absorbed by the sensor and the color information is created according to the wavelengths of the light beam for each pixel after the conversion operations (analog to digital, etc.). Then the image is formed by using the intensity values of pixels and their spatial relationships in the digital environment. The information of distance traveled by the light beam from the visualized object to the camera cannot be measured with passive visible range electro-optic cameras like surveillance cameras. Although the human brain can perceive the depth of the scene and make sense of it, it is not possible to extract this information without using extra input by classical computer vision techniques. The term perspective, referring to the distance and position of objects in the scene from the camera, emerges here. Depth information of the objects can be obtained from the two-dimensional image plane when extra information of the scene is supplied.

In this article, it will be explained how this process is done with the transformation method called homography, and then how the approximate distance between points can be calculated.

**Perspective Free Image**

Changing the size of the image, rotating the image, or even enlarging a part of the image area can be conducted by image transformation operations in image processing [1]. These transformations can generally be carried out with the help of the obtained transformation matrix. Linear algebra calculations are extensively used for both calculations of transformation matrices and when applying the transformation matrices to images for different purposes. The mathematical expressions behind all these processes will not be mentioned here to keep our article as simple and practical as possible. However, the terms will be given as a reference for further investigations.

The image in Figure 1 shows an example of transferring the points from a two-dimensional plane to another plane. The question can be asked here: can the positions of all points after the transformation be known with the help of the corresponding information of some points in both planes? The answer is yes. How this can be done with the help of the homography transformation matrix [2], will be the main topic of this article. The solution to the problem in this example can be applied to the problem of how to eliminate the perspective effect on the image.

Now let’s look at how to achieve perspective-free visual by using the reference distance information between the pixels. In Figure 2, the effect of perspective can be observed on the surface of the building (image taken from: [3]) in the original image on the left. In other words, since the right side of the building is farther from the camera than the left side, it is expressed with a smaller area in the image. By eliminating the perspective effect, the image on the right is obtained. While doing this, the information in the 3D world, the rectangle shape of the window is used. We will look for similar information in our examples and try to get rid of the perspective effect by referring to some lines that we know are actually parallel to each other in the 3D world. Since a rectangle contains two pairs of parallel lines perpendicular to each other, finding any rectangle in the image area will make our task much easier. The area marked with yellow in the left image in Figure 2 is actually a rectangle, and by using this information, we can obtain a perspective-free image.

To explain the fundamentals of the topic, it would be more appropriate to start with a simple grid view with perspective like in Figure 3. In this image, there are many parallel lines and rectangles that we can utilize to calculate the homography matrix and then get rid of the perspective effect.

Let’s select a rectangle in the middle part of the image as in Figure 4, and call these 4 points P1, P2, P3, and P4, respectively (in clockwise order starting from the top left). Since these 4 points are the corners of the rectangle in the 3D world, the original image can be transferred to a perspective-free plane by using them. In the perspective-free plane image, these 4 points will be points of a rectangle in the image plane like in the 3D world. It is important to specify the dimensions of the rectangle as precisely as possible to form an accurate projected image. The reference distance information is given with the square grids in the original image. The width of the rectangle is 11 units and the height of the rectangle is 8 units.

(cv2.getPerspectiveTransform(), cv2.warpPerspective(), cv2.perspectiveTransform() functions of OpenCV library will be used to apply transforming operations in this post. You can check here for the usage of these OpenCV functions).

Now we can calculate the homography matrix with the help of 4 points in the source image and the corresponding location of 4 points in the target image. After the homography calculations, we can apply the transformation to the source image with the obtained homography matrix. Then we can construct the image in Figure 5. We can see that the grids under the perspective effect are turned into squares in the transformed image. As can be seen from the resulting image, if we ignore the small errors, we eliminated the perspective effect in the image. At this point, we can say that the type of image we obtain as a result of this process is called the bird’s eye view in the image processing literature. In other words, we have found the answer to the question of what it would look like if we looked at the scene from the top, like a bird, but not from the camera center.

Let’s look at another example. In Figure 6, we see an image on the left where the part of the Empire State from the ground to the top is visible (image taken from: [4]). In this image, the perspective effect is quite obvious. Then when the image is transformed by using reference points on the building as we did in the example above, we can obtain a perspective-free image like the one on the right. Another observation is that the size of the source and resulting images may differ based on the number of pixels. Therefore, intensity values of some pixels in the transformed image need to be estimated through interpolation and back-projection techniques.

**Distance Measurement on Image**

In the next step, let’s look into calculating the distance between two points on the image (supposing they are on the same plane in the real world) with the help of the homography matrix that we obtained from the above operations. When we repeat the same operations applied to previous images for the image in Figure 7 (image taken from the ABODA dataset [5]), we will obtain the image in Figure 9. During this process, while giving the four points representing the rectangle in the source image, the distance between these points (in terms of width and height of the rectangle) should be measured or estimated with high accuracy. Since the reconstructed image will be in the form of a bird’s-eye view, it can be assumed that the ground sample distance of pixels will be approximately equal all over the image. When we add the distance measurement capability to these processes, it would be correct to describe the whole process as a simple geographic camera calibration.

If we assume that the tiles seen in this example are squared-shaped and both edges of the tiles seen on the floor are 40 cm (0.4 meters), one dimension of the defined rectangle will be (P1-P2 or P3-P4 distance) as 1.2 meters and the other dimension will be (P2-P3 or P4-P1 distance) 1.6 meters, as seen in Figure 8.

After the calibration, we will obtain a bird’s-eye view image as in Figure 9. If the horizon line (vanishing point [6]) can be observed in the image we calibrate, it will not be possible to display all regions of the original image in the transformed one as seen in Figure 6. Since we cannot observe the whole transformed image due to the vanishing point problem, just the interested area of the transformed image is cropped and shown in Figure 9.

As a result, since the ground sample distance of the pixels is approximately uniform in all regions of the transformed image, the distance between any two points in the image can be calculated easily. Figure 10 shows some distance calculations performed on the image.

The accuracy of the distance measurement process is directly dependent on the precision of point selection and accuracy in the determination of the distance between them. For this reason, to increase the accuracy of the process, it is also important that the reference region taken to represent the whole image must be large enough and as close to the center of the image as possible.

In addition, lens distortions such as radial distortion cause the barrel effect (also pincushion effect) encountered in some fisheye cameras or in cameras with wide-angle lenses also affect the accuracy of the calibration. Effects that distort the planar assumption in the image should be corrected before the geographic calibration process. Otherwise, errors due to the distortion effect may be observed in both the transformed image and the distance calculated on it.

In this article, we tried to eliminate the perspective effect by calculating its homography matrix and applying it to the transformation. While obtaining the bird’s-eye view, we have made a simple geographic calibration on the image by setting the measurements of the area we refer. With this information, we knew we could measure the distance between any two points on the image with high accuracy. I hope this blog post will be helpful for those who want to get an idea about the subject.

Key references are as follows:

[1] Wikipedia contributors. (2022, May 25). *Transformation matrix*. Wikipedia. https://en.wikipedia.org/wiki/Transformation_matrix

[2] Wikipedia contributors. (2021, December 29). *Homography (computer vision)*. Wikipedia. https://en.wikipedia.org/wiki/Homography_(computer_vision)

[3] Hartley, R., & Zisserman, A. (2003). Multiple view geometry in computer vision. Cambridge university press.

[4] Wikipedia contributors. (2022b, May 31). *Empire State Building*. Wikipedia. https://en.wikipedia.org/wiki/Empire_State_Building

[5] *ABODA dataset*. https://github.com/kevinlin311tw/ABODA

[6] Wikipedia contributors. (2022, May 15). *Vanishing point*. Wikipedia. https://en.wikipedia.org/wiki/Vanishing_point