Aerial image based 3D modeling for HD mapping

It is now no longer weird to say that autonomous driving is where the future of the road leads. To get there, we have to jump through all sorts of hoops, and one of them is HD maps.

While typical maps have communicative features intended for people to see and understand, HD maps must be more specific and clear as they are intended for machines. HD maps include not only connections between roads, but also the number and type of lanes, connections between lanes, features of the road surface, and road traffic control devices including markers, traffic lights and signs.

The amount and types of such information, however, are much larger and diverse than those in existing maps used for driving navigation. Hence, NAVER LABS came to realize that HD maps cannot be built in the traditional way. By combining our AI and aerial image processing capabilities, NAVER LABS unveiled hybrid HD mapping, which organically integrates aerial images of a large area at the city level and data collected by the mobile mapping system (MMS).

While the MMS acquires specific information on the road and completes an HD map, a 3D model made of aerial images lays the foundation for a balanced map that covers what the MMS cannot see. In this article, we would like to introduce the main process of building a city-scale 3D model based on aerial image.


Photogrammetry is essential for building an aerial image based 3D model, we need photogrammetry. Simply put, photogrammetry is measuring the three-dimensional (3D) real world via image. In other words, it is about restoring 2D aerial images back into the 3D real world. Then, how can we restore 2D images into 3D? The key is to use disparity. Figure 1 shows the same object photographed from the left and right. By overlapping the two images, we see that the closer the camera distance the greater the disparity, and vice versa.

Using this disparity, we can turn two or more 2D images into 3D like the bottom right of Figure 1. In other words, the object in an aerial image can be a building roof if it is close to the camera or the ground if it is far away from the camera.

[Figure 1]

Once you understand how 2D images turn into 3D, you may notice that the pose of images is crucial. As seen in Figure 2, when the pose of a image is not correct, it can turn into a different location in 3D. Hence, it is essential to accurately adjust the pose of images.

[Figure 2]

NAVER LABS restored thousands of aerial images into the 3D real world following the procedure in Figure 3.

[Figure 3]

First, the same points in the thousands of images are connected as in Figure 4. This is called image matching, and such connections in images allow us to accurately estimate the pose of images.

[Figure 4]

After connecting the images, we estimate the accurate pose of images. This is called bundle adjustment (BA), and this technology is an integral part of photogrammetry. NAVER LABS performed batch BA and was able to accurately estimate the pose of thousands of images. Figure 5 visualizes this BA. While the images have different poses, they are aligned into an accurate pose through BA. This can also be referred to as optimization.

[Figure 5]

During BA, NAVER LABS adds the following condition: ground control points (GCPs). GCPs refer to points of the photographed area measured by 3D. Why do we need GCPs? The real world (3D) has a pre-defined coordinate system. To represent them in 3D according to the actual location on Earth, taking into account the Earth’s curvature, NAVER LABS imposed limits on GCPs to perform BA. Figure 6 illustrates how BA is performed depending on the GCPs.

[Figure 6]

When the poses of aerial images are aligned, we can calculate the 3D structure of the ground surface.

Digital surface model (DSM)

As explained above, we can calculate the distance of the projected object in the image by using disparity. If you use commonly used SIFT or SURF matching points from images whose poses have been estimated, you can calculate point-level spatial coordinates. However, to make a 3D model that includes features, insufficient points and mismatching points due to repeated patterns, you need a “dense matching” algorithm.

First, as in Figure 7, by changing the distance (depth = disparity) for each pixel (x, y) of a image (master), we establish the cost volume by quantifying how similar it is to adjacent images (slave) in each depth.

[Figure 7]

Methods to quantify the matching cost include absolute difference (AD), sum of absolute difference (SAD), normalized cross-correlation (NCC), census (from SGM) and DAISY. As the 3D cost volume filled like this includes various noise, global optimization is applied to find the most stable distance while each pixel affects one another in 4 or 8 directions. Some of the most commonly used global optimization techniques include belief propagation, semi-global matching (SGM) and graph cuts. Since one digital image can only obtain discrete depth, dense matching results estimated from multiple images taken for the same area in different locations are combined to create a continuous 3D DSM.

3D model in Seoul

With this above process, we completed the DSM for the entirety of Seoul in 2019. Figure 8 applies different pseudo-colors depending on the height of the DSM, and Figure 9 completes the DSM with a 3D model and processes aerial images up to texture. 

NAVER LABS is planning to update the Seoul 3D model and HD map with the latest photographs in 2020 and is preparing a method that effectively manages the life cycle of HD maps.

[Figure 8]

[Figure 9]

Related Articles