Extracting road layout information from aerial images through deep learning technology

There are several ways to create high definition (HD) maps for autonomous driving vehicles. Among them, NAVER LABS utilizes data collected through both aerial images and mobile mapping system (MMS) vehicles. This is a unique method of hybrid HD mapping that is highly useful because it uses aerial images to extract road layout information, such as road markings and lanes, which is difficult to identify with MMS vehicles alone. However, the process of extracting information from aerial images requires labor and time. Therefore, we sought to address this issue with deep learning-based computer vision technology. The goal was to save cost and time by applying technology that automatically extracts necessary information from aerial images, and this automation technology is currently being applied to extract road marking and lane information. Read on for a brief overview of this process.

Dataset overview

First, manually annotated images of the Pangyo area were used as a training set to teach the deep learning model. As shown in Figure 1, road layout information consists of lanes and road markings. The ground sampling distance (GSD) of the aerial images used was 8 cm per pixel, and the annotation used to extract lane and road marking information was conducted on roads with four or more lanes running both ways.

[Figure 1]

Aerial images are provided as one large image after complex post-processing. Because obstacles such as vehicles and roadside trees that appear in the images serve as factors making it difficult to teach the deep learning model, a digital elevation map (DEM) is used for the ground, when merging images adjacent to each other, in order to remove obstacles from the surface of the roads to the greatest extent possible. Deep learning models trained with these images have allowed us to partially automate the process of annotating aerial images of the Seoul area

Road marking recognition

The technology for automatically extracting road markings was built in two stages: the development of a classifier and of a detector. The types of road markings are shown in the figure below. Among them, there was relatively lower degree of recognition accuracy for crosswalks, bike roads and text-based road markings due to small number of training samples or their inconsistent size and appearance.

Stage one was the development of a road marking classifier intended to partially automate the road marking annotation process. Prior to this, annotation was carried out by having workers manually mark road markings in bounding boxes and then they would select the type of road marking. However, the newly developed road marking classifier allows workers to simply mark the road markings in images as boxes. The classifier collectively processes the road markings marked with boxes and it automatically classifies the road markings by type, thereby reducing man-hours. The differences in the annotation process before and after application of the road marking classifier are illustrated in the following video clip.

As a result of training the deep learning model by leveraging road markings that have been annotated as a training set, we were able to achieve 98.32% accuracy in classification. Because the annotations are carried out on a regular PC that is not equipped with a GPU, a lightweight neural network was used that can operate in real-time on platforms with limited computational resources. The application of this road marking classifier resulted in a 49.27% reduction in the time required for the classification process and a 6.15% reduction in total work time.

Stage two is the development of a road marking detector, for which the ultimate goal is full automation of the entire annotation process. When the detector is applied, all steps of annotation are carried out automatically, from finding road markings in bounding boxes to classifying them. When the detector is used for partial automation of the annotation process, the results of automatic recognition obtained in this manner will undergo an additional step of final inspection which is carried out manually by workers. Unlike conventional object detection, the road marking detector also needs to estimate the angle of the box, and to this end, we evaluated and compared recently published models and used the best-performing one to obtain performance [1] of mAP (mean Average Precision) of 0.70 and a mean absolute error of the angle of 3.08 (degrees). An example of automatically detected road markings is provided in the figure below. This road marking detector will be applied to the annotation process in the future.

[Figure 2]

Lane recognition

Lane recognition from aerial images poses several unique technical challenges. As the resolution of aerial images is 8 cm per pixel, the thickness of the lane is only 1 to 2 pixels, and accurate recognition without any error requires a much higher degree of precision than is required for typical semantic segmentation. Additionally, in the current lane annotation scheme, even broken segments are marked as a connected line, and thus it is necessary to convert the pixelwise lane segmentation results to a vectorized form of polylines. Furthermore, even if lanes look alike, such as centerlines and the no stop or parking lanes, the type varies depending on the location of the lane.

To address these challenges, we attempted a step-by-step approach, just like we did in the case of road markings. In order to handle both the partial and complete automation scenarios of annotation process, a deep learning model was constructed to simultaneously learn semantic segmentation and the classification of lanes, as shown in Figure 3. In a partial automation scenario, if a worker marks a lane in the shape of a line, the classifier automatically classifies that lane. And, in a complete automation scenario, the semantic segmentation results automatically locate the lane and classify its type.

[Figure 3]

To improve the accuracy of semantic segmentation, we utilized GFF (Gated Full Fusion) [2] structure in our deep learning model, which enabled us to achieve lane classification accuracy of 91.5% and segmentation accuracy of 0.84. An example of lane recognition results is provided in Figure 4.

[Figure 4]

The complete automation scenario undergoes the final step of post-processing, as shown in Figure 5, to bind the results of lane recognition in pixels to a vectorized form of polylines. First, the pixelwise semantic map obtained through recognition is replaced with a binary map for each lane type. Next, the bilateral filter and skeletonization algorithms are applied to eliminate noise. The remaining pixels are clustered based on a KD tree, and then each cluster is expressed as a continuous and simplified line segment through the Douglas–Peucker algorithm. The lane recognition technology developed in the manner will also be applied to future annotations.

[Figure 5]

Integration of road marking recognition and lane recognition modules

Road layout information is generated by integrating and complementing the results of both road marking recognition and lane recognition modules. It is done in this manner because simply selecting the road marking bounding box detected by the road marking recognition module and selecting the lane vector extracted by the lane recognition module may lead to errors or duplications, making it difficult to express the road layout properly.

The lane recognition module segments not only lanes, but also areas with road markings through the semantic segmentation technology. When both modules work properly, the location of the road marking bounding box is detected through the road marking recognition module will also be segmented into the area of road markings in the lane recognition module. However, because the information obtained from the two modules may not always match, the boxes from the road marking recognition module are projected onto the result images from the lane recognition module to generate final results that are based on the aggregated probability vectors in the boxes. For example, if a road marking is detected only in the road marking module and not in the lane recognition module, the road marking is considered valid only if the aggregated vectors for its pixels have highest probability for "road marking" class, otherwise it is regarded as an error in the road marking recognition module, and the road marking is then deleted. Additionally, if a road marking and lane are recognized as overlapping in the same position, the object with a higher confidence score obtained from its corresponding module is used.

We have introduced the deep learning-based computer vision technology developed to automatically extract road marking and lane information from aerial images. In addition to road markings and lanes, aerial images contain a wide variety of information regarding road structures. As aerial images are expected to attract further attention and demands for its usefulness in creating high-precision maps, NAVER LABS will continue to research and develop automation technology for extracting meaningful information from aerial images, and to accumulate these foundational technologies for creating maps for autonomous driving.


[1] Yang et al., SCRDet: Towards More Robust Detection for Small, Cluttered and Rotated Objects, ICCV 2019
[2] Li et al., GFF: Gated Fully Fusion for Semantic Segmentation, arXiv:1904.01803, 2019

> Subscribe to our newsletter

Related Articles