[Mapping & Localization Challenge] Dataset Building Process and the Visual Localization Baseline Techniques

NAVER LABS launched the Mapping & Localization Challenge on April 8 to raise awareness about the importance of image-based localization technology, which is taking the world by storm, and support university researchers’ studies across Korea. Challenge participants will compete for the accuracy of visual localization (VL) in two tracks: indoor and outdoor. VL is a technology that allows for high-precision localization where Global Positioning System (GPS) signals are weak, such as indoor spaces, skyscraper-filled city centers, and tunnels, due to accurate six degrees of freedom (6DOF) pose estimation using only camera sensors.

NAVER LABS is providing the latest self-produced datasets, which have been used for actual research, to all participants of the NAVER LABS Mapping & Localization Challenge. This article is intended to introduce how the challenge is run as well as how the indoor and outdoor datasets disclosed to researchers have been created.

 

1. Dataset Building Process

1) Indoor dataset building; NAVER LABS’ LiDAR SLAM

First, the building process for the indoor dataset. NAVER LABS uses a mapping robot called M1X, which has various cameras, smartphones and high-precision LiDAR sensors equipped on its main body, to generate indoor maps. NAVER LABS has also developed a backpack-type mapping device called COMET to be used for spaces with irregular surfaces such as stairs.

An integral technology for mapping is NAVER LABS’ own high-precision LiDAR SLAM (LiDAR Simultaneous Localization And Mapping). One of its biggest advantages is that is can correct distortions from trajectory estimations by computing LiDAR-based odometry for environments where wheel-odometry cannot be obtained. This estimated odometry is used for initial trajectories between sequential LiDAR data, enabling for a more precise and robust mapping.

Indoor mapping using LiDAR odometry

In addition, a method of matching the wide array of data collected at different times into one is essential. NAVER LABS employs something called loop-closure, a method of recognizing a previous visited location and updates beliefs accordingly. Loop-closure based on LiDAR data enables very stable and precise matching between datasets.

Data matching through loop closure

As seen in the following, high-precision maps are created by matching processes such as mapping via M1X and COMET as well as LiDAR SLAM and loop-closure. The high-precision data contained in the map allows to accurately estimate camera poses.

 

2) Outdoor dataset building; distortion correction and high-precision localization data

Next, the building process for the outdoor dataset. The outdoor dataset was extracted from NAVER LABS’ HD map production process that combine aerial photographs and MMS data using NAVER LABS’ in-house developed MMS vehicle called R1. R1 can collect image/geometric data using its multiple cameras and LiDAR sensors. This outdoor dataset contains stereo camera images taken in front of the vehicle and omnidirectional geometric data collected by LiDAR sensors mounted on top. There are two features to this dataset worth mentioning.

First is distortion correction during geometric data collection. This outdoor dataset utilizes geometric data collected using R1’s LiDAR sensors after driving for more than 5 hours in Pangyo and Yeouido, of which approximately 130,000 frames of 3D point cloud data collected, excluding the vehicle’s idle time, will be provided to the challenge participants. Unfortunately, raw LiDAR data collected at high speeds may suffer from 3D geometric distortion depending on the vehicle’s speed at the time of data acquisition. Therefore, NAVER LABS applied its advanced localization technique to accurately calculate the vehicle’s pose and speed to correct such geometric distortions.

Geometric distortion correction through precise localization

Second is providing high-precision localization data. For the outdoor dataset, NAVER LABS is providing accurate localization data, including stereo images and geometric data collected at the time of data acquisition (i.e. R1’s pose data). While R1 has high-performance GPS and can precisely localize the vehicle, the urban areas of Pangyo and Yeouido have some GPS signal disruptions due to high-rise buildings, rendering some of the localized vehicle’s pose data unreliable. To compensate for this phenomenon, NAVER LABS used the high-precision localization technology utilized for its autonomous driving research to provide improved and precise pose data, which is also used in evaluating localization results submitted from participants.

 

3) Protection of drivers’ vehicle information and pedestrians’ personality rights

Pedestrians’ faces and vehicles’ license plates have been blurred in the datasets provided for this challenge in order to protect drivers’ and pedestrians’ personal information as well as their personality rights. For efficient multi-scale learning and inference, NAVER LABS applied SNIPER, which was revealed at NeurIPS 2018, and the AutoFocus algorithm, which was revealed at ICCV 2019, and supplemented manually using LabelMe to process the secondary steps of Gaussian blur and Median filter to the blurring areas.

Indoor/outdoor dataset blurring

 

2. Localization Baseline Technique

To encourage many Korean researchers to join the challenge and suggest performance evaluation criteria, NAVER LABS disclosed the localization performance of indoor and outdoor track baseline algorithms on a leaderboard at the start of this challenge. The indoor track baseline algorithm used a hybrid technique based on reference image extraction and keypoint matching, which are used widely for single image localization.

1) Indoor Track Baseline

For the indoor track baseline, a hybrid technique which combines RootSIFT and NetVLAD was used. NetVLAD is one of the deep learning-based methods suggested to solve the problem of image retrieval.

NetVLAD architecture [1]

With this technique, you can extract multiple reference images that are most relevant to the query image.

To identify the correlation between extracted reference image and query image, keypoints, which can connect two images, are extracted. The scale-invariant feature transform (SIFT) used here is a traditional algorithm to extract keypoints, which extracts features that are invariant to image sizes and rotations. Each detected keypoint’s unique value is referred to as a descriptor, and RootSIFT is an algorithm that normalizes this value and improves the performance of SIFT.

Extracted SIFT RootSIFT keypoints

Using these descriptors, keypoint matching of different query and reference images can be performed as shown below.

Example of keypoint matching detected in different images

After calculating the 3D points corresponding to the reference image’s keypoints, Perspective-n-Point (PnP) solver is used to estimate the query image’s 6DoF pose.

6DoF pose estimated by indoor VL pipeline

 

2) Outdoor Track Baseline

While the outdoor track baseline is similar to the indoor baseline, it is a slightly different hybrid technique that uses R2D2 (tuned) and NetVLAD. The dataset disclosed for the outdoor track is provided with accurate poses of all mapping images. Hence, you can find the mapping image that is most similar to a given test image and obtain the pose similar to that in the test image.

When you project the LiDAR geometric data, which is provided as mapping data, into the mapping image, you can obtain 3D coordinates for keypoints of the mapping image. With the PnP algorithm, which uses the correspondence between 3D geometric data and 2D coordinates for keypoints matched between test image and mapping image, you can estimate the test image’s pose.

Schematic diagram of the outdoor track baseline algorithm

The outdoor track baseline algorithm uses the global image descriptor extracted by NetVLAD to index and search mapping images, and uses the R2D2, a keypoint detection algorithm developed by NAVER LABS Europe in 2019, fine-tuned using driving images to detect and match image keypoints. Since the above baseline algorithm is the localization result only with final frames in each stereo video given as a test case, we expect that any participant who fully utilizes given data can show much better localization performance.

 

This has been a brief introduction to the dataset building process and the baseline localization technique for the NAVER LABS Mapping & Localization Challenge. NAVER LABS is in full support of the many researchers participating in the challenge. For more information and inquiries regarding the challenge, please visit the website below.

Go to the Mapping & Localization Challenge website

 

Reference

[1] Arandjelovic, Relja, et al. "NetVLAD: CNN architecture for weakly supervised place recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

Related Articles

VIDEOS