An image based safe lane change (SLC) algorithm is proposed to aid the lane-change maneuvers for both autonomous driving agents and human drivers. A binary classification (free or blocked) is performed to secure the safety of the ego-vehicle's surroundings before moving to a target lane. For a precise classification, the SLC uses a Convolution Neural Network (ConvNet) that learns image features from large scale dataset.
ConvNet is efficient in that is able to extract subtle image features what we haven't been obtained by hand-crafted functions before; however, we also doubt the nature of the ConvNet when those of outcomes are not aligned with our intuition. In fact, we cannot handle anomalous events if we are unenlightened how ConvNet works. We know road environment changes every moment; we therefore cautiously test autonomous driving functions before deploying on the road. In other words, it is essential that understanding the internal mechanisms of the ConvNet to adapt to the autonomous driving systems.
From recent weakly-supervised object localization researches, we found a clue how the ConvNet makes decisions. In this article, we would like to introduce Class Activation Mapping (CAM) and analyze where the SLC algorithm sees on images.
So, what is the weekly-supervised object localization task?
To solve well-defined machine learning problems, supervised learning algorithms require plenty of data points and the corresponding ground truth labels. For an image classification, a dataset consists of images and the keywords that describe the images. On the other hand, to learn a model for object detection task, we need not only the object names but the image coordinates of the objects (see Fig. 1). As a task becomes difficult, we have to consume more time and cost to build a new dataset for supervised learning setups. Thus, researchers look for new methods to apply the existing large scale dataset to different domains. For an example, weakly-supervised object localization attacks object detection task using image classification datasets, where the object localization labels are missing.
Fig 1: For an image, ground truth label varies depending on the tasks: examples of the ground truth labels for image classification (left) and those for object detection (right)
How to learn a model for image classification?
For image classification, the architecture of the most ConvNet can be divided into two parts: convolutional layers to compute image features and fully-connected layers for classification (see Fig. 2).
Fig. 2: Image features are computed with convolutional layers, and go through the fully-connected layers for a prediction. Supervised learning algorithms attempt to reduce the differences between the prediction (x) and the ground truth (y) during the training phase.
We lose spatial information while reshaping an image feature to input the followed fully-connected layers. In weakly-supervised object localization task, we exploit the interim image features that computed by convolutions and obtain the salient regions for a prediction. Thus, CAM algorithm assumes that the salient regions containing many parts of a certain object will be activated during the classification.
More precisely, we explain the CAM algorithm with VGG16 network architecture. The VGG16 generates (512,7,7) size of image features at the last convolution layer when it takes (3,224,224) input image. Suppose the form of the image feature that is a (7,7) sized map having 512 different channels, each channel differently contributes to classification for the given object classes. Thus, CAM algorithms learns the relative importance of the channels at the followed fully-connected layer. Using those weights, we aggregate the feature maps over the channels and finally obtain a saliency map that interprets how does the ConvNet see on the images for a prediction (see Fig. 3)
Fig. 3: Since in weakly supervised object localization task, we have no information of the objects locations in the image, we cannot apply the supervised learning regime to learn a model. Instead, CAM algorithm adaptively sums the image features, where the weights are identical to the parameter of the fully-connected layer followed the convolutions. We now see the activated areas where the ConvNet focuses to predict a class.
Back to the stories of the autonomous driving research
To learn an SLC model, we annotated rear-side view images, which are captured in various road environments, as followed criteria:
- Blocked if the ego-vehicle cannot physically move to the target lane;
- Free if the ego-vehicle can move to the target lane; and
- Undefined for an ambiguous situation such as crosswalk and any other unusual scenes.
The annotation rules are akin to human driver’s’ decision making processes for lane-change -- we instantly decide to move a target lane by checking rear-side view mirrors. To tolerate various driving behaviors for building the dataset, we only take a ground truth label when the multiple annotation works agree with the status of the scene.
Can the SLC model make a right prediction on the road where it has not been visited? Yes, we can. To examine the generalization performances of the SLC model, we tested images which are not used during the training phase and achieved 96.98% classification accuracies.
Using the CAM, we also analyzed that the SLC model has been built on our purpose. We replaced the fully-connected layers of the SLC model with a 512 length of fully-connected layer. While the parameters of the convolution are fixed, we fine-tuned SLC model on the same dataset to obtain saliency maps. As shown in Fig. 4, similar to human drivers, the SLC model looks at the space of the adjacent lanes to judge the probability to succeed lane-change.
Fig. 4: The classification result of the SLC model (left), and visualization result using CAM to highlight areas for a prediction (right)
The following video was recorded inside of the autonomous driving car running on complex urban road environment, where the results of the perception algorithms are also displayed on the right. The SLC algorithm deployed in the NAVER LABS autonomous driving car secures the safety for lane-change operations.
1) S.-G. Jeong, J. Kim, S. Kim, and J. Min, End-to-end Learning of Image based Lane-Change Decision, in Proc. IEEE IV’17
2) B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, Learning Deep Features for Discriminative Localization, in Proc. IEEE CVPR’16
3) matcaffe Implementation of class activation mapping: https://github.com/metalbubble/CAM
4) Keras Implementation of class activation mapping: https://github.com/jacobgil/keras-cam