Enabling robots to learn indoor autonomous driving algorithms that allow them to adapt to various settings and tasks through reinforcement learning

Mobile robots are increasingly rolling into a wide variety of services and everyday spaces. Therefore, it is now essential to research the autonomous driving software for robots to overcome more diverse and complex issues, rather than simply enabling them to arrive at the destination without colliding with obstacles

AROUND C surrounded by children–a common situation for mobile robots deployed in daily living spaces.

For example, in an environment without any human presence and structural changes, a robot only needs to find the shortest distance to the extent that it does not physically collide with obstacles. It is relatively simple. However, in the case of service robots in restaurants and cafes, many people stand or move around in their vicinity. Thus, they should be able to move around in consideration of changes in the size, shape and position of obstacles every second.

Above all, “people” are the most challenging obstacles for robots in everyday spaces. If a robot gets close to people it encounters at high speed, it may cause anxiety in people even if no collision occurs. Therefore, more caution needs to be exercised to avoid such situations, keeping a further distance from people than for obstacles. To address this issue, the majority of robots deployed in living spaces drive at slow speeds, and if an unexpected obstacle (or person) appears, they use a strategy to stop in place and wait for it (or the person) to disappear. However, exercising too much caution slows down the robot's original mission (i.e. delivery), causing users to feel frustrated.

From a software engineer's perspective, it is challenging to balance the importance (weighted value) between subjective factors such as these “anxiety” and “frustration” and to create autonomous driving algorithms. As the weighted value that satisfies many people varies by the space where the robot is located, the work that it performs and the user it deals with, finding a good weighted value also requires many trials and errors; even if one is found, there is no guarantee that the driving algorithm will function normally when the value is entered.

To address such complex issues and design robots that can be rapidly assigned to various spaces for diverse tasks, NAVER LABS used meta deep reinforcement learning and Bayesian active learning.

Meta deep reinforcement learning

1) Deep reinforcement learning

Reinforcement learning is a process in which an agent that interacts with the environment learns the policy, the function that determines what actions the agent should take in each state, in order to maximize the accumulative rewards it will obtain from the present point of view, while observing the state of the environment, affecting the environment through actions, and experiencing the resulting rewards.

Recent studies that incorporate deep learning into reinforcement learning have produced great results in the field of decision-making or control, such as AlphaGo[1] and OpenAI Five,[2] and a large body of research is ongoing to apply them to robots’ behavior control. NAVER LABS has also studied deep reinforcement learning for autonomous driving of AROUND robots and unveiled them at events such as CES[3] and ICRA.[4]

Autonomous driving algorithms that employ deep reinforcement learning do not require accurate maps and can predict optimal actions that take into account both current and future rewards very rapidly with the help of the GPU, but there are issues as the learning requires a tremendous amount of data and time, and adaptation is unavailable without additional learning if robots or reward settings change; making it unavailable for mobile-service robots that have to behave in different ways depending on the characteristics of each place, work or user.

For example, if an agent who had learned in an environment where it could move at a speed of 0.4 m/s and got a reward of -1 when getting close to a person undergoes modification to move at a maximum speed of 0.6 m/s and receive -3 rewards when its gets close to a person due to unsatisfied users, it takes a very long time for the agent to become satisfactory as it has to learn again from scratch through countless iterations.

2) Meta reinforcement learning

To address this issue, NAVER LABS used relied on meta reinforcement learning. Meta learning is a subfield of machine learning that enables 'learning to learn' using the method of offering learning for AI agents to solve many kinds of problems that belong to a certain distribution and enabling the learned agents to adapt quickly when new problems within that distribution is given.

Simulator using procedural generation technology

NAVER LABS has utilized the method of generating a new indoor environment in the simulator every time and sampling random robots and reward settings to allow AROUND robots to adapt immediately to various robot settings (maximum speed, rotation velocity, acceleration, etc.) and environmental settings (the structure and size of a space, rewards when opting the shortest distance, rewards when colliding with obstacles, rewards when getting too close to people, etc.). In addition, NAVER LABS has released more than ten robots in one environment and enabled learning at the same time to greatly increase the amount of data available for learning, allowing them to develop the ability to deal with moving obstacles while avoiding each other.

A still from a driving video of learned agents

As a result of using meta reinforcement learning, we could obtain agents that are capable of immediately coordinating the behavior and displaying the performance level that is equivalent to learned agents by using only the corresponding settings when the settings within the learned range are designated.

Performance comparison of meta reinforcement learning agents and agents having learned using only the corresponding settings regarding various settings (refer to the paper for further details)

3) Bayesian active learning
Meta deep reinforcement leaning was used to obtain agents capable of adapting to different settings, but it was still difficult to figure out which setting was best for space, work and user for which the robots are to be used. In many cases, it requires a great deal of labor and time due to the use of the trial and error method where engineers choose adequate settings at their discretion, receive feedback from users or UX designer, and gradually modify them. To speed up this process and promote efficiency, NAVER LABS looked into the preferences for various settings in simulations and developed a Bayesian neural network[5] based algorithm through joint researches with NAVER LABS Europe to efficiently select optimal setting candidates.

A Bayesian neural network (BNN) is a model that represents parameters of the neural network as probability distributions rather than fixed values, which has the benefit of being able to show a good performance even if the data available for learning is small in number and providing the criteria of uncertainty about predicted values. By projecting which settings would be more likely to be preferred from a small amount of preference survey data, and using those that are expected to be highly preferred but carry a high level of uncertainty at the same time in the next preference survey, we were able to reduce the number of preference surveys required to find the optimal settings.

A still from a video of the survey which used the simulation

AROUND C Pilot service

To test the performance of the newly developed algorithm, a pilot service test was conducted, enabling a robot to serve beverages at Bear Better Cafe, located on the first floor of NAVER Green Factory, from November to December 2019.

NAVER LABS Robotics and UX Teams found a setting suitable for the café service through a preference survey, and applied the setting to the meta reinforcement learning agent to allow the robot to drive autonomously. The test confirmed that AROUND C delivered beverages rapidly and safely through crowds, even in chaotic and complex situations where it is almost impossible for ordinary service robots to operate.

AROUND C serving drinks in a crowded and complex situation


The paper on meta reinforcement learning and Bayesian active learning technology employed in AROUND C was adopted by the ICRA 2020, the world's largest robotics conference, and is waiting to be presented.[6]

After successfully completing the pilot test, NAVER LABS has continued to conduct research on reinforcement learning to solve problems found in the service test and achieve higher performance. AROUND robots equipped with further upgraded autonomous driving algorithms will be applied throughout the new office building of NAVER. We are hiring deep reinforcement learning engineers to apply reinforcement learning to more problems in the future, and we look forward to receiving applications from many people who are interested in state-of-the-art researches on reinforcement learning.


[1] Silver, David, et al. "Mastering the game of go without human knowledge." Nature 550.7676 (2017): 354-359.
[2] Berner, Christopher, et al. "Dota 2 with Large Scale Deep Reinforcement Learning." arXiv preprint arXiv:1912.06680 (2019).
[3] https://www.youtube.com/watch?v=jqSztRdd-mc
[4] Choi, Jinyoung, et al. "Deep reinforcement learning of navigation in a complex and crowded environment with a limited field of view." 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019.
[5] Blundell, Charles, et al. "Weight uncertainty in neural networks." arXiv preprint arXiv:1505.05424 (2015).
[6] Choi, Jinyoung, et al. "Fast Adaptation of Deep Reinforcement Learning-Based Navigation Skills to Human Preference."

Related Articles