In this blog post, we present LAMBDA, Nuro’s Multimodal Large Language Model (MLLM). LAMBDA performs question answering as part of The Nuro Driver’s™ onboard autonomy stack, running in real-time on the hardware inside our autonomous vehicle platforms. LAMBDA has shown to be extremely capable of reasoning about a wide variety of challenging situations that an L4 driverless system is required to handle. Running this system in real-time in an autonomous vehicle is an important milestone for leveraging this reasoning as part of The Nuro Driver’s™ autonomy decision making.
LAMBDA understands the high-level behavior and intention of the vehicles in front of the autonomous vehicle, and can reason about how it should react to them.
Autonomous driving is challenging in part because of the variety of situations faced when deploying a fleet at scale. Humans are able to reason about these “long-tail” challenges by using their knowledge of the world and common sense reasoning learned from life experience to make intelligent decisions. Large Language Models (LLMs) are trained on massive amounts of data and a variety of tasks that, while some may not appear to be directly beneficial to driving, involve general semantic relationships that more closely mirror broad human experience compared to just using driving data alone. We believe that utilizing these general language models will enable autonomous systems to better reason about scenarios that may be unseen within even an expansive driving training corpus.
In addition to reasoning and common sense capabilities, LLMs and MLLMs offer a natural way to interact with embodied AI systems, through spoken or written language. This can enable autonomy systems integrated with MLLMs to receive real-time instructions, or produce explanations regarding the actions taken. Additionally, it provides a new method for improving the autonomy system, through sourcing training data in the form of text explanations. This can be a more scalable, and in some aspects richer, complement to collecting on-road driving data.
LAMBDA can understand important details about vulnerable road users, such as the pedestrian in the left scene that is walking a leashed dog, or the pedestrian in the right scene that is jogging.
At Nuro, we believe there are many potential use cases for a language-powered autonomy stack, which is why we have built LAMBDA as a multi-purpose language model that will benefit our in-vehicle experience, our autonomous decision making, and much more.
In this blog post, we share the first on-road demonstrations of LAMBDA as a commentary system that can describe to passengers observations on the state of the world and autonomy decision making.
Left: LAMBDA speculates the behavior of the oncoming vehicle when it is far away, and later correctly understands that it is yielding to the autonomous vehicle. Right: While the traffic light is green, LAMBDA correctly understands that the autonomous vehicle is remaining stopped because of the line of traffic.
State and Action Reasoning
LAMBDA has the capability to perform a diverse set of tasks and can respond to a wide variety of questions. In order to use LAMBDA for real-time decision making in complex road situations, and reason about the best actions given the situation, it is necessary to teach LAMBDA to deeply understand the current state of the autonomous vehicle and its environment. We trained LAMBDA to express this understanding by responding to questions in natural language, explaining what is happening in the world and how it relates to the vehicle’s current action. Concretely, LAMBDA answers the primary question:
What are you doing, and why?
In addition to general comments about location and conditions (e.g., “I am in a neighborhood,” “I am on a multi-lane road”), LAMBDA can refer to specific road users by using their unique IDs assigned by Nuro’s perception stack. These IDs enable LAMBDA to reason about individual road users in an explicit and unambiguous way, and also allows for follow-up questions about specific road users. In the following examples, the autonomous vehicle is yielding to two cyclists in the scene on the left and to an oncoming vehicle in the scene on the right. In both examples, LAMBDA correctly refers to their unique IDs in its response.
LAMBDA can refer to road users by their unique IDs assigned by Nuro’s perception stack. On the left the response refers to the oncoming vehicle with ID 3139, and on the right the response refers to two cyclists with IDs 813 and 925.
LAMBDA is adaptable to answer a variety of questions that an intelligent driver may need to reason about, like the current context of its scenario, questions about other agents’ potential thinking, and how those agents interact with Nuro’s autonomous vehicles.
On the left LAMBDA answers questions about a situation where the traffic light is green, but an oncoming vehicle is turning in front of the autonomous vehicle. On the right LAMBDA answers various questions about two pedestrians and their dogs that are crossing the street ahead.
LAMBDA’s Architecture
As a Multimodal LLM integrated with our autonomy stack, LAMBDA can access the current state of the autonomous vehicle and its environment (the world state). This includes information about the Nuro vehicle (including the vehicle’s history, route, and kinematics), agent features (the information available for nearby road users), and other contextual features (including scene-level information provided by map and perception modules). A Behavior Foundation Model is employed to encode the world state features into continuous token representations that can be efficiently processed by the LLM.
We use pre-trained base models for both the LLM and the Behavior Foundation Model, the latter of which is pre-trained on a dataset containing hundreds of millions of driving examples. The combined LAMBDA model is fine-tuned on a diverse set of data and tasks, including data collected through automated and manual labeling.
LAMBDA can understand the underlying reason behind different driving decisions. On the left the autonomous vehicle is slowing down to give space to a vehicle reversing out of a driveway, and on the right it’s decelerating before a left turn to let an oncoming cyclist pass first.
LAMBDA is optimized and deployed to run on the autonomous vehicle’s onboard hardware platform and does not require any remote connection to operate. To provide an interactive interface for passengers, a tablet is connected to the onboard platform that communicates user questions and LAMBDA’s responses.
A screenshot from the tablet being used by a passenger interacting with LAMBDA inside the AV.
Towards LAMBDA for Real-Time Decision Making
The remarkable qualities of LLMs and MLLMs make them excellent candidates to elevate autonomy systems to generalize to new environments and uncommon road events (long-tail events). Nevertheless, there are several challenges that have to be overcome before these models can be effectively utilized for decision making in long-tail driving situations.
- Reliably reason about essential factors for decision making. While off-the-shelf MLLMs have strong reasoning and common sense abilities that extend to driving decision making, further adaptation is required to ensure the safety and coverage necessary for L4 driverless deployment. It is particularly important to have robust scene understanding to ensure all important factors necessary for long-tail decision making are captured from the input in an efficient and accurate way. Currently, LAMBDA demonstrates advanced comprehension and reasoning in various road situations, and we continue to improve its accuracy by building scalable data collection, model training, and evaluation pipelines.
- Utilize LLM’s internet-scale knowledge and reasoning when making driving decisions. While LLMs can be trained to directly generate low-level driving plans that the autonomy’s control system can execute into actions, a key challenge is to ensure they utilize the knowledge and reasoning capability embedded inside the LLM. It may be possible to leverage these capabilities explicitly through natural language generation, for example in the form of high-level driving plans. However, successful realization of this system depends, to a great extent, on the balance of several design factors in high-level plan composition. These include the flexibility to allow long-tail decision making, the concreteness needed for reliable low-level conversion, the granularity for controlling nuances of a low-level plan, and the computation restrictions necessary for real-time decision making.
- Ensure the safety and security of the integrated autonomy system. LLMs are known to make occasional mistakes, including generating non-factual information known as hallucinations. While it is important to minimize errors that impact decision making, the autonomy system should remain robust against potential mistakes. In particular, it is important to keep validation processes that restrict executing any unsafe decisions based on faulty LLM reasoning. Furthermore, it is necessary to develop security measures that ensure any input provided externally (e.g., via human interaction) is passed to the LLM in a highly controlled manner.
Despite the challenges, we believe Multimodal LLMs have the potential to revolutionize how autonomous vehicles make decisions. We are excited to demonstrate even more uses for LAMBDA in the future, as we continue to develop world-class Level 4 driving software.
If you believe in our mission and want to help build the future of autonomous driving, join us! https://www.nuro.ai/careers
By: Aryan Arbabi, Brian Yao, Haohuan Wang, Wuming Zhang, Georgie Mathews, Mathew Hanczor and Aleksandr Petiushko
The authors would like to also acknowledge the contributions from Aaron Weldy, Brandon Buckley, Erin Gracey, John Reliford, Kai Ang, Max Siegel, Mengxiong Liu, Santiago Sinisterra, Ury Zhilinsky, Yifei Shen and other Nuro team members not mentioned here who provided help and support.