2024/10/18

How Multi-Modal LLMs are Powering the Future of Autonomous Driving

Author: Ramon Wartala (Associate Partner, IBM Consulting)

In recent years, large language models (LLMs) have not only learned to understand language but also to “see” — marking a pivotal shift in artificial intelligence. What we are witnessing today is a confluence of multi-modal AI technologies, where models trained on text are now being extended with the capability to interpret images and video. The potential for these models in autonomous driving is extraordinary. LLMs like OpenAI’s ChatGPT 4o, Anthropic’s Claude and Google’s Gemini Pro, alongside cutting-edge open-source models like Mistral’s Pixtral, Alibaba’s QwenVL or Meta’s LLaMA 3.2, are demonstrating impressive progress in fields that were traditionally the domain of computer vision.

Multi-Modal LLMs: Learning to See

At the core of this revolution is the ability of LLMs to process both text and visual information. Traditionally, language models operated in a purely textual domain, excelling in generating and understanding words. However, with the rise of multi-modal systems, these models are now capable of interpreting images and associating them with textual descriptions. This advancement builds on the architecture of models like transformers, which allow for the parallel processing of different types of data.

For instance, when a language model is trained to recognize a scenario like “a child runs into the street,” it not only understands the sentence but can also associate it with relevant visual data, such as images of children and streets. The key trick here is that the learned topics are supplemented with corresponding images. As the model learns from text-image pairs, it becomes adept at recognizing objects in real-world settings, making it invaluable for computer vision applications.

Why Multi-Modal Models Excel in Computer Vision

Autonomous driving relies heavily on real-time, accurate object detection. Current systems, like emergency braking assistants, can already recognize pedestrians and cyclists with high reliability. This is where mature machine learning models demonstrate their value. They can recognize objects in images or videos not only swiftly but with an impressive degree of accuracy. Since modern cars are equipped with an abundance of sensors, LiDARs and cameras, the data necessary to train these models is generated in vast quantities.

As multi-modal LLMs continue to evolve, they are bridging the gap between object detection and language comprehension. This enables models to process both visual and textual data, which is crucial in dynamic environments like road traffic. By combining image recognition and natural language processing, these models can now understand scenarios with deeper context.

How Object Detection Works in Real-Time

Consider the scenario of a child running into the street. In a split second, a multi-modal model can detect the presence of the child, identify where they are in relation to the vehicle, and assess whether the child is running toward or away from the car. This is classic image recognition happening in real time. The model has been trained on millions of images containing similar objects and situations, allowing it to recognize and react with remarkable speed.

But the sophistication of multi-modal LLMs goes beyond this. The model doesn’t just identify a child and categorize them as a pedestrian. It also processes movement and positioning, which is crucial in determining whether the child is 1 meter or 10 meters away from the vehicle. This depth of understanding is critical in ensuring a timely and effective response, such as the automatic initiation of an emergency stop.

How Multi-Modal Models Transform Autonomous Driving

Autonomous vehicles can now do more than just detect objects. They are learning to understand the environment through the seamless integration of visual and linguistic data.

The following example shows a typical street scene. What do typical multi-modal models recognize now? We used the following prompt in Claude, Gemini Pro Vision and GPT 4o to find out:

This is the front-view image of your car. Please describe the current traffic scenario you’re in and detect all relevant objects and whether they are near or far away

The model answers can be found in Table 1:

Table 1: Different Model answers.

When a multi-modal model like LLaMA 3.2 processes a situation where “a child runs” it does more than complete the phrase “into the street.” The model has been trained to associate that term with actual visual objects — the “child” in the image and the corresponding “street” context. This is what makes these models particularly powerful: they are not limited to language or image recognition in isolation. Instead, they process both in tandem, which is critical for the complex decision-making required in self-driving cars.

The concept of using this synergy between text and images to advance autonomous driving is not new. In fact, it dates back to the invention of transformer-based models by Google. Today, models like Car-GPT, LeGEND, and Lingo2 are pushing the boundaries further. These models are not only able to recognize traffic objects, but they can also describe entire scenes in natural language and predict the next best action. For instance, if a car approaches an intersection, the model can both detect the presence of traffic lights and describe the scene in real-time: “A red light is present, and a pedestrian is crossing.”

Beyond Object Detection: Coordinating Action

The real power of multi-modal models in autonomous driving lies in their ability to generate actionable insights. These models don’t just detect objects or describe a scene; they evaluate situations to inform decisions. Take the child running into the street example again: after detecting the child, the model can coordinate with other systems in the vehicle to initiate an emergency brake or adjust the car’s speed based on the distance of the child.

This ability to both “see” and “understand” is critical for the next phase of autonomous driving. Multi-modal models offer a path towards a more comprehensive integration of textual and physical data, providing cars with the situational awareness required to navigate complex environments with minimal human intervention.

Try it Yourself: LLaMA 3.2 and Beyond

For those curious about how these models work in practice, experimenting with open-source models like Meta’s LLaMA 3.2 on IBM’s watsonx platform offers a hands-on glimpse into the future. By feeding in real-world data such as images of traffic scenarios and pairing them with descriptive text, you can see firsthand how these models are advancing object detection and scene interpretation.

The Road Ahead

Multi-modal large language models represent the next leap in artificial intelligence, combining vision and language in ways that unlock new possibilities for autonomous driving. As the technology evolves, it will continue to play an essential role in making self-driving cars safer and more reliable by improving object detection, scene comprehension, and decision-making.

The future of autonomous driving lies in the hands of these advanced AI systems, and with the rapid pace of innovation, the road ahead looks incredibly promising.