Avoiding the Crash: A Vision-Language Model Evaluation of Critical Traffic Scenarios

Apr 1, 2025· David FernandezDavid Fernandez ,Pedram MohajerAnsari ,Amir Salarpour ,Mert D. Pese

Key Contributions & Takeaways

  • Demonstrates that state-of-the-art VLMs (LLaVA-7B and MoE-LLaVA) identify potential crashes 1.13 to 1.33 seconds faster than human drivers.
  • Introduces Crash Prevention Efficiency (CPE), a novel metric evaluating VLM timing and proximity performance in crash sequences.
  • Provides a comprehensive evaluation framework for ADS real-time decision support based on frame-by-frame critical scenario diagnostics.
publications
01  ·  The Problem

AV systems rely on DNNs that require constant retraining. When outdated, they fail to recognize novel crash scenarios, causing preventable accidents.

02  ·  The Approach

Real-world dashcam crash footage is decomposed frame-by-frame. LLaVA-7B and MoE-LLaVA predict the safest driving action at each frame: brake, accelerate, or turn.

03  ·  The Metric: CPE

Crash Prevention Efficiency scores how early and how precisely a model detects the threat, measured against the point of no return (tPNR) and crash point (tx).

Frame-by-Frame VLM Analysis
MoE-LLaVA's real-time decision-making across two critical crash scenarios, from safe driving to the moment of impact.
Q1Can VLMs Overcome Human Drivers?
Red Light Violation·Dashcam · victim vehicle POV·VLMs detected threat 1.33 s earlier than humans
t₀ · StarttR · DetectiontPNR · No Returntx · Crash
Safe Zone
Start of video, approaching intersection at 8 mph
MoE-LLaVA

"You are driving at 8 mph approaching a four-way intersection."

VLM Detects Threat
Detection, car approaching crosswalk
MoE-LLaVA

"A car is approaching a crosswalk – exercise caution at the intersection."

Point of No Return
Point of no return, red-light violator in intersection
MoE-LLaVA

"Slow down – maintain a safe distance from the white car in front of you."

Crash Point
Crash, collision occurring
MoE-LLaVA

"You should slow down immediately – the situation ahead is critical."

Q2Can VLMs Overcome DNN-Based AV Systems?
Flipped Truck · Highway at Night·Tesla dashcam footage·VLMs detected hazard before braking initiated
t₀ · StarttR · DetectiontPNR · No Returntx · Crash
Safe Zone
Start of video, dark highway at night, 57 mph
MoE-LLaVA

"You are driving at 57 mph on a dark highway at night – maintain safe following distance."

VLM Detects Threat
Detection, obstacle visible on highway
MoE-LLaVA

"Be aware of potential obstacles or traffic conditions ahead on the highway."

Point of No Return
Point of no return, flipped truck blocking highway
MoE-LLaVA

"There is a truck on the road in a precarious position – slow down to avoid a collision."

Crash Point
Crash, Tesla collides with flipped truck
MoE-LLaVA

"Brake immediately – there is an obstruction across the full width of the road."

Timeline zones reflect each scenario's Crash Prevention Efficiency (CPE) intervals: the window between VLM detection (tR), the point of no return (tPNR), and impact (tx). Frames extracted from real-world dashcam footage used in the study.

Crash Prevention Efficiency (CPE)
A novel metric measuring how early and precisely a model acts relative to the Point of No Return (tb).
Formula
CPE =
e−α(tb−tc)EARLY    if  tc < tb
1ON-TIME    if  tc = tb
−e−β(tc−tb)LATE    if  tc > tb
Variables
tb= d/v, Point of No Return timetc= td + tr, Detection + reaction timeαScaling factor for early actionsβScaling factor for late actions; β > α
CPE vs. Detection Timing
LLaVA-7BCPE≈0.60MoE-LLaVACPE≈0.63AV SystemsCPE≈−0.07−.0710−1CPETime(t)tₐ = tᵇEarly ActionLate Action
Score & Interpretation
−1  very early+1  ideal−1  very late
Results highlight
LLaVA-7Bavg CPE0.42–0.82
MoE-LLaVAavg CPE0.40–0.85
AV Systemsavg CPE−0.13 to −0.002
Abstract

"LLaVA-7B and MoE-LLaVA identified potential crash scenarios 1.13 to 1.33 seconds earlier than human drivers, highlighting their potential role in autonomous driving systems."

Autonomous Vehicles (AVs) have transformed transportation by reducing human error and enhancing traffic efficiency, driven by deep neural network (DNN) models that power image classification and object detection. However, to maintain optimal performance, these models require periodic re-training; failure to do so can result in malfunctions that may lead to accidents. Recently, Vision-Language Models (VLMs), such as LLaVA-7B and MoE-LLaVA, have emerged as powerful alternatives, capable of correlating visual and textual data with a high degree of accuracy. These models’ robustness and ability to generalize across diverse environments make them especially suited to analyzing complex driving scenarios like crashes. To evaluate the decision-making capabilities of these models across common crash scenarios, a set of real-world crash incident videos was collected. By decomposing these videos into frame-by-frame images, we task the VLMs to determine the appropriate driving action at each frame: accelerate, brake, turn left, turn right, or maintain the current course. For each frame, three sets of outputs are analyzed: the actual action executed in the video, the action a human driver would likely take to avoid a crash, and the action the VLM predicts as optimal to avoid a crash. To measure and compare the effectiveness of the VLMs, we introduce a metric called Crash Prevention Efficiency (CPE) which evaluates the model’s performance in detecting crash scenarios and taking appropriate actions to avoid them. CPE assesses how well a VLM can respond to potential crashes by analyzing both the timing of the detection and the proximity to a predefined point in the crash sequence. Our findings reveal that VLMs demonstrate a high level of consistency in decision-making, with LLaVA-7B and MoE-LLaVA models identifying potential crash scenarios 1.13 to 1.33 seconds earlier than humans, respectively. This highlights their potential role in autonomous driving systems (ADS), supporting both real-time decision-making for human drivers and fully autonomous operations.
Venue SAE Technical Paper 2025-01-8213, WCX SAE World Congress Experience
David Fernandez
Authors
PhD Candidate in Computer Science

David Fernandez is a PhD candidate in Computer Science at Clemson University, working on safe, efficient, and explainable AI for safety-critical systems. His research spans perception, adversarial robustness, and on-device deployment of large foundation models, including LLMs and VLMs, with five first-authored publications on component-level explainability, zero-shot reasoning, and adversarial scenario analysis, alongside collaborative work on edge AI for industrial agentic systems. Much of this research is grounded in autonomous driving, where trustworthiness, latency, and robustness constraints are unforgiving, but the underlying methods transfer broadly to other high-stakes domains.

As a member of Clemson’s VIPR-GS Research Program, he develops hierarchical LLM reasoning frameworks and VLM evaluation systems for the U.S. Army’s Next Generation Combat Vehicle (NGCV) program, focusing on zero-shot reasoning and component-level explainability under real-world deployment constraints.

At BMW Group, he designs agentic AI systems for enterprise environments, building autonomous prompt optimization pipelines that enable continual agent improvement without model retraining and context-aware moderation frameworks that detect coordinated multi-turn adversarial attacks in production deployments.