Avoiding the Crash: A Vision-Language Model Evaluation of Critical Traffic Scenarios

David Fernandez; Pedram MohajerAnsari; Amir Salarpour; Mert D. Pese

doi:10.4271/2025-01-8213

Avoiding the Crash: A Vision-Language Model Evaluation of Critical Traffic Scenarios

Apr 1, 2025·

David Fernandez ,Pedram MohajerAnsari ,Amir Salarpour ,Mert D. Pese

PDF DOI Outstanding Student Paper Cite

Key Contributions & Takeaways

Demonstrates that state-of-the-art VLMs (LLaVA-7B and MoE-LLaVA) identify potential crashes 1.13 to 1.33 seconds faster than human drivers.
Introduces Crash Prevention Efficiency (CPE), a novel metric evaluating VLM timing and proximity performance in crash sequences.
Provides a comprehensive evaluation framework for ADS real-time decision support based on frame-by-frame critical scenario diagnostics.

01 · The Problem

AV systems rely on DNNs that require constant retraining. When outdated, they fail to recognize novel crash scenarios, causing preventable accidents.

02 · The Approach

Real-world dashcam crash footage is decomposed frame-by-frame. LLaVA-7B and MoE-LLaVA predict the safest driving action at each frame: brake, accelerate, or turn.

03 · The Metric: CPE

Crash Prevention Efficiency scores how early and how precisely a model detects the threat, measured against the point of no return (t_PNR) and crash point (t_x).

Frame-by-Frame VLM Analysis

MoE-LLaVA's real-time decision-making across two critical crash scenarios, from safe driving to the moment of impact.

Q1Can VLMs Overcome Human Drivers?

Red Light Violation·Dashcam · victim vehicle POV·VLMs detected threat 1.33 s earlier than humans

t₀ · StarttR · DetectiontPNR · No Returntx · Crash

Safe Zone

Start of video, approaching intersection at 8 mph

MoE-LLaVA

"You are driving at 8 mph approaching a four-way intersection."

VLM Detects Threat

MoE-LLaVA

"A car is approaching a crosswalk – exercise caution at the intersection."

Point of No Return

MoE-LLaVA

"Slow down – maintain a safe distance from the white car in front of you."

Crash Point

MoE-LLaVA

"You should slow down immediately – the situation ahead is critical."

Q2Can VLMs Overcome DNN-Based AV Systems?

Flipped Truck · Highway at Night·Tesla dashcam footage·VLMs detected hazard before braking initiated

t₀ · StarttR · DetectiontPNR · No Returntx · Crash

Safe Zone

Start of video, dark highway at night, 57 mph

MoE-LLaVA

"You are driving at 57 mph on a dark highway at night – maintain safe following distance."

VLM Detects Threat

MoE-LLaVA

"Be aware of potential obstacles or traffic conditions ahead on the highway."

Point of No Return

MoE-LLaVA

"There is a truck on the road in a precarious position – slow down to avoid a collision."

Crash Point

Crash, Tesla collides with flipped truck

MoE-LLaVA

"Brake immediately – there is an obstruction across the full width of the road."

Timeline zones reflect each scenario's Crash Prevention Efficiency (CPE) intervals: the window between VLM detection (t_R), the point of no return (t_PNR), and impact (t_x). Frames extracted from real-world dashcam footage used in the study.

∫

Crash Prevention Efficiency (CPE)

A novel metric measuring how early and precisely a model acts relative to the Point of No Return (t_b).

Formula

CPE =

e^{−α(t_b−t_c)}EARLY if t_c < t_b

1ON-TIME if t_c = t_b

−e^{−β(t_c−t_b)}LATE if t_c > t_b

Variables

t_b= d/v, Point of No Return timet_c= t_d + t_r, Detection + reaction timeαScaling factor for early actionsβScaling factor for late actions; β > α

CPE vs. Detection Timing

Score & Interpretation

−1  very early+1  ideal−1  very late

Results highlight

LLaVA-7Bavg CPE0.42–0.82

MoE-LLaVAavg CPE0.40–0.85

AV Systemsavg CPE−0.13 to −0.002

Last updated on Apr 1, 2025

Abstract

"LLaVA-7B and MoE-LLaVA identified potential crash scenarios 1.13 to 1.33 seconds earlier than human drivers, highlighting their potential role in autonomous driving systems."

Autonomous Vehicles (AVs) have transformed transportation by reducing human error and enhancing traffic efficiency, driven by deep neural network (DNN) models that power image classification and object detection. However, to maintain optimal performance, these models require periodic re-training; failure to do so can result in malfunctions that may lead to accidents. Recently, Vision-Language Models (VLMs), such as LLaVA-7B and MoE-LLaVA, have emerged as powerful alternatives, capable of correlating visual and textual data with a high degree of accuracy. These models’ robustness and ability to generalize across diverse environments make them especially suited to analyzing complex driving scenarios like crashes. To evaluate the decision-making capabilities of these models across common crash scenarios, a set of real-world crash incident videos was collected. By decomposing these videos into frame-by-frame images, we task the VLMs to determine the appropriate driving action at each frame: accelerate, brake, turn left, turn right, or maintain the current course. For each frame, three sets of outputs are analyzed: the actual action executed in the video, the action a human driver would likely take to avoid a crash, and the action the VLM predicts as optimal to avoid a crash. To measure and compare the effectiveness of the VLMs, we introduce a metric called Crash Prevention Efficiency (CPE) which evaluates the model’s performance in detecting crash scenarios and taking appropriate actions to avoid them. CPE assesses how well a VLM can respond to potential crashes by analyzing both the timing of the detection and the proximity to a predefined point in the crash sequence. Our findings reveal that VLMs demonstrate a high level of consistency in decision-making, with LLaVA-7B and MoE-LLaVA models identifying potential crash scenarios 1.13 to 1.33 seconds earlier than humans, respectively. This highlights their potential role in autonomous driving systems (ADS), supporting both real-time decision-making for human drivers and fully autonomous operations.

Type Conference paper

Venue SAE Technical Paper 2025-01-8213, WCX SAE World Congress Experience

Vision Language Models Autonomous Vehicles Traffic Safety

Authors

David Fernandez

PhD Candidate in Computer Science

David Fernandez is a PhD candidate in Computer Science at Clemson University, working on safe, efficient, and explainable AI for safety-critical systems. His research spans perception, adversarial robustness, and on-device deployment of large foundation models, including LLMs and VLMs, with five first-authored publications on component-level explainability, zero-shot reasoning, and adversarial scenario analysis, alongside collaborative work on edge AI for industrial agentic systems. Much of this research is grounded in autonomous driving, where trustworthiness, latency, and robustness constraints are unforgiving, but the underlying methods transfer broadly to other high-stakes domains.

As a member of Clemson’s VIPR-GS Research Program, he develops hierarchical LLM reasoning frameworks and VLM evaluation systems for the U.S. Army’s Next Generation Combat Vehicle (NGCV) program, focusing on zero-shot reasoning and component-level explainability under real-world deployment constraints.

At BMW Group, he designs agentic AI systems for enterprise environments, building autonomous prompt optimization pipelines that enable continual agent improvement without model retraining and context-aware moderation frameworks that detect coordinated multi-turn adversarial attacks in production deployments.

← SASA: Sequence-Aware Shadow Attacks via Attention Alignment for Traffic Sign Recognition Jun 1, 2025

David vs. Goliath: A Comparative Study of Different-Sized LLMs for Code Generation in the Domain of Automotive Scenario Generation Jan 1, 2025 →

No results found

Avoiding the Crash: A Vision-Language Model Evaluation of Critical Traffic Scenarios

Key Contributions & Takeaways