Understanding Adversarial Transferability in Vision-Language Models for Autonomous Driving: A Cross-Architecture Analysis

Apr 7, 2026· David FernandezDavid Fernandez ,Pedram MohajerAnsari ,Amir Salarpour ,Mert D. Pese

Key Contributions & Takeaways

  • Adversarial patches transfer across VLM architectures with 73–91% success rate; attackers need no knowledge of the deployed model to mount effective attacks.
  • Introduces a Transfer Matrix framework revealing that CLIP-based vision encoders (Dolphins, LeapVAD) drive stronger bidirectional transferability than EVA-CLIP (OmniDrive).
  • Attacks persist across 64–79% of frames throughout the critical decision window, too sustained for temporal filtering or ensemble defenses to reliably mitigate.
publications

01  ·  The Problem

Physical adversarial patches on road signs can manipulate VLM driving decisions. Attackers typically don't know which model a vehicle uses, yet that may not matter.

02  ·  The Method

Three VLMs (Dolphins, OmniDrive, LeapVAD) are evaluated with physically realizable patches in Crosswalk and Highway scenarios using Black-Box NES optimization.

03  ·  The Finding: Transferability

Patches optimised for one model remain 73–91% effective on others. Architectural diversity alone provides limited real-world protection.

Threat Model
Black-box attacker uses only physical patches on legitimate roadside infrastructure: no model internals, no gradients, no target knowledge required.
No patch · Drives safely
Clean road scene, no adversarial patch on billboard
With patch · Manipulated
Road scene with adversarial patch on billboard
Methodology
Five-stage pipeline: black-box patch optimization → physical deployment → VLM inference → semantic homogenization → transfer evaluation
01
Patch
Generation
02
CARLA
Deployment
03
VLM
Inference
04
CLIP
Homogenization
05
Transfer
Evaluation
Phase 01  ·  Patch Generation
No gradients: NES converges in 6 000 queries

Patches are optimised via Natural Evolution Strategies using a CLIP-based semantic similarity loss over the VLM output command, with no access to model weights or gradients. Expectation over Transformation (EoT) bakes in physical robustness across viewing angles and lighting conditions.

// Semantic adversarial loss (CLIP text embedding space)
L(δ)  =  1 − cos( ET(f(x ⊕ Mδ)) ,  ET(f(x)) )
// NES gradient estimate & patch update
gk  =  (nσ)−1 ⋅ Σi εi ⋅ L(δk + σεi)
δk+1  =  clip( δk + α⋅gk ,  [0,1])
Zero gradient access 6 000 queries / model EoT physical robustness CARLA simulation
CARLArenders patchin simulationVLMinference onpatched sceneCLIP Losssemantic driftcos similarity ↓NES Updateevolve patch δgradient-free6 000queries / modelAdversarial Patch δ*physicallydeployableon billboard
Adversarial patch δ*  →  deployed on simulated billboard
Phase 02  ·  CARLA Deployment
Patch rendered on a simulated billboard; VLM vehicles drive past it

The optimised patch δ* is placed on a roadside billboard inside the CARLA simulator. Ego-vehicles equipped with the target VLMs drive toward it under varied lighting and camera angles, producing the inference inputs for Phase 03.

S1: Bus stop crosswalk scenario
S1 · Crosswalk: Bus Stop
Patch on shelter billboard at pedestrian crossing
S2: Highway billboard scenario
S2 · Highway: Billboard
Patch on roadside billboard at highway speed
Phase 03  ·  VLM Inference
Three architectures receive the patched scene and produce driving decisions
Dolphins
OpenFlamingo-based
CLIP shared
Multi-frame video inputCLIP ViT-L/14InfoNCE contrastive · 24 layers · 1024-dPerceiver Resampler · compresses visual tokensCross-attention (Flamingo)Q: lang tokens · KV: visual features · GCoTMPT-7B6.7B params · frozen LMGrounded Chain-of-ThoughtGrounded Chain-of-Thought output
Vision Encoder
CLIP ViT-L/14
InfoNCE contrastive  ·  24 layers  ·  1024-d
↓ Perceiver Resampler
Connector
Cross-attention (Flamingo)
Q: language tokens  ·  KV: visual features  ·  GCoT
Language Model
MPT-7B
6.7B params  ·  Grounded Chain of Thought
OmniDrive
Omni-L & Omni-Q variants
Distinct
Cam 1Cam 2Cam N······EVA-02-LMIM distills CLIP · multi-view N×C×H×W · 3D pos.MLP (Omni-L)linear · LLaVA-styleQ-Former (Omni-Q)carrier + perception queriesLLaMA2-7B4096-d embeddings · causal decoder3D detection + language tasks3D detectionlanguage output
Vision Encoder
EVA-02-L
MIM distills CLIP  ·  multi-view N×C×H×W  ·  3D pos. enc.
↓ projected features
Connector
MLP or Q-Former
Omni-L: linear (LLaVA-style)  ·  Omni-Q: carrier + perception queries
Language Model
LLaMA2-7B
4096-d embeddings  ·  3D detection + language tasks
LeapVAD
Dual-process system
CLIP shared
Multi-view · multi-frame inputQwen-VL-7BCLIP-based ViT · scene VLMScene Encoder256-d · ACT 128 + ACC 128scene descriptionMemory Banktop-k cosine retrievalSys-II: GPT-4oanalytical · logicalSys-I: 1.8Bheuristic · 5× fastercontrol signal output
Vision Layer
Qwen-VL-7B
CLIP-based ViT  ·  multi-view + multi-frame  ·  7.1K CARLA fine-tune
↓ Scene Encoder → 256-d (ACT 128-d + ACC 128-d)
Memory System
Contrastive Scene Tokens
ACT + ACC spaces  ·  top-k cosine retrieval from memory bank
Dual-Process Reasoning
GPT-4o + Qwen1.5-1.8B
System-II: GPT-4o (analytical)  ·  System-I: 1.8B (5× faster)
VLM inference outputs → CLIP embedding space
★ Novel
Phase 04  ·  CLIP Homogenization
Architecture-agnostic evaluation via shared embedding space

VLMs generate heterogeneous outputs across different vocabularies and instruction formats. The key methodological contribution is projecting all outputs through a shared CLIP text embedding space, enabling direct cross-architecture semantic comparison. CLIP-family vision encoders create a shared visual attack surface, explaining the high bidirectional transfer between Dolphins and LeapVAD (0.82–0.91).

DIVERSE OUTPUTSPROJECTIONSHARED SEMANTIC SPACEDolphins"Turn left at crossroad"natural language (GCoT)OmniDrive"Ped@0.4m · brake·2.3m"3D bbox + structured cmdLeapVAD"decel 2.5 m/s² · 8 tokens"control signals + memoryCLIP TextEncodershared ℝ⁵¹² spaceCLIP-familyDolphinsLeapVAD0.82–0.91OmniDriveEVA-02-L0.42–0.57(partial)cosine similarity comparison
Why shared encoders create a shared attack surface
Dolphins (OpenFlamingo, Perceiver Resampler + cross-attention) and LeapVAD (Qwen-VL-7B) both ground vision in a CLIP-aligned InfoNCE contrastive manifold; patches that maximise cosine drift in this space transfer directly between them (0.82–0.91). OmniDrive’s EVA-02-L uses masked image reconstruction (MIM) that distills CLIP, producing a partially distinct feature manifold; its Q-Former connector (Omni-Q variant) further decouples spatial features during language alignment, routing adversarial signal through a different pathway, explaining the lower but still substantial cross-cluster transfer.
TRij
Transfer
rate
TSi
Outgoing
generalisation
Frame
ASR
Temporal
persistence
VSj
Incoming
susceptibility
CLIP embedding similarities  →  transfer matrix
Phase 05  ·  Transfer Evaluation
Two driving scenarios, each tested with patches from all three VLM architectures. Transfer rates show how well attacks generalise across unseen models.
Architecture Vulnerability Ranking
Dolphins: most vulnerable to incoming attacks  ·  LeapVAD: most transferable patches  ·  CLIP encoder drives strongest bidirectional transfer
MetricDolphinsOmniDriveLeapVADInterpretation
Self-Attack ASR76.0%84.4%71.7%OmniDrive hardest to fool from outside, but optimised attacks on it succeed best
Vulnerability Score0.820.8100.814Dolphins most affected by patches from other architectures
Transfer-Out Rate0.7920.7720.882LeapVAD patches generalise best across unseen architectures
Universal Patch ASR64.3%69.8%63.5%A single patch fools all three; OmniDrive most susceptible
CLIP Enc. ASR
encoder-pathway
79.2%81.5%76.8%Attack success rate routed through the shared CLIP encoder pathway; confirms encoder is the primary vulnerability driver
Implications
Security

Architectural diversity is not a sufficient defense. Patches optimised for one model remain 73–91% effective on others; an attacker with physical access to a roadside billboard needs no knowledge of which model the target vehicle runs.

Design

Shared CLIP encoders create a shared attack surface. Deploying models with distinct vision encoder families (CLIP vs. non-CLIP) would structurally limit transferability; EVA-02-L’s MIM training already reduces but does not eliminate cross-cluster transfer.

Defence

Temporal filtering and ensemble voting are insufficient. Attacks persist across 64–79% of frames in the critical decision window, long enough to corrupt decisions before any frame-level filter or multi-model vote can compensate.

Transfer rates evaluated via TRij = ASRij / ASRii, normalising cross-architecture success against same-model baseline. Diagonal entries (self-attacks) shown in gray. All experiments conducted in CARLA simulator with physically realizable patches. Scenarios: Crosswalk/Bus Stop and Highway/Billboard.

Abstract

"LLaVA-7B and MoE-LLaVA identified potential crash scenarios 1.13 to 1.33 seconds earlier than human drivers, highlighting their potential role in autonomous driving systems."

Vision-language models (VLMs) are increasingly used in autonomous driving because they combine visual perception with language-based reasoning, supporting more interpretable decision-making, yet their robustness to physical adversarial attacks, especially whether such attacks transfer across different VLM architectures, is not well understood and poses a practical risk when attackers do not know which model a vehicle uses. We address this gap with a systematic cross-architecture study of adversarial transferability in VLM-based driving, evaluating three representative architectures (Dolphins, OmniDrive, and LeapVAD) using physically realizable patches placed on roadside infrastructure in both crosswalk and highway scenarios. Our transfer-matrix evaluation shows high cross-architecture effectiveness, with transfer rates of 73-91% (mean TR = 0.815 for crosswalk and 0.833 for highway) and sustained frame-level manipulation over 64.7-79.4% of the critical decision window even when patches are not optimized for the target model.
Venue SAE Technical Paper 2026-01-0170, WCX SAE World Congress Experience
David Fernandez
Authors
PhD Candidate in Computer Science

David Fernandez is a PhD candidate in Computer Science at Clemson University, working on safe, efficient, and explainable AI for safety-critical systems. His research spans perception, adversarial robustness, and on-device deployment of large foundation models, including LLMs and VLMs, with five first-authored publications on component-level explainability, zero-shot reasoning, and adversarial scenario analysis, alongside collaborative work on edge AI for industrial agentic systems. Much of this research is grounded in autonomous driving, where trustworthiness, latency, and robustness constraints are unforgiving, but the underlying methods transfer broadly to other high-stakes domains.

As a member of Clemson’s VIPR-GS Research Program, he develops hierarchical LLM reasoning frameworks and VLM evaluation systems for the U.S. Army’s Next Generation Combat Vehicle (NGCV) program, focusing on zero-shot reasoning and component-level explainability under real-world deployment constraints.

At BMW Group, he designs agentic AI systems for enterprise environments, building autonomous prompt optimization pipelines that enable continual agent improvement without model retraining and context-aware moderation frameworks that detect coordinated multi-turn adversarial attacks in production deployments.