Safeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks
Abstract
Vision-Language Models (VLMs) extend the capabilities of Large Language Models (LLMs) by incorporating visual information, yet they remain vulnerable to jailbreak attacks, especially when processing noisy or corrupted images. Although existing VLMs adopt security measures during training to mitigate such attacks, vulnerabilities associated with noise-augmented visual inputs are overlooked. In this work, we identify that missing noise-augmented training causes critical security gaps: many VLMs are susceptible to even simple perturbations such as Gaussian noise. To address this challenge, we propose Robust-VLGuard, a multimodal safety dataset with aligned / misaligned image-text pairs, combined with noise-augmented fine-tuning that reduces attack success rates while preserving functionality of VLM. For stronger optimization-based visual perturbation attacks, we propose DiffPure-VLM, leveraging diffusion models to convert adversarial perturbations into Gaussian-like noise, which can be defended by VLMs with noise-augmented safety fine-tuning. Experimental results demonstrate that the distribution-shifting property of diffusion model aligns well with our fine-tuned VLMs, significantly mitigating adversarial perturbations across varying intensities. The dataset and code are available at https://github.com/JarvisUSTC/DiffPure-RobustVLM.
Community
Contributions Summary:
First systematic analysis showing that mainstream VLMs are not robust to Gaussian noise perturbations.
Propose Robust-VLGuard, a new dataset with image-text misalignment and Gaussian noise augmentation for fine-tuning VLMs.
Introduce DiffPure-VLM, a defense framework that leverages diffusion models to convert adversarial noise into Gaussian-like noise, improving robustness against optimization-based attacks.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models (2025)
- Adversarial Training for Multimodal Large Language Models against Jailbreak Attacks (2025)
- Adversary-Aware DPO: Enhancing Safety Alignment in Vision Language Models via Adversarial Training (2025)
- HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States (2025)
- Survey of Adversarial Robustness in Multimodal Large Language Models (2025)
- Zero-Shot Defense Against Toxic Images via Inherent Multimodal Alignment in LVLMs (2025)
- Universal Adversarial Attack on Aligned Multimodal LLMs (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 3
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper