A curated and structured collection of papers on hallucination in Video Large Language Models (Vid-LLMs), covering 19 evaluation benchmarks and 23 mitigation methods. Automatically updated monthly via arXiv search.
📄 Based on the survey: Distorted or Fabricated? A Survey on Hallucination in Video LLMs
| Evaluation Benchmarks | Mitigation Methods | Top-tier Venues | Coverage Period | Monthly Paper Update |
- [2026/03] 🤖 Automated monthly arXiv paper update is now live! A GitHub Action runs on the 1st of each month to find new video hallucination papers and commit directly to the main branch. Newly discovered papers that have not yet been classified can be found in
new_papers.md.
📖 Table of Contents
📋 Taxonomy of Video Hallucinations
📊 Evaluation Benchmarks — 19 benchmarks
🔵 Spatiotemporal Dynamics
🟢 Referential Inconsistency
🟠 Context-Driven Fabrication
🟣 Audio-Visual Conflict
🛠️ Mitigation Strategies — 23 methods
🔵 Spatiotemporal Dynamics
🟢 Referential Inconsistency
🟠 Context-Driven Fabrication
🟣 Audio-Visual Conflict
🤝 Contributing
We propose a mechanism-driven taxonomy that classifies hallucinations in Video Large Language Models (Vid-LLMs) into two primary types:
|
🔷 Dynamic Distortion The model correctly detects entities but misrepresents their temporal progression or referential consistency.
|
🔶 Content Fabrication The model produces outputs that lack grounding in visual evidence and are instead influenced by learned priors.
|
Mechanism-driven taxonomy of Vid-LLM hallucinations. Solid fill = benchmarks; striped fill = mitigation methods.
Note
Benchmarks are organized by our mechanism-driven taxonomy. Each entry includes venue, date, and links to code/project pages where available.
Legend: = Project Page
= GitHub Repository
- = Not Available
Event Misordering (4 papers)
| Title | Benchmark | Venue | Date | Code |
|---|---|---|---|---|
| VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding | VidHalluc | CVPR 2025 | 12/2024 | |
| Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation | HAVEN | arXiv 2025 | 03/2025 | |
| MHBench: Demystifying Motion Hallucination in VideoLLMs | MHBench | AAAI 2025 | 01/2025 | |
| ARGUS: Hallucination and Omission Evaluation in Video-LLMs | ARGUS | ICCV 2025 | 06/2025 |
Duration Distortion (2 papers)
| Title | Benchmark | Venue | Date | Code |
|---|---|---|---|---|
| VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models | VideoHallucer | arXiv 2024 | 06/2024 | |
| Online Video Understanding: OVBench and VideoChat-Online | OVBench | CVPR 2025 | 01/2025 |
Frequency Confusion (2 papers)
| Title | Benchmark | Venue | Date | Code |
|---|---|---|---|---|
| VidHal: Benchmarking Temporal Hallucinations in Vision LLMs | VidHal | arXiv 2024 | 11/2024 | |
| Vript: A Video Is Worth Thousands of Words | Vript | NeurIPS 2024 | 06/2024 |
Character Conflation (2 papers)
| Title | Benchmark | Venue | Date | Code |
|---|---|---|---|---|
| EGOILLUSION: Benchmarking Hallucinations in Egocentric Video Understanding | EGOILLUSION | EMNLP 2025 | 11/2025 | |
| MESH: Measuring Hallucinations in Large Video Models | MESH | ACM MM 2025 | 09/2025 |
Scene Conflation (1 paper)
| Title | Benchmark | Venue | Date | Code |
|---|---|---|---|---|
| ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding | ELV-Halluc | arXiv 2025 | 08/2025 |
Object-Action Hallucination (2 papers)
| Title | Benchmark | Venue | Date | Code |
|---|---|---|---|---|
| VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding | VideoHallu | NeurIPS 2025 | 05/2025 | |
| Models See Hallucinations: Evaluating the Factuality in Video Captioning | FactVC | EMNLP 2023 | 03/2023 |
Scene-Event Hallucination (3 papers)
| Title | Benchmark | Venue | Date | Code |
|---|---|---|---|---|
| EventHallusion: Diagnosing Event Hallucinations in Video LLMs | EventHallusion | arXiv 2024 | 09/2024 | |
| NOAH: Benchmarking Narrative Prior driven Hallucination and Omission in Video Large Language Models | NOAH | arXiv 2025 | 11/2025 | |
| RoadSocial: A Diverse VideoQA Dataset and Benchmark for Road Event Understanding from Social Video Narratives | RoadSocial | CVPR 2025 | 02/2025 |
Action Attribution (2 papers)
| Title | Benchmark | Venue | Date | Code |
|---|---|---|---|---|
| AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models | AVHBench | ICLR 2025 | 10/2024 | |
| The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio | CMM | arXiv 2024 | 10/2024 |
Emotion Inference (1 paper)
| Title | Benchmark | Venue | Date | Code |
|---|---|---|---|---|
| EmotionHallucer: Evaluating Emotion Hallucinations in Multimodal Large Language Models | EmotionHallucer | arXiv 2025 | 05/2025 |
Note
Methods are classified by the type of hallucination they target. The Training-Free column indicates whether the method requires additional training (✘) or not (✔︎).
Event Misordering (3 papers)
| Title | Method | Venue | Date | Training-Free | Code |
|---|---|---|---|---|---|
| SEASON: Mitigating Temporal Hallucination in Video LLMs via Self-Diagnostic Contrastive Decoding | SEASON | arXiv 2025 | 12/2025 | ✔︎ | - |
| Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation | Video-thinking (TDPO) | arXiv 2025 | 03/2025 | ✘ | |
| SmartSight: Mitigating Hallucination in Video-LLMs via Temporal Attention Collapse | SmartSight | AAAI 2026 | 12/2025 | ✔︎ | - |
Duration Distortion (3 papers)
| Title | Method | Venue | Date | Training-Free | Code |
|---|---|---|---|---|---|
| Temporal Insight Enhancement: Mitigating Temporal Hallucination in Video Understanding by MLLMs | Temporal Insight | ICPR 2024 | 01/2024 | ✔︎ | - |
| VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding | DINO-HEAL | CVPR 2025 | 12/2024 | ✔︎ | |
| Mitigating Hallucination in VideoLLMs via Temporal-Aware Activation Engineering | TAAE | arXiv 2025 | 05/2025 | ✘ | - |
Frequency Confusion (2 papers)
| Title | Method | Venue | Date | Training-Free | Code |
|---|---|---|---|---|---|
| VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding | VTG-LLM | AAAI 2025 | 05/2024 | ✘ | |
| Vript: A Video Is Worth Thousands of Words | Vriptor | NeurIPS 2024 | 06/2024 | ✘ |
Character Conflation (2 papers)
| Title | Method | Venue | Date | Training-Free | Code |
|---|---|---|---|---|---|
| Vista-LLaMA: Reducing Hallucination in Video Language Models via Equal Distance to Visual Tokens | Vista-LLaMA | CVPR 2024 | 12/2023 | ✘ | |
| Alternating Perception-Reasoning for Hallucination-Resistant Video Understanding | VideoPLR | arXiv 2025 | 11/2025 | ✘ |
Scene Conflation (2 papers)
| Title | Method | Venue | Date | Training-Free | Code |
|---|---|---|---|---|---|
| ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding | ELV-Halluc-DPO | arXiv 2025 | 08/2025 | ✘ | |
| Online Video Understanding: OVBench and VideoChat-Online | VideoChat-Online | CVPR 2025 | 01/2025 | ✘ |
Object-Action Hallucination (2 papers)
| Title | Method | Venue | Date | Training-Free | Code |
|---|---|---|---|---|---|
| Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmented Contrastive Alignment | SANTA | WACV 2026 | 12/2025 | ✘ | |
| EventHallusion: Diagnosing Event Hallucinations in Video LLMs | TCD | arXiv 2024 | 09/2024 | ✔︎ |
Scene-Event Hallucination (3 papers)
| Title | Method | Venue | Date | Training-Free | Code |
|---|---|---|---|---|---|
| MASH-VLM: Mitigating Action-Scene Hallucination in Video-LLMs through Disentangled Spatial-Temporal Representations | MASH-VLM | CVPR 2025 | 03/2025 | ✘ | - |
| PaMi-VDPO: Mitigating Video Hallucinations by Prompt-Aware Multi-Instance Video Preference Learning | PaMi-VDPO | arXiv 2025 | 04/2025 | ✘ | - |
| Hallucination Reduction in Video-Language Models via Hierarchical Multimodal Consistency | MMA | IJCAI 2025 | 08/2025 | ✘ | - |
Both Object-Action & Scene-Event (2 papers)
| Title | Method | Venue | Date | Training-Free | Code |
|---|---|---|---|---|---|
| VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models | VistaDPO | ICML 2025 | 04/2025 | ✘ | |
| VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding | VideoHallu-GRPO | NeurIPS 2025 | 05/2025 | ✘ |
Action Attribution (2 papers)
| Title | Method | Venue | Date | Training-Free | Code |
|---|---|---|---|---|---|
| AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models | AVHModel-Align-FT | ICLR 2025 | 10/2024 | ✘ | |
| AVCD: Mitigating Hallucinations in Audio-Visual Large Language Models through Contrastive Decoding | AVCD | NeurIPS 2025 | 05/2025 | ✔︎ |
Emotion Inference (1 paper)
| Title | Method | Venue | Date | Training-Free | Code |
|---|---|---|---|---|---|
| EmotionHallucer: Evaluating Emotion Hallucinations in Multimodal Large Language Models | PEP-MEK | arXiv 2025 | 05/2025 | ✔︎ |
Tip
We welcome contributions from the community! Here's how you can help:
🔀 Pull Request — Add new papers, update code links, or correct errors
🐛 Open an Issue — Report mistakes, suggest missing papers, or request features
📝 PR Format Guide
Please follow this table structure when adding new entries:
| [**Paper Title**](paper_link) | Method/Benchmark Name | Venue | MM/YYYY | [code](code_link) |
If you find this repository helpful, please consider giving it a ⭐
Maintained by the SmileLab team at Northeastern University.
