MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues
Recent advances in multimodal large language models (MLLMs) have brought remarkable progress in video understanding.
However, existing evaluation benchmarks remain limited to single-turn question answering, overlooking the complexity of multi-turn dialogues in real-world scenarios.
🎬 MT-Video-Bench fills this gap.
It emphasizes cross-scene reasoning, long-range dependencies, and interactive adaptability, thereby aligning closely with real-world application demands.
Figure 1. Illustration of multi-turn dialogues under single-scene and cross-scene settings. The evaluated questions corresponding to tasks are marked with underlining, and the scenes involved in the entire multi-turn dialogues are marked with blue dotted boxes.
MT-Video-Bench‘s information:
-
📌 135 videos from 5 major categories & 23 subcategories
-
💬 987 dialogues (each with 5–8 turns) and 5,805 QA pairs for evaluating six core abilities
- Object Reference
- Memory Recall
- Content Summary
- Answer Refusal
- Topic Shifting
- Proactive Interaction
-
🧮 Long-Video Evaluation: durations up to 20 minutes
-
🧠 Very challenging, even 🥇 best-performing model achieving only
⚠️ 68.45 % overall accuracy, revealing the considerable difficulty of this dataset.
Figure 2. It covers a broad range of topics across five main categories: Movie, TV, Sports, Knowledge, and Life Record, each with multiple sub-topics, ensuring a diverse and balanced data distribution.
MT-Video-Bench is a new multi-turn video understanding benchmark that lets you easily compare our dataset with existing video-language benchmarks.
Figure 3. Comparison with other benchmarks. Avg. Q/V - the average number of QA pairs per video. \textbf{Long}: whether the average video length is greater than 10 minutes. Cross-Scene - whether the dialogue covers more than 4 scenes.
A glance at how MT-Video-Bench was built👇
- 🔎 Video Collection & Single-Scene Splitting: Manually collect videos → split into short clips using PySceneDetect → generate captions for each clip → merge related clips based on captions to form coherent single-scene videos.
- 🧾 Cross-Scene Video Merging: Extract key frames → perform object detection → build a dynamic object memory bank → retrieve and merge segments sharing common objects or themes.
- 📦 Multi-Turn Dialogue Generation: Use Gemini 2.5 to automatically generate single-scene and cross-scene multi-turn dialogues → select the most suitable task for each scene → design cross-scene questions with an object-centered approach.
- 🚦 Human Quality Control: Remove cases with information leakage → manually verify QA alignment, factual correctness, and difficulty → ensure high-quality, contextually coherent multi-turn dialogues.
Figure 4. Data construction and refinement pipeline of MT-Video-Bench.
Our dataset is under the CC-BY-NC-SA-4.0 license.
We do not own the copyright of any raw video files. Currently, we provide video access to researchers under the condition of acknowledging the above license. For the video data used, we respect and acknowledge any copyrights of the video authors.
If the original authors of the related works still believe that the videos should be removed, please contact ynpan24@m.fudan.edu.cn or directly raise an issue.
We evaluate both closed- and open-source MLLMs on MT-Video-Bench. Closed-source models include Gemini 2.5 Pro, Gemini 2.5 Flash, and Doubao-Seed-1.6-vision, while open-source models cover 18 representative MLLMs from Qwen2.5 VL, InternVL3.5, LLaVA, InterVideo, VideoChat, VideoLlama3, and MiniCPM series.
Figure 5. Evaluation results on MT-Video-Bench. "OR" - Object Reference. "MR" - Memory Recall. "CS" - Content Summary. "AR" - Answer Refusal. "TS" - Topic Shifting. "PI" - Proactive Interaction.
📦 More results can been seen here.
Figure 6. Performance comparison of Qwen2.5-VL-7B, InternVL3.5-8B (Think), and Gemini 2.5 Pro across various tasks under single-scene and cross-scene settings.
Figure 7. Performance comparison of four MLLMs across diverse video lengths.
Figure 8. Performance comparison of golden context, self-predicted context, and without context for the Qwen2.5-VL-7B model.
Figure 9. Ablation results of frames on different abilities. (a) Performance of Object Reference, Memory Recall, Content Summary, and Proactive Interaction; (b) Performance of Answer Refusal and Topic Shifting.
Figure 10. Ablation results of resolutions on different abilities.
We take the InternVL3.5 model as an example and provide the inference script. You can run:
python infer_internvl.py --model_type internvl4bTo evaluate the inference results, use the following command:
python eval.py --model_type internvl4bIf you find MT-Video-Bench useful for your research, please cite:
@misc{pan2025mtvideobenchholisticvideounderstanding,
title={MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues},
author={Yaning Pan and Zekun Wang and Qianqian Xie and Yongqian Wen and Yuanxing Zhang and Guohui Zhang and Haoxuan Hu and Zhiyu Pan and Yibing Huang and Zhidong Gan and Yonghong Lin and An Ping and Tianhao Peng and Jiaheng Liu},
year={2025},
eprint={2510.17722},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.17722},
}