Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

🖋 Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Xiang Yue, Bo Li, Yuanhan Zhang, and Ziwei Liu

🔥 News

[2025-2] 🎉🎉 We update Video-MMMU leaderboard to include Qwen-2.5-VL-72B, Qwen-2.5-VL-7B, mPLUG-Owl3-7B, InternVideo2.5-Chat-8B, VideoChat-Flash-7B@448.
[2025-1] 🎉🎉 We introduce VideoMMMU, a massive, multi-modal, multi-disciplinary video benchmark that evaluates the knowledge acquisition capability from educational videos.

🧠 Overview

Video-MMMU is the first benchmark to assess knowledge acquisition from educational videos, evaluating how well LMMs learn new knowledge from videos and apply what they learn in practice.

1) Knowledge-Intensive Video Collection

Video-MMMU features 300 lecture-style videos covering 6 professional disciplines—Art, Business, Science, Medicine, Humanities, and Engineering, spanning 30 subjects.

2) Knowledge Acquisition-Based Question Design

Each video is accompanied by 3 QA pairs, designed to evaluate video-based learning at different cognitive levels:

Perception – Identifying key information.
Comprehension – Understanding underlying concepts.
Adaptation – Applying knowledge to new scenarios.

This results in 900 question-answer pairs (300 videos × 3 QA pairs per video), systematically measuring a model's ability to acquire and apply knowledge from videos.

❓QA Design

Perception

ASR (Automatic Speech Recognition): The Art category (top left).
OCR (Optical Character Recognition): The Business category (bottom left).

Comprehension

Concept Comprehension: The Humanities category (top center).
Problem-Solving Strategy Comprehension: The Science category (bottom center).

Adaptation

Case Study Analysis: The Medicine category (top right).
Problem-Solving Strategy Adaptation: The Engineering category (bottom right).

🔍 A New Perspective on VideoQA

Videos as a Knowledge Source

Traditional VideoQA benchmarks focus primarily on scene-based understanding, evaluating how well models interpret visual content. Video-MMMU takes a different approach—it is the first to treat videos as a source of knowledge, assessing how effectively large multimodal models (LMMs) acquire and apply information from educational videos.

Measuring Knowledge Gain: The Δknowledge Metric

A key novelty of Video-MMMU is that it evaluates not just a model’s absolute accuracy but also its delta accuracy—the improvement in performance after learning from a video. A model may initially fail to solve an exam question, but we give the model a video where a human could learn to solve the question by watching the video. Video-MMMU tests how well LMMs improve their performance after watching the videos. Video-MMMU introduces Δknowledge to quantify knowledge gain by evaluating a model’s improvement on practice exam questions (Adaptation track) after watching a video.

🛠️ Evaluation Pipeline

The evaluation of VideoMMMU is integrated into LMMs-Eval. Below is a detailed instruction of the evaluation.

Installation

For formal usage, you can install the package from PyPI by running the following command:

pip install lmms-eval

For development, you can install the package by cloning the repository and running the following command:

git clone https://github.com/EvolvingLMMs-Lab/lmms-eval
cd lmms-eval
pip install -e .

If you want to test LLaVA, you will have to clone their repo from LLaVA and

git clone https://github.com/LLaVA-VL/LLaVA-NeXT
cd LLaVA-NeXT
pip install -e .

Evaluation

We use LLaVA-OneVision-7B as an example in the following commands. You can change --model, and --model_args based on your requirement.

Evaluation of LLaVA-OneVision on VideoMMMU (all 3 tracks)

accelerate launch --num_processes=1 --main_process_port 12345 -m lmms_eval \
--model llava_onevision \
--model_args pretrained=lmms-lab/llava-onevision-qwen2-7b-ov,conv_template=qwen_1_5,model_name=llava_qwen,max_frames_num=32,torch_dype=bfloat16 \
    --tasks video_mmmu \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix debug \
    --output_path ./logs/

Evaluate a single track of VideoMMMU

Perception track:

accelerate launch --num_processes=1 --main_process_port 12345 -m lmms_eval \
--model llava_onevision \
--model_args pretrained=lmms-lab/llava-onevision-qwen2-7b-ov,conv_template=qwen_1_5,model_name=llava_qwen,max_frames_num=32,torch_dype=bfloat16 \
    --tasks video_mmmu_perception \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix debug \
    --output_path ./logs/

Comprehension track:

accelerate launch --num_processes=1 --main_process_port 12345 -m lmms_eval \
--model llava_onevision \
--model_args pretrained=lmms-lab/llava-onevision-qwen2-7b-ov,conv_template=qwen_1_5,model_name=llava_qwen,max_frames_num=32,torch_dype=bfloat16 \
    --tasks video_mmmu_comprehension \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix debug \
    --output_path ./logs/

Adaptation track:

accelerate launch --num_processes=1 --main_process_port 12345 -m lmms_eval \
--model llava_onevision \
--model_args pretrained=lmms-lab/llava-onevision-qwen2-7b-ov,conv_template=qwen_1_5,model_name=llava_qwen,max_frames_num=32,torch_dype=bfloat16 \
    --tasks video_mmmu_adaptation \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix debug \
    --output_path ./logs/

Evaluate the question_only track of VideoMMMU -- Knowledge Acquisition Experiment (∆knowledge)

The "question_only" track consists of 2-second videos that contain the image associated with the Adaptation Track question. This is the baseline for ∆knowledge.

To evaluate this setting, you can use the following command:

accelerate launch --num_processes=1 --main_process_port 12345 -m lmms_eval \
--model llava_onevision \
--model_args pretrained=lmms-lab/llava-onevision-qwen2-7b-ov,conv_template=qwen_1_5,model_name=llava_qwen,max_frames_num=1,torch_dype=bfloat16 \
    --tasks video_mmmu_adaptation_question_only \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix debug \
    --output_path ./logs/

Adaptation Track setting

To ensure compatibility with LMMs-Eval, the image associated with the Adaptation Track question has been appended in the last frame of the video. A prompt has also been added to inform the model that the question image is located in this final frame.

As a result, you can execute the commands from the previous section without manually interleaving the image and video.

If you prefer an interleaved format, you can manually insert the image (either the last frame of the video or the image 1 entry from the HF dataset) into the designated placeholder <image 1>.

🎓 Video-MMMU Leaderboard

We evaluate various open-source and proprietary LMMs. The table below provides a detailed comparison. To submit your model results, please send an email to [email protected].

Model	Overall	Perception	Comprehension	Adaptation	Δknowledge
Human Expert	74.44	84.33	78.67	60.33	+33.1
Claude-3.5-Sonnet	65.78	72.00	69.67	55.67	+11.4
GPT-4o	61.22	66.00	62.00	55.67	+15.6
Qwen-2.5-VL-72B	60.22	69.33	61.00	50.33	+9.7
Gemini 1.5 Pro	53.89	59.00	53.33	49.33	+8.7
Aria	50.78	65.67	46.67	40.00	+3.2
Gemini 1.5 Flash	49.78	57.33	49.00	43.00	-3.3
LLaVA-Video-72B	49.67	59.67	46.00	43.33	+7.1
LLaVA-OneVision-72B	48.33	59.67	42.33	43.00	+6.6
Qwen-2.5-VL-7B	47.44	58.33	44.33	39.67	+2.2
InternVideo2.5-Chat-8B	43.00	54.67	41.67	32.67	+3.0
mPLUG-Owl3-7B	42.00	49.33	38.67	38.00	+7.5
MAmmoTH-VL-8B	41.78	51.67	40.00	33.67	+1.5
VideoChat-Flash-7B@448	41.67	51.67	40.67	32.67	-1.3
InternVL2-8B	37.44	47.33	33.33	31.67	-8.5
LLaVA-Video-7B	36.11	41.67	33.33	33.33	-5.3
VILA1.5-40B	34.00	38.67	30.67	32.67	+9.4
LLaVA-OneVision-7B	33.89	40.00	31.00	30.67	-5.6
Llama-3.2-11B	30.00	35.67	32.33	22.00	-
LongVA-7B	23.98	24.00	24.33	23.67	-7.0
VILA1.5-8B	20.89	20.33	17.33	25.00	+5.9

Citation

@article{hu2025videommmu,
    title={Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos},
    author={Kairui Hu and Penghao Wu and Fanyi Pu and Wang Xiao and Yuanhan Zhang and Xiang Yue and Bo Li and Ziwei Liu},
    booktitle={arXiv preprint arXiv:2501.13826},
    year={2025},
    url={https://arxiv.org/abs/2501.13826}
}

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
assets		assets
data		data
docs		docs
experiment_on_delta		experiment_on_delta
lmms_eval		lmms_eval
miscs		miscs
tools		tools
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

🔥 News

🧠 Overview

1) Knowledge-Intensive Video Collection

2) Knowledge Acquisition-Based Question Design

❓QA Design

🔍 A New Perspective on VideoQA

Videos as a Knowledge Source

Measuring Knowledge Gain: The Δknowledge Metric

🛠️ Evaluation Pipeline

Installation

Evaluation

🎓 Video-MMMU Leaderboard

Citation

About

Releases

Packages

Contributors 2

Languages

License

EvolvingLMMs-Lab/VideoMMMU

Folders and files

Latest commit

History

Repository files navigation

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

🔥 News

🧠 Overview

1) Knowledge-Intensive Video Collection

2) Knowledge Acquisition-Based Question Design

❓QA Design

🔍 A New Perspective on VideoQA

Videos as a Knowledge Source

Measuring Knowledge Gain: The Δknowledge Metric

🛠️ Evaluation Pipeline

Installation

Evaluation

🎓 Video-MMMU Leaderboard

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages