AIGVE-MACS: Unified Multi-Aspect Commenting and Scoring Model for AI-Generated Video Evaluation

🔍 Introduction

This repository contains the AIGVE-MACS model, a unified model for AI-Generated Video Evaluation (AIGVE), as presented in the paper AIGVE-MACS: Unified Multi-Aspect Commenting and Scoring Model for AI-Generated Video Evaluation. AIGVE-MACS is a unified Vision-Language Model (VLM) for evaluating AI-generated videos. It produces both numerical scores (from 0 to 5) and natural language justifications across 9 human-aligned aspects of video quality:

Metric	Description
Technical Quality	Assesses the technical aspects of the video, including whether the resolution is sufficient for object recognition, whether the colors are natural, and whether there is an absence of noise or artifacts.
Dynamic	Measures the extent of pixel changes throughout the video, focusing on significant object or camera movements and changes in environmental factors such as daylight, weather, or seasons.
Consistency	Evaluates whether objects in the video maintain consistent properties, avoiding glitches, flickering, or unexpected changes.
Physics	Determines if the scene adheres to physical laws, ensuring that object behaviors and interactions are realistic and aligned with real-world physics.
Element Presence	Checks if all objects mentioned in the instructions are present in the video. The score is based on the proportion of objects that are correctly included.
Element Quality	Assesses the realism and fidelity of objects in the video, awarding higher scores for detailed, natural, and visually appealing appearances.
Action/Interaction Presence	Evaluates whether all actions and interactions described in the instructions are accurately represented in the video.
Action/Interaction Quality	Measures the naturalness and smoothness of actions and interactions, with higher scores for those that are realistic, lifelike, and seamlessly integrated into the scene.
Overall	Reflects the comprehensive quality of the video based on all metrics, allowing raters to incorporate their subjective preferences into the evaluation.

🚀 Quickstart

Installation

pip install transformers accelerate
pip install qwen-vl-utils

Example Usage

from transformers import AutoProcessor
from models import Qwen2_5_VLForConditionalGeneration
import torch
from qwen_vl_utils import process_vision_info

# Load model and processor
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "xiaoliux/AIGVE-MACS", 
    torch_dtype=torch.bfloat16, 
    attn_implementation="flash_attention_2"
).to("cuda:0")

processor = AutoProcessor.from_pretrained("xiaoliux/AIGVE-MACS", use_fast=True)

# Compose input message
def get_user_message(video_frames, prompt):
    messages = [
            {
                "role": "user",
                "content": [
                    {
                        "type": "video",
                        "resized_height": 480,
                        "resized_width": 854,
                        'fps': 1,
                        "video": video_frames,
                    },
                    {
                        "type": "text",
                        "text": "You are an expert in evaluating AI-Generated Videos, you evaluate videos in the following 9 aspects: "
                                "1. technical_quality: including whether the resolution is sufficient for object recognition, whether the colors are natural, and whether there is an absence of noise or artifacts. "
                                "2. dynamic: the extent of pixel changes throughout the video, focusing on significant object or camera movements and changes in environmental factors such as daylight, weather, or seasons. "
                                "3. consistency: whether objects in the video maintain consistent properties, avoiding glitches, flickering, or unexpected changes." 
                                "4. physics: Determines if the scene adheres to physical laws." 
                                "5. element_presentence: Checks if all objects mentioned in the instructions are present in the video. "
                                "6. element_quality: Assesses the realism and fidelity of objects in the video, awarding higher scores for detailed, natural, and visually appealing appearances. "
                                "7. action_presentence: Evaluates whether all actions and interactions described in the instructions are accurately represented in the video. "
                                "8. action_quality: Measures the naturalness and smoothness of actions and interactions, with higher scores for those that are realistic, lifelike, and seamlessly integrated into the scene." 
                                "9. overall: Reflects the comprehensive quality of the video based on all metrics. "
                                "The score can be chosen from [0, 5] with whole numbers. You should also include the comment for each score. "
                                "Please output as a JSON."
                                f"The video instruction is: {prompt}"
                    },
                ],
            },
        ]
    return messages

# Example inputs
video_frames = ["/path/to/frame1.png", "/path/to/frame2.png", ...]
prompt = "A tiger runs across a snowy field while snowflakes fall."

messages = get_user_message(video_frames, prompt)
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)

inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
    **video_kwargs,
).to("cuda:0")

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=1500)
output_text = processor.batch_decode(
    [out[len(inp):] for inp, out in zip(inputs.input_ids, generated_ids)],
    skip_special_tokens=True
)[0]

print("Evaluation Result:\n", output_text)

📁 Output Format

{
  "technical_quality": {"score": 5, "comment": "..."},
  "dynamic": {"score": 4, "comment": "..."},
  "consistency": {"score": 5, "comment": "..."},
  ...
  "overall": {"score": 4, "comment": "..."}
}

📊 Main Results on AIGVE-BENCH 2

📈 Score Correlation (Spearman’s ρ ↑)

Method	TQ	Dy	CS	Phy	EP	EQ	AP	AQ	OR	AVG
GPT-4o	34.71	7.05	18.12	20.28	23.10	30.47	36.57	31.58	38.57	26.72
GPT-4.1	36.49	5.81	26.68	19.87	27.22	28.77	32.75	20.22	29.98	25.31
Qwen2.5-VL	8.77	4.00	1.24	-6.01	9.19	10.19	18.74	0.72	9.59	6.27
VideoLLaMA3	15.94	19.44	11.70	13.21	-3.13	12.27	13.61	-0.69	11.58	10.44
AIGVE-MACS	40.60	57.31	61.49	64.36	40.32	40.81	44.31	60.71	59.88	52.20

TQ: Technical Quality, Dy: Dynamics, CS: Consistency, Phy: Physics
EP/EQ: Element Presence/Quality, AP/AQ: Action Presence/Quality, OR: Overall

💬 Comment Generation Quality

Method	ROUGE-1 ↑	ROUGE-L ↑	BERTScore ↑	UniEval-Fact ↑	G-Eval ↑
GPT-4o	18.30	15.86	74.90	40.84	2.10
GPT-4.1	15.80	12.94	73.99	43.99	2.10
Qwen2.5-VL	17.95	15.31	74.31	42.32	2.37
VideoLLaMA3	19.99	17.67	75.35	40.21	2.18
AIGVE-MACS	49.50	38.00	85.87	57.04	3.42

📚 Citation

@article{liu2025aigvemacs,
  title={AIGVE-MACS: Unified Multi-Aspect Commenting and Scoring Model for AI-Generated Video Evaluation},
  author={Xiao Liu and Jiawei Zhang},
  journal={arXiv preprint arXiv:2507.01255},
  year={2025}
}

🔗 Additional Resources

📄 Paper on arXiv
🤗 Hugging Face Model
🗃️ AIGVE-BENCH 2 Dataset: Coming soon

Downloads last month: 50

Safetensors

Model size

8B params

Tensor type

BF16

Inference Providers NEW

Video-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for xiaoliux/AIGVE-MACS

Base model

Qwen/Qwen2.5-VL-7B-Instruct

Finetuned

(792)

this model