AIGVE-MACS: Unified Multi-Aspect Commenting and Scoring Model for AI-Generated Video Evaluation

πŸ” Introduction

This repository contains the AIGVE-MACS model, a unified model for AI-Generated Video Evaluation (AIGVE), as presented in the paper AIGVE-MACS: Unified Multi-Aspect Commenting and Scoring Model for AI-Generated Video Evaluation. AIGVE-MACS is a unified Vision-Language Model (VLM) for evaluating AI-generated videos. It produces both numerical scores (from 0 to 5) and natural language justifications across 9 human-aligned aspects of video quality:

Metric Description
Technical Quality Assesses the technical aspects of the video, including whether the resolution is sufficient for object recognition, whether the colors are natural, and whether there is an absence of noise or artifacts.
Dynamic Measures the extent of pixel changes throughout the video, focusing on significant object or camera movements and changes in environmental factors such as daylight, weather, or seasons.
Consistency Evaluates whether objects in the video maintain consistent properties, avoiding glitches, flickering, or unexpected changes.
Physics Determines if the scene adheres to physical laws, ensuring that object behaviors and interactions are realistic and aligned with real-world physics.
Element Presence Checks if all objects mentioned in the instructions are present in the video. The score is based on the proportion of objects that are correctly included.
Element Quality Assesses the realism and fidelity of objects in the video, awarding higher scores for detailed, natural, and visually appealing appearances.
Action/Interaction Presence Evaluates whether all actions and interactions described in the instructions are accurately represented in the video.
Action/Interaction Quality Measures the naturalness and smoothness of actions and interactions, with higher scores for those that are realistic, lifelike, and seamlessly integrated into the scene.
Overall Reflects the comprehensive quality of the video based on all metrics, allowing raters to incorporate their subjective preferences into the evaluation.

πŸš€ Quickstart

Installation

pip install transformers accelerate
pip install qwen-vl-utils

Example Usage

from transformers import AutoProcessor
from models import Qwen2_5_VLForConditionalGeneration
import torch
from qwen_vl_utils import process_vision_info

# Load model and processor
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "xiaoliux/AIGVE-MACS", 
    torch_dtype=torch.bfloat16, 
    attn_implementation="flash_attention_2"
).to("cuda:0")

processor = AutoProcessor.from_pretrained("xiaoliux/AIGVE-MACS", use_fast=True)

# Compose input message
def get_user_message(video_frames, prompt):
    messages = [
            {
                "role": "user",
                "content": [
                    {
                        "type": "video",
                        "resized_height": 480,
                        "resized_width": 854,
                        'fps': 1,
                        "video": video_frames,
                    },
                    {
                        "type": "text",
                        "text": "You are an expert in evaluating AI-Generated Videos, you evaluate videos in the following 9 aspects: "
                                "1. technical_quality: including whether the resolution is sufficient for object recognition, whether the colors are natural, and whether there is an absence of noise or artifacts. "
                                "2. dynamic: the extent of pixel changes throughout the video, focusing on significant object or camera movements and changes in environmental factors such as daylight, weather, or seasons. "
                                "3. consistency: whether objects in the video maintain consistent properties, avoiding glitches, flickering, or unexpected changes." 
                                "4. physics: Determines if the scene adheres to physical laws." 
                                "5. element_presentence: Checks if all objects mentioned in the instructions are present in the video. "
                                "6. element_quality: Assesses the realism and fidelity of objects in the video, awarding higher scores for detailed, natural, and visually appealing appearances. "
                                "7. action_presentence: Evaluates whether all actions and interactions described in the instructions are accurately represented in the video. "
                                "8. action_quality: Measures the naturalness and smoothness of actions and interactions, with higher scores for those that are realistic, lifelike, and seamlessly integrated into the scene." 
                                "9. overall: Reflects the comprehensive quality of the video based on all metrics. "
                                "The score can be chosen from [0, 5] with whole numbers. You should also include the comment for each score. "
                                "Please output as a JSON."
                                f"The video instruction is: {prompt}"
                    },
                ],
            },
        ]
    return messages

# Example inputs
video_frames = ["/path/to/frame1.png", "/path/to/frame2.png", ...]
prompt = "A tiger runs across a snowy field while snowflakes fall."

messages = get_user_message(video_frames, prompt)
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)

inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
    **video_kwargs,
).to("cuda:0")

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=1500)
output_text = processor.batch_decode(
    [out[len(inp):] for inp, out in zip(inputs.input_ids, generated_ids)],
    skip_special_tokens=True
)[0]

print("Evaluation Result:\n", output_text)

πŸ“ Output Format

{
  "technical_quality": {"score": 5, "comment": "..."},
  "dynamic": {"score": 4, "comment": "..."},
  "consistency": {"score": 5, "comment": "..."},
  ...
  "overall": {"score": 4, "comment": "..."}
}

πŸ“Š Main Results on AIGVE-BENCH 2

πŸ“ˆ Score Correlation (Spearman’s ρ ↑)

Method TQ Dy CS Phy EP EQ AP AQ OR AVG
GPT-4o 34.71 7.05 18.12 20.28 23.10 30.47 36.57 31.58 38.57 26.72
GPT-4.1 36.49 5.81 26.68 19.87 27.22 28.77 32.75 20.22 29.98 25.31
Qwen2.5-VL 8.77 4.00 1.24 -6.01 9.19 10.19 18.74 0.72 9.59 6.27
VideoLLaMA3 15.94 19.44 11.70 13.21 -3.13 12.27 13.61 -0.69 11.58 10.44
AIGVE-MACS 40.60 57.31 61.49 64.36 40.32 40.81 44.31 60.71 59.88 52.20

TQ: Technical Quality, Dy: Dynamics, CS: Consistency, Phy: Physics
EP/EQ: Element Presence/Quality, AP/AQ: Action Presence/Quality, OR: Overall


πŸ’¬ Comment Generation Quality

Method ROUGE-1 ↑ ROUGE-L ↑ BERTScore ↑ UniEval-Fact ↑ G-Eval ↑
GPT-4o 18.30 15.86 74.90 40.84 2.10
GPT-4.1 15.80 12.94 73.99 43.99 2.10
Qwen2.5-VL 17.95 15.31 74.31 42.32 2.37
VideoLLaMA3 19.99 17.67 75.35 40.21 2.18
AIGVE-MACS 49.50 38.00 85.87 57.04 3.42

πŸ“š Citation

@article{liu2025aigvemacs,
  title={AIGVE-MACS: Unified Multi-Aspect Commenting and Scoring Model for AI-Generated Video Evaluation},
  author={Xiao Liu and Jiawei Zhang},
  journal={arXiv preprint arXiv:2507.01255},
  year={2025}
}

πŸ”— Additional Resources


Downloads last month
14
Safetensors
Model size
8.29B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for xiaoliux/AIGVE-MACS

Finetuned
(438)
this model