cakelens-v5

Open-source AI-gen video detection model

Please see the blog post and the open-source Python library plus CLI tool cakelens-v5 for more details.

Installation

Install the package with its dependencies:

pip install cakelens-v5

Command Line Interface

The package provides a command line tool cakelens for easy video detection:

Basic Usage

# Using Hugging Face Hub (recommended)
cakelens video.mp4

# Using local model file
cakelens video.mp4 --model-path model.pt

Options

--model-path: Path to the model checkpoint file (optional - will load from Hugging Face Hub if not provided)
--batch-size: Batch size for inference (default: 1)
--device: Device to run inference on (cpu, cuda, mps) - auto-detected if not specified
--verbose, -v: Enable verbose logging
--output: Output file path for results (JSON format)

Examples

# Basic detection (uses Hugging Face Hub)
cakelens video.mp4

# Using local model file
cakelens video.mp4 --model-path model.pt

# With custom batch size and device
cakelens video.mp4 --batch-size 4 --device cuda

# Save results to JSON file
cakelens video.mp4 --output results.json

# Verbose output
cakelens video.mp4 --verbose

Output

The tool provides:

Real-time prediction percentages for each label
Final mean predictions across all frames
Option to save results in JSON format
Detailed logging (with --verbose flag)

Programmatic Usage

You can also use the detection functionality programmatically in your Python code:

Basic Detection

import pathlib
from cakelens.detect import Detector
from cakelens.model import Model

# Create model and load from Hugging Face Hub
model = Model()
# load the model weights from Hugging Face Hub
model.load_from_huggingface_hub()
# or, if you have a local model file:
# model.load_state_dict(torch.load("model.pt")["model_state_dict"])

# Create detector
detector = Detector(
    model=model,
    batch_size=1,
    device="cpu"  # or "cuda", "mps", or None for auto-detection
)

# Run detection
video_path = pathlib.Path("video.mp4")
verdict = detector.detect(video_path)

# Access results
print(f"Video: {verdict.video_filepath}")
print(f"Frame count: {verdict.frame_count}")
print("Predictions:")
for i, prob in enumerate(verdict.predictions):
    print(f"  Label {i}: {prob * 100:.2f}%")

Labels

The model can detect the following labels:

AI_GEN: Is the video AI-generated or not?
ANIME_1D: Is the video in 2D anime style?
ANIME_2D: Is the video in 3D anime style?
VIDEO_GAME: Does the video look like a video game?
KLING: Is the video generated by Kling?
HIGGSFIELD: Is the video generated by Higgsfield?
WAN: Is the video generated by Wan?
MIDJOURNEY: Is the video generated using images from Midjourney?
HAILUO: Is the video generated by Hailuo?
RAY: Is the video generated by Ray?
VEO: Is the video generated by Veo?
RUNWAY: Is the video generated by Runway?
SORA: Is the video generated by Sora?
CHATGPT: Is the video generated using images from ChatGPT?
PIKA: Is the video generated by Pika?
HUNYUAN: Is the video generated by Hunyuan?
VIDU: Is the video generated by Vidu?

Note: The AI_GEN label is the most accurate as it has the most training data. Other labels have limited training data and may be less accurate.

Accuracy

The PR curve of the model is shown below:

PR Curve

At threshold 0.5, the model has an precision of 0.77 and a recall of 0.74. The dataset size is 5,093 videos for training and 498 videos for validation. Please note that the model is not perfect and may make mistakes.