File size: 2,609 Bytes
3586231
66a6edd
3586231
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
---
library_name: transformers
license: mit
language:
- en
base_model:
- meta-llama/Llama-3.1-8B-Instruct
tags:
- vidchapters
- video
- video-chaptering
---
# Chapter-Llama Models

This repository contains the model checkpoints used in the paper ["Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs"](https://arxiv.org/abs/2504.00072) (CVPR 2025).

## Models Overview

Chapter-Llama is based on fine-tuned Llama-3.1-8B-Instruct with LoRA adapters. We provide three main model variants:

1. **asr-10k**: Model trained with ASR from 10k videos of the VidChapter-7M dataset
   - Used for our Speech-based frame selector
   - Input: Only speech transcripts with timestamps

2. **captions_asr-10k**: Model trained with Captions+ASR from 10k videos
   - Our primary model used for most experiments
   - Input: Both speech transcripts and visual captions with timestamps

3. **captions_asr-1k**: Model trained with Captions+ASR from 1k videos
   - Smaller training set variant
   - Input: Both speech transcripts and visual captions with timestamps

## Model Performance

Our best model achieves 45.3 F1 score on the VidChapters-7M benchmark, substantially outperforming previous state-of-the-art methods.

## Usage

The models can be downloaded and used with the [Chapter-Llama codebase](https://github.com/lucas-ventura/chapter-llama):

```bash
# Download model LoRA adapters
python tools/download/models.py "asr-10k" --local_dir "."
python tools/download/models.py "captions_asr-10k" --local_dir "."
python tools/download/models.py "captions_asr-1k" --local_dir "."

# Inference on a single video
python inference.py /path/to/your/video.mp4
```

## Model Architecture

- Base model: Llama-3.1-8B-Instruct
- Adaptation: LoRA fine-tuning
- Input format: Text tokens representing ASR and/or frame captions with timestamps
- Output format: Timestamps for chapter boundaries and free-form chapter titles

## Citation

If you use these models in your work, please cite our paper:

```bibtex
@article{ventura25chapter,
    title     = {{Chapter-Llama}: Efficient Chaptering in Hour-Long Videos with {LLM}s},
    author    = {Lucas Ventura and Antoine Yang and Cordelia Schmid and G{\"u}l Varol},
    journal   = {CVPR},
    year      = {2025}
}
```

## Links

- [Paper](https://arxiv.org/abs/2504.00072)
- [Project Page](https://imagine.enpc.fr/~lucas.ventura/chapter-llama/)
- [GitHub Repository](https://github.com/lucas-ventura/chapter-llama)

## License

These models are distributed under an MIT License. Please check the [repository](https://github.com/lucas-ventura/chapter-llama) for more details.