--- library_name: transformers license: mit language: - en base_model: - meta-llama/Llama-3.1-8B-Instruct tags: - vidchapters - video - video-chaptering --- # Chapter-Llama Models This repository contains the model checkpoints used in the paper ["Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs"](https://arxiv.org/abs/2504.00072) (CVPR 2025). ## Models Overview Chapter-Llama is based on fine-tuned Llama-3.1-8B-Instruct with LoRA adapters. We provide three main model variants: 1. **asr-10k**: Model trained with ASR from 10k videos of the VidChapter-7M dataset - Used for our Speech-based frame selector - Input: Only speech transcripts with timestamps 2. **captions_asr-10k**: Model trained with Captions+ASR from 10k videos - Our primary model used for most experiments - Input: Both speech transcripts and visual captions with timestamps 3. **captions_asr-1k**: Model trained with Captions+ASR from 1k videos - Smaller training set variant - Input: Both speech transcripts and visual captions with timestamps ## Model Performance Our best model achieves 45.3 F1 score on the VidChapters-7M benchmark, substantially outperforming previous state-of-the-art methods. ## Usage The models can be downloaded and used with the [Chapter-Llama codebase](https://github.com/lucas-ventura/chapter-llama): ```bash # Download model LoRA adapters python tools/download/models.py "asr-10k" --local_dir "." python tools/download/models.py "captions_asr-10k" --local_dir "." python tools/download/models.py "captions_asr-1k" --local_dir "." # Inference on a single video python inference.py /path/to/your/video.mp4 ``` ## Model Architecture - Base model: Llama-3.1-8B-Instruct - Adaptation: LoRA fine-tuning - Input format: Text tokens representing ASR and/or frame captions with timestamps - Output format: Timestamps for chapter boundaries and free-form chapter titles ## Citation If you use these models in your work, please cite our paper: ```bibtex @article{ventura25chapter, title = {{Chapter-Llama}: Efficient Chaptering in Hour-Long Videos with {LLM}s}, author = {Lucas Ventura and Antoine Yang and Cordelia Schmid and G{\"u}l Varol}, journal = {CVPR}, year = {2025} } ``` ## Links - [Paper](https://arxiv.org/abs/2504.00072) - [Project Page](https://imagine.enpc.fr/~lucas.ventura/chapter-llama/) - [GitHub Repository](https://github.com/lucas-ventura/chapter-llama) ## License These models are distributed under an MIT License. Please check the [repository](https://github.com/lucas-ventura/chapter-llama) for more details.