A competitive and human-aligned detailed video captioner model based on VILA-v1.5-13B and described in Cockatiel: Ensembling Synthetic and Human Preferenced Training for Detailed Video Caption.
This model produces detailed captions for input video, as presented in Cockatiel: Ensembling Synthetic and Human Preferenced Training for Detailed Video Caption.
For more details, please refer to our project page: https://sais-fuxi.github.io/projects/cockatiel
- Downloads last month
- 17
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
HF Inference deployability: The HF Inference API does not support video-text-to-text models for transformers
library.