FastVLM-0.5B Video Analysis and Captioning

This Colab notebook demonstrates how to use the Apple FastVLM-0.5B model from Hugging Face (apple/FastVLM-0.5B) to perform video analysis and generate captions for video frames.

The notebook covers the following steps:

Model Loading: Loading the FastVLM-0.5B model and its processor using the Hugging Face transformers library.
Image Captioning: Testing the model on sample images.
Video Processing: Reading a video file (specifically /content/drive/MyDrive/VLMs/vlm_warehouse.mp4 in this case) and extracting frames.
Inference on Video Frames: Running the FastVLM model on selected video frames to generate descriptions.
Caption Overlay and Video Generation: Creating a new video file where the original video frames are displayed with the generated captions overlaid or stacked below. The captions update based on the inference performed on key frames.

Usage

You can open this notebook directly in Google Colab by clicking the "Open in Colab" badge on the repository page.

To run the video analysis section, make sure you have a video file available in your Google Drive at the path specified in the notebook (currently set to /content/drive/MyDrive/VLMs/vlm_warehouse.mp4).

Model Details

Model ID: apple/FastVLM-0.5B
Model Type: Vision-Language Model
Library: Hugging Face transformers

Datasets Used

Conceptual Captions (used for initial model testing)
Custom video file (vlm_warehouse.mp4 from Google Drive)

Example Output

Stacked video with original frames that are available with generated captions at the bottom

Acknowledgements

The developers of the FastVLM-0.5B model.
The Hugging Face team for the transformers and huggingface_hub libraries.
Google Colab for providing the environment.

Feel free to explore and adapt this notebook for your own video analysis tasks!