|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
base_model: |
|
- OpenGVLab/InternVideo2_distillation_models |
|
pipeline_tag: video-classification |
|
--- |
|
|
|
# cminst/StreamMamba |
|
### Vision-Language Model and <span style="position: relative; cursor: help;"><span class="streammamba-glow">StreamMamba</span><span class="glow-ring"></span></span> checkpoints |
|
|
|
<details> |
|
<summary>License: Apache-2.0</summary> |
|
This model is licensed under the <a href="https://www.apache.org/licenses/LICENSE-2.0">Apache-2.0 License</a>. |
|
</details> |
|
|
|
--- |
|
|
|
## Overview |
|
**InternVideo2-B14** is a family of pre-trained vision-language models designed for cross-modal video-text understanding, vision-language alignment, and efficient deployment. This repository provides modular checkpoints for various downstream tasks, including video classification and frame-skipping systems. |
|
|
|
**Base Model**: [OpenGVLab/InternVideo2_distillation_models](https://github.com/OpenGVLab/InternVideo) |
|
|
|
**Pipeline Tag**: `video-classification` (supports vision-language and video-only tasks) |
|
|
|
--- |
|
|
|
## Model Details |
|
|
|
### Included Checkpoints |
|
| Filename | Size | Description | |
|
|-------------------------|----------|-----------------------------------------------------------------------------| |
|
| `cross_mamba_film_warmup.pt` | 504 MB | Cross-modal model combining vision and text using **FiLM** (Feature-wise Linear Modulation) and **Mamba** layers for temporal modeling. | |
|
| `mamba_mobileclip_ckpt.pt` | 500 MB | <span style="position: relative; cursor: help;"><span class="streammamba-glow">StreamMamba</span><span class="glow-ring"></span></span> temporal aggregator trained on MobileCLIP embeddings (no FiLM). Checkpoint 6900. | |
|
| `internvideo2_clip.pt` | 5.55 MB | CLIP-style vision-language alignment component for InternVideo2-B14. | |
|
| `internvideo2_vision.pt` | 205 MB | Vision encoder backbone (InternVideo2-B14) for video feature extraction. | |
|
| `mobileclip_blt.pt` | 599 MB | Lightweight **MobileCLIP** variant (BLT) for resource-constrained applications. | |
|
| `lstm_ckpt.pt` | 530 MB | Contains InternVideo2-B14 weights and MobileCLIP weights, along with a trained LSTM (used for ablating against Mamba) | |
|
|
|
#### <span style="position: relative; cursor: help;"><span class="streammamba-glow">StreamMamba</span><span class="glow-ring"></span></span> Self-Predictive Frame Skipping (SPFS) |
|
The `spfs_r64` folder contains a self-contained system for adaptive frame skipping in videos. Each checkpoint file includes: |
|
- MobileCLIP vision/text encoders |
|
- InternVideo2-B14 vision encoder weights |
|
- Mamba temporal aggregator (merged from `mamba_mobileclip_ckpt.pt`) |
|
- SPFS-specific weights for frame selection |
|
|
|
<style> |
|
.streammamba-glow { |
|
color: #000; /* Blue text */ |
|
text-shadow: 0 0 8px #03d3fc, 0 0 13px #03d3fc, 0 0 16px #03d3fc; |
|
transition: text-shadow 0.3s ease-in-out; |
|
position: relative; |
|
z-index: 1; |
|
} |
|
|
|
.streammamba-glow:hover { |
|
text-shadow: 0 0 12px #03d3fc, 0 0 24px #03d3fc, 0 0 36px #03d3fc; |
|
} |
|
|
|
.glow-ring { |
|
content: ''; |
|
position: absolute; |
|
top: 50%; |
|
left: 50%; |
|
width: 8px; |
|
height: 8px; |
|
background: #03d3fc; |
|
border-radius: 50%; |
|
transform: translate(-50%, -50%); |
|
box-shadow: 0 0 10px #03d3fc, 0 0 20px #03d3fc, 0 0 30px #03d3fc; |
|
opacity: 0.6; |
|
animation: pulse 2s infinite; |
|
z-index: -1; |
|
} |
|
|
|
@keyframes pulse { |
|
0%, 100% { transform: translate(-50%, -50%) scale(1); opacity: 0.6; } |
|
50% { transform: translate(-50%, -50%) scale(1.5); opacity: 0.2; } |
|
} |
|
</style> |