File size: 3,564 Bytes
408c7d7 aabdb36 408c7d7 c1cd8d6 c16238f 408c7d7 923823f 408c7d7 923823f 408c7d7 923823f 6afa5b2 923823f d74b900 0d2a70b 6afa5b2 aefe9f4 923823f 34677f2 923823f d5f04db dc892d9 d5f04db dc892d9 d5f04db b27dcf4 d5f04db b27dcf4 d5f04db |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 |
---
license: apache-2.0
language:
- en
base_model:
- OpenGVLab/InternVideo2_distillation_models
pipeline_tag: video-classification
---
# cminst/StreamMamba
### Vision-Language Model and <span style="position: relative; cursor: help;"><span class="streammamba-glow">StreamMamba</span><span class="glow-ring"></span></span> checkpoints
<details>
<summary>License: Apache-2.0</summary>
This model is licensed under the <a href="https://www.apache.org/licenses/LICENSE-2.0">Apache-2.0 License</a>.
</details>
---
## Overview
**InternVideo2-B14** is a family of pre-trained vision-language models designed for cross-modal video-text understanding, vision-language alignment, and efficient deployment. This repository provides modular checkpoints for various downstream tasks, including video classification and frame-skipping systems.
**Base Model**: [OpenGVLab/InternVideo2_distillation_models](https://github.com/OpenGVLab/InternVideo)
**Pipeline Tag**: `video-classification` (supports vision-language and video-only tasks)
---
## Model Details
### Included Checkpoints
| Filename | Size | Description |
|-------------------------|----------|-----------------------------------------------------------------------------|
| `cross_mamba_film_warmup.pt` | 504 MB | Cross-modal model combining vision and text using **FiLM** (Feature-wise Linear Modulation) and **Mamba** layers for temporal modeling. |
| `mamba_mobileclip_ckpt.pt` | 500 MB | <span style="position: relative; cursor: help;"><span class="streammamba-glow">StreamMamba</span><span class="glow-ring"></span></span> temporal aggregator trained on MobileCLIP embeddings (no FiLM). Checkpoint 6900. |
| `internvideo2_clip.pt` | 5.55 MB | CLIP-style vision-language alignment component for InternVideo2-B14. |
| `internvideo2_vision.pt` | 205 MB | Vision encoder backbone (InternVideo2-B14) for video feature extraction. |
| `mobileclip_blt.pt` | 599 MB | Lightweight **MobileCLIP** variant (BLT) for resource-constrained applications. |
| `lstm_ckpt.pt` | 530 MB | Contains InternVideo2-B14 weights and MobileCLIP weights, along with a trained LSTM (used for ablating against Mamba) |
#### <span style="position: relative; cursor: help;"><span class="streammamba-glow">StreamMamba</span><span class="glow-ring"></span></span> Self-Predictive Frame Skipping (SPFS)
The `spfs_r64` folder contains a self-contained system for adaptive frame skipping in videos. Each checkpoint file includes:
- MobileCLIP vision/text encoders
- InternVideo2-B14 vision encoder weights
- Mamba temporal aggregator (merged from `mamba_mobileclip_ckpt.pt`)
- SPFS-specific weights for frame selection
<style>
.streammamba-glow {
color: #000; /* Blue text */
text-shadow: 0 0 8px #03d3fc, 0 0 13px #03d3fc, 0 0 16px #03d3fc;
transition: text-shadow 0.3s ease-in-out;
position: relative;
z-index: 1;
}
.streammamba-glow:hover {
text-shadow: 0 0 12px #03d3fc, 0 0 24px #03d3fc, 0 0 36px #03d3fc;
}
.glow-ring {
content: '';
position: absolute;
top: 50%;
left: 50%;
width: 8px;
height: 8px;
background: #03d3fc;
border-radius: 50%;
transform: translate(-50%, -50%);
box-shadow: 0 0 10px #03d3fc, 0 0 20px #03d3fc, 0 0 30px #03d3fc;
opacity: 0.6;
animation: pulse 2s infinite;
z-index: -1;
}
@keyframes pulse {
0%, 100% { transform: translate(-50%, -50%) scale(1); opacity: 0.6; }
50% { transform: translate(-50%, -50%) scale(1.5); opacity: 0.2; }
}
</style> |