File size: 3,564 Bytes

408c7d7
 
aabdb36
 
 
 
 
408c7d7
 
c1cd8d6
c16238f
408c7d7
923823f
 
 
 
408c7d7
923823f
 
 
 
 
 
 
 
 
 
 
 
408c7d7
923823f
 
 
 
6afa5b2
923823f
 
 
d74b900
0d2a70b
6afa5b2
aefe9f4
923823f
 
34677f2
923823f
d5f04db
 
 
 
dc892d9
d5f04db
 
 
 
 
 
dc892d9
d5f04db
 
 
 
 
 
 
 
 
b27dcf4
d5f04db
 
b27dcf4
d5f04db

---
license: apache-2.0
language:
- en
base_model:
- OpenGVLab/InternVideo2_distillation_models
pipeline_tag: video-classification
---

# cminst/StreamMamba
### Vision-Language Model and <span style="position: relative; cursor: help;"><span class="streammamba-glow">StreamMamba</span><span class="glow-ring"></span></span> checkpoints

<details>
<summary>License: Apache-2.0</summary>
This model is licensed under the <a href="https://www.apache.org/licenses/LICENSE-2.0">Apache-2.0 License</a>.
</details>

---

## Overview
**InternVideo2-B14** is a family of pre-trained vision-language models designed for cross-modal video-text understanding, vision-language alignment, and efficient deployment. This repository provides modular checkpoints for various downstream tasks, including video classification and frame-skipping systems.

**Base Model**: [OpenGVLab/InternVideo2_distillation_models](https://github.com/OpenGVLab/InternVideo)

**Pipeline Tag**: `video-classification` (supports vision-language and video-only tasks)

---

## Model Details

### Included Checkpoints
| Filename                | Size     | Description                                                                 |
|-------------------------|----------|-----------------------------------------------------------------------------|
| `cross_mamba_film_warmup.pt` | 504 MB | Cross-modal model combining vision and text using **FiLM** (Feature-wise Linear Modulation) and **Mamba** layers for temporal modeling. |
| `mamba_mobileclip_ckpt.pt`   | 500 MB | <span style="position: relative; cursor: help;"><span class="streammamba-glow">StreamMamba</span><span class="glow-ring"></span></span> temporal aggregator trained on MobileCLIP embeddings (no FiLM). Checkpoint 6900. |
| `internvideo2_clip.pt`       | 5.55 MB | CLIP-style vision-language alignment component for InternVideo2-B14. |
| `internvideo2_vision.pt`     | 205 MB  | Vision encoder backbone (InternVideo2-B14) for video feature extraction. |
| `mobileclip_blt.pt`          | 599 MB  | Lightweight **MobileCLIP** variant (BLT) for resource-constrained applications. |
| `lstm_ckpt.pt`               | 530 MB  | Contains InternVideo2-B14 weights and MobileCLIP weights, along with a trained LSTM (used for ablating against Mamba) |

#### <span style="position: relative; cursor: help;"><span class="streammamba-glow">StreamMamba</span><span class="glow-ring"></span></span> Self-Predictive Frame Skipping (SPFS)
The `spfs_r64` folder contains a self-contained system for adaptive frame skipping in videos. Each checkpoint file includes:
- MobileCLIP vision/text encoders
- InternVideo2-B14 vision encoder weights
- Mamba temporal aggregator (merged from `mamba_mobileclip_ckpt.pt`)
- SPFS-specific weights for frame selection

<style>
.streammamba-glow {
  color: #000; /* Blue text */
  text-shadow: 0 0 8px #03d3fc, 0 0 13px #03d3fc, 0 0 16px #03d3fc;
  transition: text-shadow 0.3s ease-in-out;
  position: relative;
  z-index: 1;
}

.streammamba-glow:hover {
  text-shadow: 0 0 12px #03d3fc, 0 0 24px #03d3fc, 0 0 36px #03d3fc;
}

.glow-ring {
  content: '';
  position: absolute;
  top: 50%;
  left: 50%;
  width: 8px;
  height: 8px;
  background: #03d3fc;
  border-radius: 50%;
  transform: translate(-50%, -50%);
  box-shadow: 0 0 10px #03d3fc, 0 0 20px #03d3fc, 0 0 30px #03d3fc;
  opacity: 0.6;
  animation: pulse 2s infinite;
  z-index: -1;
}

@keyframes pulse {
  0%, 100% { transform: translate(-50%, -50%) scale(1); opacity: 0.6; }
  50% { transform: translate(-50%, -50%) scale(1.5); opacity: 0.2; }
}
</style>