File size: 3,399 Bytes
408c7d7 aabdb36 408c7d7 c16238f 408c7d7 923823f 408c7d7 923823f 408c7d7 923823f 6afa5b2 923823f 0d2a70b 6afa5b2 aefe9f4 923823f 34677f2 923823f d5f04db dc892d9 d5f04db dc892d9 d5f04db b27dcf4 d5f04db b27dcf4 d5f04db |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 |
---
license: apache-2.0
language:
- en
base_model:
- OpenGVLab/InternVideo2_distillation_models
pipeline_tag: video-classification
---
# InternVideo2-B14
### Vision-Language Model and <span style="position: relative; cursor: help;"><span class="streammamba-glow">StreamMamba</span><span class="glow-ring"></span></span> checkpoints
<details>
<summary>License: Apache-2.0</summary>
This model is licensed under the <a href="https://www.apache.org/licenses/LICENSE-2.0">Apache-2.0 License</a>.
</details>
---
## Overview
**InternVideo2-B14** is a family of pre-trained vision-language models designed for cross-modal video-text understanding, vision-language alignment, and efficient deployment. This repository provides modular checkpoints for various downstream tasks, including video classification and frame-skipping systems.
**Base Model**: [OpenGVLab/InternVideo2_distillation_models](https://github.com/OpenGVLab/InternVideo)
**Pipeline Tag**: `video-classification` (supports vision-language and video-only tasks)
---
## Model Details
### Included Checkpoints
| Filename | Size | Description |
|-------------------------|----------|-----------------------------------------------------------------------------|
| `cross_mamba_film_warmup.pt` | 504 MB | Cross-modal model combining vision and text using **FiLM** (Feature-wise Linear Modulation) and **Mamba** layers for temporal modeling. |
| `mamba_mobileclip_ckpt.pt` | 500 MB | <span style="position: relative; cursor: help;"><span class="streammamba-glow">StreamMamba</span><span class="glow-ring"></span></span> temporal aggregator trained on MobileCLIP embeddings (no FiLM). Checkpoint 6900. |
| `internvideo2_clip.pt` | 5.55 MB | CLIP-style vision-language alignment component for InternVideo2-B14. |
| `internvideo2_vision.pt` | 205 MB | Vision encoder backbone (InternVideo2-B14) for video feature extraction. |
| `mobileclip_blt.pt` | 599 MB | Lightweight **MobileCLIP** variant (BLT) for resource-constrained applications. |
#### <span style="position: relative; cursor: help;"><span class="streammamba-glow">StreamMamba</span><span class="glow-ring"></span></span> Self-Predictive Frame Skipping (SPFS)
The `spfs_r64` folder contains a self-contained system for adaptive frame skipping in videos. Each checkpoint file includes:
- MobileCLIP vision/text encoders
- InternVideo2-B14 vision encoder weights
- Mamba temporal aggregator (merged from `mamba_mobileclip_ckpt.pt`)
- SPFS-specific weights for frame selection
<style>
.streammamba-glow {
color: #000; /* Blue text */
text-shadow: 0 0 8px #03d3fc, 0 0 13px #03d3fc, 0 0 16px #03d3fc;
transition: text-shadow 0.3s ease-in-out;
position: relative;
z-index: 1;
}
.streammamba-glow:hover {
text-shadow: 0 0 12px #03d3fc, 0 0 24px #03d3fc, 0 0 36px #03d3fc;
}
.glow-ring {
content: '';
position: absolute;
top: 50%;
left: 50%;
width: 8px;
height: 8px;
background: #03d3fc;
border-radius: 50%;
transform: translate(-50%, -50%);
box-shadow: 0 0 10px #03d3fc, 0 0 20px #03d3fc, 0 0 30px #03d3fc;
opacity: 0.6;
animation: pulse 2s infinite;
z-index: -1;
}
@keyframes pulse {
0%, 100% { transform: translate(-50%, -50%) scale(1); opacity: 0.6; }
50% { transform: translate(-50%, -50%) scale(1.5); opacity: 0.2; }
}
</style> |