File size: 3,399 Bytes
408c7d7
 
aabdb36
 
 
 
 
408c7d7
 
 
c16238f
408c7d7
923823f
 
 
 
408c7d7
923823f
 
 
 
 
 
 
 
 
 
 
 
408c7d7
923823f
 
 
 
6afa5b2
923823f
 
 
0d2a70b
6afa5b2
aefe9f4
923823f
 
34677f2
923823f
d5f04db
 
 
 
dc892d9
d5f04db
 
 
 
 
 
dc892d9
d5f04db
 
 
 
 
 
 
 
 
b27dcf4
d5f04db
 
b27dcf4
d5f04db
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
---
license: apache-2.0
language:
- en
base_model:
- OpenGVLab/InternVideo2_distillation_models
pipeline_tag: video-classification
---

# InternVideo2-B14
### Vision-Language Model and <span style="position: relative; cursor: help;"><span class="streammamba-glow">StreamMamba</span><span class="glow-ring"></span></span> checkpoints

<details>
<summary>License: Apache-2.0</summary>
This model is licensed under the <a href="https://www.apache.org/licenses/LICENSE-2.0">Apache-2.0 License</a>.
</details>

---

## Overview
**InternVideo2-B14** is a family of pre-trained vision-language models designed for cross-modal video-text understanding, vision-language alignment, and efficient deployment. This repository provides modular checkpoints for various downstream tasks, including video classification and frame-skipping systems.

**Base Model**: [OpenGVLab/InternVideo2_distillation_models](https://github.com/OpenGVLab/InternVideo)

**Pipeline Tag**: `video-classification` (supports vision-language and video-only tasks)

---

## Model Details

### Included Checkpoints
| Filename                | Size     | Description                                                                 |
|-------------------------|----------|-----------------------------------------------------------------------------|
| `cross_mamba_film_warmup.pt` | 504 MB | Cross-modal model combining vision and text using **FiLM** (Feature-wise Linear Modulation) and **Mamba** layers for temporal modeling. |
| `mamba_mobileclip_ckpt.pt`   | 500 MB | <span style="position: relative; cursor: help;"><span class="streammamba-glow">StreamMamba</span><span class="glow-ring"></span></span> temporal aggregator trained on MobileCLIP embeddings (no FiLM). Checkpoint 6900. |
| `internvideo2_clip.pt`       | 5.55 MB | CLIP-style vision-language alignment component for InternVideo2-B14. |
| `internvideo2_vision.pt`     | 205 MB  | Vision encoder backbone (InternVideo2-B14) for video feature extraction. |
| `mobileclip_blt.pt`          | 599 MB  | Lightweight **MobileCLIP** variant (BLT) for resource-constrained applications. |

#### <span style="position: relative; cursor: help;"><span class="streammamba-glow">StreamMamba</span><span class="glow-ring"></span></span> Self-Predictive Frame Skipping (SPFS)
The `spfs_r64` folder contains a self-contained system for adaptive frame skipping in videos. Each checkpoint file includes:
- MobileCLIP vision/text encoders
- InternVideo2-B14 vision encoder weights
- Mamba temporal aggregator (merged from `mamba_mobileclip_ckpt.pt`)
- SPFS-specific weights for frame selection

<style>
.streammamba-glow {
  color: #000; /* Blue text */
  text-shadow: 0 0 8px #03d3fc, 0 0 13px #03d3fc, 0 0 16px #03d3fc;
  transition: text-shadow 0.3s ease-in-out;
  position: relative;
  z-index: 1;
}

.streammamba-glow:hover {
  text-shadow: 0 0 12px #03d3fc, 0 0 24px #03d3fc, 0 0 36px #03d3fc;
}

.glow-ring {
  content: '';
  position: absolute;
  top: 50%;
  left: 50%;
  width: 8px;
  height: 8px;
  background: #03d3fc;
  border-radius: 50%;
  transform: translate(-50%, -50%);
  box-shadow: 0 0 10px #03d3fc, 0 0 20px #03d3fc, 0 0 30px #03d3fc;
  opacity: 0.6;
  animation: pulse 2s infinite;
  z-index: -1;
}

@keyframes pulse {
  0%, 100% { transform: translate(-50%, -50%) scale(1); opacity: 0.6; }
  50% { transform: translate(-50%, -50%) scale(1.5); opacity: 0.2; }
}
</style>