cminst
/

StreamMamba

Video Classification

Model card Files Files and versions

StreamMamba / README.md

qingy2024's picture

Update README.md

c1cd8d6 verified 6 days ago

|

history blame contribute delete

3.56 kB

	---
	license: apache-2.0
	language:
	- en
	base_model:
	- OpenGVLab/InternVideo2_distillation_models
	pipeline_tag: video-classification
	---

	# cminst/StreamMamba
	### Vision-Language Model and <span style="position: relative; cursor: help;"><span class="streammamba-glow">StreamMamba</span><span class="glow-ring"></span></span> checkpoints

	<details>
	<summary>License: Apache-2.0</summary>
	This model is licensed under the <a href="https://www.apache.org/licenses/LICENSE-2.0">Apache-2.0 License</a>.
	</details>

	---

	## Overview
	InternVideo2-B14 is a family of pre-trained vision-language models designed for cross-modal video-text understanding, vision-language alignment, and efficient deployment. This repository provides modular checkpoints for various downstream tasks, including video classification and frame-skipping systems.

	Base Model: [OpenGVLab/InternVideo2_distillation_models](https://github.com/OpenGVLab/InternVideo)

	Pipeline Tag: `video-classification` (supports vision-language and video-only tasks)

	---

	## Model Details

	### Included Checkpoints
	\| Filename \| Size \| Description \|
	\|-------------------------\|----------\|-----------------------------------------------------------------------------\|
	\| `cross_mamba_film_warmup.pt` \| 504 MB \| Cross-modal model combining vision and text using FiLM (Feature-wise Linear Modulation) and Mamba layers for temporal modeling. \|
	\| `mamba_mobileclip_ckpt.pt` \| 500 MB \| <span style="position: relative; cursor: help;"><span class="streammamba-glow">StreamMamba</span><span class="glow-ring"></span></span> temporal aggregator trained on MobileCLIP embeddings (no FiLM). Checkpoint 6900. \|
	\| `internvideo2_clip.pt` \| 5.55 MB \| CLIP-style vision-language alignment component for InternVideo2-B14. \|
	\| `internvideo2_vision.pt` \| 205 MB \| Vision encoder backbone (InternVideo2-B14) for video feature extraction. \|
	\| `mobileclip_blt.pt` \| 599 MB \| Lightweight MobileCLIP variant (BLT) for resource-constrained applications. \|
	\| `lstm_ckpt.pt` \| 530 MB \| Contains InternVideo2-B14 weights and MobileCLIP weights, along with a trained LSTM (used for ablating against Mamba) \|

	#### <span style="position: relative; cursor: help;"><span class="streammamba-glow">StreamMamba</span><span class="glow-ring"></span></span> Self-Predictive Frame Skipping (SPFS)
	The `spfs_r64` folder contains a self-contained system for adaptive frame skipping in videos. Each checkpoint file includes:
	- MobileCLIP vision/text encoders
	- InternVideo2-B14 vision encoder weights
	- Mamba temporal aggregator (merged from `mamba_mobileclip_ckpt.pt`)
	- SPFS-specific weights for frame selection

	<style>
	.streammamba-glow {
	color: #000; /* Blue text */
	text-shadow: 0 0 8px #03d3fc, 0 0 13px #03d3fc, 0 0 16px #03d3fc;
	transition: text-shadow 0.3s ease-in-out;
	position: relative;
	z-index: 1;
	}

	.streammamba-glow:hover {
	text-shadow: 0 0 12px #03d3fc, 0 0 24px #03d3fc, 0 0 36px #03d3fc;
	}

	.glow-ring {
	content: '';
	position: absolute;
	top: 50%;
	left: 50%;
	width: 8px;
	height: 8px;
	background: #03d3fc;
	border-radius: 50%;
	transform: translate(-50%, -50%);
	box-shadow: 0 0 10px #03d3fc, 0 0 20px #03d3fc, 0 0 30px #03d3fc;
	opacity: 0.6;
	animation: pulse 2s infinite;
	z-index: -1;
	}

	@keyframes pulse {
	0%, 100% { transform: translate(-50%, -50%) scale(1); opacity: 0.6; }
	50% { transform: translate(-50%, -50%) scale(1.5); opacity: 0.2; }
	}
	</style>