Video-Text-to-Text
MLX
Safetensors
English
qwen3_5
video
multimodal
video-captioning
temporal-grounding
qwen3.5
apple-silicon
quantized
8-bit precision
custom_code
Instructions to use junwatu/Marlin-2B-MLX-8bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use junwatu/Marlin-2B-MLX-8bit with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir Marlin-2B-MLX-8bit junwatu/Marlin-2B-MLX-8bit
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
Marlin 2B — MLX 8-bit (Apple Silicon)
MLX 8-bit quantized version of NemoStation/Marlin-2B for fast inference on Apple Silicon Macs.
Key specs
| Original (PyTorch) | This model (MLX hybrid) | |
|---|---|---|
| Precision | bfloat16 | 8-bit (9.625 avg bits/weight) |
| Size | 5.1 GB | 2.5 GB |
| Peak memory | ~7.5 GB | ~5 GB |
| Caption speed | ~130s/video | ~28s/video |
| Timestamps | ✅ Correct | ✅ Correct |
| Speedup | — | 4.6x |
What is Marlin?
Marlin is a 2B video VLM for dense captioning and temporal grounding:
- Caption mode:
Scene: <paragraph>+Events: <start-end> <description> - Find mode: Given a query, returns
From X.X to Y.Y.
Recommended usage: Hybrid approach
The hybrid approach uses HF transformers for input preparation (correct M-RoPE temporal positions) and MLX for fast generation. This gives correct timestamps + MLX speed.
Install
pip install mlx-vlm "transformers>=5.7.0" torch torchcodec "qwen-vl-utils>=0.0.14"
Code
import os
os.environ["FORCE_QWENVL_VIDEO_READER"] = "torchcodec"
os.environ["VIDEO_MAX_PIXELS"] = "200704"
os.environ["FPS"] = "2.0"
os.environ["FPS_MAX_FRAMES"] = "240"
os.environ["FPS_MIN_FRAMES"] = "4"
import mlx.core as mx
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
from mlx_vlm import load as mlx_load
from mlx_vlm.models.cache import make_prompt_cache
HF_MODEL = "NemoStation/Marlin-2B"
MLX_MODEL = "junwatu/Marlin-2B-MLX-8bit"
CAPTION_PROMPT = (
"Provide a spatial description of this clip followed by time-ranged events.\n"
"For each event, give the time range as <start - end> and a short description."
)
# 1. Prepare inputs with HF (correct position encoding)
hf_processor = AutoProcessor.from_pretrained(HF_MODEL, trust_remote_code=True)
hf_model = AutoModelForCausalLM.from_pretrained(
HF_MODEL, trust_remote_code=True, dtype=torch.float32, low_cpu_mem_usage=True
)
messages = [{"role": "user", "content": [
{"type": "video", "video": "video.mp4"},
{"type": "text", "text": CAPTION_PROMPT},
]}]
inputs = hf_processor.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True,
return_tensors="pt", return_dict=True,
)
with torch.no_grad():
position_ids, _ = hf_model.model.get_rope_index(
input_ids=inputs["input_ids"],
mm_token_type_ids=inputs["mm_token_type_ids"],
video_grid_thw=inputs.get("video_grid_thw"),
attention_mask=inputs.get("attention_mask"),
)
del hf_model # free memory
# 2. Generate with MLX (fast)
mlx_model, mlx_processor = mlx_load(MLX_MODEL)
input_ids = mx.array(inputs["input_ids"].numpy())
pixel_values = mx.array(inputs["pixel_values_videos"].numpy())
video_grid_thw = mx.array(inputs["video_grid_thw"].numpy())
embedding_output = mlx_model.get_input_embeddings(
input_ids, pixel_values,
mask=mx.array(inputs["attention_mask"].numpy()),
video_grid_thw=video_grid_thw,
)
mlx_model.language_model._position_ids = mx.array(position_ids.numpy())
prompt_cache = make_prompt_cache(mlx_model.language_model)
outputs = mlx_model.language_model(
input_ids, inputs_embeds=embedding_output.inputs_embeds, cache=prompt_cache
)
mx.eval([c.state for c in prompt_cache])
# Greedy decode
eos = mlx_model.config.eos_token_id
y = mx.argmax(outputs.logits[:, -1, :], axis=-1, keepdims=True)
tokens = []
for _ in range(384):
t = y.item()
if t == eos:
break
tokens.append(t)
outputs = mlx_model.language_model(y, cache=prompt_cache)
mx.eval([c.state for c in prompt_cache])
y = mx.argmax(outputs.logits[:, -1, :], axis=-1, keepdims=True)
text = mlx_processor.tokenizer.decode(tokens, skip_special_tokens=True)
print(text)
Alternative: Pure MLX (faster, compressed timestamps)
If you only need scene descriptions and don't need precise timestamps:
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
model, processor = load("junwatu/Marlin-2B-MLX-8bit")
config = load_config("junwatu/Marlin-2B-MLX-8bit")
prompt = apply_chat_template(processor, config, CAPTION_PROMPT, video=["video.mp4"], fps=2.0)
output = generate(model, processor, prompt, video=["video.mp4"], max_tokens=384, fps=2.0)
print(output.text) # ~22s, timestamps compressed but descriptions accurate
Serving as an API
# See scripts/serve_marlin_mlx.py in the repo for a FastAPI server
# Key: endpoint must be async to keep MLX on the main event loop thread
uvicorn.run(app, host="0.0.0.0", port=8080, workers=1, loop="asyncio")
Conversion details
# Patch vision_config.model_type from "qwen3_5_vision" to "qwen3_5" first
python -m mlx_vlm.convert --hf-path NemoStation/Marlin-2B --mlx-path Marlin-2B-MLX-8bit -q --q-bits 8
Performance (Apple Silicon)
Tested on 8-second video clips:
| Mode | Caption time | Timestamps |
|---|---|---|
| Pure MLX | ~22s | Compressed (mlx-vlm M-RoPE limitation) |
| Hybrid (recommended) | ~28s | ✅ Correct |
| PyTorch MPS (original) | ~130s | ✅ Correct |
Known limitations
- Pure mlx-vlm mode produces compressed timestamps due to missing
mm_token_type_idssupport in their Qwen3.5 M-RoPE implementation. Use the hybrid approach for correct timestamps. - The hybrid approach requires both
transformersandmlx-vlminstalled.
Requirements
- Apple Silicon Mac (M1/M2/M3/M4)
- Python 3.10+
mlx-vlm >= 0.5.0transformers >= 5.7.0(for hybrid mode)torch,torchcodec,qwen-vl-utils >= 0.0.14
License
Apache 2.0 (same as base model)
- Downloads last month
- 4
Model size
0.9B params
Tensor type
BF16
·
U32 ·
Hardware compatibility
Log In to add your hardware
8-bit
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support