You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Marlin 2B — MLX 8-bit (Apple Silicon)

MLX 8-bit quantized version of NemoStation/Marlin-2B for fast inference on Apple Silicon Macs.

Key specs

	Original (PyTorch)	This model (MLX hybrid)
Precision	bfloat16	8-bit (9.625 avg bits/weight)
Size	5.1 GB	2.5 GB
Peak memory	~7.5 GB	~5 GB
Caption speed	~130s/video	~28s/video
Timestamps	✅ Correct	✅ Correct
Speedup	—	4.6x

What is Marlin?

Marlin is a 2B video VLM for dense captioning and temporal grounding:

Caption mode: Scene: <paragraph> + Events: <start-end> <description>
Find mode: Given a query, returns From X.X to Y.Y.

Recommended usage: Hybrid approach

The hybrid approach uses HF transformers for input preparation (correct M-RoPE temporal positions) and MLX for fast generation. This gives correct timestamps + MLX speed.

Install

pip install mlx-vlm "transformers>=5.7.0" torch torchcodec "qwen-vl-utils>=0.0.14"

Code

import os
os.environ["FORCE_QWENVL_VIDEO_READER"] = "torchcodec"
os.environ["VIDEO_MAX_PIXELS"] = "200704"
os.environ["FPS"] = "2.0"
os.environ["FPS_MAX_FRAMES"] = "240"
os.environ["FPS_MIN_FRAMES"] = "4"

import mlx.core as mx
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
from mlx_vlm import load as mlx_load
from mlx_vlm.models.cache import make_prompt_cache

HF_MODEL = "NemoStation/Marlin-2B"
MLX_MODEL = "junwatu/Marlin-2B-MLX-8bit"

CAPTION_PROMPT = (
    "Provide a spatial description of this clip followed by time-ranged events.\n"
    "For each event, give the time range as <start - end> and a short description."
)

# 1. Prepare inputs with HF (correct position encoding)
hf_processor = AutoProcessor.from_pretrained(HF_MODEL, trust_remote_code=True)
hf_model = AutoModelForCausalLM.from_pretrained(
    HF_MODEL, trust_remote_code=True, dtype=torch.float32, low_cpu_mem_usage=True
)

messages = [{"role": "user", "content": [
    {"type": "video", "video": "video.mp4"},
    {"type": "text", "text": CAPTION_PROMPT},
]}]
inputs = hf_processor.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True,
    return_tensors="pt", return_dict=True,
)
with torch.no_grad():
    position_ids, _ = hf_model.model.get_rope_index(
        input_ids=inputs["input_ids"],
        mm_token_type_ids=inputs["mm_token_type_ids"],
        video_grid_thw=inputs.get("video_grid_thw"),
        attention_mask=inputs.get("attention_mask"),
    )
del hf_model  # free memory

# 2. Generate with MLX (fast)
mlx_model, mlx_processor = mlx_load(MLX_MODEL)
input_ids = mx.array(inputs["input_ids"].numpy())
pixel_values = mx.array(inputs["pixel_values_videos"].numpy())
video_grid_thw = mx.array(inputs["video_grid_thw"].numpy())

embedding_output = mlx_model.get_input_embeddings(
    input_ids, pixel_values,
    mask=mx.array(inputs["attention_mask"].numpy()),
    video_grid_thw=video_grid_thw,
)
mlx_model.language_model._position_ids = mx.array(position_ids.numpy())

prompt_cache = make_prompt_cache(mlx_model.language_model)
outputs = mlx_model.language_model(
    input_ids, inputs_embeds=embedding_output.inputs_embeds, cache=prompt_cache
)
mx.eval([c.state for c in prompt_cache])

# Greedy decode
eos = mlx_model.config.eos_token_id
y = mx.argmax(outputs.logits[:, -1, :], axis=-1, keepdims=True)
tokens = []
for _ in range(384):
    t = y.item()
    if t == eos:
        break
    tokens.append(t)
    outputs = mlx_model.language_model(y, cache=prompt_cache)
    mx.eval([c.state for c in prompt_cache])
    y = mx.argmax(outputs.logits[:, -1, :], axis=-1, keepdims=True)

text = mlx_processor.tokenizer.decode(tokens, skip_special_tokens=True)
print(text)

Alternative: Pure MLX (faster, compressed timestamps)

If you only need scene descriptions and don't need precise timestamps:

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

model, processor = load("junwatu/Marlin-2B-MLX-8bit")
config = load_config("junwatu/Marlin-2B-MLX-8bit")

prompt = apply_chat_template(processor, config, CAPTION_PROMPT, video=["video.mp4"], fps=2.0)
output = generate(model, processor, prompt, video=["video.mp4"], max_tokens=384, fps=2.0)
print(output.text)  # ~22s, timestamps compressed but descriptions accurate

Serving as an API

# See scripts/serve_marlin_mlx.py in the repo for a FastAPI server
# Key: endpoint must be async to keep MLX on the main event loop thread
uvicorn.run(app, host="0.0.0.0", port=8080, workers=1, loop="asyncio")

Conversion details

# Patch vision_config.model_type from "qwen3_5_vision" to "qwen3_5" first
python -m mlx_vlm.convert --hf-path NemoStation/Marlin-2B --mlx-path Marlin-2B-MLX-8bit -q --q-bits 8

Performance (Apple Silicon)

Tested on 8-second video clips:

Mode	Caption time	Timestamps
Pure MLX	~22s	Compressed (mlx-vlm M-RoPE limitation)
Hybrid (recommended)	~28s	✅ Correct
PyTorch MPS (original)	~130s	✅ Correct

Known limitations

Pure mlx-vlm mode produces compressed timestamps due to missing mm_token_type_ids support in their Qwen3.5 M-RoPE implementation. Use the hybrid approach for correct timestamps.
The hybrid approach requires both transformers and mlx-vlm installed.

Requirements

Apple Silicon Mac (M1/M2/M3/M4)
Python 3.10+
mlx-vlm >= 0.5.0
transformers >= 5.7.0 (for hybrid mode)
torch, torchcodec, qwen-vl-utils >= 0.0.14

License

Apache 2.0 (same as base model)

Downloads last month: 4

Safetensors

Model size

0.9B params

Tensor type

BF16

U32

MLX

Hardware compatibility

8-bit

Inference Providers NEW

Video-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for junwatu/Marlin-2B-MLX-8bit

Base model

Qwen/Qwen3.5-2B-Base

Finetuned

Qwen/Qwen3.5-2B

Finetuned

NemoStation/Marlin-2B

Quantized

(1)

this model