You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Marlin 2B — MLX 8-bit (Apple Silicon)

MLX 8-bit quantized version of NemoStation/Marlin-2B for fast inference on Apple Silicon Macs.

Key specs

Original (PyTorch) This model (MLX hybrid)
Precision bfloat16 8-bit (9.625 avg bits/weight)
Size 5.1 GB 2.5 GB
Peak memory ~7.5 GB ~5 GB
Caption speed ~130s/video ~28s/video
Timestamps ✅ Correct ✅ Correct
Speedup 4.6x

What is Marlin?

Marlin is a 2B video VLM for dense captioning and temporal grounding:

  • Caption mode: Scene: <paragraph> + Events: <start-end> <description>
  • Find mode: Given a query, returns From X.X to Y.Y.

Recommended usage: Hybrid approach

The hybrid approach uses HF transformers for input preparation (correct M-RoPE temporal positions) and MLX for fast generation. This gives correct timestamps + MLX speed.

Install

pip install mlx-vlm "transformers>=5.7.0" torch torchcodec "qwen-vl-utils>=0.0.14"

Code

import os
os.environ["FORCE_QWENVL_VIDEO_READER"] = "torchcodec"
os.environ["VIDEO_MAX_PIXELS"] = "200704"
os.environ["FPS"] = "2.0"
os.environ["FPS_MAX_FRAMES"] = "240"
os.environ["FPS_MIN_FRAMES"] = "4"

import mlx.core as mx
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
from mlx_vlm import load as mlx_load
from mlx_vlm.models.cache import make_prompt_cache

HF_MODEL = "NemoStation/Marlin-2B"
MLX_MODEL = "junwatu/Marlin-2B-MLX-8bit"

CAPTION_PROMPT = (
    "Provide a spatial description of this clip followed by time-ranged events.\n"
    "For each event, give the time range as <start - end> and a short description."
)

# 1. Prepare inputs with HF (correct position encoding)
hf_processor = AutoProcessor.from_pretrained(HF_MODEL, trust_remote_code=True)
hf_model = AutoModelForCausalLM.from_pretrained(
    HF_MODEL, trust_remote_code=True, dtype=torch.float32, low_cpu_mem_usage=True
)

messages = [{"role": "user", "content": [
    {"type": "video", "video": "video.mp4"},
    {"type": "text", "text": CAPTION_PROMPT},
]}]
inputs = hf_processor.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True,
    return_tensors="pt", return_dict=True,
)
with torch.no_grad():
    position_ids, _ = hf_model.model.get_rope_index(
        input_ids=inputs["input_ids"],
        mm_token_type_ids=inputs["mm_token_type_ids"],
        video_grid_thw=inputs.get("video_grid_thw"),
        attention_mask=inputs.get("attention_mask"),
    )
del hf_model  # free memory

# 2. Generate with MLX (fast)
mlx_model, mlx_processor = mlx_load(MLX_MODEL)
input_ids = mx.array(inputs["input_ids"].numpy())
pixel_values = mx.array(inputs["pixel_values_videos"].numpy())
video_grid_thw = mx.array(inputs["video_grid_thw"].numpy())

embedding_output = mlx_model.get_input_embeddings(
    input_ids, pixel_values,
    mask=mx.array(inputs["attention_mask"].numpy()),
    video_grid_thw=video_grid_thw,
)
mlx_model.language_model._position_ids = mx.array(position_ids.numpy())

prompt_cache = make_prompt_cache(mlx_model.language_model)
outputs = mlx_model.language_model(
    input_ids, inputs_embeds=embedding_output.inputs_embeds, cache=prompt_cache
)
mx.eval([c.state for c in prompt_cache])

# Greedy decode
eos = mlx_model.config.eos_token_id
y = mx.argmax(outputs.logits[:, -1, :], axis=-1, keepdims=True)
tokens = []
for _ in range(384):
    t = y.item()
    if t == eos:
        break
    tokens.append(t)
    outputs = mlx_model.language_model(y, cache=prompt_cache)
    mx.eval([c.state for c in prompt_cache])
    y = mx.argmax(outputs.logits[:, -1, :], axis=-1, keepdims=True)

text = mlx_processor.tokenizer.decode(tokens, skip_special_tokens=True)
print(text)

Alternative: Pure MLX (faster, compressed timestamps)

If you only need scene descriptions and don't need precise timestamps:

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

model, processor = load("junwatu/Marlin-2B-MLX-8bit")
config = load_config("junwatu/Marlin-2B-MLX-8bit")

prompt = apply_chat_template(processor, config, CAPTION_PROMPT, video=["video.mp4"], fps=2.0)
output = generate(model, processor, prompt, video=["video.mp4"], max_tokens=384, fps=2.0)
print(output.text)  # ~22s, timestamps compressed but descriptions accurate

Serving as an API

# See scripts/serve_marlin_mlx.py in the repo for a FastAPI server
# Key: endpoint must be async to keep MLX on the main event loop thread
uvicorn.run(app, host="0.0.0.0", port=8080, workers=1, loop="asyncio")

Conversion details

# Patch vision_config.model_type from "qwen3_5_vision" to "qwen3_5" first
python -m mlx_vlm.convert --hf-path NemoStation/Marlin-2B --mlx-path Marlin-2B-MLX-8bit -q --q-bits 8

Performance (Apple Silicon)

Tested on 8-second video clips:

Mode Caption time Timestamps
Pure MLX ~22s Compressed (mlx-vlm M-RoPE limitation)
Hybrid (recommended) ~28s ✅ Correct
PyTorch MPS (original) ~130s ✅ Correct

Known limitations

  • Pure mlx-vlm mode produces compressed timestamps due to missing mm_token_type_ids support in their Qwen3.5 M-RoPE implementation. Use the hybrid approach for correct timestamps.
  • The hybrid approach requires both transformers and mlx-vlm installed.

Requirements

  • Apple Silicon Mac (M1/M2/M3/M4)
  • Python 3.10+
  • mlx-vlm >= 0.5.0
  • transformers >= 5.7.0 (for hybrid mode)
  • torch, torchcodec, qwen-vl-utils >= 0.0.14

License

Apache 2.0 (same as base model)

Downloads last month
4
Safetensors
Model size
0.9B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for junwatu/Marlin-2B-MLX-8bit

Finetuned
Qwen/Qwen3.5-2B
Quantized
(1)
this model