DolphinGR00T-N1.5-3B-Zero
by Eric Hartford
I love GR00T but NVidia's license - Tsk-tsk, no no no, that won't do at all.
Also all their inference code is wrapped in hard coded CUDA dependencies. Rude.
The world - our future - and our children's future - deserves a high quality permissively licensed robot control model that isn't tied to any specific hardware.
This repo contains a fully open-source Apache 2.0 licensed, randomly initialized version of the GR00T-N1.5-3B architecture for humanoid robot control. This model has the exact same architecture as NVIDIA's GR00T-N1.5-3B but with random weights.
And NO it's NOT gonna be uncensored! It's driving a humanoid robot you guys! I am not trying to burn down the world here! (you can easily finetune it to do ANYTHING you want it to.)
I created this model using this script
The purpose is to distill GR00T into an Apache-2.0 licensed version.
The whole job looks like this:
- make an Apache 2.0 licensed "blank slate" with the right shape (this repo)
- Track down the sub-components that are Apache 2.0, and bring those weights in. (qwen3-1.7b, for instance, is used as the language tower.)
- missing components - find some initialization that's better than "random" - like merging from similar models into the correct shape.
- distill GR00T onto it with online logit distillation. The model's small, easy to load both models into vram!

Model Description
DolphinGR00T-N1.5-3B-Zero is a Vision-Language-Action (VLA) model designed for humanoid robot control:
- Architecture: Dual-system design with vision-language backbone (Eagle-based with Qwen3 LLM) and diffusion transformer action head
- Parameters: 2,724M total (1,655M backbone in bfloat16, 1,069M action head in float32)
- License: Apache-2.0 (fully open source)
- Weights: Randomly initialized - no pre-training, ready for your own training
Key Features
- โ Exact architecture match with NVIDIA GR00T-N1.5-3B
- โ No license restrictions - Apache-2.0 throughout
- โ Mixed precision ready - bfloat16 backbone, float32 action head
- โ Multi-modal inputs - images, language instructions, and robot proprioception
- โ Continuous action output via diffusion transformer
Installation
pip install torch transformers safetensors
Usage
Loading the Model
import torch
from transformers import AutoModel, AutoTokenizer
# Load model
model = AutoModel.from_pretrained(
"DolphinGR00T-N1.5-3B-Zero",
trust_remote_code=True,
torch_dtype="auto"
)
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("DolphinGR00T-N1.5-3B-Zero")
# Move to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
Inference Example
import torch
import torch.nn.functional as F
from PIL import Image
import numpy as np
def prepare_image(image_path, target_size=(224, 224)):
"""Prepare image for model input"""
image = Image.open(image_path).convert('RGB')
image = image.resize(target_size)
# Normalize to [-1, 1]
image = np.array(image).astype(np.float32) / 127.5 - 1.0
image = torch.from_numpy(image).permute(2, 0, 1)
return image
def inference(model, tokenizer, image_paths, instruction, robot_state, device):
"""
Run inference to generate robot actions
Args:
image_paths: List of paths to camera images
instruction: Natural language instruction
robot_state: Current robot proprioception (joint angles, etc.)
device: torch device
Returns:
actions: Predicted robot actions
"""
model.eval()
with torch.no_grad():
# Prepare inputs
images = torch.stack([prepare_image(path) for path in image_paths])
images = images.unsqueeze(0).to(device) # Add batch dimension
# Tokenize instruction
text_inputs = tokenizer(
instruction,
return_tensors="pt",
padding=True,
truncation=True,
max_length=256
).to(device)
# Robot state (example: 32-dim joint angles)
if isinstance(robot_state, list):
robot_state = torch.tensor(robot_state, dtype=torch.float32)
robot_state = robot_state.unsqueeze(0).to(device)
# Forward pass through backbone
# Note: This is a simplified example - actual implementation depends on model interface
vision_features = model.backbone.eagle_model.vision_model(images)
# Process language
language_features = model.backbone.eagle_model.language_model.model(
input_ids=text_inputs.input_ids,
attention_mask=text_inputs.attention_mask
).last_hidden_state
# Combine features (simplified - actual fusion may be more complex)
combined_features = torch.cat([
vision_features.mean(dim=1), # Pool vision features
language_features.mean(dim=1) # Pool language features
], dim=-1)
# Generate actions through diffusion process
# This is a simplified placeholder - actual diffusion requires multiple steps
action_features = model.action_head.model(
combined_features,
timesteps=torch.zeros(1, device=device),
context=robot_state
)
# Decode to action space
actions = model.action_head.action_decoder(action_features)
return actions
# Example usage
image_paths = ["camera1.jpg", "camera2.jpg"]
instruction = "Pick up the red cube and place it on the table"
robot_state = torch.randn(32) # Example: 32 joint angles
actions = inference(model, tokenizer, image_paths, instruction, robot_state, device)
print(f"Predicted actions shape: {actions.shape}")
Training Example
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset
from transformers import get_linear_schedule_with_warmup
class RobotDataset(Dataset):
"""Example dataset for robot manipulation tasks"""
def __init__(self, data_path, tokenizer, transform=None):
self.data = [] # Load your data here
self.tokenizer = tokenizer
self.transform = transform
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
# Return dict with keys: images, instruction, robot_state, target_actions
sample = self.data[idx]
# Process images
images = torch.stack([self.transform(img) for img in sample['images']])
# Tokenize instruction
text = self.tokenizer(
sample['instruction'],
return_tensors="pt",
padding="max_length",
truncation=True,
max_length=256
)
return {
'images': images,
'input_ids': text['input_ids'].squeeze(),
'attention_mask': text['attention_mask'].squeeze(),
'robot_state': torch.tensor(sample['robot_state'], dtype=torch.float32),
'target_actions': torch.tensor(sample['target_actions'], dtype=torch.float32)
}
def train_step(model, batch, criterion, device):
"""Single training step"""
# Move batch to device
images = batch['images'].to(device)
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
robot_state = batch['robot_state'].to(device)
target_actions = batch['target_actions'].to(device)
# Forward pass (simplified - actual implementation may differ)
# Process vision
vision_features = model.backbone.eagle_model.vision_model(images)
# Process language
language_output = model.backbone.eagle_model.language_model.model(
input_ids=input_ids,
attention_mask=attention_mask
)
language_features = language_output.last_hidden_state
# Combine modalities
combined_features = torch.cat([
vision_features.mean(dim=1),
language_features.mean(dim=1)
], dim=-1)
# Generate actions (simplified diffusion)
predicted_actions = model.action_head(
combined_features,
context=robot_state
)
# Compute loss
loss = criterion(predicted_actions, target_actions)
return loss
# Training setup
def train_model(model, train_dataset, val_dataset, config):
"""Main training loop"""
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
# Create dataloaders
train_loader = DataLoader(
train_dataset,
batch_size=config['batch_size'],
shuffle=True,
num_workers=4
)
val_loader = DataLoader(
val_dataset,
batch_size=config['batch_size'],
shuffle=False,
num_workers=4
)
# Setup optimizer with different learning rates for backbone and action head
optimizer = torch.optim.AdamW([
{'params': model.backbone.parameters(), 'lr': config['backbone_lr']},
{'params': model.action_head.parameters(), 'lr': config['action_head_lr']}
], weight_decay=config['weight_decay'])
# Learning rate scheduler
num_training_steps = len(train_loader) * config['num_epochs']
scheduler = get_linear_schedule_with_warmup(
optimizer,
num_warmup_steps=config['warmup_steps'],
num_training_steps=num_training_steps
)
# Loss function
criterion = nn.MSELoss() # or nn.L1Loss() for action prediction
# Training loop
for epoch in range(config['num_epochs']):
model.train()
total_loss = 0
for batch_idx, batch in enumerate(train_loader):
optimizer.zero_grad()
loss = train_step(model, batch, criterion, device)
loss.backward()
# Gradient clipping
torch.nn.utils.clip_grad_norm_(
model.parameters(),
config['max_grad_norm']
)
optimizer.step()
scheduler.step()
total_loss += loss.item()
if batch_idx % config['log_interval'] == 0:
print(f"Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}")
# Validation
model.eval()
val_loss = 0
with torch.no_grad():
for batch in val_loader:
loss = train_step(model, batch, criterion, device)
val_loss += loss.item()
avg_train_loss = total_loss / len(train_loader)
avg_val_loss = val_loss / len(val_loader)
print(f"Epoch {epoch}: Train Loss: {avg_train_loss:.4f}, Val Loss: {avg_val_loss:.4f}")
# Save checkpoint
if (epoch + 1) % config['save_interval'] == 0:
torch.save({
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'scheduler_state_dict': scheduler.state_dict(),
'train_loss': avg_train_loss,
'val_loss': avg_val_loss,
}, f"checkpoint_epoch_{epoch+1}.pt")
# Example configuration
config = {
'batch_size': 16,
'num_epochs': 100,
'backbone_lr': 1e-5,
'action_head_lr': 1e-4,
'weight_decay': 0.01,
'warmup_steps': 1000,
'max_grad_norm': 1.0,
'log_interval': 10,
'save_interval': 10
}
# Create dataset (you need to implement data loading)
# train_dataset = RobotDataset("path/to/train/data", tokenizer)
# val_dataset = RobotDataset("path/to/val/data", tokenizer)
# Train model
# train_model(model, train_dataset, val_dataset, config)
Fine-tuning Tips
- Mixed Precision Training: The model is designed for mixed precision. Use
torch.cuda.amp
for faster training:
from torch.cuda.amp import GradScaler, autocast
scaler = GradScaler()
with autocast():
loss = train_step(model, batch, criterion, device)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
- Gradient Checkpointing: For memory-efficient training:
model.backbone.eagle_model.language_model.gradient_checkpointing_enable()
- Frozen Backbone Training: Start by training only the action head:
# Freeze backbone
for param in model.backbone.parameters():
param.requires_grad = False
# Train only action head
optimizer = torch.optim.AdamW(
model.action_head.parameters(),
lr=1e-4
)
Model Architecture
The model consists of two main components:
1. Vision-Language Backbone (System 2)
- Vision Encoder: Based on Eagle vision model with 27 transformer layers
- Language Model: Qwen3-based LLM with 12 layers, 2048 hidden dim
- Cross-modal Fusion: MLP connector between vision and language
2. Action Head (System 1)
- Diffusion Transformer: 16 DiT blocks for action generation
- State Encoder: Processes robot proprioception
- Action Decoder: Outputs continuous robot actions
- Self-Attention Blocks: 4 transformer blocks for vision-language features
Limitations
- This is a blank model with random weights - it requires training before use
- No pre-trained knowledge or capabilities
- Designed for humanoid robots but can be adapted for other embodiments
- Requires significant computational resources for training
Citation
If you use this model in your research, please cite:
@software{DolphinGR00T2024,
title={DolphinGR00T-N1.5-3B-Zero: a Permissively Licensed Reimplementation of GR00T-N1.5-3B},
author={Eric Hartford},
year={2024},
license={Apache-2.0}
}
License
Apache-2.0 - This model is fully open source with no restrictions.
Acknowledgments
This is an independent implementation of the GR00T architecture for the open-source community. The architecture is based on publicly available information about NVIDIA's GR00T-N1.5 model, but contains no proprietary code or weights.
- Downloads last month
- 0