DolphinGR00T-N1.5-3B-Zero

by Eric Hartford

I love GR00T but NVidia's license - Tsk-tsk, no no no, that won't do at all.
Also all their inference code is wrapped in hard coded CUDA dependencies. Rude.

The world - our future - and our children's future - deserves a high quality permissively licensed robot control model that isn't tied to any specific hardware.

This repo contains a fully open-source Apache 2.0 licensed, randomly initialized version of the GR00T-N1.5-3B architecture for humanoid robot control. This model has the exact same architecture as NVIDIA's GR00T-N1.5-3B but with random weights.

And NO it's NOT gonna be uncensored! It's driving a humanoid robot you guys! I am not trying to burn down the world here! (you can easily finetune it to do ANYTHING you want it to.)

I created this model using this script

The purpose is to distill GR00T into an Apache-2.0 licensed version.

The whole job looks like this:

make an Apache 2.0 licensed "blank slate" with the right shape (this repo)
Track down the sub-components that are Apache 2.0, and bring those weights in. (qwen3-1.7b, for instance, is used as the language tower.)
missing components - find some initialization that's better than "random" - like merging from similar models into the correct shape.
distill GR00T onto it with online logit distillation. The model's small, easy to load both models into vram!

Model Description

DolphinGR00T-N1.5-3B-Zero is a Vision-Language-Action (VLA) model designed for humanoid robot control:

Architecture: Dual-system design with vision-language backbone (Eagle-based with Qwen3 LLM) and diffusion transformer action head
Parameters: 2,724M total (1,655M backbone in bfloat16, 1,069M action head in float32)
License: Apache-2.0 (fully open source)
Weights: Randomly initialized - no pre-training, ready for your own training

Key Features

✅ Exact architecture match with NVIDIA GR00T-N1.5-3B
✅ No license restrictions - Apache-2.0 throughout
✅ Mixed precision ready - bfloat16 backbone, float32 action head
✅ Multi-modal inputs - images, language instructions, and robot proprioception
✅ Continuous action output via diffusion transformer

Installation

pip install torch transformers safetensors

Usage

Loading the Model

import torch
from transformers import AutoModel, AutoTokenizer

# Load model
model = AutoModel.from_pretrained(
    "DolphinGR00T-N1.5-3B-Zero",
    trust_remote_code=True,
    torch_dtype="auto"
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("DolphinGR00T-N1.5-3B-Zero")

# Move to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

Inference Example

import torch
import torch.nn.functional as F
from PIL import Image
import numpy as np

def prepare_image(image_path, target_size=(224, 224)):
    """Prepare image for model input"""
    image = Image.open(image_path).convert('RGB')
    image = image.resize(target_size)
    # Normalize to [-1, 1]
    image = np.array(image).astype(np.float32) / 127.5 - 1.0
    image = torch.from_numpy(image).permute(2, 0, 1)
    return image

def inference(model, tokenizer, image_paths, instruction, robot_state, device):
    """
    Run inference to generate robot actions
    
    Args:
        image_paths: List of paths to camera images
        instruction: Natural language instruction
        robot_state: Current robot proprioception (joint angles, etc.)
        device: torch device
    
    Returns:
        actions: Predicted robot actions
    """
    model.eval()
    
    with torch.no_grad():
        # Prepare inputs
        images = torch.stack([prepare_image(path) for path in image_paths])
        images = images.unsqueeze(0).to(device)  # Add batch dimension
        
        # Tokenize instruction
        text_inputs = tokenizer(
            instruction,
            return_tensors="pt",
            padding=True,
            truncation=True,
            max_length=256
        ).to(device)
        
        # Robot state (example: 32-dim joint angles)
        if isinstance(robot_state, list):
            robot_state = torch.tensor(robot_state, dtype=torch.float32)
        robot_state = robot_state.unsqueeze(0).to(device)
        
        # Forward pass through backbone
        # Note: This is a simplified example - actual implementation depends on model interface
        vision_features = model.backbone.eagle_model.vision_model(images)
        
        # Process language
        language_features = model.backbone.eagle_model.language_model.model(
            input_ids=text_inputs.input_ids,
            attention_mask=text_inputs.attention_mask
        ).last_hidden_state
        
        # Combine features (simplified - actual fusion may be more complex)
        combined_features = torch.cat([
            vision_features.mean(dim=1),  # Pool vision features
            language_features.mean(dim=1)  # Pool language features
        ], dim=-1)
        
        # Generate actions through diffusion process
        # This is a simplified placeholder - actual diffusion requires multiple steps
        action_features = model.action_head.model(
            combined_features,
            timesteps=torch.zeros(1, device=device),
            context=robot_state
        )
        
        # Decode to action space
        actions = model.action_head.action_decoder(action_features)
        
    return actions

# Example usage
image_paths = ["camera1.jpg", "camera2.jpg"]
instruction = "Pick up the red cube and place it on the table"
robot_state = torch.randn(32)  # Example: 32 joint angles

actions = inference(model, tokenizer, image_paths, instruction, robot_state, device)
print(f"Predicted actions shape: {actions.shape}")

Training Example

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset
from transformers import get_linear_schedule_with_warmup

class RobotDataset(Dataset):
    """Example dataset for robot manipulation tasks"""
    def __init__(self, data_path, tokenizer, transform=None):
        self.data = []  # Load your data here
        self.tokenizer = tokenizer
        self.transform = transform
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        # Return dict with keys: images, instruction, robot_state, target_actions
        sample = self.data[idx]
        
        # Process images
        images = torch.stack([self.transform(img) for img in sample['images']])
        
        # Tokenize instruction
        text = self.tokenizer(
            sample['instruction'],
            return_tensors="pt",
            padding="max_length",
            truncation=True,
            max_length=256
        )
        
        return {
            'images': images,
            'input_ids': text['input_ids'].squeeze(),
            'attention_mask': text['attention_mask'].squeeze(),
            'robot_state': torch.tensor(sample['robot_state'], dtype=torch.float32),
            'target_actions': torch.tensor(sample['target_actions'], dtype=torch.float32)
        }

def train_step(model, batch, criterion, device):
    """Single training step"""
    # Move batch to device
    images = batch['images'].to(device)
    input_ids = batch['input_ids'].to(device)
    attention_mask = batch['attention_mask'].to(device)
    robot_state = batch['robot_state'].to(device)
    target_actions = batch['target_actions'].to(device)
    
    # Forward pass (simplified - actual implementation may differ)
    # Process vision
    vision_features = model.backbone.eagle_model.vision_model(images)
    
    # Process language
    language_output = model.backbone.eagle_model.language_model.model(
        input_ids=input_ids,
        attention_mask=attention_mask
    )
    language_features = language_output.last_hidden_state
    
    # Combine modalities
    combined_features = torch.cat([
        vision_features.mean(dim=1),
        language_features.mean(dim=1)
    ], dim=-1)
    
    # Generate actions (simplified diffusion)
    predicted_actions = model.action_head(
        combined_features,
        context=robot_state
    )
    
    # Compute loss
    loss = criterion(predicted_actions, target_actions)
    
    return loss

# Training setup
def train_model(model, train_dataset, val_dataset, config):
    """Main training loop"""
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = model.to(device)
    
    # Create dataloaders
    train_loader = DataLoader(
        train_dataset,
        batch_size=config['batch_size'],
        shuffle=True,
        num_workers=4
    )
    
    val_loader = DataLoader(
        val_dataset,
        batch_size=config['batch_size'],
        shuffle=False,
        num_workers=4
    )
    
    # Setup optimizer with different learning rates for backbone and action head
    optimizer = torch.optim.AdamW([
        {'params': model.backbone.parameters(), 'lr': config['backbone_lr']},
        {'params': model.action_head.parameters(), 'lr': config['action_head_lr']}
    ], weight_decay=config['weight_decay'])
    
    # Learning rate scheduler
    num_training_steps = len(train_loader) * config['num_epochs']
    scheduler = get_linear_schedule_with_warmup(
        optimizer,
        num_warmup_steps=config['warmup_steps'],
        num_training_steps=num_training_steps
    )
    
    # Loss function
    criterion = nn.MSELoss()  # or nn.L1Loss() for action prediction
    
    # Training loop
    for epoch in range(config['num_epochs']):
        model.train()
        total_loss = 0
        
        for batch_idx, batch in enumerate(train_loader):
            optimizer.zero_grad()
            
            loss = train_step(model, batch, criterion, device)
            
            loss.backward()
            
            # Gradient clipping
            torch.nn.utils.clip_grad_norm_(
                model.parameters(),
                config['max_grad_norm']
            )
            
            optimizer.step()
            scheduler.step()
            
            total_loss += loss.item()
            
            if batch_idx % config['log_interval'] == 0:
                print(f"Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}")
        
        # Validation
        model.eval()
        val_loss = 0
        with torch.no_grad():
            for batch in val_loader:
                loss = train_step(model, batch, criterion, device)
                val_loss += loss.item()
        
        avg_train_loss = total_loss / len(train_loader)
        avg_val_loss = val_loss / len(val_loader)
        
        print(f"Epoch {epoch}: Train Loss: {avg_train_loss:.4f}, Val Loss: {avg_val_loss:.4f}")
        
        # Save checkpoint
        if (epoch + 1) % config['save_interval'] == 0:
            torch.save({
                'epoch': epoch,
                'model_state_dict': model.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
                'scheduler_state_dict': scheduler.state_dict(),
                'train_loss': avg_train_loss,
                'val_loss': avg_val_loss,
            }, f"checkpoint_epoch_{epoch+1}.pt")

# Example configuration
config = {
    'batch_size': 16,
    'num_epochs': 100,
    'backbone_lr': 1e-5,
    'action_head_lr': 1e-4,
    'weight_decay': 0.01,
    'warmup_steps': 1000,
    'max_grad_norm': 1.0,
    'log_interval': 10,
    'save_interval': 10
}

# Create dataset (you need to implement data loading)
# train_dataset = RobotDataset("path/to/train/data", tokenizer)
# val_dataset = RobotDataset("path/to/val/data", tokenizer)

# Train model
# train_model(model, train_dataset, val_dataset, config)

Fine-tuning Tips

Mixed Precision Training: The model is designed for mixed precision. Use torch.cuda.amp for faster training:

from torch.cuda.amp import GradScaler, autocast

scaler = GradScaler()

with autocast():
    loss = train_step(model, batch, criterion, device)

scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

Gradient Checkpointing: For memory-efficient training:

model.backbone.eagle_model.language_model.gradient_checkpointing_enable()

Frozen Backbone Training: Start by training only the action head:

# Freeze backbone
for param in model.backbone.parameters():
    param.requires_grad = False

# Train only action head
optimizer = torch.optim.AdamW(
    model.action_head.parameters(),
    lr=1e-4
)

Model Architecture

The model consists of two main components:

1. Vision-Language Backbone (System 2)

Vision Encoder: Based on Eagle vision model with 27 transformer layers
Language Model: Qwen3-based LLM with 12 layers, 2048 hidden dim
Cross-modal Fusion: MLP connector between vision and language

2. Action Head (System 1)

Diffusion Transformer: 16 DiT blocks for action generation
State Encoder: Processes robot proprioception
Action Decoder: Outputs continuous robot actions
Self-Attention Blocks: 4 transformer blocks for vision-language features

Limitations

This is a blank model with random weights - it requires training before use
No pre-trained knowledge or capabilities
Designed for humanoid robots but can be adapted for other embodiments
Requires significant computational resources for training

Citation

If you use this model in your research, please cite:

@software{DolphinGR00T2024,
  title={DolphinGR00T-N1.5-3B-Zero: a Permissively Licensed Reimplementation of GR00T-N1.5-3B},
  author={Eric Hartford},
  year={2024},
  license={Apache-2.0}
}

License

Apache-2.0 - This model is fully open source with no restrictions.

Acknowledgments

This is an independent implementation of the GR00T architecture for the open-source community. The architecture is based on publicly available information about NVIDIA's GR00T-N1.5 model, but contains no proprietary code or weights.