LLaDA-346M: Large Language Diffusion with Masking

Model Description

This is a 346 Million parameter Large Language Diffusion Model trained with masked diffusion processes. This model demonstrates that diffusion-based approaches can be viable alternatives to autoregressive language models.

Key Features

Architecture: Masked Diffusion Model (MDM) with Transformer encoder
Parameters: 346M
Sequence Length: 512 tokens
Vocab Size: 50,257 (GPT-2)
Training Data: 50,000 WikiText-2 samples

Model Architecture

Token Embeddings (50257 × 1024)
    ↓
Position Embeddings (512 × 1024)
    ↓
Time Embeddings (MLP)
    ↓
Transformer Encoder (12 layers, 16 heads)
    ├─ Self-Attention
    └─ Feed-Forward (4096 dim)
    ↓
Output Projection (1024 × 50257)

Training Details

Algorithm: Masked Diffusion Model (MDM)
Loss Function: Cross-entropy on masked positions
Optimizer: AdamW (lr=3e-5, betas=(0.9, 0.95))
Batch Size: 16 (effective: 32 with grad accumulation)
Gradient Checkpointing: Enabled
Mixed Precision: AMP (FP32/FP16)
Epochs: 4
Training Samples: 50,000
GPU: NVIDIA V100 (22GB VRAM)
Training Time: ~20 hours

Performance

Metric	Value
Initial Loss	5.96
Final Loss	4.94
Loss Reduction	17.1%
Total Parameters	346M
Model Size (FP32)	1.38 GB

Usage

Installation

pip install transformers torch

Loading the Model

import torch
from transformers import AutoTokenizer
from your_module import MaskedDiffusionModel

# Load model
model = MaskedDiffusionModel(
    vocab_size=50257,
    hidden_dim=1024,
    num_layers=12,
    num_heads=16,
    ff_dim=4096,
    dropout=0.1,
    max_seq_length=512,
    num_timesteps=100
)

# Load weights
checkpoint = torch.load("pytorch_model.bin")
model.load_state_dict(checkpoint)
model.eval()

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

Text Generation

from diffusion_sampler import DiffusionSampler

sampler = DiffusionSampler(model, tokenizer, config, device)

# Generate text
text = sampler.generate(
    prompt="The future of AI",
    num_steps=40,
    temperature=0.8,
    top_p=0.9
)
print(text)

Model Characteristics

Advantages

✅ Bidirectional Context: Sees full context unlike autoregressive models
✅ Parallel Generation: Can predict multiple tokens simultaneously
✅ Reversal Invariance: Equal performance on forward and reverse tasks
✅ Global Coherence: Reduces error accumulation

Limitations

❌ Slower generation (iterative denoising process)
❌ Requires more compute for inference
❌ Not fine-tuned for specific tasks

Training Process

Forward Process

Gradually mask tokens randomly
At timestep t ∈ [0,1], each token masked with probability t
Creates noisy version of input

Reverse Process

Iteratively predict and unmask tokens
Uses transformer to predict masked positions
Trained with cross-entropy loss on masked tokens only

Optimization Techniques

Gradient Checkpointing: Save memory during backprop
Mixed Precision (AMP): Use FP16 where possible
Gradient Accumulation: Simulate larger batches
Layer Norm First: Improved training stability

Citation

If you use this model, please cite:

@article{nie2025llada,
  title={Large Language Diffusion Models},
  author={Nie, Shen and others},
  journal={arXiv preprint arXiv:2502.09992},
  year={2025}
}

License

MIT License - Feel free to use for research and commercial purposes

Acknowledgments

Based on "Large Language Diffusion Models" (Nie et al., 2025)
Built with PyTorch and Transformers
Trained on WikiText-2 dataset
Inspired by diffusion models for vision (DiT, Genie)

Contact & Support

For issues, questions, or suggestions, please open an issue on GitHub or contact the model author.

Last Updated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support