LLaDA-346M: Large Language Diffusion with Masking

Model Description

This is a 346 Million parameter Large Language Diffusion Model trained with masked diffusion processes. This model demonstrates that diffusion-based approaches can be viable alternatives to autoregressive language models.

Key Features

  • Architecture: Masked Diffusion Model (MDM) with Transformer encoder
  • Parameters: 346M
  • Sequence Length: 512 tokens
  • Vocab Size: 50,257 (GPT-2)
  • Training Data: 50,000 WikiText-2 samples

Model Architecture

Token Embeddings (50257 Γ— 1024)
    ↓
Position Embeddings (512 Γ— 1024)
    ↓
Time Embeddings (MLP)
    ↓
Transformer Encoder (12 layers, 16 heads)
    β”œβ”€ Self-Attention
    └─ Feed-Forward (4096 dim)
    ↓
Output Projection (1024 Γ— 50257)

Training Details

  • Algorithm: Masked Diffusion Model (MDM)
  • Loss Function: Cross-entropy on masked positions
  • Optimizer: AdamW (lr=3e-5, betas=(0.9, 0.95))
  • Batch Size: 16 (effective: 32 with grad accumulation)
  • Gradient Checkpointing: Enabled
  • Mixed Precision: AMP (FP32/FP16)
  • Epochs: 4
  • Training Samples: 50,000
  • GPU: NVIDIA V100 (22GB VRAM)
  • Training Time: ~20 hours

Performance

Metric Value
Initial Loss 5.96
Final Loss 4.94
Loss Reduction 17.1%
Total Parameters 346M
Model Size (FP32) 1.38 GB

Usage

Installation

pip install transformers torch

Loading the Model

import torch
from transformers import AutoTokenizer
from your_module import MaskedDiffusionModel

# Load model
model = MaskedDiffusionModel(
    vocab_size=50257,
    hidden_dim=1024,
    num_layers=12,
    num_heads=16,
    ff_dim=4096,
    dropout=0.1,
    max_seq_length=512,
    num_timesteps=100
)

# Load weights
checkpoint = torch.load("pytorch_model.bin")
model.load_state_dict(checkpoint)
model.eval()

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

Text Generation

from diffusion_sampler import DiffusionSampler

sampler = DiffusionSampler(model, tokenizer, config, device)

# Generate text
text = sampler.generate(
    prompt="The future of AI",
    num_steps=40,
    temperature=0.8,
    top_p=0.9
)
print(text)

Model Characteristics

Advantages

βœ… Bidirectional Context: Sees full context unlike autoregressive models
βœ… Parallel Generation: Can predict multiple tokens simultaneously
βœ… Reversal Invariance: Equal performance on forward and reverse tasks
βœ… Global Coherence: Reduces error accumulation

Limitations

❌ Slower generation (iterative denoising process)
❌ Requires more compute for inference
❌ Not fine-tuned for specific tasks

Training Process

Forward Process

  • Gradually mask tokens randomly
  • At timestep t ∈ [0,1], each token masked with probability t
  • Creates noisy version of input

Reverse Process

  • Iteratively predict and unmask tokens
  • Uses transformer to predict masked positions
  • Trained with cross-entropy loss on masked tokens only

Optimization Techniques

  • Gradient Checkpointing: Save memory during backprop
  • Mixed Precision (AMP): Use FP16 where possible
  • Gradient Accumulation: Simulate larger batches
  • Layer Norm First: Improved training stability

Citation

If you use this model, please cite:

@article{nie2025llada,
  title={Large Language Diffusion Models},
  author={Nie, Shen and others},
  journal={arXiv preprint arXiv:2502.09992},
  year={2025}
}

License

MIT License - Feel free to use for research and commercial purposes

Acknowledgments

  • Based on "Large Language Diffusion Models" (Nie et al., 2025)
  • Built with PyTorch and Transformers
  • Trained on WikiText-2 dataset
  • Inspired by diffusion models for vision (DiT, Genie)

Contact & Support

For issues, questions, or suggestions, please open an issue on GitHub or contact the model author.


Last Updated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support