GPT-2 from Scratch
This model implements the GPT-2 architecture (125M parameters) trained from scratch.
Model Description
- Model type: GPT-2 (125M parameters)
- Architecture: Transformer-based autoregressive language model following the original GPT-2 design
- Training data: Combined dataset (33GB) from:
- HuggingFaceFW/fineweb-edu
- bigcode/the-stack
- Skylion007/openwebtext
- Training approach: Built and trained from scratch, not fine-tuned from an existing checkpoint
- Language: English
Intended Uses & Limitations
- Intended use: Research and experimentation with language models; reference implementation for reproducing GPT-2
- Limitations: With only 125M parameters (compared to larger models like GPT-3 with 175B), this model has limited capabilities in generating coherent long-form text and understanding complex instructions
Training Details
- Training corpus: Approximately 4.5B tokens
- Training duration: 4 epochs (approximately 52 hours total)
- Hardware: 2× NVIDIA RTX 4090 GPUs via vast.ai
- Estimated cost: $35 for complete training
- Token context: 1024 tokens
Hyperparameters
- context_len: 1024
- seed: 42
- epochs: 4
- batch_size: 8
- mega_batch_size: 512
- grad_clip: 1.0
- optimizer: "adamw"
- max_lr: 6.0e-4
- min_lr: 6.0e-5
- beta1: 0.9
- beta2: 0.95
- weight_decay: 0.1
- warmup_steps: 720
Performance and Evaluation
This model was built as an educational exercise to reproduce the GPT-2 architecture from scratch. While it demonstrates the core capabilities of transformer-based language models, its performance is naturally limited compared to larger, more extensively trained models.
Contact
GitHub: thecr7guy2
- Downloads last month
- 13
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support