GPT-2 from Scratch

This model implements the GPT-2 architecture (125M parameters) trained from scratch.

Model Description

Model type: GPT-2 (125M parameters)
Architecture: Transformer-based autoregressive language model following the original GPT-2 design
Training data: Combined dataset (33GB) from:
- HuggingFaceFW/fineweb-edu
- bigcode/the-stack
- Skylion007/openwebtext
Training approach: Built and trained from scratch, not fine-tuned from an existing checkpoint
Language: English

Intended Uses & Limitations

Intended use: Research and experimentation with language models; reference implementation for reproducing GPT-2
Limitations: With only 125M parameters (compared to larger models like GPT-3 with 175B), this model has limited capabilities in generating coherent long-form text and understanding complex instructions

Training Details

Training corpus: Approximately 4.5B tokens
Training duration: 4 epochs (approximately 52 hours total)
Hardware: 2× NVIDIA RTX 4090 GPUs via vast.ai
Estimated cost: $35 for complete training
Token context: 1024 tokens

Hyperparameters

context_len: 1024
seed: 42
epochs: 4
batch_size: 8
mega_batch_size: 512
grad_clip: 1.0
optimizer: "adamw"
max_lr: 6.0e-4
min_lr: 6.0e-5
beta1: 0.9
beta2: 0.95
weight_decay: 0.1
warmup_steps: 720

Performance and Evaluation

This model was built as an educational exercise to reproduce the GPT-2 architecture from scratch. While it demonstrates the core capabilities of transformer-based language models, its performance is naturally limited compared to larger, more extensively trained models.

Contact

GitHub: thecr7guy2

thecr7guy
/

GPT2fromScratch