Text Generation
Safetensors
English

GPT-2 from Scratch

This model implements the GPT-2 architecture (125M parameters) trained from scratch.

Model Description

  • Model type: GPT-2 (125M parameters)
  • Architecture: Transformer-based autoregressive language model following the original GPT-2 design
  • Training data: Combined dataset (33GB) from:
    • HuggingFaceFW/fineweb-edu
    • bigcode/the-stack
    • Skylion007/openwebtext
  • Training approach: Built and trained from scratch, not fine-tuned from an existing checkpoint
  • Language: English

Intended Uses & Limitations

  • Intended use: Research and experimentation with language models; reference implementation for reproducing GPT-2
  • Limitations: With only 125M parameters (compared to larger models like GPT-3 with 175B), this model has limited capabilities in generating coherent long-form text and understanding complex instructions

Training Details

  • Training corpus: Approximately 4.5B tokens
  • Training duration: 4 epochs (approximately 52 hours total)
  • Hardware: 2× NVIDIA RTX 4090 GPUs via vast.ai
  • Estimated cost: $35 for complete training
  • Token context: 1024 tokens

Hyperparameters

  • context_len: 1024
  • seed: 42
  • epochs: 4
  • batch_size: 8
  • mega_batch_size: 512
  • grad_clip: 1.0
  • optimizer: "adamw"
  • max_lr: 6.0e-4
  • min_lr: 6.0e-5
  • beta1: 0.9
  • beta2: 0.95
  • weight_decay: 0.1
  • warmup_steps: 720

Performance and Evaluation

This model was built as an educational exercise to reproduce the GPT-2 architecture from scratch. While it demonstrates the core capabilities of transformer-based language models, its performance is naturally limited compared to larger, more extensively trained models.

Contact

GitHub: thecr7guy2

Downloads last month
13
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train thecr7guy/GPT2fromScratch