GPT-1: Transformer Model (10M)
Model Architecture
Bigram Baseline
- Input Representation: Each character is mapped to an embedding vector from an embedding table.
- Bigram Modeling: Generates text by predicting the next token based on the previous token.
Self-Attention Mechanism
- Attention Weights: Captures dependencies between tokens using scaled dot-product attention.
- Positional Encoding: Ensures the model understands the relative position of tokens in input sequences.
Transformer Blocks
- Multi-Head Self-Attention: Allows the model to attend to different parts of the input sequence simultaneously.
- Feedforward Layers: Two-layer MLP with nonlinear activations applied to the output of the attention mechanism.
- Residual Connections: Combines outputs of attention and feedforward layers to improve gradient flow.
- Layer Normalization: Stabilizes training by normalizing activations at each step.
Scaling Techniques
- Dropout Regularization: Reduces overfitting by randomly zeroing activations during training.
- Decoder Architecture: Implements transformer blocks focused exclusively on autoregressive generation (no encoder).
Training Configuration
Dataset
- Source: A cleaned and merged dataset of the 'Harry Potter Novels' collection sourced from Kaggle.
- Size: Contains ~ 6 million characters (~ 6 million tokens).
Training Setup
- Hardware: NVIDIA GeForce GTX 1650 GPU (4GB VRAM).
- Framework: PyTorch (custom implementation of transformer components).
- Time: Approximately 90 minutes for training and generating output.
Optimizer
- Adam Optimizer: Adaptive gradient optimization for efficient weight updates.
- Learning Rate Schedule: Warm-up period followed by decay.
Loss Function
- Negative Log Likelihood (NLL): Measures the likelihood of the predicted tokens against the actual tokens.
Evaluation and Output
- Generated Text: The final output is saved in
generated.txt
. Examples include Harry Potter-like sentences generated using character-level language modeling. - Qualitative Analysis: Insights into the model's ability to generate coherent and contextually relevant text based on training data.
Documentation
For a more detailed breakdown of each component and concept, visit the Road to GPT Documentation Site, which includes visualizations, explanations, and annotated notebooks.
Acknowledgments
This implementation is inspired by the Let’s Build GPT from Scratch video by Andrej Karpathy. Special thanks to the Kaggle community for providing the raw dataset.
For more projects, check out my Portfolio Site.
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support