GPT-1: Transformer Model (10M)

Model Architecture

Bigram Baseline

Input Representation: Each character is mapped to an embedding vector from an embedding table.
Bigram Modeling: Generates text by predicting the next token based on the previous token.

Self-Attention Mechanism

Attention Weights: Captures dependencies between tokens using scaled dot-product attention.
Positional Encoding: Ensures the model understands the relative position of tokens in input sequences.

Transformer Blocks

Multi-Head Self-Attention: Allows the model to attend to different parts of the input sequence simultaneously.
Feedforward Layers: Two-layer MLP with nonlinear activations applied to the output of the attention mechanism.
Residual Connections: Combines outputs of attention and feedforward layers to improve gradient flow.
Layer Normalization: Stabilizes training by normalizing activations at each step.

Scaling Techniques

Dropout Regularization: Reduces overfitting by randomly zeroing activations during training.
Decoder Architecture: Implements transformer blocks focused exclusively on autoregressive generation (no encoder).

Training Configuration

Dataset

Source: A cleaned and merged dataset of the 'Harry Potter Novels' collection sourced from Kaggle.
Size: Contains ~ 6 million characters (~ 6 million tokens).

Training Setup

Hardware: NVIDIA GeForce GTX 1650 GPU (4GB VRAM).
Framework: PyTorch (custom implementation of transformer components).
Time: Approximately 90 minutes for training and generating output.

Optimizer

Adam Optimizer: Adaptive gradient optimization for efficient weight updates.
Learning Rate Schedule: Warm-up period followed by decay.

Loss Function

Negative Log Likelihood (NLL): Measures the likelihood of the predicted tokens against the actual tokens.

Evaluation and Output

Generated Text: The final output is saved in generated.txt. Examples include Harry Potter-like sentences generated using character-level language modeling.
Qualitative Analysis: Insights into the model's ability to generate coherent and contextually relevant text based on training data.

Documentation

For a more detailed breakdown of each component and concept, visit the Road to GPT Documentation Site, which includes visualizations, explanations, and annotated notebooks.

Acknowledgments

This implementation is inspired by the Let’s Build GPT from Scratch video by Andrej Karpathy. Special thanks to the Kaggle community for providing the raw dataset.

For more projects, check out my Portfolio Site.

MuzzammilShah
/

GPT-10M-HarryPotter-Novels