π‘ TinyStories-GPT2-10k
TinyStories-GPT2-10k
is a lightweight, decoder-only transformer model trained from scratch on a tokenized version of the TinyStories dataset. It uses a custom Byte Pair Encoding (BPE) tokenizer with a vocabulary size of 10,000 tokens, making it well-suited for experiments in efficient language modeling, scaling laws, and low-resource fine-tuning.
π§ Model Architecture
This model follows the core GPT-2 architectural principles with a few simplifications to reduce parameter count and training cost.
Component | Value |
---|---|
Architecture | Decoder-only Transformer (GPT-like) |
Layers | 8 |
Embedding Size | 128 |
Attention Heads | 16 |
Feedforward Size | 512 (4Γ expansion) |
Sequence Length | 1024 |
Vocabulary Size | 10,000 |
Total Parameters | ~2.99M |
Dropout / Bias | None (disabled for simplicity) |
Weight Tying | β Enabled (input/output embeddings) |
Initialization
Weights were initialized with a normal distribution (π©(0, 0.02)), with additional scaling in residual paths by ( \frac{1}{\sqrt{2N}} ), where ( N = 8 ) (number of decoder layers), as inspired by GPT-2's residual accumulation strategy.
π§ͺ Training Configuration
Setting | Value |
---|---|
Dataset | TinyStories-tokenized-10k |
Tokenizer | Custom BPE (10k vocab) |
Training Tokens | 459M |
Validation Tokens | 4.6M |
Max Tokens Seen | ~1.37B |
Epochs | 3 |
Batch Size | 48 Γ 512 |
Optimizer | AdamW |
Learning Rate | 0.06 (linear decay, warmup=256 steps) |
Betas | (0.9, 0.95) |
Weight Decay | 0.1 |
Device | A100 GPU |
Training Time | ~72 minutes |
Performance
Metric | Value |
---|---|
Initial Loss | 9.23 |
Final Train Loss | 4.98 |
Best Validation Loss | 4.69 |
Overfitting | β Not observed |
π‘ Tokenizer
The model was trained using a custom BPE tokenizer built from the TinyStories dataset using the Hugging Face tokenizers
library. The tokenizer was capped at 10,000 tokens and saved as bpe-tokenizer_tinystories.json
.
π¦ Files Included
best_model.pt
: Final model weightsbpe-tokenizer_tinystories.json
: BPE tokenizer (10k vocab)config.yaml
: Architecture and training configurationloss_history.json
: Per-epoch training losses
π Related Resources
- π Dataset:
KabirBakhshaei/TinyStories-tokenized-10k
- π§ Base Dataset:
roneneldan/TinyStories
π Inference Example
from transformers import GPT2TokenizerFast, GPT2LMHeadModel
tokenizer = GPT2TokenizerFast.from_pretrained("KabirBakhshaei/TinyStories-GPT2-10k", tokenizer_file="bpe-tokenizer_tinystories.json")
model = GPT2LMHeadModel.from_pretrained("KabirBakhshaei/TinyStories-GPT2-10k")
prompt = "Once upon a time"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100, temperature=0.9, top_k=10)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
- Downloads last month
- 4