--- title: StoryLlama emoji: 📖 colorFrom: blue colorTo: purple sdk: gradio sdk_version: 5.21.0 app_file: app.py pinned: false --- # Introducing StoryLlama - A Smaller Language Model for Bedtime Stories! - So, I trained a Llama a 88M architecture I coded from ground up to build a small instruct model, going through the below-mentioned stages from scratch. - Trained on TiyStories dataset form HuggingFace consisting of 4B tokens for a total of 5000 steps ### Pretraining #### Dataset - I used the [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) dataset from HuggingFace. 1) Train dataset - 2 M records approx 2) Val dataset - 26K records approx --- #### ModelArgs (Hyperparameters) Below is a table summarizing the configuration parameters for the model: | Parameter | Description | Default Value | Type | |--------------------------------|-----------------------------------------------------------------------------|-----------------------------------|-----------| | `epochs` | Number of training epochs | `4` | `int` | | `block_size` | Size of each block (context length) | `512` | `int` | | `batch_size` | Batch size for training | `64` | `int` | | `inference` | Inference mode (not specified) | `None` | `None` | | `embeddings_dims` | Dimensionality of embeddings | `512` | `int` | | `attn_dropout` | Dropout rate for attention layers | `0.1` | `float` | | `no_of_heads` | Number of attention heads | `8` | `int` | | `dropout` | Dropout rate for the model | `0.1` | `float` | | `val_epochs` | Number of validation epochs | `2` | `int` | | `max_lr` | Maximum learning rate | `6e-4` | `float` | | `no_of_decoder_layers` | Number of decoder layers | `8` | `int` | | `weight_decay_optim` | Weight decay for the optimizer | `0.1` | `float` | | `beta_1` | Beta 1 for Adam optimizer | `0.9` | `float` | | `beta_2` | Beta 2 for Adam optimizer | `0.95` | `float` | | `clip` | Gradient clipping value | `1.0` | `float` | | `device` | Device to run the model (`cuda` or `cpu`) | `'cuda'` | `str` | | `no_kv_heads` | Number of key-value heads | `2` | `int` | | `vocab_size` | Size of the vocabulary | `50304` | `int` | | `eps` | Epsilon value for numerical stability | `1e-5` | `float` | | `dtype` | Data type for tensors (`bfloat16` if supported, else `float16`) | `'bfloat16'` or `'float16'` | `str` | | `save_checkpoint_dir` | Directory to save model checkpoints | `"checkpoints"` | `str` | | `prompt` | Default prompt for inference | `"Once upon a time"` | `str` | | `save_checkpoint_iter` | Save checkpoint every N iterations | `50` | `int` | | `total_iters` | Total number of training iterations | `10000` | `int` | | `eval_iters` | Evaluate model every N iterations | `50` | `int` | | `eval_check` | Check evaluation metrics every N iterations | `100` | `int` | | `warmup_iters` | Number of warmup iterations for learning rate scheduling | `700` | `int` | | `min_lr` | Minimum learning rate (10% of `max_lr`) | `0.1 * max_lr` | `float` | | `lr_decay_iters` | Number of iterations for learning rate decay | `10000` | `int` | | `total_batch_size` | Total batch size across all devices | `524288` | `int` | | `micro_batch_size` | Micro batch size per device | `batch_size` | `int` | | `gradient_accumulation_steps` | Gradient accumulation steps | 524288 | `int` | --- #### Hardware Setup - Used DPP using Pytorch torchrun consisting of 2x GeForce RTX A100 AXM (80gb VRAM each) rented on runpod.io - The model is a 0.768GB in size but needs around 4 GB of VRAM when loaded in fp32 precision --- #### Frameworks: **Pytorch** --- #### Epochs/Steps - Iterations (train) = 5k - Val iterations = every 50 steps --- #### Losses - Train loss - 1.43 - Val loss - 1.45 --- #### Screenshots of the loss curves - Loss Curves (Train and Val) ![Loss Curves (Train and Val)](images/loss_curves.jpg) --- #### Output - Prompt: Once upon a time ![Prompt: Once upon a time](images/sample.jpg) --- ### Local setup ### Requirements ```python git [clone the repo](https://github.com/YuvrajSingh-mist/StoryLlama.git) cd StoryLlama bash ./install.sh ``` - A wandb.ai account for plotting graphs for your loss curves - On your terminal run ```python wandb login ``` - Enter the api key and follow the instructions and once you are succesfully logged in follow the given steps - Download the model ```python python download_model_weight.py ``` --- ### Running #### Training a model - Kindly change 'device' to any of your available cuda gpus. To run: ```python bash ./install.sh ``` ```python torchrun --standalone --nproc_per_node=gpu trainer.py \ --epochs 10 \ --block_size 256 \ --batch_size 128 \ --embeddings_dims 768 \ --attn_dropout 0.2 \ --no_of_heads 12 \ --dropout 0.2 \ --val_epochs 3 \ --max_lr 5e-4 \ --no_of_decoder_layers 6 \ --weight_decay_optim 0.01 \ --beta_1 0.85 \ --beta_2 0.99 \ --clip 0.5 \ --device "cuda" \ --no_kv_heads 4 \ --vocab_size 50257 \ --eps 1e-6 \ --dtype "float16" \ --save_checkpoint_dir "model_checkpoints" \ --prompt "Once upon a time" \ --save_checkpoint_iter 100 \ --total_iters 5000 \ --eval_iters 200 \ --eval_check 500 \ --warmup_iters 1000 \ --min_lr 1e-5 \ --lr_decay_iters 2000 \ --total_batch_size 262144 \ --micro_batch_size 128 \ --gradient_accumulation_steps 4 ``` --standalone - if all the gpu are on one server --npro_per_node - number of gpus available and use the keyword gpu to use all #### Inference on a model ```python python inference.py --prompt "Once upon a time" --max_length 100 --temperature 0.8 --topk 50 ```