Model Card for gpt2

This model is a reproduction of gpt2 following Andrej Karpathy's GPT tutorial series and the original GPT-2 Paper

The model was trained using the nano-gpt library which follows the pattern from Karpathy's excellent content, with some additional packaging and infrastructure work to make it more maintainable and reusable.

Model Details

Model Description

GPT-2 is a transformers model pretrained on a large corpus of english only text with no labeling. This is the smallest version of GPT-2, with 124M parameters.

The model follows the standard GPT-2 architecture with transformer blocks containing:

Multi-head causal self-attention
Layer normalization
MLP blocks with GELU activation

The model was trained using a sample of the FineWeb-EDU using 10B tokens. The dataset contains educational web pages.

Developed by: Allen Porter

Model Sources

Repository: https://github.com/allenporter/nano-gpt/

How to Get Started with the Model

This model is stored in safetensors format and is in the same format used by the gpt2 model released by OpenAI.

The easiest way to load this model is to use the nano-gpt command line tool. You can install the package from pypi. Here is an example using a virtual enviromnemt with uv:

$ uv venv --python=3.13
$ source .venv/bin/activate
$ uv pip install nano-gpt

You may then load this pretrained model:

$ nano-gpt sample --pretrained=allenporter/gpt2
> Hello, I'm a language model, you're doing your application, I've put your main program and you want to model. Here are some things
> Hello, I'm a language model, so let's have a look at a few very old and popular dialects with some basic information about some of
> Hello, I'm a language model, but I also use a number of core vocabulary from the Python language and some data structures from
the web to
> Hello, I'm a language model, so this is about building a language to help my students to express themselves in all possible situations when they are in
> Hello, I'm a language model, who wrote my first 'hello' and never used it, but my first 'hello' can't be in

Training Details

Training Data

This model was trained from https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu from the 10B token sample.

Training Procedure

Preprocessing

The model was pre-processed and sharded to make data loading efficient. The dataset was pre-tokenized using GPT-2 tokenizer using the nano-gpt prepare_dataset command. The file is split into manageable chunks.

Training Hyperparameters

Training regime: See train_config in config.json for the hyper parameters.

The main features of the training process are:

Learning rate scheduling with warmup
Gradient clipping for stable training
Model compilation for improved performance where available

Speeds, Sizes, Times

The model was trained using 8 x A100s. The model was run for one full epoch of the 10B token dataset, which is 19072 steps. The model was trained for about 2 hours.

Evaluation

The model was evaluated using hellaswag dataset. TBD results.

The nano-gpt train command has built in support for evaluating against the val dataset as well as HellaSwag in between training steps. Every 500 steps the model was evaluated against the val dataset and HellaSwag.

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: A100
Hours used: 2 hours
Cloud Provider: Lambda Labs
Compute Region: Arizona
Power estimate: 8 GPUs * 0.325 kW/GPU = 2.6 kW. 2.6 kW * 2 hours = 5.2 kWh
CO2 Estimate: 2.6 pounds of CO2 equivalent assuming 500 lbs CO2e per MWh

allenporter
/

gpt2