SimpleStories 35M
SimpleStories-35M is a 35 million parameter language model trained on the SimpleStories dataset. This model is the largest in the SimpleStories model family, offering the best performance across all evaluation metrics. This is part of the family of small language models trained on SimpleStories dataset. The models range from 1.25M to 35M parameters, offering a spectrum of capabilities while maintaining efficiency. The model training and evaluation code can be found here: https://github.com/danbraunai/simple_stories_train/tree/main/simple_stories_train
Model Variants
Model Name | Parameters | Description |
---|---|---|
SimpleStories-35M | 35 million | Our largest model, offering the best performance across all metrics |
SimpleStories-30M | 30 million | A slightly smaller model with comparable performance |
SimpleStories-11M | 11 million | Medium-sized model with good balance of performance and efficiency |
SimpleStories-5M | 5 million | Smaller model suitable for resource-constrained environments |
SimpleStories-1.25M | 1.25 million | Our smallest model, remarkably capable despite its tiny size |
Performance Comparison
Our models demonstrate strong performance across various evaluation metrics as shown in the chart below. The trained models are scored using the model as a judge evaluation framework.
- Originality: Measures the uniqueness and creativity of generated content
- Coherence: Evaluates the logical flow and consistency of generated stories
- Grammar: Assesses grammatical correctness and linguistic quality
- Quality: Holistic evaluation of overall text generation quality
The larger models (35M, 30M) achieve the best performance, particularly in coherence and grammar, while even our smallest 1.25M parameter model produces readable and coherent content. As shown in the visualization, our SimpleStories-33M model achieves scores of 90.8 in Grammar, 85.7 in Coherence, 81.5 in Quality, and 72.5 in Originality.
Dataset
The SimpleStories dataset is a collection of short stories generated by state-of-the-art language models. It features:
- Story annotation with high-level concepts: theme, topic, style, etc.
- Higher semantic and syntactic diversity through seeded story generation
- Generated by 2024 models
- Several NLP-metrics pre-computed to aid filtering
- ASCII-only guarantee for the English dataset
Tokenizer
We have trained a custom WordPiece tokenizer with a small vocabulary size of 4096. We conducted morphological analysis and coverage gain analysis on the dataset to build a small tokenizer without compromising on the quality of generation.
Installation
Follow the steps at https://github.com/danbraunai/simple_stories_train to install the simple stories package.
Usage
Here's how to use any model in the SimpleStories family:
from transformers import AutoTokenizer
import torch
from simple_stories_train.models.llama import Llama
from simple_stories_train.models.model_configs import MODEL_CONFIGS
# Select the model size you want to use
model_size = "35M" # Options: "35M", "30M", "11M", "5M", "1.25M"
# Load model configuration
model_config = MODEL_CONFIGS[model_size]
# Load appropriate model
model_path = f"chandan-sreedhara/SimpleStories-{model_size}"
model = Llama.from_pretrained(model_path, model_config)
model.to("cuda")
model.eval()
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)
# Define your prompt
prompt = "The curious cat looked at the"
# IMPORTANT: Use tokenizer without special tokens
inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
input_ids = inputs.input_ids.to("cuda")
# IMPORTANT: Set correct EOS token ID (not the default from tokenizer)
eos_token_id = 1
# Generate text
with torch.no_grad():
output_ids = model.generate(
idx=input_ids,
max_new_tokens=800,
temperature=0.7,
top_k=40,
eos_token_id=eos_token_id
)
# Decode output
output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(f"Generated text:\n{output_text}")
Limitations
- These models are trained primarily on English content
- The smaller models (3M, 1M) may show limitations in handling complex narrative structures
Citation
If you use these models in your research, please cite:
@misc{sreedhara2025simplestories,
author = {Sreedhara, Chandan},
title = {SimpleStories Model Family},
year = {2025},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/chandan-sreedhara/SimpleStories-35M}}
}
Acknowledgements
These models build upon the work done in the TinyStories project by Eldan and Li, with the SimpleStories dataset created by Lennart Finke and the training code created by Dan Braun.
- Downloads last month
- 24