π§ SmallCoder (303M)
SmallCoder is a 303M parameter LLaMA-style language model trained from scratch for code generation and algorithmic reasoning.
This checkpoint represents a 6B-token Supervised Fine-Tuning (SFT) run that fixed a critical End-of-Sequence (EOS) token bug from earlier versions.
Despite its compact size, SmallCoder achieves state-of-the-art (SOTA) coding performance for <500M models, rivaling 1Bβ7B parameter LLMs.
Trained with support from Googleβs TPU Research Cloud (TRC) program.
π Key Results
| Model | Size | HumanEval (pass@1) | MBPP (pass@1) |
|---|---|---|---|
| SmallCoder (Stage 4.1) | 303M | 27.4 % | 31.0 % |
| TinyLlama-1.1B | 1.1B | ~26.4 % | ~27.6 % |
| MPT-1B-Instruct | 1.0B | ~22.0 % | ~25.0 % |
| Zephyr-1.3B-SFT | 1.3B | 31.0 % | 34.0 % |
| Mistral-7B-Base | 7B | 30.5 % | 47.5 % |
βοΈ SmallCoder nearly matches Mistral 7B on HumanEval while being 23Γ smaller.
𧬠Model Architecture
A LLaMA-type causal decoder with standard Multi-Head Attention (MHA).
LlamaConfig(
vocab_size=49152, # StarCoder tokenizer
hidden_size=768,
num_hidden_layers=24,
num_attention_heads=8,
num_key_value_heads=8,
intermediate_size=3072,
max_position_embeddings=1024,
)
| Parameter | Value |
|---|---|
| Total parameters | β 303 M |
| Context length | 1 024 tokens |
| Tokenizer | bigcode/starcoder |
| Architecture type | LLaMA (MHA, non-GQA) |
| Precision | bfloat16 |
| Optimizer | AdamW XLA |
| Hardware | TPU v4-32 (TRC) |
π Training Curriculum (4 Stages, 29.8B tokens)
| Stage | Tokens (B) | Dataset | Objective | Loss β |
|---|---|---|---|---|
| 1. Linguistic Base | 6.3 | FineWeb-Edu | General English grounding | 10.87 β 2.58 |
| 2. Code Specialization | 7.5 | 60 % Nemotron Synthetic Code / 40 % StarCoderData | Code syntax & reasoning | 5.00 β 1.25 |
| 3. Math & Knowledge | 10.0 | Nemotron CC-Math / FineWiki / OpenWebMath | Mathematical reasoning | 2.77 β 1.55 |
| 4.1 SFT (EOS Fixed) | 6.0 | Nemotron SFT / OpenCodeInstruct / OpenMathInstruct-2 | Instruction-tuned code alignment | 1.73 β ~0.70 |
π§© Total β 29.8 B tokens of curated curriculum learning.
π Detailed Benchmarks (Stage 4.1 SFT)
| Domain | Benchmark | Metric | Score |
|---|---|---|---|
| Code | HumanEval (0-shot) | pass@1 | 27.4 % |
| Code | MBPP (3-shot) | pass@1 | 31.0 % |
| Math | GSM8k (0-shot) | exact match | 4.55 % |
| Knowledge | Wikitext-2 | perplexity β | 167.6 |
| Reasoning | ARC (Easy/Challenge) | acc norm | 34.6 / 22.8 % |
| Commonsense | HellaSwag | acc norm | 28.3 % |
humaneval/mbppwere computed with manual evaluation (max_new_tokens=512,temp=0.2) due to SFT format truncation issues inlm-eval.
β οΈ Known Limitations
Code-Specialized Model Tuned for Python and algorithmic reasoning. Poor performance on general text, math, and commonsense tasks.
Short Context Trained on 1 024-token sequences only. Performance degrades on longer inputs.
Tokenizer Bias Uses
bigcode/starcoderBPE vocabulary β optimized for code, not prose.
π» Usage Example
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "Beebey/smallcoder-303m"
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16).to(device)
prompt = """User: Write a Python function to compute Fibonacci numbers.
Assistant:"""
inputs = tokenizer(prompt, return_tensors="pt").to(device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=512,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
π‘ Trained using the βUser:β / βAssistant:β dialogue format.
π§Ύ Citation
If you use SmallCoder (303M) in your research, please cite:
@misc{smallcoder303m,
title = {SmallCoder: A 303M-parameter Code LLM trained from scratch},
author = {Da Silva, Ilan},
year = {2025},
url = {https://huggingface.co/Beebey/smallcoder-303m},
note = {Trained with Google TPU Research Cloud (TRC) support}
}
π Acknowledgements
This model was trained with support from the Google TPU Research Cloud (TRC) program. Special thanks to the open datasets that enabled this work: FineWeb, StarCoderData, Nemotron, and OpenWebMath.
π§© Summary
| Category | Description |
|---|---|
| Type | Code LLM (LLaMA-style) |
| Parameters | 303 M |
| Training tokens | ~29.8 B |
| Specialty | Code generation & reasoning |
| Context window | 1 024 tokens |
| Tokenizer | bigcode/starcoder |
| License | Apache 2.0 |
| Hardware | TPU v4 (TRC Program) |
π¬ SmallCoder (303M) demonstrates that a carefully designed <500M model can achieve near-SOTA coding performance, matching 1B-class models on HumanEval β proving that efficient, compact, open models still matter.
- Downloads last month
- 43