SparkNet 70M v5
SparkNet 70M v5 is the final 70M-parameter checkpoint from the SparkNet research run by DienerTech. It is a compact GPT-2–style decoder (12 layers, 512 hidden size, 8 attention heads, 1024-token context) that was trained for ~1B tokens on a custom mixture of high-quality web and document corpora. The release ships with the SparkNet v5 tokenizer and weights stored in model.safetensors, ready for direct use via 🤗 Transformers.
Special thanks to CodeLion for inspiring the One Billion Token Challenge, and for providing the high-quality datasets used in this training run.
Model Details
- Developer: DienerTech
- Architecture: GPT-2–style causal decoder (approx. 70M parameters), dropout 0.1, cosine LR schedule, AdamW (fused).
- Context length: 1,024 tokens.
- Tokenizer: SparkNet v5 byte-level BPE (vocab size 50,257, EOS = `` and
<|pad|>padding). - Framework: PyTorch / 🤗 Transformers 4.46+.
- Checkpoint: Converted to
model.safetensorsfor safe loading; nopytorch_model.binleft in the repo.
Intended Use
- Lightweight text generation experiments, story/note drafting, or as a base for instruction-tuning / domain adaptation (LoRA, QLoRA, etc.).
- Research on small-model scaling laws or tokenizer experimentation.
Limitations & Risks
- No RLHF / instruction tuning; outputs will be generic next-token predictions and may require prompting tricks.
- Training data is predominantly public web/document text, so bias, toxicity, or outdated information may surface.
- Not evaluated for safety-critical deployments—perform your own alignment and red-teaming before production use.
Training Data
- 1B tokens packed into 1,024-token blocks (
datasets/sparknet-v5-1b). - Sources sampled uniformly across:
codelion/finepdfs-1B,codelion/dclm-baseline-1B,codelion/fineweb-edu-1B, plus curated DienerTech blog data. - Validation set:
wikitext-2-raw-v1(standard Hugging Face split).
Training Procedure
- Optimizer: AdamW (fused) with β₁=0.9, β₂=0.95, weight decay 0.1, gradient clipping at 1.0.
- Learning rate: 1e-4 peak with 3% warmup then cosine decay.
- Batching: per-device batch size 32, gradient accumulation 2 → 65,536 tokens/step.
- Budget: 1,000,000,000 effective tokens (≈15,259 steps).
- Hardware: Single 24GB+ NVIDIA GPU with TF32 + Flash Attention enabled.
- Best checkpoint: step 14,000 with eval loss 4.99 on WikiText-2 (logged via
trainer_state.json).
Evaluation
Formal downstream evaluation has not been run yet. Inside trainer_state.json, the best validation (WikiText-2) cross-entropy reached 4.9869 at step 14k. If you benchmark the model (e.g., with lm-eval-harness), please consider contributing results back to the card via a PR.
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "DienerTech/sparknet-70m-v5"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16, # or torch.float16 on older GPUs
device_map="auto",
)
prompt = "In a distant research lab, a tiny transformer model awakened and"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(
**inputs,
max_new_tokens=120,
temperature=0.9,
top_p=0.9,
do_sample=True,
)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Citation
@software{sparknet70mv5,
author = {DienerTech},
title = {SparkNet 70M v5},
year = {2025},
url = {https://huggingface.co/DienerTech/sparknet-70m-v5}
}
Please open an issue or PR on the DienerTech Hugging Face repo if you have feedback, evaluations, or fine-tuned variants to share.
- Downloads last month
- 37