SparkNet 70M v5

SparkNet 70M v5 is the final 70M-parameter checkpoint from the SparkNet research run by DienerTech. It is a compact GPT-2–style decoder (12 layers, 512 hidden size, 8 attention heads, 1024-token context) that was trained for ~1B tokens on a custom mixture of high-quality web and document corpora. The release ships with the SparkNet v5 tokenizer and weights stored in model.safetensors, ready for direct use via 🤗 Transformers.

Special thanks to CodeLion for inspiring the One Billion Token Challenge, and for providing the high-quality datasets used in this training run.

Model Details

Developer: DienerTech
Architecture: GPT-2–style causal decoder (approx. 70M parameters), dropout 0.1, cosine LR schedule, AdamW (fused).
Context length: 1,024 tokens.
Tokenizer: SparkNet v5 byte-level BPE (vocab size 50,257, EOS = `` and <|pad|> padding).
Framework: PyTorch / 🤗 Transformers 4.46+.
Checkpoint: Converted to model.safetensors for safe loading; no pytorch_model.bin left in the repo.

Intended Use

Lightweight text generation experiments, story/note drafting, or as a base for instruction-tuning / domain adaptation (LoRA, QLoRA, etc.).
Research on small-model scaling laws or tokenizer experimentation.

Limitations & Risks

No RLHF / instruction tuning; outputs will be generic next-token predictions and may require prompting tricks.
Training data is predominantly public web/document text, so bias, toxicity, or outdated information may surface.
Not evaluated for safety-critical deployments—perform your own alignment and red-teaming before production use.

Training Data

1B tokens packed into 1,024-token blocks (datasets/sparknet-v5-1b).
Sources sampled uniformly across: codelion/finepdfs-1B, codelion/dclm-baseline-1B, codelion/fineweb-edu-1B, plus curated DienerTech blog data.
Validation set: wikitext-2-raw-v1 (standard Hugging Face split).

Training Procedure

Optimizer: AdamW (fused) with β₁=0.9, β₂=0.95, weight decay 0.1, gradient clipping at 1.0.
Learning rate: 1e-4 peak with 3% warmup then cosine decay.
Batching: per-device batch size 32, gradient accumulation 2 → 65,536 tokens/step.
Budget: 1,000,000,000 effective tokens (≈15,259 steps).
Hardware: Single 24GB+ NVIDIA GPU with TF32 + Flash Attention enabled.
Best checkpoint: step 14,000 with eval loss 4.99 on WikiText-2 (logged via trainer_state.json).

Evaluation

Formal downstream evaluation has not been run yet. Inside trainer_state.json, the best validation (WikiText-2) cross-entropy reached 4.9869 at step 14k. If you benchmark the model (e.g., with lm-eval-harness), please consider contributing results back to the card via a PR.

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "DienerTech/sparknet-70m-v5"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,  # or torch.float16 on older GPUs
    device_map="auto",
)

prompt = "In a distant research lab, a tiny transformer model awakened and"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(
    **inputs,
    max_new_tokens=120,
    temperature=0.9,
    top_p=0.9,
    do_sample=True,
)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Citation

@software{sparknet70mv5,
  author = {DienerTech},
  title = {SparkNet 70M v5},
  year = {2025},
  url = {https://huggingface.co/DienerTech/sparknet-70m-v5}
}

Please open an issue or PR on the DienerTech Hugging Face repo if you have feedback, evaluations, or fine-tuned variants to share.

Downloads last month: 37

Safetensors

Model size

64.1M params

Tensor type

F32

Datasets used to train DienerTech/sparknet-70m

Evaluation results

Metadata error: specify a dataset to view leaderboard