DienerTech commited on
Commit
9fcbc63
·
verified ·
1 Parent(s): 11c49e5

Upload folder using huggingface_hub

Browse files
README.md ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ library_name: transformers
5
+ license: apache-2.0
6
+ tags:
7
+ - sparknet
8
+ - causal-lm
9
+ - text-generation
10
+ - gpt
11
+ - pytorch
12
+ - 70m
13
+ pipeline_tag: text-generation
14
+ model-index:
15
+ - name: SparkNet-70M-v5
16
+ results: []
17
+ ---
18
+
19
+ # SparkNet 70M v5
20
+
21
+ SparkNet 70M v5 is the final 70M-parameter checkpoint from the SparkNet research run by **DienerTech**. It is a compact GPT-2–style decoder (12 layers, 512 hidden size, 8 attention heads, 1024-token context) that was trained for ~1B tokens on a custom mixture of high-quality web and document corpora. The release ships with the SparkNet v5 tokenizer and weights stored in `model.safetensors`, ready for direct use via 🤗 Transformers.
22
+
23
+ ## Model Details
24
+
25
+ - **Developer**: DienerTech
26
+ - **Architecture**: GPT-2–style causal decoder (approx. 70M parameters), dropout 0.1, cosine LR schedule, AdamW (fused).
27
+ - **Context length**: 1,024 tokens.
28
+ - **Tokenizer**: SparkNet v5 byte-level BPE (vocab size 50,257, EOS = `` and `<|pad|>` padding).
29
+ - **Framework**: PyTorch / 🤗 Transformers 4.46+.
30
+ - **Checkpoint**: Converted to `model.safetensors` for safe loading; no `pytorch_model.bin` left in the repo.
31
+
32
+ ## Intended Use
33
+
34
+ - Lightweight text generation experiments, story/note drafting, or as a base for instruction-tuning / domain adaptation (LoRA, QLoRA, etc.).
35
+ - Research on small-model scaling laws or tokenizer experimentation.
36
+
37
+ ## Limitations & Risks
38
+
39
+ - No RLHF / instruction tuning; outputs will be generic next-token predictions and may require prompting tricks.
40
+ - Training data is predominantly public web/document text, so bias, toxicity, or outdated information may surface.
41
+ - Not evaluated for safety-critical deployments—perform your own alignment and red-teaming before production use.
42
+
43
+ ## Training Data
44
+
45
+ - 1B tokens packed into 1,024-token blocks (`datasets/sparknet-v5-1b`).
46
+ - Sources sampled uniformly across: `codelion/finepdfs-1B`, `codelion/dclm-baseline-1B`, `codelion/fineweb-edu-1B`, plus curated DienerTech blog data.
47
+ - Validation set: `wikitext-2-raw-v1` (standard Hugging Face split).
48
+
49
+ ## Training Procedure
50
+
51
+ - **Optimizer**: AdamW (fused) with β₁=0.9, β₂=0.95, weight decay 0.1, gradient clipping at 1.0.
52
+ - **Learning rate**: 1e-4 peak with 3% warmup then cosine decay.
53
+ - **Batching**: per-device batch size 32, gradient accumulation 2 → 65,536 tokens/step.
54
+ - **Budget**: 1,000,000,000 effective tokens (≈15,259 steps).
55
+ - **Hardware**: Single 24GB+ NVIDIA GPU with TF32 + Flash Attention enabled.
56
+ - **Best checkpoint**: step 14,000 with eval loss 4.99 on WikiText-2 (logged via `trainer_state.json`).
57
+
58
+ ## Evaluation
59
+
60
+ Formal downstream evaluation has not been run yet. Inside `trainer_state.json`, the best validation (WikiText-2) cross-entropy reached **4.9869** at step 14k. If you benchmark the model (e.g., with lm-eval-harness), please consider contributing results back to the card via a PR.
61
+
62
+ ## Usage
63
+
64
+ ```python
65
+ from transformers import AutoTokenizer, AutoModelForCausalLM
66
+ import torch
67
+
68
+ model_id = "DienerTech/sparknet-70m-v5"
69
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
70
+ model = AutoModelForCausalLM.from_pretrained(
71
+ model_id,
72
+ torch_dtype=torch.bfloat16, # or torch.float16 on older GPUs
73
+ device_map="auto",
74
+ )
75
+
76
+ prompt = "In a distant research lab, a tiny transformer model awakened and"
77
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
78
+ output = model.generate(
79
+ **inputs,
80
+ max_new_tokens=120,
81
+ temperature=0.9,
82
+ top_p=0.9,
83
+ do_sample=True,
84
+ )
85
+ print(tokenizer.decode(output[0], skip_special_tokens=True))
86
+ ```
87
+
88
+ ## Citation
89
+
90
+ ```
91
+ @software{sparknet70mv5,
92
+ author = {DienerTech},
93
+ title = {SparkNet 70M v5},
94
+ year = {2025},
95
+ url = {https://huggingface.co/DienerTech/sparknet-70m-v5}
96
+ }
97
+ ```
98
+
99
+ Please open an issue or PR on the DienerTech Hugging Face repo if you have feedback, evaluations, or fine-tuned variants to share.
config.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "activation_function": "gelu_new",
3
+ "architectures": [
4
+ "GPT2LMHeadModel"
5
+ ],
6
+ "attn_pdrop": 0.1,
7
+ "bos_token_id": 50256,
8
+ "dtype": "float32",
9
+ "embd_pdrop": 0.1,
10
+ "eos_token_id": 50256,
11
+ "initializer_range": 0.02,
12
+ "layer_norm_epsilon": 1e-05,
13
+ "model_type": "gpt2",
14
+ "n_embd": 512,
15
+ "n_head": 8,
16
+ "n_inner": null,
17
+ "n_layer": 12,
18
+ "n_positions": 1024,
19
+ "reorder_and_upcast_attn": false,
20
+ "resid_pdrop": 0.1,
21
+ "scale_attn_by_inverse_layer_idx": false,
22
+ "scale_attn_weights": true,
23
+ "summary_activation": null,
24
+ "summary_first_dropout": 0.1,
25
+ "summary_proj_to_labels": true,
26
+ "summary_type": "cls_index",
27
+ "summary_use_proj": true,
28
+ "transformers_version": "4.57.1",
29
+ "use_cache": false,
30
+ "vocab_size": 50257
31
+ }
generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 50256,
4
+ "eos_token_id": 50256,
5
+ "transformers_version": "4.57.1"
6
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:223dace1d2c6be60c9f8793863e1795b36a24e53ff188b83d0949bd6af0c49e6
3
+ size 256356888
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {}
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "1": {
4
+ "content": "<|pad|>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ }
11
+ },
12
+ "clean_up_tokenization_spaces": false,
13
+ "eos_token": "",
14
+ "extra_special_tokens": {},
15
+ "model_max_length": 1024,
16
+ "pad_token": "",
17
+ "padding_side": "right",
18
+ "tokenizer_class": "PreTrainedTokenizerFast"
19
+ }
training_metadata.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "run_name": "sparknet-70m-v5",
3
+ "timestamp": "2025-11-15T07:24:08.029637",
4
+ "params": {
5
+ "n_embd": 512,
6
+ "n_layer": 12,
7
+ "n_head": 8,
8
+ "context_length": 1024,
9
+ "token_budget": 1000000000
10
+ },
11
+ "datasets": [
12
+ "sparknet-v5-1b"
13
+ ],
14
+ "notes": "V5 | Custom tokenizer, dropout, cosine LR, static 1B token dataset."
15
+ }