kunjcr2 commited on
Commit
4e17640
Β·
verified Β·
1 Parent(s): 82b46bf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +247 -3
README.md CHANGED
@@ -1,8 +1,252 @@
1
  ---
2
- license: apache-2.0
3
  language:
4
  - en
 
5
  pipeline_tag: text-generation
6
  tags:
7
- - SFSU
8
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
2
  language:
3
  - en
4
+ library_name: transformers
5
  pipeline_tag: text-generation
6
  tags:
7
+ - decoder-only
8
+ - nlp
9
+ - autoregressive
10
+ - rope
11
+ - gqa
12
+ - rmsnorm
13
+ - swiglu
14
+ - from-scratch
15
+ datasets:
16
+ - roneneldan/TinyStories
17
+ license: apache-2.0
18
+ model-index:
19
+ - name: GatorGPT2
20
+ results: []
21
+ ---
22
+
23
+ # 🐊 GatorGPT2
24
+
25
+ **GatorGPT2** is a small, decoder-only Transformer trained from scratch on a subset of **TinyStories** for next-token prediction.
26
+ It uses **RoPE** (rotary positional embeddings), **GQA** (grouped-query attention), **RMSNorm**, and a **SwiGLU MLP**.
27
+ Tokenizer is **tiktoken** with **p50k_base** vocabulary.
28
+
29
+ > **Repo**: `kunjcr2/GatorGPT2`
30
+ > **Intended use**: research, experimentation, educational demos for training/serving custom LMs
31
+
32
+ ---
33
+
34
+ ## πŸ”§ Architecture
35
+
36
+ - **Type**: Decoder-only, causal LM
37
+ - **Layers**: `num_hidden_layers = 10`
38
+ - **Hidden size**: `hidden_size = 448`
39
+ - **Heads**: `num_attention_heads = 8` (GQA with 2 KV heads per query group)
40
+ - **FFN**: SwiGLU, `d_ff β‰ˆ 2Γ— hidden_size`
41
+ - **Norm**: RMSNorm (pre-norm blocks)
42
+ - **Positional**: RoPE
43
+ - **Vocab**: `vocab_size = 50,257` (tiktoken p50k_base)
44
+ - **Context length**: `max_position_embeddings = 1024`
45
+ - **Weight tying**: output head tied with token embeddings
46
+ - **Files**:
47
+ - `pytorch_model.bin` (or `model.safetensors`)
48
+ - `config.json` (`model_type: "gator-transformer"`, `auto_map` provided)
49
+ - `modeling_gator.py`, `configuration_gator.py`, `__init__.py`
50
+ - `tokenizer_manifest.json` β†’ `{ "library": "tiktoken", "encoding": "p50k_base" }`
51
+
52
+ > Custom code is loaded via `trust_remote_code=True`.
53
+
54
+ ---
55
+
56
+ ## πŸ“¦ Install
57
+
58
+ ```bash
59
+ pip install torch transformers tiktoken
60
+ ````
61
+
62
+ ---
63
+
64
+ ## πŸš€ Quickstart (Transformers + tiktoken)
65
+
66
+ ```python
67
+ import torch
68
+ from transformers import AutoModelForCausalLM
69
+ import tiktoken
70
+
71
+ MODEL_ID = "kunjcr2/GatorGPT2"
72
+ DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
73
+
74
+ # Load model (uses custom modeling code)
75
+ model = AutoModelForCausalLM.from_pretrained(
76
+ MODEL_ID,
77
+ trust_remote_code=True,
78
+ torch_dtype=torch.float32,
79
+ ).to(DEVICE).eval()
80
+
81
+ # Tokenizer (p50k_base via tiktoken)
82
+ tok = tiktoken.get_encoding("p50k_base")
83
+
84
+ def generate_greedy(prompt: str, max_new_tokens: int = 64) -> str:
85
+ ids = tok.encode(prompt)
86
+ x = torch.tensor([ids], device=DEVICE)
87
+ for _ in range(max_new_tokens):
88
+ with torch.no_grad():
89
+ out = model(x)
90
+ logits = out["logits"] if isinstance(out, dict) else out.logits
91
+ next_id = int(torch.argmax(logits[0, -1]))
92
+ x = torch.cat([x, torch.tensor([[next_id]], device=DEVICE)], dim=1)
93
+ return tok.decode(x[0].tolist()).replace("<|endoftext|>", "").strip()
94
+
95
+ print(generate_greedy("Little girl was"))
96
+ ```
97
+
98
+ ### Temperature-only sampling (no top-k/p)
99
+
100
+ ```python
101
+ def generate_temp(prompt, max_new_tokens=64, temperature=0.9):
102
+ ids = tok.encode(prompt)
103
+ x = torch.tensor([ids], device=DEVICE)
104
+ for _ in range(max_new_tokens):
105
+ with torch.no_grad():
106
+ logits = model(x).logits[0, -1] / max(temperature, 1e-6)
107
+ probs = torch.softmax(logits, dim=-1)
108
+ next_id = torch.multinomial(probs, 1).item()
109
+ x = torch.cat([x, torch.tensor([[next_id]], device=DEVICE)], dim=1)
110
+ return tok.decode(x[0].tolist()).replace("<|endoftext|>", "").strip()
111
+ ```
112
+
113
+ ---
114
+
115
+ ## 🌐 Serving with vLLM (Optional)
116
+
117
+ ```bash
118
+ python -m vllm.entrypoints.openai.api_server \
119
+ --model kunjcr2/GatorGPT2 \
120
+ --tokenizer kunjcr2/GatorGPT2 \
121
+ --trust-remote-code \
122
+ --dtype float32 \
123
+ --max-model-len 1024 \
124
+ --host 0.0.0.0 --port 8000
125
+ ```
126
+
127
+ Call it:
128
+
129
+ ```bash
130
+ curl http://localhost:8000/v1/completions \
131
+ -H "Content-Type: application/json" \
132
+ -d '{"model":"kunjcr2/GatorGPT2","prompt":"Little girl was","max_tokens":64,"temperature":0.9}'
133
+ ```
134
+
135
+ ---
136
+
137
+ ## πŸ§ͺ Training Summary
138
+
139
+ * **Data**: `roneneldan/TinyStories` (train split; subset of \~1.5M stories)
140
+ * **Objective**: causal LM (next-token prediction), cross-entropy
141
+ * **Optimizer**: AdamW (`lr=3e-4`, `weight_decay=0.01`, `eps=1e-8`)
142
+ * **Precision**: bf16 autocast on CUDA during forward for speed
143
+ * **Batching**: sliding windows via a `FastDataset` (window size e.g. 512, stride 256)
144
+ * **Eval**: periodic validation over fixed batches; train loss downsampled to eval steps for plotting
145
+ * **Hardware**: intended for A100-class GPUs; also runs on CPU for debug (slow)
146
+
147
+ > This is a *from-scratch* toy/educational model; quality depends heavily on steps, data cleaned, and schedule. Expect simple, short English generations.
148
+
149
+ ---
150
+
151
+ ## βœ… Intended Use
152
+
153
+ * Research on small decoder-only Transformers
154
+ * Educational demos (training, saving, model hub, vLLM serving)
155
+ * Baseline for experimenting with:
156
+
157
+ * LoRA/QLoRA, quantization, distillation
158
+ * Attention variants (Flash-Attention, GQA configs)
159
+ * Data curation and scaling laws
160
+
161
+ **Not** intended for production or safety-critical use.
162
+
163
+ ---
164
+
165
+ ## ⚠️ Limitations & Risks
166
+
167
+ * Trained on children’s story data β‡’ limited world knowledge & reasoning
168
+ * May output incoherent, repetitive, or undesirable text
169
+ * No instruction-tuning or RLHF
170
+ * Tokenizer is `tiktoken p50k_base` (not a standard HF tokenizer), so examples use `tiktoken` directly
171
+
172
+ ---
173
+
174
+ ## πŸ“ Repo Structure
175
+
176
+ ```
177
+ .
178
+ β”œβ”€β”€ config.json
179
+ β”œβ”€β”€ pytorch_model.bin # or model.safetensors
180
+ β”œβ”€β”€ modeling_gator.py # custom architecture (RoPE, GQA, RMSNorm, SwiGLU)
181
+ β”œβ”€β”€ configuration_gator.py
182
+ β”œβ”€β”€ __init__.py
183
+ └── tokenizer_manifest.json # { "library": "tiktoken", "encoding": "p50k_base" }
184
+ ```
185
+
186
+ `config.json` includes:
187
+
188
+ ```json
189
+ {
190
+ "model_type": "gator-transformer",
191
+ "architectures": ["GatorModel"],
192
+ "auto_map": {
193
+ "AutoConfig": "configuration_gator.GatorConfig",
194
+ "AutoModelForCausalLM": "modeling_gator.GatorModel"
195
+ }
196
+ }
197
+ ```
198
+
199
+ ---
200
+
201
+ ## πŸ“Š Evaluation
202
+
203
+ No formal benchmarks reported. You can compute loss/perplexity on your own validation subset:
204
+
205
+ ```python
206
+ import math, torch
207
+ from torch.utils.data import DataLoader, TensorDataset
208
+
209
+ # ...build a DataLoader of (input_ids, target_ids) pairs...
210
+ def eval_loss(model, loader, device="cuda"):
211
+ model.eval(); total, n = 0.0, 0
212
+ with torch.no_grad():
213
+ for x, y in loader:
214
+ x, y = x.to(device), y.to(device)
215
+ logits = model(x).logits
216
+ loss = torch.nn.functional.cross_entropy(
217
+ logits.view(-1, logits.size(-1)), y.view(-1)
218
+ )
219
+ total += loss.item(); n += 1
220
+ return total / max(n,1)
221
+
222
+ val_loss = eval_loss(model, your_val_loader)
223
+ print("val loss:", val_loss, " ppl:", math.exp(val_loss))
224
+ ```
225
+
226
+ ---
227
+
228
+ ## πŸ“œ License
229
+
230
+ **apache-2.0**
231
+
232
+ ---
233
+
234
+ ## πŸ™Œ Acknowledgements
235
+
236
+ * **TinyStories** dataset by Ronen Eldan et al. (`roneneldan/TinyStories`)
237
+ * Community tooling: **PyTorch**, **πŸ€— Transformers**, **tiktoken**, **vLLM**
238
+
239
+ ---
240
+
241
+ ## βœ‰οΈ Citation
242
+
243
+ If you use this model, please cite this repository:
244
+
245
+ ```bibtex
246
+ @software{GatorGPT2_2025,
247
+ author = {Kunj},
248
+ title = {GatorGPT2: a small decoder-only Transformer with RoPE+GQA},
249
+ year = {2025},
250
+ url = {https://huggingface.co/kunjcr2/GatorGPT2}
251
+ }
252
+ ```