mav23 commited on
Commit
5a9f4f0
Β·
verified Β·
1 Parent(s): 93d59c1

Upload folder using huggingface_hub

Browse files
Files changed (3) hide show
  1. .gitattributes +1 -0
  2. README.md +368 -0
  3. biggie-smollm-0.15b-base.Q4_0.gguf +3 -0
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ biggie-smollm-0.15b-base.Q4_0.gguf filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,368 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: HuggingFaceTB/SmolLM-135M
3
+ datasets:
4
+ - LDJnr/Capybara
5
+ inference:
6
+ parameters:
7
+ model_file: biggie_groked_int8_q8_0.gguf
8
+ temperature: 1
9
+ license: mit
10
+ ---
11
+
12
+ ### TINY Frankenstein of [SmolLM-135M](https://huggingface.co/HuggingFaceTB/SmolLM-135M) upped to 0.18b
13
+ Use this frankenbase for training.
14
+ Sorry for the mislabelling, the model is a 0.18b 181m parameter, not 0.15.
15
+ I did not except this repo to blow up and now all the training scripts depend on it.
16
+
17
+ * ## CITE WORK FROM THIS HF PAGE AND [@cognitivecompai](https://huggingface.co/ehartford)'s OPTIMIZER ON YOUR FUTURE PAPERS OR I WILL DRAG YOUR ORG ON TWITTER LIKE I DID WITH COHERE LOL (we're cool now btw, visited them :)
18
+ * https://github.com/cognitivecomputations/grokadamw
19
+ * https://github.com/SakanaAI/evolutionary-model-merge/
20
+ * https://huggingface.co/blog/smollm
21
+
22
+ >>[!TIP]🐧 If you're impppatient, get the trained checkpoint file that runs on 1 cpu core:
23
+ >>
24
+ >>wget https://huggingface.co/nisten/Biggie-SmoLlm-0.15B-Base/resolve/main/biggie_groked_int8_q8_0.gguf
25
+ >>
26
+ >>make sure to install latest llama.cpp first, it's easy on linux & mac:
27
+ >>
28
+ >> git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && make -j
29
+
30
+ Now for the magic trained finetune that runs at insane speeds:
31
+
32
+ The settings are very finicky so be careful with your experimentation
33
+ ```verilog
34
+ ./llama-cli -fa -b 512 -ctv q8_0 -ctk q8_0 --min-p 0.3 --top-p 0.85 --keep -1 \
35
+ -p "You are a NASA JPL Scientists. Human: I want to bring my cat to mars." \
36
+ --in-prefix "<|im_start|>Human:" --reverse-prompt "Human:" \
37
+ -m biggie_groked_int8_q8_0.gguf -co -cnv \
38
+ -c 1024 -n 700 --temp 1.5 -ngl 0 -t 1
39
+ ```
40
+ Yup, that's no gpu, 1 cpu core.
41
+
42
+ This base model was built one via semi-automated continuous merging to figure out the recipe.
43
+ Model is more coherent.
44
+
45
+ The temperature settings and min p etc need to be adjusted but even at default temp0 it was coherent for first 100 tokens.
46
+ Amazing option for further training. And this is a merge of the base, not the instruct!
47
+
48
+ ## 🧠 What's Really Going Down Here?
49
+
50
+ We're talking about a convergence of whole bunch of stuff, more papers will be written about this:
51
+
52
+ 1. **Evolutionary Merging**:
53
+ 2. **BitNet Integration**:
54
+ 4. **Experimental GrokAdamW Optimizer**:
55
+
56
+ ## Prior work, from last week
57
+
58
+ Credits for optimizer go to [@cognitivecompai](https://github.com/cognitivecomputations/grokadamw) for laying the groundwork with the original GrokAdamW optimizer.
59
+
60
+ ## LETS TRY OUT THE EXPERIMENTAL GROKKED FINETUNE:
61
+
62
+ ```bash
63
+ wget https://huggingface.co/nisten/Biggie-SmoLlm-0.15B-Base/resolve/main/biggie_groked_int8_q8_0.gguf
64
+ ```
65
+
66
+ Yes we will be talking with a 164mb file that runs at 160 tokens per second on a single cpu core
67
+ ## you read all of that correctly yes, 1 cpu core 160 tps https://x.com/nisten/status/1819752034305970649
68
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6379683a81c1783a4a2ddba8/nTNISjByBkN7bJZzuOvOw.png)
69
+
70
+ ## πŸš€ run it with NO GPU and only one CPU core it with these settings
71
+ ```bash
72
+ ./llama-cli -n -1 -fa -b 512 -ctv q8_0 -ctk q8_0 -fa --min-p 0.3 --top-p 0.85 --keep -1 -p "You are a NASA JPL Scientists. Human: I want to bring my cat to mars." -m biggie_groked_int8_q8_0.gguf -co -cnv --in-prefix "<|im_start|>Human:" --reverse-prompt "Human:" -c 1024 -n 512 --temp 1.5 -ngl 0
73
+ ```
74
+
75
+
76
+ ## πŸ‹οΈ Training Tutorial, MAKE YOUR OWN BIGGIE_SMOlLM
77
+
78
+
79
+ Clone the repo like you're stealing code from the future:
80
+ ```bash
81
+ git clone https://github.com/nisten/grokadamw
82
+ cd grokadamw
83
+ ```
84
+
85
+ Fire up the training script and watch the magic happen:
86
+ ```bash
87
+ python smoltrainer.py
88
+ ```
89
+
90
+ ## πŸ’» Do it from scratch yourself
91
+ Install the secret sauce (dependencies):
92
+ ```bash
93
+ pip install torch transformers datasets tqdm
94
+ ```
95
+
96
+ make a file named meow.py , copy paste in this code, and then run it ```python meow.py```
97
+
98
+ ```python
99
+ import torch
100
+ import torch.nn as nn
101
+ import logging
102
+ from datasets import load_dataset, Dataset
103
+ from transformers import AutoConfig, AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForLanguageModeling
104
+ from torch.cuda.amp import autocast
105
+ import warnings
106
+ from tqdm import tqdm
107
+
108
+ warnings.filterwarnings("ignore", category=FutureWarning)
109
+ warnings.filterwarnings("ignore", category=UserWarning)
110
+
111
+ logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
112
+ logger = logging.getLogger(__name__)
113
+
114
+ MODEL_NAME = "nisten/Biggie-SmoLlm-0.15B-Base"
115
+ MAX_LENGTH = 2048
116
+ BATCH_SIZE = 8
117
+ LEARNING_RATE = 2e-4
118
+ MAX_STEPS = 3000
119
+ GRADIENT_ACCUMULATION_STEPS = 2
120
+ NUM_WARMUP_STEPS = 30
121
+ OUTPUT_DIR = "./capybara_finetuned_results"
122
+
123
+ torch.backends.cuda.matmul.allow_tf32 = True
124
+ torch.backends.cudnn.allow_tf32 = True
125
+
126
+ class GrokAdamW(torch.optim.Optimizer):
127
+ def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8, weight_decay=1e-2,
128
+ alpha_init=0.98, lamb=2.0, gamma=0.1, grokking_signal_fns=None,
129
+ grokking_signal_decay_rate=0.1, gradient_clipping=1.0):
130
+ defaults = dict(lr=lr, betas=betas, eps=eps, weight_decay=weight_decay,
131
+ alpha_init=alpha_init, lamb=lamb, gamma=gamma,
132
+ grokking_signal_fns=grokking_signal_fns,
133
+ grokking_signal_decay_rate=grokking_signal_decay_rate,
134
+ gradient_clipping=gradient_clipping)
135
+ super(GrokAdamW, self).__init__(params, defaults)
136
+
137
+ @torch.no_grad()
138
+ def step(self, closure=None):
139
+ loss = None
140
+ if closure is not None:
141
+ with torch.enable_grad():
142
+ loss = closure()
143
+
144
+ for group in self.param_groups:
145
+ grokking_signal = self._compute_grokking_signal(group)
146
+ for i, p in enumerate(group['params']):
147
+ if p.grad is None:
148
+ continue
149
+ grad = p.grad
150
+
151
+ if group['gradient_clipping'] > 0:
152
+ grad = torch.clamp(grad, -group['gradient_clipping'], group['gradient_clipping'])
153
+
154
+ state = self.state[p]
155
+
156
+ if len(state) == 0:
157
+ state['step'] = 0
158
+ state['exp_avg'] = torch.zeros_like(p, memory_format=torch.preserve_format)
159
+ state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format)
160
+ state['grok_ema'] = torch.zeros_like(p, memory_format=torch.preserve_format)
161
+
162
+ exp_avg, exp_avg_sq, grok_ema = state['exp_avg'], state['exp_avg_sq'], state['grok_ema']
163
+ beta1, beta2 = group['betas']
164
+
165
+ state['step'] += 1
166
+
167
+ layer_beta1 = beta1 * (1 - group['gamma'])**i
168
+
169
+ alpha = group['alpha_init'] * torch.exp(torch.tensor(-group['grokking_signal_decay_rate'] * grokking_signal))
170
+ grok_ema.mul_(alpha).add_(grad, alpha=1 - alpha)
171
+ grok_grad = grad + group['lamb'] * grok_ema
172
+
173
+ exp_avg.mul_(layer_beta1).add_(grok_grad, alpha=1 - layer_beta1)
174
+ exp_avg_sq.mul_(beta2).addcmul_(grok_grad, grok_grad, value=1 - beta2)
175
+
176
+ denom = exp_avg_sq.sqrt().add_(group['eps'])
177
+ step_size = group['lr']
178
+
179
+ if group['weight_decay'] != 0:
180
+ p.data.mul_(1 - group['lr'] * group['weight_decay'])
181
+
182
+ p.addcdiv_(exp_avg, denom, value=-step_size)
183
+
184
+ return loss
185
+
186
+ def _compute_grokking_signal(self, group):
187
+ if group['grokking_signal_fns'] is None:
188
+ return 0.0
189
+
190
+ signals = []
191
+ for fn in group['grokking_signal_fns']:
192
+ try:
193
+ signal = fn()
194
+ if signal is not None:
195
+ signals.append(signal)
196
+ except Exception as e:
197
+ logger.warning(f"Error in grokking_signal_fn: {e}. Ignoring this function.")
198
+
199
+ if not signals:
200
+ return 0.0
201
+
202
+ return sum(signals) / len(signals)
203
+
204
+ def format_capybara_prompts(examples):
205
+ texts = []
206
+ for conversation in examples['conversation']:
207
+ formatted_text = ""
208
+ for turn in conversation:
209
+ if 'input' in turn:
210
+ formatted_text += f"Human: {turn['input']}\n\n"
211
+ if 'output' in turn:
212
+ formatted_text += f"Assistant: {turn['output']}\n\n"
213
+ texts.append(formatted_text.strip())
214
+ return {"text": texts}
215
+
216
+ class CustomTrainer(Trainer):
217
+ def __init__(self, *args, **kwargs):
218
+ super().__init__(*args, **kwargs)
219
+ self.grokking_signal = 0.0
220
+
221
+ def compute_loss(self, model, inputs, return_outputs=False):
222
+ labels = inputs.pop("labels")
223
+ outputs = model(**inputs)
224
+ logits = outputs.logits
225
+ shift_logits = logits[..., :-1, :].contiguous()
226
+ shift_labels = labels[..., 1:].contiguous()
227
+ loss_fct = nn.CrossEntropyLoss()
228
+ loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
229
+ return (loss, outputs) if return_outputs else loss
230
+
231
+ def training_step(self, model, inputs):
232
+ model.train()
233
+ inputs = self._prepare_inputs(inputs)
234
+
235
+ with autocast(dtype=torch.bfloat16):
236
+ loss = self.compute_loss(model, inputs)
237
+
238
+ if self.args.gradient_accumulation_steps > 1:
239
+ loss = loss / self.args.gradient_accumulation_steps
240
+
241
+ loss.backward()
242
+
243
+ self.grokking_signal = loss.item()
244
+
245
+ return loss.detach()
246
+
247
+ def grokking_signal_fn():
248
+ return trainer.grokking_signal
249
+
250
+ def main():
251
+ logger.info(f"πŸš€ Initializing {MODEL_NAME} finetuning with GrokAdamW")
252
+
253
+ try:
254
+ config = AutoConfig.from_pretrained(MODEL_NAME)
255
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
256
+ model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, torch_dtype=torch.bfloat16)
257
+ except Exception as e:
258
+ logger.error(f"❌ Failed to load model or tokenizer: {str(e)}")
259
+ return
260
+
261
+ if tokenizer.pad_token is None:
262
+ tokenizer.pad_token = tokenizer.eos_token
263
+ model.config.pad_token_id = model.config.eos_token_id
264
+
265
+ logger.info("πŸ“š Loading Capybara dataset")
266
+ try:
267
+ capybara_dataset = load_dataset("LDJnr/Capybara", split="train")
268
+ capybara_dataset = capybara_dataset.map(format_capybara_prompts, batched=True, remove_columns=capybara_dataset.column_names)
269
+ except Exception as e:
270
+ logger.error(f"❌ Failed to load Capybara dataset: {str(e)}")
271
+ return
272
+
273
+ logger.info(f"πŸ“Š Capybara dataset size: {len(capybara_dataset)}")
274
+
275
+ def tokenize_function(examples):
276
+ return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=MAX_LENGTH)
277
+
278
+ logger.info("πŸ”’ Tokenizing dataset")
279
+ tokenized_dataset = capybara_dataset.map(tokenize_function, batched=True, remove_columns=capybara_dataset.column_names)
280
+
281
+ logger.info("πŸ‹οΈ Setting up the training arguments")
282
+ training_args = TrainingArguments(
283
+ output_dir=OUTPUT_DIR,
284
+ num_train_epochs=3,
285
+ per_device_train_batch_size=BATCH_SIZE,
286
+ gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
287
+ learning_rate=LEARNING_RATE,
288
+ weight_decay=0.01,
289
+ bf16=True,
290
+ logging_steps=10,
291
+ save_steps=300,
292
+ save_total_limit=10,
293
+ dataloader_num_workers=4,
294
+ warmup_steps=NUM_WARMUP_STEPS,
295
+ gradient_checkpointing=True,
296
+ evaluation_strategy="steps",
297
+ eval_steps=300,
298
+ max_steps=MAX_STEPS,
299
+ fp16=False,
300
+ optim="adamw_hf",
301
+ lr_scheduler_type="cosine",
302
+ load_best_model_at_end=True,
303
+ metric_for_best_model="loss",
304
+ greater_is_better=False,
305
+ )
306
+
307
+ data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
308
+
309
+ optimizer = GrokAdamW(
310
+ model.parameters(),
311
+ lr=LEARNING_RATE,
312
+ betas=(0.9, 0.999),
313
+ eps=1e-8,
314
+ weight_decay=0.01,
315
+ alpha_init=0.98,
316
+ lamb=2.0,
317
+ gamma=0.1,
318
+ grokking_signal_fns=[grokking_signal_fn],
319
+ grokking_signal_decay_rate=0.1,
320
+ gradient_clipping=1.0
321
+ )
322
+
323
+ logger.info("πŸƒβ€β™‚οΈ Initializing Trainer with GrokAdamW")
324
+ global trainer
325
+ trainer = CustomTrainer(
326
+ model=model,
327
+ args=training_args,
328
+ train_dataset=tokenized_dataset,
329
+ eval_dataset=tokenized_dataset.select(range(min(1000, len(tokenized_dataset)))),
330
+ data_collator=data_collator,
331
+ optimizers=(optimizer, None),
332
+ )
333
+
334
+ logger.info("πŸ”₯ Starting the training with GrokAdamW")
335
+ try:
336
+ trainer.train()
337
+ except Exception as e:
338
+ logger.error(f"❌ Training failed: {str(e)}")
339
+ return
340
+
341
+ logger.info("πŸ’Ύ Saving the model")
342
+ try:
343
+ trainer.save_model(OUTPUT_DIR)
344
+ except Exception as e:
345
+ logger.error(f"❌ Failed to save model: {str(e)}")
346
+
347
+ logger.info("πŸŽ‰ Finetuning with GrokAdamW completed!")
348
+
349
+ if __name__ == "__main__":
350
+ main()
351
+ ```
352
+ πŸš€ Now go forth and train, accelerate that code!
353
+
354
+ > **Note:** You'll need about 14GB of VRAM. If you have 8GB, change to batch size 4.
355
+
356
+ Results will appear in `./capybara_finetuned_results`
357
+
358
+ ---
359
+
360
+ ### Author
361
+
362
+ **Nisten Tahiraj**
363
+ 🏒 [rakun.ai](https://rakun.ai)
364
+ πŸ“ Toronto, Canada
365
+
366
+ ---
367
+ Happy training!
368
+ <video controls autoplay muted src="https://cdn-uploads.huggingface.co/production/uploads/6379683a81c1783a4a2ddba8/WCLhKzZWbrLo8BETGaKvI.qt"></video>
biggie-smollm-0.15b-base.Q4_0.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c17f1378f6a9b674f69b3649ead902e48bbd7812268d00ff0dc1f5a11691df74
3
+ size 117630976