Nikity commited on
Commit
0e7be3d
·
verified ·
1 Parent(s): f838414

initial public release

Browse files
Files changed (1) hide show
  1. README.md +351 -3
README.md CHANGED
@@ -1,3 +1,351 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - Nikity/Kyoto-Corpus
5
+ language:
6
+ - en
7
+ base_model:
8
+ - Nikity/lille-130m-base
9
+ new_version: Nikity/lille-130m-instruct
10
+ model-index:
11
+ - name: lille-130m-instruct
12
+ results:
13
+ - task:
14
+ type: text-generation
15
+ dataset:
16
+ name: arc_challenge
17
+ type: arc_challenge
18
+ metrics:
19
+ - name: ARC (Challenge)
20
+ type: Accuracy
21
+ value: 15.05
22
+ - task:
23
+ type: text-generation
24
+ dataset:
25
+ name: arc_easy
26
+ type: arc_easy
27
+ metrics:
28
+ - name: ARC (Easy)
29
+ type: Accuracy
30
+ value: 21.4
31
+ - task:
32
+ type: text-generation
33
+ dataset:
34
+ name: gpqa
35
+ type: gpqa
36
+ metrics:
37
+ - name: GPQA
38
+ type: Accuracy
39
+ value: 12.73
40
+ - task:
41
+ type: text-generation
42
+ dataset:
43
+ name: gsm8k
44
+ type: gsm8k
45
+ metrics:
46
+ - name: GSM8K
47
+ type: Accuracy
48
+ value: 7.73
49
+ - task:
50
+ type: text-generation
51
+ dataset:
52
+ name: ifeval
53
+ type: ifeval
54
+ metrics:
55
+ - name: IFEVAL
56
+ type: Accuracy
57
+ value: 9.01
58
+ - task:
59
+ type: text-generation
60
+ dataset:
61
+ name: math
62
+ type: math
63
+ metrics:
64
+ - name: MATH (Level 5)
65
+ type: Accuracy
66
+ value: 1.91
67
+ - task:
68
+ type: text-generation
69
+ dataset:
70
+ name: mmlu
71
+ type: mmlu
72
+ metrics:
73
+ - name: MMLU
74
+ type: Accuracy
75
+ value: 22.76
76
+ - task:
77
+ type: text-generation
78
+ dataset:
79
+ name: mt_bench
80
+ type: mt_bench
81
+ metrics:
82
+ - name: MT-Bench
83
+ type: Accuracy
84
+ value: 8.2
85
+ - task:
86
+ type: text-generation
87
+ dataset:
88
+ name: truthful_qa
89
+ type: truthful_qa
90
+ metrics:
91
+ - name: TruthfulQA
92
+ type: Accuracy
93
+ value: 9.06
94
+ ---
95
+
96
+ # Lille 130M Base
97
+
98
+ ![Lille-Header](assets/lille-header.png)
99
+
100
+ > **You are currently viewing the `lille-130m-instruct` model card.**
101
+ >
102
+ > View the base model here: **[Nikity/lille-130m-base](https://huggingface.co/Nikity/lille-130m-base)**
103
+
104
+ ## Table of Contents
105
+ 1. [Model Summary](#-model-summary)
106
+ 2. [Evaluation](#-evaluation)
107
+ 3. [How to Use](#-how-to-use)
108
+ 4. [Training and Finetuning](#-training-and-finetuning)
109
+ 5. [Training Details](#-training-details)
110
+ 6. [Limitations](#-limitations)
111
+ 7. [The Truly Open-Source Stack](#-the-truly-open-source-repos)
112
+ 8. [License](#-license)
113
+ 9. [Citation](#-citation)
114
+
115
+ ## ✨ Model Summary
116
+
117
+ **Lille** is a 130-million-parameter language model built from the ground up as a core component of a completely open-source deep learning stack. The name Lille reflects both its compact size and strong capabilities - capturing the idea that less can be more. It draws on the Norwegian word lille (‘small’ or ‘little’) as well as the French city Lille, giving it both meaning and place. It was trained using a custom tokenizer, a curated dataset, and a memory-efficient optimizer, all of which are publicly available.
118
+
119
+ The model comes in two versions:
120
+ * **`Lille-130M-Base`**: The foundational model pretrained on 4.27 billion of tokens from the [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) dataset. A post-processing step to only include the highest quality of content was added. It has strong general knowledge and text completion abilities.
121
+ * **`Lille-130M-Instruct`**: The instruction-tuned version, fine-tuned on the **[Kyoto-Corpus](https://huggingface.co/datasets/Nikity/Kyoto-Corpus)**. It excels at following user commands, engaging in chat, and performing a variety of instruction-based tasks.
122
+
123
+ The model architecture is a modern Transformer decoder featuring Grouped-Query Attention (GQA), RoPE, and RMSNorm, making it efficient and performant for its size.
124
+
125
+ *Note on parameter count: While the model name is `130M` for simplicity, the actual parameter count is closer to 140 million.*
126
+
127
+ ## 📊 Evaluation
128
+
129
+ All evaluations were conducted using **[simple-eval](https://github.com/Nikityyy/simple-eval)**, our open-source evaluation framework. Benchmarks are run in a zero-shot setting unless specified otherwise.
130
+
131
+ #### `Lille-130M-Instruct`
132
+
133
+ ![Evaluations](assets/evaluations.png)
134
+
135
+ > Evaluations for other LLMs are sourced from the <a href="https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard">Open LLM Leaderboard</a> or their respective model cards when benchmark data is unavailable. For Lille 140M Instruct, evaluations are performed using <a href="https://github.com/Nikityyy/simple-eval">simple-eval</a>. ARC-C and ARC-E for Smollm2 are also evaluated using <a href="https://github.com/Nikityyy/simple-eval">simple-eval</a>.
136
+
137
+ ## 🚀 How to Use
138
+
139
+ ### 1. SimpleAI SDK (Recommended for Easy Use)
140
+
141
+ The easiest way to get started with Lille is by using the `simpleai-sdk`, which handles all the boilerplate for you and provides a simple, high-level API for both Hugging Face and ONNX backends.
142
+
143
+ ```bash
144
+ pip install simpleai-sdk
145
+ ```
146
+
147
+ ```python
148
+ from simple_ai import lille
149
+
150
+ # This will download and cache the model on first run.
151
+ # Specify the model version: "130m-instruct" (default) or "130m-base"
152
+ # Specify the backend: "huggingface" (default) or "onnx"
153
+ model = lille("huggingface", "130m-instruct")
154
+
155
+ # --- For Chat (with instruct model) ---
156
+ print("--- Chat Example ---")
157
+ response1 = model.chat("What is the capital of France?", max_new_tokens=50)
158
+ print(f"Bot: {response1}")
159
+
160
+ response2 = model.chat("And what is its population?", max_new_tokens=50, top_p=0.90)
161
+ print(f"Bot: {response2}")
162
+
163
+ # This resets the chat history
164
+ model.reset_chat()
165
+
166
+ # --- For Text Completion (with base or instruct model) ---
167
+ prompt = "Artificial Intelligence is"
168
+ response = model.generate(prompt, max_new_tokens=50, temperature=0.9)
169
+ print(f"\n--- Completion Example ---\n{prompt}{response}")
170
+ ```
171
+
172
+ ### 2. Standard Hugging Face Transformers (this also needs `simpleai-sdk` currently)
173
+
174
+ You can also use the model directly with the `transformers` library for more advanced use cases.
175
+
176
+ ```bash
177
+ pip install transformers torch simpleai-sdk
178
+ ```
179
+
180
+ ```python
181
+ import torch
182
+ from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM
183
+ from simple_ai.model_hf import LilleConfig, LilleForCausalLM
184
+
185
+ # 1. Register the custom model architecture with Hugging Face
186
+ AutoConfig.register("lille-130m", LilleConfig)
187
+ AutoModelForCausalLM.register(LilleConfig, LilleForCausalLM)
188
+
189
+ # 2. Define constants and setup device
190
+ MODEL = "Nikity/lille-130m-instruct"
191
+ DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
192
+
193
+ # 3. Load tokenizer and model
194
+ tokenizer = AutoTokenizer.from_pretrained(MODEL)
195
+ model = AutoModelForCausalLM.from_pretrained(
196
+ MODEL,
197
+ torch_dtype="auto",
198
+ device_map=DEVICE,
199
+ )
200
+
201
+ # 4. Prepare chat prompt and tokenize it
202
+ chat = [{"role": "user", "content": "What is the capital of France?"}]
203
+ inputs = tokenizer.apply_chat_template(
204
+ chat,
205
+ add_generation_prompt=True,
206
+ return_tensors="pt"
207
+ ).to(DEVICE)
208
+
209
+ # 5. Generate a response
210
+ with torch.inference_mode():
211
+ outputs = model.generate(
212
+ input_ids=inputs,
213
+ max_new_tokens=512,
214
+ eos_token_id=tokenizer.eos_token_id,
215
+ pad_token_id=tokenizer.pad_token_id,
216
+ do_sample=True,
217
+ temperature=0.5,
218
+ top_p=0.95,
219
+ )
220
+
221
+ # 6. Decode and print the response
222
+ response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
223
+ print(response)
224
+ ```
225
+
226
+ ## 🚀 Training and Finetuning
227
+
228
+ You can replicate the pretraining of `Lille-130M-Base` or fine-tune it on your own dataset using the provided scripts.
229
+
230
+ #### 1. Setup
231
+
232
+ First, clone the repository and install the required dependencies:
233
+
234
+ ```bash
235
+ git clone https://github.com/Nikityyy/lille
236
+ cd lille
237
+ pip install -r requirements.txt
238
+ ```
239
+
240
+ **Note on the Optimizer:** The default `Sophia-Triton` optimizer requires the [Triton](https://triton-lang.org/main/getting-started/installation.html) library. Triton is officially supported on Linux with NVIDIA GPUs. While experimental installation on Windows is possible, it can be a complex and difficult process. For a much simpler setup on **Windows and macOS**, or if you prefer not to install Triton, it is highly recommended to use a pure PyTorch implementation of Sophia instead:
241
+
242
+ 1. Replace the contents of the `sophia_triton.py` file with the code from [this link](https://github.com/Liuhong99/Sophia/blob/main/sophia.py).
243
+ 2. The `train.py` script should work without any import changes, as the class name `SophiaG` is the same.
244
+
245
+ #### 2. Data Preparation
246
+
247
+ The training script expects data in a specific `.npz` format containing tokenized documents and their offsets.
248
+
249
+ **For Pretraining (like FineWeb-Edu):**
250
+
251
+ Use the `prepare_dataset_fineweb.py` script. It will stream the dataset from Hugging Face, apply filters, tokenize the text, and save it in the required format.
252
+
253
+ ```bash
254
+ python prepare_dataset_fineweb.py
255
+ ```
256
+ This will create `data/fineweb_edu_sample_10BT/train.npz` and `val.npz`.
257
+
258
+ **For Finetuning (Instruction Datasets):**
259
+
260
+ Use the `prepare_dataset.py` script. Your input data should be a single `.txt` file where each example is separated by the `<|endoftext|>` token.
261
+
262
+ 1. Place your data file, for example, at `data/my_dataset/train.txt`.
263
+ 2. Modify the `input_file_path` and `output_dir` variables in `prepare_dataset.py`.
264
+ 3. Run the script:
265
+
266
+ ```bash
267
+ python prepare_dataset.py
268
+ ```
269
+ This will create `train.npz` and `val.npz` in your specified output directory.
270
+
271
+ #### 3. Running the Training Script
272
+
273
+ All training logic is handled by `train.py`. You can configure hyperparameters directly at the top of this file.
274
+
275
+ **To Pretrain from Scratch:**
276
+
277
+ 1. Ensure you have prepared a pretraining dataset.
278
+ 2. In `train.py`, set `finetune = False`.
279
+ 3. Configure pretraining parameters like `data_dir`, `batch_size`, etc.
280
+ 4. Run the script:
281
+
282
+ ```bash
283
+ python train.py
284
+ ```
285
+
286
+ **To Fine-tune a Pretrained Model:**
287
+
288
+ 1. Ensure you have prepared a fine-tuning dataset.
289
+ 2. In `train.py`, set `finetune = True`.
290
+ 3. Set `resume_checkpoint` to the path of the pretrained model checkpoint (e.g., `checkpoints/best_model.pt`).
291
+ 4. Configure fine-tuning parameters like `finetune_data_dir` and `finetune_learning_rate`.
292
+ 5. Run the script:
293
+
294
+ ```bash
295
+ python train.py
296
+ ```
297
+
298
+ Checkpoints will be saved in the directory specified by `out_dir` (for pretraining) or `finetune_out_dir` (for fine-tuning). The best model based on validation loss will be saved as `best_model.pt`.
299
+
300
+ ## 🛠️ Training Details
301
+
302
+ ### Pretraining (`Lille-130M-Base`)
303
+ * **Dataset:** Pretrained on **4.27 billion tokens** from the `sample-10BT` configuration of the [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) dataset.
304
+ * **Tokenizer:** The custom **[Hastings](https://github.com/Nikityyy/Hastings)** tokenizer with a 32,768 vocabulary size.
305
+ * **Optimizer:** The memory-efficient **[Sophia-Triton](https://github.com/Nikityyy/Sophia-Triton)** optimizer.
306
+ * **Hardware:** Trained on a single NVIDIA RTX 4070-TI.
307
+ * **Precision:** bfloat16.
308
+
309
+ ### Instruction Tuning (`Lille-130M-Instruct`)
310
+ * **Dataset:** Supervised Fine-Tuning (SFT) was performed on the **[Kyoto-Corpus](https://github.com/Nikityyy/Kyoto-Corpus)**, a high-quality, curated collection of conversational and instructional data.
311
+
312
+ ### Model Architecture
313
+ * **Type:** Transformer Decoder
314
+ * **Layers:** 24
315
+ * **Embedding Size:** 640
316
+ * **Attention Heads:** 10
317
+ * **KV Heads (GQA):** 2
318
+ * **Context Length:** 512 tokens
319
+
320
+ ## Limitations
321
+
322
+ Lille models primarily understand and generate content in English. While powerful for their size, they can produce text that may not always be factually accurate, logically consistent, or free from biases present in the training data. These models should be used as assistive tools rather than definitive sources of information. Users should always verify important information and critically evaluate any generated content.
323
+
324
+ ## 🛠️ The truly open-source repos
325
+
326
+ Lille is a key component of my initiative to build and release a complete, truly open-source stack for language modeling. All components are designed to work together seamlessly.
327
+
328
+ * **Tokenizer:** **[Hastings](https://github.com/Nikityyy/Hastings)** - A modern, efficient tokenizer with a 32k vocabulary.
329
+ * **Dataset:** **[Kyoto-Corpus](https://github.com/Nikityyy/Kyoto-Corpus)** - A high-quality, small-scale dataset for instruction tuning.
330
+ * **Model:** **[lille](https://github.com/Nikityyy/lille)** (this model) - A powerful 130-million-parameter model trained from scratch.
331
+ * **Optimizer:** **[Sophia-Triton](https://github.com/Nikityyy/Sophia-Triton)** - A memory-efficient, Triton-based implementation of the SophiaG optimizer.
332
+ * **Evaluations:** **[simple-eval](https://github.com/Nikityyy/simple-eval)** - A straightforward framework for evaluating model performance using an LLM as a Judge.
333
+
334
+ ## 📜 License
335
+
336
+ This project is licensed under the Apache-2.0 License.
337
+
338
+ ## Citation
339
+
340
+ If you use Lille or any part of this open-source stack in your work, please consider citing it:
341
+
342
+ ```bibtex
343
+ @misc{lille-130m,
344
+ author = {Nikita Berger},
345
+ title = {Lille: A Truly Open-Source 130M Language Model},
346
+ year = {2025},
347
+ publisher = {GitHub},
348
+ journal = {GitHub repository},
349
+ howpublished = {\url{https://github.com/Nikityyy/lille}}
350
+ }
351
+ ```