Transformers
GGUF
code
topshik commited on
Commit
fa54c97
·
verified ·
1 Parent(s): 7b0afd6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +368 -3
README.md CHANGED
@@ -1,3 +1,368 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - bigcode/the-stack
5
+ - bigcode/the-stack-v2
6
+ - bigcode/starcoderdata
7
+ - bigcode/commitpack
8
+ library_name: transformers
9
+ tags:
10
+ - code
11
+ model-index:
12
+ - name: Mellum-4b-base
13
+ results:
14
+ - task:
15
+ type: text-generation
16
+ dataset:
17
+ type: tianyang/repobench_python_v1.1
18
+ name: RepoBench 1.1 (Python)
19
+ metrics:
20
+ - name: EM
21
+ type: exact_match
22
+ value: 0.2591
23
+ verified: false
24
+ - name: EM ≤ 8k
25
+ type: exact_match
26
+ value: 0.2797
27
+ verified: false
28
+ - task:
29
+ type: text-generation
30
+ dataset:
31
+ type: tianyang/repobench_python_v1.1
32
+ name: RepoBench 1.1 (Python, 2k)
33
+ metrics:
34
+ - name: EM
35
+ type: exact_match
36
+ value: 0.2820
37
+ verified: false
38
+ - task:
39
+ type: text-generation
40
+ dataset:
41
+ type: tianyang/repobench_python_v1.1
42
+ name: RepoBench 1.1 (Python, 4k)
43
+ metrics:
44
+ - name: EM
45
+ type: exact_match
46
+ value: 0.2795
47
+ verified: false
48
+ - task:
49
+ type: text-generation
50
+ dataset:
51
+ type: tianyang/repobench_python_v1.1
52
+ name: RepoBench 1.1 (Python, 8k)
53
+ metrics:
54
+ - name: EM
55
+ type: exact_match
56
+ value: 0.2777
57
+ verified: false
58
+ - task:
59
+ type: text-generation
60
+ dataset:
61
+ type: tianyang/repobench_python_v1.1
62
+ name: RepoBench 1.1 (Python, 12k)
63
+ metrics:
64
+ - name: EM
65
+ type: exact_match
66
+ value: 0.2453
67
+ verified: false
68
+ - task:
69
+ type: text-generation
70
+ dataset:
71
+ type: tianyang/repobench_python_v1.1
72
+ name: RepoBench 1.1 (Python, 16k)
73
+ metrics:
74
+ - name: EM
75
+ type: exact_match
76
+ value: 0.2110
77
+ verified: false
78
+ - task:
79
+ type: text-generation
80
+ dataset:
81
+ type: tianyang/repobench_java_v1.1
82
+ name: RepoBench 1.1 (Java)
83
+ metrics:
84
+ - name: EM
85
+ type: exact_match
86
+ value: 0.2858
87
+ verified: false
88
+ - name: EM ≤ 8k
89
+ type: exact_match
90
+ value: 0.3108
91
+ verified: false
92
+ - task:
93
+ type: text-generation
94
+ dataset:
95
+ type: tianyang/repobench_java_v1.1
96
+ name: RepoBench 1.1 (Java, 2k)
97
+ metrics:
98
+ - name: EM
99
+ type: exact_match
100
+ value: 0.3202
101
+ verified: false
102
+ - task:
103
+ type: text-generation
104
+ dataset:
105
+ type: tianyang/repobench_java_v1.1
106
+ name: RepoBench 1.1 (Java, 4k)
107
+ metrics:
108
+ - name: EM
109
+ type: exact_match
110
+ value: 0.3212
111
+ verified: false
112
+ - task:
113
+ type: text-generation
114
+ dataset:
115
+ type: tianyang/repobench_java_v1.1
116
+ name: RepoBench 1.1 (Java, 8k)
117
+ metrics:
118
+ - name: EM
119
+ type: exact_match
120
+ value: 0.2910
121
+ verified: false
122
+ - task:
123
+ type: text-generation
124
+ dataset:
125
+ type: tianyang/repobench_java_v1.1
126
+ name: RepoBench 1.1 (Java, 12k)
127
+ metrics:
128
+ - name: EM
129
+ type: exact_match
130
+ value: 0.2492
131
+ verified: false
132
+ - task:
133
+ type: text-generation
134
+ dataset:
135
+ type: tianyang/repobench_java_v1.1
136
+ name: RepoBench 1.1 (Java, 16k)
137
+ metrics:
138
+ - name: EM
139
+ type: exact_match
140
+ value: 0.2474
141
+ verified: false
142
+ - task:
143
+ type: text-generation
144
+ dataset:
145
+ type: gonglinyuan/safim
146
+ name: SAFIM
147
+ metrics:
148
+ - name: pass@1
149
+ type: pass@1
150
+ value: 0.3811
151
+ verified: false
152
+ - task:
153
+ type: text-generation
154
+ dataset:
155
+ type: gonglinyuan/safim
156
+ name: SAFIM (Algorithmic)
157
+ metrics:
158
+ - name: pass@1
159
+ type: pass@1
160
+ value: 0.2530
161
+ verified: false
162
+ - task:
163
+ type: text-generation
164
+ dataset:
165
+ type: gonglinyuan/safim
166
+ name: SAFIM (Control)
167
+ metrics:
168
+ - name: pass@1
169
+ type: pass@1
170
+ value: 0.3839
171
+ verified: false
172
+ - task:
173
+ type: text-generation
174
+ dataset:
175
+ type: gonglinyuan/safim
176
+ name: SAFIM (API)
177
+ metrics:
178
+ - name: pass@1
179
+ type: pass@1
180
+ value: 0.5065
181
+ verified: false
182
+ - task:
183
+ type: text-generation
184
+ dataset:
185
+ type: loubnabnl/humaneval_infilling
186
+ name: HumanEval Infilling (Single-Line)
187
+ metrics:
188
+ - name: pass@1
189
+ type: pass@1
190
+ value: 0.6621
191
+ verified: false
192
+ - task:
193
+ type: text-generation
194
+ dataset:
195
+ type: loubnabnl/humaneval_infilling
196
+ name: HumanEval Infilling (Multi-Line)
197
+ metrics:
198
+ - name: pass@1
199
+ type: pass@1
200
+ value: 0.3852
201
+ verified: false
202
+ - task:
203
+ type: text-generation
204
+ dataset:
205
+ type: loubnabnl/humaneval_infilling
206
+ name: HumanEval Infilling (Random Span)
207
+ metrics:
208
+ - name: pass@1
209
+ type: pass@1
210
+ value: 0.2969
211
+ verified: false
212
+ ---
213
+
214
+ # Model Description
215
+ Mellum-4b-base is JetBrains' first open-source large language model (LLM) optimized for code-related tasks.
216
+
217
+ Trained on over 4 trillion tokens with a context window of 8192 tokens across multiple programming languages, Mellum-4b-base is tailored specifically for code completion.
218
+ The model follows a LLaMA-style architecture with 4 billion parameters, making it efficient for both cloud inference (e.g., via vLLM) and local deployment (e.g., using llama.cpp or Ollama).
219
+
220
+ Mellum was trained using Automatic Mixed Precision (AMP) with bf16 precision.
221
+ The uploaded version on Hugging Face retains the bf16 format for public use.
222
+
223
+ Designed for integration into professional developer tooling (e.g., intelligent code suggestions in IDEs), AI-powered coding assistants, and research on code understanding and generation, Mellum is also well-suited for educational applications and fine-tuning experiments.
224
+
225
+ This release includes a base model, and some SFT models as well.
226
+ Keep in mind that base model is not fine-tuned for downstream tasks out-of-the-box, however, it is fully capable of supporting supervised fine-tuning (SFT) and reinforcement learning (RL) for adaptation to specific applications.
227
+
228
+ # Training Data
229
+ - Total Training Tokens: ~4.2 trillion tokens
230
+ - Corpus: The Stack, StarCoder Training Dataset, The Stack v2, CommitPack, English Wikipedia
231
+
232
+ # Training Details
233
+ - Context Window: 8,192 tokens
234
+ - Optimization: Standard language modeling objective.
235
+ - Hardware: Cluster of 256 x H200 NVIDIA GPUs with Infiniband
236
+ - Training Duration: ~20 days
237
+
238
+ # Benchmarks
239
+ In addition to the base model scores, we are providing scores for a Mellum fine-tuned for Python to provide model’s users with some estimation about potential capabilities.
240
+
241
+ ## RepoBench 1.1
242
+ - Type: single-line
243
+ - Languages: Python and Java
244
+ - Metric: Exact Match (EM), %
245
+
246
+ Since Mellum has a maximum context window of 8k, we report here both the average performance across all evaluated context lengths (2k, 4k, 8k, 12k, and 16k) and the average over context lengths within its supported range (≤ 8k).
247
+
248
+ ### Python Subset
249
+ | Model | 2k | 4k | 8k | 12k | 16k | Avg | Avg ≤ 8k |
250
+ |----------------------|--------|--------|--------|--------|--------|--------|----------|
251
+ | Mellum-4b-sft-python | 29.24% | 30.60% | 29.77% | 26.80% | 25.43% | 28.37% | 29.87% |
252
+ | Mellum-4b-base | 28.20% | 27.95% | 27.77% | 24.53% | 21.10% | 25.91% | 27.97% |
253
+
254
+ ### Java Subset
255
+ | Model | 2k | 4k | 8k | 12k | 16k | Avg | Avg ≤ 8k |
256
+ |----------------|--------|--------|--------|--------|--------|--------|----------|
257
+ | Mellum-4b-base | 32.02% | 32.12% | 29.10% | 24.92% | 24.74% | 28.58% | 31.08% |
258
+
259
+ ## Syntax-Aware Fill-in-the-Middle (SAFIM)
260
+ - Type: mix of multi-line and single-line
261
+ - Languages: multi-language
262
+ - Metric: pass@1, %
263
+
264
+ | Model | Algorithmic | Control | API | Average |
265
+ |----------------------|-------------|---------|--------|---------|
266
+ | Mellum-4b-sft-python | 33.16% | 36.11% | 57.10% | 42.12% |
267
+ | Mellum-4b-base | 25.30% | 38.39% | 50.65% | 38.11% |
268
+
269
+ ## HumanEval Infilling
270
+ - Type: single-line and multi-line
271
+ - Languages: Python
272
+ - Metric: pass@1, %
273
+
274
+ | Model | Single-Line | Multi-Line | Random Span |
275
+ |----------------------|-------------|------------|-------------|
276
+ | Mellum-4b-sft-python | 80.45% | 48.19% | 37.68% |
277
+ | Mellum-4b-base | 66.21% | 38.52% | 29.70% |
278
+
279
+ # Limitations
280
+ - Biases: May reflect biases present in public codebases. For example it will likely produce code which is similar in style to the open-source repositories.
281
+ - Security: Code suggestions should not be assumed to be secure or free of vulnerabilities.
282
+
283
+ # Sample Usage
284
+ Here are examples of how to run and sample from the model.
285
+
286
+ ## Generic generaion
287
+ ```python
288
+ import json
289
+ from transformers import AutoTokenizer, AutoModelForCausalLM
290
+
291
+ example = """
292
+ import sys
293
+ import os
294
+ import time
295
+
296
+ sys.path.append(os.getcwd())
297
+
298
+ from cluster.prepare_data import get_headers_pairs_list, write_dist_matrix
299
+ from cluster.token_edit_distance import get_distance_matrix
300
+
301
+ if len(sys.argv) < 3:
302
+ print(
303
+ "Too few arguments. You should provide: \n1. dataset_filename" +
304
+ "\n2. output_data_filename"
305
+ )
306
+ sys.exit()
307
+
308
+ start = time.perf_counter()
309
+ dataset_filename_ = sys.argv[1]
310
+ output_data_filename_ = sys.argv[2]
311
+
312
+ headers_pairs = get_headers_pairs_list(dataset_filename_, verbose=True)
313
+
314
+ dist_matrix, max_dist = get_distance_matrix(
315
+ list(map(lambda x: x[1], headers_pairs)),
316
+ verbose=True
317
+ )
318
+
319
+ write_dist_matrix(dist_matrix, max_dist, output_data_filename_, verbose=True)
320
+
321
+ end = time.perf_counter()
322
+ """
323
+
324
+ tokenizer = AutoTokenizer.from_pretrained('JetBrains/Mellum-4b-base')
325
+ model = AutoModelForCausalLM.from_pretrained('JetBrains/Mellum-4b-base')
326
+ encoded_input = tokenizer(example, return_tensors='pt', return_token_type_ids=False)
327
+ input_len = len(encoded_input["input_ids"][0])
328
+ out = model.generate(
329
+ **encoded_input,
330
+ max_new_tokens=100,
331
+ )
332
+ print("### Context")
333
+ print(tokenizer.decode(out[0][:input_len]))
334
+ print("### Prediction")
335
+ print(tokenizer.decode(out[0][input_len:]))
336
+ ```
337
+
338
+ ## Fill in the middle generation
339
+ ```python
340
+ prefix = """
341
+ def fibonacci(n: int) -> int:
342
+ """
343
+
344
+ suffix = """
345
+ if __name__ == "__main__":
346
+ print(fibonacci(10))
347
+ """
348
+
349
+ encoded_input = tokenizer(f"<fim_suffix>{suffix}<fim_prefix>{prefix}<fim_middle>", return_tensors='pt', return_token_type_ids=False)
350
+ out = model.generate(
351
+ **encoded_input,
352
+ max_new_tokens=100,
353
+ )
354
+ ```
355
+
356
+ # Citation
357
+ If you use this model, please cite:
358
+
359
+ ```bibtex
360
+ @misc{Mellum-4b-base,
361
+ title = {Mellum-4b-base},
362
+ author = {Pavlichenko, Nikita and Nazarov, Iurii and Dolgov, Ivan and Reshetnikova, Julia and Garanina, Ekaterina and Lasocki, Karol and Boitsov, Sergei and Karaeva, Dariia and Bondyrev, Ivan and Sheptyakov, Maksim and Ustalov, Dmitry and Abramov, Nikita and Kolomyttseva, Olga and Lysaniuk, Kseniia and Zavidnyi, Ilia and Semenkin, Anton and Sazanovich, Uladzislau},
363
+ year = {2025},
364
+ }
365
+ ```
366
+
367
+ # Contact
368
+ For questions, collaborations and requests reach us out via [email protected]