jujbob commited on
Commit
f08c852
Β·
verified Β·
1 Parent(s): 5834afa

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +146 -12
README.md CHANGED
@@ -6,30 +6,50 @@ license: apache-2.0
6
  <img src="https://github.com/MLP-Lab/KORMo-tutorial/blob/main/tutorial/attachment/kormo_logo.png?raw=true" style="width: 100%; max-width: 1100px;">
7
  </p>
8
 
9
- # 🦾 KORMo-10B
10
 
 
 
 
 
 
11
  **KORMo-10B** is a **10.8B parameter fully open LLM** capable of handling both **Korean and English**.
12
- The model, training code, and training data are all **fully open**, allowing anyone to reproduce and extend it.
13
 
14
  - 🧠 **Model Size**: 10.8B parameters
15
  - πŸ—£οΈ **Languages**: Korean / English
16
  - πŸͺ„ **Training Data**: Synthetic data + public datasets
17
  - πŸ§ͺ **License**: Apache 2.0 (commercial use permitted)
18
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
  ---
20
 
21
  ## πŸ”— Links
22
 
 
23
  - πŸ€— **Hugging Face**: [πŸ‘‰ Model Download](https://huggingface.co/KORMo-Team)
24
  - πŸ’» **GitHub Repository**: [πŸ‘‰ Training and Inference Code](https://github.com/MLP-Lab/KORMo-tutorial)
 
25
 
26
  ---
27
 
28
- ## πŸ†• Update News
29
- - πŸš€ **Oct 2025**: Official release of KORMo v1.0!
30
-
31
- ---
32
-
33
  ## Model Architecture
34
  | Item | Description |
35
  |:----|:------------|
@@ -38,7 +58,7 @@ The model, training code, and training data are all **fully open**, allowing any
38
  | Context Length | 32K |
39
  | Languages | Korean, English |
40
  | License | Apache 2.0 |
41
-
42
  ---
43
 
44
  ## πŸ“ˆ Benchmark Performance
@@ -74,11 +94,10 @@ The model, training code, and training data are all **fully open**, allowing any
74
  | kr_clinical_qa | 77.32 | 53.97 | 48.33 | 46.22 | 65.84 | 80.00 | 63.54 | 60.00 | 77.22 |
75
  | **Korean Avg.** | **58.15** | 47.37 | 35.82 | 39.34 | 60.94 | 63.35 | 49.60 | 49.60 | 60.37 |
76
 
77
- ---
78
 
79
- ## πŸ“ Qualitative Evaluation (LLM-as-a-Judge)
80
 
81
- | Benchmark | KORMo-10B | smolLM3-3B | olmo2-7B | olmo2-13B | kanana1.5-8B | qwen3-8B | llama3.1-8B | exaone3.5-8B* | gemma3-12B |
82
  |:----------|---------:|----------:|---------:|---------:|------------:|--------:|------------:|-------------:|-----------:|
83
  | MT-Bench (EN) | 8.32 | 7.15 | 7.32 | 7.64 | 8.45 | 8.70 | 6.32 | 8.15 | 8.70 |
84
  | KO-MT-Bench (KO) | 8.54 | - | - | - | 8.02 | 8.16 | 4.27 | 8.13 | 8.51 |
@@ -87,8 +106,123 @@ The model, training code, and training data are all **fully open**, allowing any
87
 
88
  ---
89
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
90
  ## Contact
91
  - KyungTae Lim, Professor at KAIST. `[email protected]`
92
 
93
-
94
  ## Contributor
 
6
  <img src="https://github.com/MLP-Lab/KORMo-tutorial/blob/main/tutorial/attachment/kormo_logo.png?raw=true" style="width: 100%; max-width: 1100px;">
7
  </p>
8
 
 
9
 
10
+
11
+ ## πŸ†• Update News
12
+ - πŸš€ **2025-10-09 (ν•œκΈ€λ‚ )**: Official release of KORMo-10B-base (Be aware that it's not an SFT model!!).
13
+ ---
14
+ ## 🦾 About KORMo
15
  **KORMo-10B** is a **10.8B parameter fully open LLM** capable of handling both **Korean and English**.
16
+ The model, training code, and training data are all **fully open**, allowing anyone to reproduce and extend them.
17
 
18
  - 🧠 **Model Size**: 10.8B parameters
19
  - πŸ—£οΈ **Languages**: Korean / English
20
  - πŸͺ„ **Training Data**: Synthetic data + public datasets
21
  - πŸ§ͺ **License**: Apache 2.0 (commercial use permitted)
22
 
23
+ ```bash
24
+ KORMoλŠ” λΉ„μ˜μ–΄κΆŒ 졜초의 Fully Open Source LLM으둜, 곡읡적 ν™œμš©μ„ λͺ©ν‘œλ‘œ νƒ„μƒν–ˆμŠ΅λ‹ˆλ‹€.
25
+ μš°λ¦¬λŠ” λˆ„κ΅¬λ‚˜ 세계 μˆ˜μ€€μ˜ μ–Έμ–΄λͺ¨λΈμ„ 직접 λ§Œλ“€κ³  λ°œμ „μ‹œν‚¬ 수 μžˆλŠ” ν™˜κ²½μ„ λ§Œλ“€κ³ μž ν•©λ‹ˆλ‹€.
26
+ KORMo의 μ£Όμš” νŠΉμ§•μ€ λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€:
27
+
28
+ 1. From scratch ν•™μŠ΅μœΌλ‘œ μ„€κ³„λœ 10BκΈ‰ ν•œβ€“μ˜ μΆ”λ‘  μ–Έμ–΄λͺ¨λΈμž…λ‹ˆλ‹€.
29
+ 2. ν•™μŠ΅ 데이터, μ½”λ“œ, λͺ¨λ“  쀑간 λͺ¨λΈκ³Ό νŠœν† λ¦¬μ–Όμ„ 100% κ³΅κ°œν•˜μ—¬, λˆ„κ΅¬λ‚˜ SOTA에 κ·Όμ ‘ν•œ λͺ¨λΈμ„ 직접 μž¬ν˜„ν•˜κ³  ν™•μž₯ν•  수 μžˆμŠ΅λ‹ˆλ‹€.
30
+ 3. 총 3.7T 토큰 규λͺ¨μ˜ ν•™μŠ΅ 데이터λ₯Ό κ³΅κ°œν•©λ‹ˆλ‹€. 특히 μ§€κΈˆκΉŒμ§€ ν•œ λ²ˆλ„ 곡개된 적 μ—†λŠ” μ΄ˆκ³ ν’ˆμ§ˆ μ „μ£ΌκΈ° ν•œκ΅­μ–΄ 데이터(μ‚¬μ „ν•™μŠ΅, μ‚¬ν›„ν•™μŠ΅, μΌλ°˜ν˜•, μΆ”λ‘ ν˜•, κ°•ν™”ν•™μŠ΅ λ“±)λ₯Ό μ œκ³΅ν•©λ‹ˆλ‹€.
31
+ 4. 이 λͺ¨λ“  μž‘μ—…μ€ KAIST λ¬Έν™”κΈ°μˆ λŒ€ν•™μ› MLPμ—°κ΅¬μ‹€μ˜ 학뢀·석사생 8λͺ…이 ν˜‘λ ₯ν•˜μ—¬ μ§„ν–‰ν–ˆμœΌλ©°, 45μž₯에 λ‹¬ν•˜λŠ” λ…Όλ¬ΈμœΌλ‘œ μ •λ¦¬ν–ˆμŠ΅λ‹ˆλ‹€.
32
+
33
+ μ§€κΈˆκΉŒμ§€ ν•œκ΅­μ–΄ λͺ¨λΈμ„ 써보면, 벀치마크 μ μˆ˜λŠ” 쒋은데 μ‹€μ‚¬μš©μ—μ„œλŠ” μ–΄λ”˜κ°€ μ΄μƒν•˜κ±°λ‚˜,
34
+ νŠœλ‹λ§Œ ν•˜λ©΄ λͺ¨λΈμ΄ λ§κ°€μ§€λŠ” κ²½ν—˜μ„ ν•˜μ…¨μ„ κ²λ‹ˆλ‹€. λ‹΅λ‹΅ν•˜μ…¨μ£ ?
35
+
36
+ KORMoλŠ” 그런 문제λ₯Ό μ •λ©΄μœΌλ‘œ ν•΄κ²°ν•©λ‹ˆλ‹€.
37
+ λͺ¨λ“  쀑간 λͺ¨λΈκ³Ό μ‚¬ν›„ν•™μŠ΅ 데이터λ₯Ό ν•¨κ»˜ κ³΅κ°œν•˜κΈ° λ•Œλ¬Έμ—, μ‚¬μš©μžλŠ” 베이슀 λͺ¨λΈ μœ„μ— μžμ‹ λ§Œμ˜ 데이터λ₯Ό μ–Ήμ–΄ μ›ν•˜λŠ” λ°©ν–₯으둜 κ°•ν™”ν•™μŠ΅Β·νŠœλ‹μ„ μ§„ν–‰ν•  수 μžˆμŠ΅λ‹ˆλ‹€.
38
+ πŸ‘‰ β€œμ’‹μ€ ν•œκ΅­μ–΄ λͺ¨λΈμ„ κ°–κ³  μ‹Άλ‹€λ©΄, 이제 직접 λ§Œλ“€μ–΄λ³΄μ„Έμš”. μ½”λž© 무료 GPUλ‘œλ„ νŠœλ‹λ©λ‹ˆλ‹€! πŸ€—β€
39
+ ```
40
+
41
  ---
42
 
43
  ## πŸ”— Links
44
 
45
+ - πŸ“– **Technical Report**: [πŸ‘‰ Archive](https://huggingface.co/KORMo-Team)
46
  - πŸ€— **Hugging Face**: [πŸ‘‰ Model Download](https://huggingface.co/KORMo-Team)
47
  - πŸ’» **GitHub Repository**: [πŸ‘‰ Training and Inference Code](https://github.com/MLP-Lab/KORMo-tutorial)
48
+ - πŸ”‰ **Tutorial**: [πŸ‘‰ Instruction Tuning over google colab](https://colab.research.google.com/github/MLP-Lab/KORMo-tutorial/blob/main/tutorial/02.sft_qlora.ipynb) [πŸ‘‰ Youtube Tutorial](https://www.youtube.com/@MLPLab)
49
 
50
  ---
51
 
52
+ <!--
 
 
 
 
53
  ## Model Architecture
54
  | Item | Description |
55
  |:----|:------------|
 
58
  | Context Length | 32K |
59
  | Languages | Korean, English |
60
  | License | Apache 2.0 |
61
+ -->
62
  ---
63
 
64
  ## πŸ“ˆ Benchmark Performance
 
94
  | kr_clinical_qa | 77.32 | 53.97 | 48.33 | 46.22 | 65.84 | 80.00 | 63.54 | 60.00 | 77.22 |
95
  | **Korean Avg.** | **58.15** | 47.37 | 35.82 | 39.34 | 60.94 | 63.35 | 49.60 | 49.60 | 60.37 |
96
 
 
97
 
98
+ ### πŸ“ Qualitative Evaluation (LLM-as-a-Judge)
99
 
100
+ | Benchmark | KORMo-10B | smolLM3-3B | olmo2-7B | olmo2-13B | kanana1.5-8B | qwen3-8B | llama3.1-8B | exaone3.5-8B | gemma3-12B |
101
  |:----------|---------:|----------:|---------:|---------:|------------:|--------:|------------:|-------------:|-----------:|
102
  | MT-Bench (EN) | 8.32 | 7.15 | 7.32 | 7.64 | 8.45 | 8.70 | 6.32 | 8.15 | 8.70 |
103
  | KO-MT-Bench (KO) | 8.54 | - | - | - | 8.02 | 8.16 | 4.27 | 8.13 | 8.51 |
 
106
 
107
  ---
108
 
109
+ ## πŸ“¦ Installation
110
+
111
+ ### 1. Clone the repository
112
+ ```bash
113
+ git clone https://github.com/MLP-Lab/KORMo-tutorial.git
114
+ cd KORMo-tutorial
115
+ ```
116
+ ### 2. Create and activate a virtual environment (optional but recommended)
117
+ ```bash
118
+ uv venv
119
+ source .venv/bin/activate
120
+ ```
121
+ ### 3. Install KORMo
122
+ ```bash
123
+ uv pip install -e .
124
+ ```
125
+
126
+ ---
127
+ ## πŸš€ Inference Example
128
+
129
+ ```python
130
+ from transformers import AutoModelForCausalLM, AutoTokenizer
131
+ import torch
132
+
133
+ model_name = "KORMo-Team/KORMo-10B-sft"
134
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
135
+ model = AutoModelForCausalLM.from_pretrained(
136
+ model_name,
137
+ torch_dtype=torch.bfloat16,
138
+ device_map="auto",
139
+ trust_remote_code=True
140
+ )
141
+
142
+ messages = [
143
+ {"role": "user", "content": "What happens inside a black hole?"}
144
+ ]
145
+
146
+ chat_prompt = tokenizer.apply_chat_template(
147
+ messages,
148
+ tokenize=False,
149
+ add_generation_prompt=True,
150
+ enable_thinking=False
151
+ )
152
+
153
+ inputs = tokenizer(chat_prompt, return_tensors="pt").to(model.device)
154
+
155
+ with torch.no_grad():
156
+ output_ids = model.generate(
157
+ **inputs,
158
+ max_new_tokens=1024,
159
+ )
160
+
161
+ response = tokenizer.decode(output_ids[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
162
+ print("Assistant:", response)
163
+ ```
164
+
165
+ ## 🧠 Enabling Thinking Mode
166
+
167
+ If you want to enable the **thinking** mode, simply set `enable_thinking=True`:
168
+
169
+ ```python
170
+ chat_prompt = tokenizer.apply_chat_template(
171
+ messages,
172
+ tokenize=False,
173
+ add_generation_prompt=True,
174
+ enable_thinking=True
175
+ )
176
+ ```
177
+ ---
178
+
179
+ ## πŸͺ„ Using Specific Revisions (Training Checkpoints)
180
+
181
+ KORMo provides multiple model revisions corresponding to different training stages and checkpoints.
182
+ You can load a specific revision with the `revision` parameter in `from_pretrained`.
183
+
184
+ ### πŸ“ Stage 1 Model (sft-stage1)
185
+
186
+ ```python
187
+ from transformers import AutoModelForCausalLM, AutoTokenizer
188
+ import torch
189
+
190
+ model_name = "KORMo-Team/KORMo-10B-sft"
191
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
192
+ model = AutoModelForCausalLM.from_pretrained(
193
+ model_name,
194
+ revision="sft-stage1", # Load Stage 1 checkpoint
195
+ torch_dtype=torch.bfloat16,
196
+ device_map="auto",
197
+ trust_remote_code=True
198
+ )
199
+ ```
200
+
201
+ ### πŸš€ Main Model (Final Checkpoint: sft-stage2-ckpt2)
202
+
203
+ ```python
204
+ from transformers import AutoModelForCausalLM, AutoTokenizer
205
+ import torch
206
+
207
+ model_name = "KORMo-Team/KORMo-10B-sft"
208
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
209
+ model = AutoModelForCausalLM.from_pretrained(
210
+ model_name,
211
+ revision="sft-stage2-ckpt2", # Load Final Main Checkpoint
212
+ torch_dtype=torch.bfloat16,
213
+ device_map="auto",
214
+ trust_remote_code=True
215
+ )
216
+ ```
217
+
218
+ > πŸ’‘ **Tip**:
219
+ > - Use `sft-stage1` for ablation studies or comparison experiments.
220
+ > - Use `sft-stage2-ckpt2` as the **main production model**.
221
+
222
+ ---
223
+
224
+
225
  ## Contact
226
  - KyungTae Lim, Professor at KAIST. `[email protected]`
227
 
 
228
  ## Contributor