File size: 2,305 Bytes
8fac64c
2a9608b
 
 
 
 
8fac64c
 
2a9608b
8fac64c
2a9608b
8fac64c
2a9608b
8fac64c
2a9608b
 
 
8fac64c
2a9608b
8fac64c
2a9608b
 
 
8fac64c
2a9608b
 
 
 
 
8fac64c
2a9608b
8fac64c
fe45933
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2a9608b
8fac64c
fc4ef6b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
---
language:
- ms
- en
- zh
- ta
---

# Malaysian gemma-3-1b-it

Continue finetuning https://huggingface.co/google/gemma-3-1b-it on highly curated 1.5B tokens Malaysian instruction dataset.

## Improvement

1. Support respond in Mandarin, Tamil, Jawi, Manglish, Johor, Kedah, Kelantan, Pahang, Perak, Sabah, Sarawak, Selangor, Negeri Sembilan and Terengganu.
2. Able to code in Mandarin, Tamil, Jawi, Manglish, Johor, Kedah, Kelantan, Pahang, Perak, Sabah, Sarawak, Selangor, Negeri Sembilan and Terengganu.
3. Multi-turn Malaysian context such as related to Malaysian Legislation, politics, religions and languages.

## Training session

Finetune on [mesolitica/Malaysian-SFT](https://huggingface.co/datasets/mesolitica/Malaysian-SFT) to make the model understand Malaysian context.
  
## How we train

1. LoRA on `["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj", "embed_tokens", "lm_head"]`.
2. 128 Rank with alpha 256, or alpha of 2.0
3. Multipacking 8192 context length with proper SDPA causal masking to prevent document contamination and also make sure proper position ids.
4. Chunk CCE loss for LoRA.
5. WanDB at https://wandb.ai/huseinzol05/lora-embedding-128-gemma3-1b-malaysian-8k?nw=nwuserhuseinzol05

Source code at https://github.com/mesolitica/malaya/tree/master/session/gemma3

## Benchmark

### MalayMMLU

Based on 0-shot first token accuracy,

```
                     Model   Accuracy   shot by_letter        category
0  Malaysian-gemma-3-1b-it  48.096603  0shot      True            STEM
1  Malaysian-gemma-3-1b-it  47.423664  0shot      True        Language
2  Malaysian-gemma-3-1b-it  47.210176  0shot      True  Social science
3  Malaysian-gemma-3-1b-it  47.709283  0shot      True          Others
4  Malaysian-gemma-3-1b-it  51.786121  0shot      True      Humanities
{'Social science': 6918, 'Language': 6288, 'Humanities': 4395, 'Others': 4169, 'STEM': 2443}
Model : Malaysian-gemma-3-1b-it
Metric : first
Shot : 0shot
average accuracy 48.27158964192789
accuracy for STEM 48.09660253786328
accuracy for Language 47.4236641221374
accuracy for Social science 47.21017635154669
accuracy for Others 47.70928280163108
accuracy for Humanities 51.786120591581344
```

## Acknowledgement

Special thanks to https://www.sns.com.my and Nvidia for 8x H100 node!