nejumi commited on
Commit
98ecc64
·
verified ·
1 Parent(s): ad84935

Update README_en.md

Browse files
Files changed (1) hide show
  1. README_en.md +8 -1
README_en.md CHANGED
@@ -22,12 +22,14 @@ Evaluation results from [Nejumi LLM Leaderboard 3 (W&B)](https://wandb.ai/wandb-
22
  Blue: Original
23
  Orange: 8bit
24
  Green: 4bit
 
25
  ### Benchmark Overall Results
26
  | Model | GLP Average | ALT Average | Overall Average |
27
  |--------|---------|---------|----------|
28
  | phi-4 Int4 | 0.5815 | 0.6953 | 0.6384 |
29
  | phi-4 Int8 | 0.5948 | 0.7015 | 0.6482 |
30
  | phi-4 Original | 0.5950 | 0.7005 | 0.6477 |
 
31
  ### General Language Performance (GLP) Details
32
  | Subcategory | Int4 | Int8 | Original |
33
  |-------------|------|------|------|
@@ -38,9 +40,12 @@ Green: 4bit
38
  | Mathematical Reasoning | 0.5400 | 0.5967 | 0.5817 |
39
  | Extraction | 0.3304 | 0.3408 | 0.3470 |
40
  | Knowledge & QA | 0.5587 | 0.5735 | 0.5685 |
41
- | English | 0.3035 | 0.2351 | 0.2158 |
42
  | Semantic Analysis | 0.4220 | 0.5200 | 0.5070 |
43
  | Syntax Analysis | 0.4399 | 0.4967 | 0.4903 |
 
 
 
44
  ### Alignment (ALT) Details
45
  | Subcategory | Int4 | Int8 | Original |
46
  |-------------|------|------|------|
@@ -50,6 +55,7 @@ Green: 4bit
50
  | Bias | 0.8858 | 0.8730 | 0.8650 |
51
  | Robustness | 0.3717 | 0.4208 | 0.4226 |
52
  | Truthfulness | 0.5292 | 0.4983 | 0.5206 |
 
53
  ### Benchmark Scores
54
  | Benchmark | Int4 | Int8 | Original |
55
  |-------------|------|------|------|
@@ -57,6 +63,7 @@ Green: 4bit
57
  | JASTER (2-shot) | 0.6136 | 0.6441 | 0.6398 |
58
  | MT-Bench | 8.2438 | 8.2000 | 8.1313 |
59
  | LCTG | 0.6860 | 0.6670 | 0.6750 |
 
60
  ---
61
  ## Model Characteristics & Evaluation
62
  - **High Stability**: Standard GPTQ quantization achieves sufficient performance for 14B class models
 
22
  Blue: Original
23
  Orange: 8bit
24
  Green: 4bit
25
+
26
  ### Benchmark Overall Results
27
  | Model | GLP Average | ALT Average | Overall Average |
28
  |--------|---------|---------|----------|
29
  | phi-4 Int4 | 0.5815 | 0.6953 | 0.6384 |
30
  | phi-4 Int8 | 0.5948 | 0.7015 | 0.6482 |
31
  | phi-4 Original | 0.5950 | 0.7005 | 0.6477 |
32
+
33
  ### General Language Performance (GLP) Details
34
  | Subcategory | Int4 | Int8 | Original |
35
  |-------------|------|------|------|
 
40
  | Mathematical Reasoning | 0.5400 | 0.5967 | 0.5817 |
41
  | Extraction | 0.3304 | 0.3408 | 0.3470 |
42
  | Knowledge & QA | 0.5587 | 0.5735 | 0.5685 |
43
+ | MMLU_en | 0.3035 | 0.2351 | 0.2158 |
44
  | Semantic Analysis | 0.4220 | 0.5200 | 0.5070 |
45
  | Syntax Analysis | 0.4399 | 0.4967 | 0.4903 |
46
+
47
+ Note: The low MMLU_en scores are due to the model's inability to strictly follow the required answer format for this benchmark, rather than reflecting its actual knowledge or reasoning capabilities.
48
+
49
  ### Alignment (ALT) Details
50
  | Subcategory | Int4 | Int8 | Original |
51
  |-------------|------|------|------|
 
55
  | Bias | 0.8858 | 0.8730 | 0.8650 |
56
  | Robustness | 0.3717 | 0.4208 | 0.4226 |
57
  | Truthfulness | 0.5292 | 0.4983 | 0.5206 |
58
+
59
  ### Benchmark Scores
60
  | Benchmark | Int4 | Int8 | Original |
61
  |-------------|------|------|------|
 
63
  | JASTER (2-shot) | 0.6136 | 0.6441 | 0.6398 |
64
  | MT-Bench | 8.2438 | 8.2000 | 8.1313 |
65
  | LCTG | 0.6860 | 0.6670 | 0.6750 |
66
+
67
  ---
68
  ## Model Characteristics & Evaluation
69
  - **High Stability**: Standard GPTQ quantization achieves sufficient performance for 14B class models