Update README_en.md
Browse files- README_en.md +8 -1
README_en.md
CHANGED
@@ -22,12 +22,14 @@ Evaluation results from [Nejumi LLM Leaderboard 3 (W&B)](https://wandb.ai/wandb-
|
|
22 |
Blue: Original
|
23 |
Orange: 8bit
|
24 |
Green: 4bit
|
|
|
25 |
### Benchmark Overall Results
|
26 |
| Model | GLP Average | ALT Average | Overall Average |
|
27 |
|--------|---------|---------|----------|
|
28 |
| phi-4 Int4 | 0.5815 | 0.6953 | 0.6384 |
|
29 |
| phi-4 Int8 | 0.5948 | 0.7015 | 0.6482 |
|
30 |
| phi-4 Original | 0.5950 | 0.7005 | 0.6477 |
|
|
|
31 |
### General Language Performance (GLP) Details
|
32 |
| Subcategory | Int4 | Int8 | Original |
|
33 |
|-------------|------|------|------|
|
@@ -38,9 +40,12 @@ Green: 4bit
|
|
38 |
| Mathematical Reasoning | 0.5400 | 0.5967 | 0.5817 |
|
39 |
| Extraction | 0.3304 | 0.3408 | 0.3470 |
|
40 |
| Knowledge & QA | 0.5587 | 0.5735 | 0.5685 |
|
41 |
-
|
|
42 |
| Semantic Analysis | 0.4220 | 0.5200 | 0.5070 |
|
43 |
| Syntax Analysis | 0.4399 | 0.4967 | 0.4903 |
|
|
|
|
|
|
|
44 |
### Alignment (ALT) Details
|
45 |
| Subcategory | Int4 | Int8 | Original |
|
46 |
|-------------|------|------|------|
|
@@ -50,6 +55,7 @@ Green: 4bit
|
|
50 |
| Bias | 0.8858 | 0.8730 | 0.8650 |
|
51 |
| Robustness | 0.3717 | 0.4208 | 0.4226 |
|
52 |
| Truthfulness | 0.5292 | 0.4983 | 0.5206 |
|
|
|
53 |
### Benchmark Scores
|
54 |
| Benchmark | Int4 | Int8 | Original |
|
55 |
|-------------|------|------|------|
|
@@ -57,6 +63,7 @@ Green: 4bit
|
|
57 |
| JASTER (2-shot) | 0.6136 | 0.6441 | 0.6398 |
|
58 |
| MT-Bench | 8.2438 | 8.2000 | 8.1313 |
|
59 |
| LCTG | 0.6860 | 0.6670 | 0.6750 |
|
|
|
60 |
---
|
61 |
## Model Characteristics & Evaluation
|
62 |
- **High Stability**: Standard GPTQ quantization achieves sufficient performance for 14B class models
|
|
|
22 |
Blue: Original
|
23 |
Orange: 8bit
|
24 |
Green: 4bit
|
25 |
+
|
26 |
### Benchmark Overall Results
|
27 |
| Model | GLP Average | ALT Average | Overall Average |
|
28 |
|--------|---------|---------|----------|
|
29 |
| phi-4 Int4 | 0.5815 | 0.6953 | 0.6384 |
|
30 |
| phi-4 Int8 | 0.5948 | 0.7015 | 0.6482 |
|
31 |
| phi-4 Original | 0.5950 | 0.7005 | 0.6477 |
|
32 |
+
|
33 |
### General Language Performance (GLP) Details
|
34 |
| Subcategory | Int4 | Int8 | Original |
|
35 |
|-------------|------|------|------|
|
|
|
40 |
| Mathematical Reasoning | 0.5400 | 0.5967 | 0.5817 |
|
41 |
| Extraction | 0.3304 | 0.3408 | 0.3470 |
|
42 |
| Knowledge & QA | 0.5587 | 0.5735 | 0.5685 |
|
43 |
+
| MMLU_en | 0.3035 | 0.2351 | 0.2158 |
|
44 |
| Semantic Analysis | 0.4220 | 0.5200 | 0.5070 |
|
45 |
| Syntax Analysis | 0.4399 | 0.4967 | 0.4903 |
|
46 |
+
|
47 |
+
Note: The low MMLU_en scores are due to the model's inability to strictly follow the required answer format for this benchmark, rather than reflecting its actual knowledge or reasoning capabilities.
|
48 |
+
|
49 |
### Alignment (ALT) Details
|
50 |
| Subcategory | Int4 | Int8 | Original |
|
51 |
|-------------|------|------|------|
|
|
|
55 |
| Bias | 0.8858 | 0.8730 | 0.8650 |
|
56 |
| Robustness | 0.3717 | 0.4208 | 0.4226 |
|
57 |
| Truthfulness | 0.5292 | 0.4983 | 0.5206 |
|
58 |
+
|
59 |
### Benchmark Scores
|
60 |
| Benchmark | Int4 | Int8 | Original |
|
61 |
|-------------|------|------|------|
|
|
|
63 |
| JASTER (2-shot) | 0.6136 | 0.6441 | 0.6398 |
|
64 |
| MT-Bench | 8.2438 | 8.2000 | 8.1313 |
|
65 |
| LCTG | 0.6860 | 0.6670 | 0.6750 |
|
66 |
+
|
67 |
---
|
68 |
## Model Characteristics & Evaluation
|
69 |
- **High Stability**: Standard GPTQ quantization achieves sufficient performance for 14B class models
|