Update README.md
Browse files
README.md
CHANGED
@@ -71,10 +71,10 @@ Acknowledgement: evaluation results provided by [@Sofetory](https://huggingface.
|
|
71 |
|google/gemma-3-12b-it |57,388,402 |2.506 |47,285,432 |1.307 |354,241,840 |1.788 |6,240,944 |3.341 |95,520,817 |2.245 |103,950,626 |2.329 |131,398,147 |3.045 |
|
72 |
|Qwen/Qwen3-8B |84,408,084 |3.686 |46,884,593 |1.296 |395,581,536 |1.996 |7,956,741 |4.259 |116,115,062 |2.728 |132,597,427 |2.971 |173,571,099 |4.022 |
|
73 |
|meta-llama/Llama-3.1-8B-Instruct|57,226,997 |2.499 |46,085,724 |1.274 |382,143,751 |1.928 |7,386,873 |3.954 |104,974,733 |2.467 |119,123,733 |2.669 |150,189,294 |3.48 |
|
74 |
-
|microsoft/Phi-4-mini-instruct |59,447,036 |2.596 |45,423,925
|
75 |
-
|CohereLabs/aya-expanse-8b |50,973,632 |2.226 |47,364,187 |1.309 |353,221,932 |1.782 |6,614,719 |3.541 |93,089,697 |2.187 |112,612,668
|
76 |
-
|
77 |
-
|Comments <td colspan=2> Significant 27 % improvement over the Aya-Expanse baseline; absolute leader in Ukrainian tokenization.<td colspan=2>Tokens-per-word for English rises by less than 4 % compared with the baseline.<td colspan=2>
|
78 |
|
79 |
|
80 |
## Contents
|
|
|
71 |
|google/gemma-3-12b-it |57,388,402 |2.506 |47,285,432 |1.307 |354,241,840 |1.788 |6,240,944 |3.341 |95,520,817 |2.245 |103,950,626 |2.329 |131,398,147 |3.045 |
|
72 |
|Qwen/Qwen3-8B |84,408,084 |3.686 |46,884,593 |1.296 |395,581,536 |1.996 |7,956,741 |4.259 |116,115,062 |2.728 |132,597,427 |2.971 |173,571,099 |4.022 |
|
73 |
|meta-llama/Llama-3.1-8B-Instruct|57,226,997 |2.499 |46,085,724 |1.274 |382,143,751 |1.928 |7,386,873 |3.954 |104,974,733 |2.467 |119,123,733 |2.669 |150,189,294 |3.48 |
|
74 |
+
|microsoft/Phi-4-mini-instruct |59,447,036 |2.596 |45,423,925 |**1.256** |335,188,687 |**1.691** |5,995,822 |3.209 |91,824,464 |**2.158** |102,472,523 |2.296 |119,587,038 |**2.771** |
|
75 |
+
|CohereLabs/aya-expanse-8b |50,973,632 |2.226 |47,364,187 |1.309 |353,221,932 |1.782 |6,614,719 |3.541 |93,089,697 |2.187 |112,612,668 |**2.523** |141,262,943 |3.273 |
|
76 |
+
|**ayayay-tokenizer (Ours)** |37,094,157 |**1.62**🤩 |48,288,882 |1.335 |372,587,959 |1.88 |4,238,587 |**2.269** |107,331,167 |2.522 |114,292,191 |2.561 |133,618,186 |3.096 |
|
77 |
+
|Comments <td colspan=2> Significant 27 % improvement over the Aya-Expanse baseline; absolute leader in Ukrainian tokenization.<td colspan=2>Tokens-per-word for English rises by less than 4 % compared with the baseline.<td colspan=2>Ayayay tokenizer retains strong multilingual capabilities <td colspan=2>Shows significant improvement on QIRIM Cyrillic versus the original aya and other tokenizers<td colspan=2>Russian efficiency drops, owing to the Ukrainian-centric changes, but still beats Qwen.<td colspan=4> Other Cyrillic languages, such as Bulgarian and Belarusian, perform well after the token replacement; Belarusian improves especially noticeably.|
|
78 |
|
79 |
|
80 |
## Contents
|