Commit
50b53cc
·
verified ·
1 Parent(s): 4241aed

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -7
README.md CHANGED
@@ -65,7 +65,7 @@ print(toks.input_ids) # [123903, 175118, 167580, 196099] - only 4 tokens 💪
65
  Acknowledgement: evaluation results provided by [@Sofetory](https://huggingface.co/Sofetory).
66
  ||lang-uk/malyuk |100k texts|allenai/c4(en)| 100k texts|allenai/c4(es, fr, it, de) | 400k texts |QIRIM/crh_monocorpus(Cyrillic) | 94 texts |allenai/c4(ru) | 100k texts|allenai/c4(bg) | 100k texts|allenai/c4(be)| 100k texts|
67
  |--------------------------------|-------------------------------------------------------------------------------------------------------------------|---------|---------------------|---------|-----------------------------------------------------------------------------------------------|---------|--------------------------------------------------------------------------------------|---------|----------------------------------------------------------------------------------------------------------------------------------|---------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|---------------------|---------|
68
- |words count <td colspan=2>22,898,164 |36,170,971 | |198,173,216 | |1,868,259 | |42,557,519 | |44,627,199 | |43,153,645 | |
69
  ||||||||||||||||
70
  |tokenizers |tokens |toks/word|tokens |toks/word|tokens |toks/word|tokens |toks/word|tokens |toks/word|tokens |toks/word|tokens |toks/word|
71
  |google/gemma-3-12b-it |57,388,402 |2.506 |47,285,432 |1.307 |354,241,840 |1.788 |6,240,944 |3.341 |95,520,817 |2.245 |103,950,626 |2.329 |131,398,147 |3.045 |
@@ -79,17 +79,17 @@ Acknowledgement: evaluation results provided by [@Sofetory](https://huggingface.
79
 
80
  ## Contents
81
 
82
- - **`tokenizer.json`** Byte‐level tokenizer spec (vocab, merges, model settings).
83
 
84
- - **`tokenizer_utf8.json`** Human-readable dump: UTF-8-decoded sub-tokens and merge rules, for corpus-linguistic inspection.
85
 
86
- - **`malyuk_qirim_tokenizer.json`** Aya-style tokenizer trained on the full Malyuk Ukrainian corpus plus Cyrillic QIRIM (100 : 1 ratio), with min_frequency = 4_000.
87
 
88
- - **`merge_info.json`** Lists the replaced Aya token IDs and the IDs of the added Malyuk tokens in [malyuk_qirim_tokenizer](https://huggingface.co/transhumanist-already-exists/ayayay_tokenizer/blob/main/malyuk_qirim_tokenizer.json).
89
 
90
- - **`tokenizer_config.json`** Configuration metadata.
91
 
92
- - **`special_tokens_map.json`** Mapping of special token (The same with Aya).
93
 
94
  ## Initialisation of embeddings for new tokens in Aya-Expanse models
95
  Some tokens are identical to those in the original Aya-Expanse tokenizer. For the newly added tokens, you can initialise embeddings with tools such as [Focus](https://github.com/konstantinjdobler/focus/tree/main) and [Zett](https://github.com/bminixhofer/zett). The simplest—and often effective—alternative is to initialise the new embeddings randomly and train them with a warm-up schedule.
 
65
  Acknowledgement: evaluation results provided by [@Sofetory](https://huggingface.co/Sofetory).
66
  ||lang-uk/malyuk |100k texts|allenai/c4(en)| 100k texts|allenai/c4(es, fr, it, de) | 400k texts |QIRIM/crh_monocorpus(Cyrillic) | 94 texts |allenai/c4(ru) | 100k texts|allenai/c4(bg) | 100k texts|allenai/c4(be)| 100k texts|
67
  |--------------------------------|-------------------------------------------------------------------------------------------------------------------|---------|---------------------|---------|-----------------------------------------------------------------------------------------------|---------|--------------------------------------------------------------------------------------|---------|----------------------------------------------------------------------------------------------------------------------------------|---------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|---------------------|---------|
68
+ |words count <td colspan=2>22,898,164 <td colspan=2>36,170,971 <td colspan=2>198,173,216 <td colspan=2>1,868,259 <td colspan=2>42,557,519 <td colspan=2>44,627,199 <td colspan=2>43,153,645 |
69
  ||||||||||||||||
70
  |tokenizers |tokens |toks/word|tokens |toks/word|tokens |toks/word|tokens |toks/word|tokens |toks/word|tokens |toks/word|tokens |toks/word|
71
  |google/gemma-3-12b-it |57,388,402 |2.506 |47,285,432 |1.307 |354,241,840 |1.788 |6,240,944 |3.341 |95,520,817 |2.245 |103,950,626 |2.329 |131,398,147 |3.045 |
 
79
 
80
  ## Contents
81
 
82
+ - [tokenizer.json](tokenizer.json) Byte‐level tokenizer spec (vocab, merges, model settings).
83
 
84
+ - [tokenizer_utf8.json](tokenizer_utf8.json) Human-readable dump: UTF-8-decoded sub-tokens and merge rules, for corpus-linguistic inspection.
85
 
86
+ - [malyuk_qirim_tokenizer.json](malyuk_qirim_tokenizer.json) Aya-style tokenizer trained on the full Malyuk Ukrainian corpus plus Cyrillic QIRIM (100 : 1 ratio), with min_frequency = 4_000.
87
 
88
+ - [merge_info.json](merge_info.json) Lists the replaced Aya token IDs and the IDs of the added Malyuk tokens in [malyuk_qirim_tokenizer](https://huggingface.co/transhumanist-already-exists/ayayay_tokenizer/blob/main/malyuk_qirim_tokenizer.json).
89
 
90
+ - [tokenizer_config.json](tokenizer_config.json) Configuration metadata.
91
 
92
+ - [special_tokens_map.json](special_tokens_map.json) Mapping of special token (The same with Aya).
93
 
94
  ## Initialisation of embeddings for new tokens in Aya-Expanse models
95
  Some tokens are identical to those in the original Aya-Expanse tokenizer. For the newly added tokens, you can initialise embeddings with tools such as [Focus](https://github.com/konstantinjdobler/focus/tree/main) and [Zett](https://github.com/bminixhofer/zett). The simplest—and often effective—alternative is to initialise the new embeddings randomly and train them with a warm-up schedule.