transhumanist-already-exists
/

ayayay-tokenizer

corpus-linguistics

Model card Files Files and versions Community

transhumanist-already-exists commited on Jun 8

Commit

3a98085

·

verified ·

1 Parent(s): 50b53cc

Update README.md

Files changed (1) hide show

README.md +6 -6

README.md CHANGED Viewed

@@ -79,17 +79,17 @@ Acknowledgement: evaluation results provided by [@Sofetory](https://huggingface.
 ## Contents
-- [tokenizer.json](tokenizer.json) Byte‐level tokenizer spec (vocab, merges, model settings).
-- [tokenizer_utf8.json](tokenizer_utf8.json) Human-readable dump: UTF-8-decoded sub-tokens and merge rules, for corpus-linguistic inspection.
-- [malyuk_qirim_tokenizer.json](malyuk_qirim_tokenizer.json) Aya-style tokenizer trained on the full Malyuk Ukrainian corpus plus Cyrillic QIRIM (100 : 1 ratio), with min_frequency = 4_000.
-- [merge_info.json](merge_info.json) Lists the replaced Aya token IDs and the IDs of the added Malyuk tokens in [malyuk_qirim_tokenizer](https://huggingface.co/transhumanist-already-exists/ayayay_tokenizer/blob/main/malyuk_qirim_tokenizer.json).
-- [tokenizer_config.json](tokenizer_config.json) Configuration metadata.
-- [special_tokens_map.json](special_tokens_map.json) Mapping of special token (The same with Aya).
 ## Initialisation of embeddings for new tokens in Aya-Expanse models
 Some tokens are identical to those in the original Aya-Expanse tokenizer. For the newly added tokens, you can initialise embeddings with tools such as [Focus](https://github.com/konstantinjdobler/focus/tree/main) and [Zett](https://github.com/bminixhofer/zett). The simplest—and often effective—alternative is to initialise the new embeddings randomly and train them with a warm-up schedule.

 ## Contents
+- [tokenizer.json](tokenizer.json): Byte‐level tokenizer spec (vocab, merges, model settings).
+- [tokenizer_utf8.json](tokenizer_utf8.json): Human-readable dump: UTF-8-decoded sub-tokens and merge rules, for corpus-linguistic inspection.
+- [malyuk_qirim_tokenizer.json](malyuk_qirim_tokenizer.json): Aya-style tokenizer trained on the full Malyuk Ukrainian corpus plus Cyrillic QIRIM (100 : 1 ratio), with min_frequency = 4_000.
+- [merge_info.json](merge_info.json): Lists the replaced Aya token IDs and the IDs of the added Malyuk tokens in [malyuk_qirim_tokenizer](https://huggingface.co/transhumanist-already-exists/ayayay_tokenizer/blob/main/malyuk_qirim_tokenizer.json).
+- [tokenizer_config.json](tokenizer_config.json): Configuration metadata.
+- [special_tokens_map.json](special_tokens_map.json): Mapping of special token (The same with Aya).
 ## Initialisation of embeddings for new tokens in Aya-Expanse models
 Some tokens are identical to those in the original Aya-Expanse tokenizer. For the newly added tokens, you can initialise embeddings with tools such as [Focus](https://github.com/konstantinjdobler/focus/tree/main) and [Zett](https://github.com/bminixhofer/zett). The simplest—and often effective—alternative is to initialise the new embeddings randomly and train them with a warm-up schedule.