diff --git a/.gitattributes b/.gitattributes new file mode 100644 index 0000000000000000000000000000000000000000..4db0835e9b69a4cdbaa833c96fc6361a8c2c9112 --- /dev/null +++ b/.gitattributes @@ -0,0 +1,30 @@ +*.7z filter=lfs diff=lfs merge=lfs -text +*.arrow filter=lfs diff=lfs merge=lfs -text +*.bin filter=lfs diff=lfs merge=lfs -text +*.bin.* filter=lfs diff=lfs merge=lfs -text +*.bz2 filter=lfs diff=lfs merge=lfs -text +*.ftz filter=lfs diff=lfs merge=lfs -text +*.gz filter=lfs diff=lfs merge=lfs -text +*.h5 filter=lfs diff=lfs merge=lfs -text +*.joblib filter=lfs diff=lfs merge=lfs -text +*.lfs.* filter=lfs diff=lfs merge=lfs -text +*.model filter=lfs diff=lfs merge=lfs -text +*.msgpack filter=lfs diff=lfs merge=lfs -text +*.onnx filter=lfs diff=lfs merge=lfs -text +*.ot filter=lfs diff=lfs merge=lfs -text +*.parquet filter=lfs diff=lfs merge=lfs -text +*.pb filter=lfs diff=lfs merge=lfs -text +*.pt filter=lfs diff=lfs merge=lfs -text +*.pth filter=lfs diff=lfs merge=lfs -text +*.rar filter=lfs diff=lfs merge=lfs -text +saved_model/**/* filter=lfs diff=lfs merge=lfs -text +*.tar.* filter=lfs diff=lfs merge=lfs -text +*.tflite filter=lfs diff=lfs merge=lfs -text +*.tgz filter=lfs diff=lfs merge=lfs -text +*.xz filter=lfs diff=lfs merge=lfs -text +*.zip filter=lfs diff=lfs merge=lfs -text +*.zstandard filter=lfs diff=lfs merge=lfs -text +*tfevents* filter=lfs diff=lfs merge=lfs -text +*sp.model filter=lfs diff=lfs merge=lfs -text +*sp.vocab filter=lfs diff=lfs merge=lfs -text +*.arpa filter=lfs diff=lfs merge=lfs -text diff --git a/LICENSE b/LICENSE new file mode 100644 index 0000000000000000000000000000000000000000..1deb33371a70835b6347a64793f14606f3d9acba --- /dev/null +++ b/LICENSE @@ -0,0 +1,202 @@ + + Apache License + Version 2.0, January 2004 + http://www.apache.org/licenses/ + + TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION + + 1. Definitions. + + "License" shall mean the terms and conditions for use, reproduction, + and distribution as defined by Sections 1 through 9 of this document. + + "Licensor" shall mean the copyright owner or entity authorized by + the copyright owner that is granting the License. + + "Legal Entity" shall mean the union of the acting entity and all + other entities that control, are controlled by, or are under common + control with that entity. For the purposes of this definition, + "control" means (i) the power, direct or indirect, to cause the + direction or management of such entity, whether by contract or + otherwise, or (ii) ownership of fifty percent (50%) or more of the + outstanding shares, or (iii) beneficial ownership of such entity. + + "You" (or "Your") shall mean an individual or Legal Entity + exercising permissions granted by this License. + + "Source" form shall mean the preferred form for making modifications, + including but not limited to software source code, documentation + source, and configuration files. + + "Object" form shall mean any form resulting from mechanical + transformation or translation of a Source form, including but + not limited to compiled object code, generated documentation, + and conversions to other media types. + + "Work" shall mean the work of authorship, whether in Source or + Object form, made available under the License, as indicated by a + copyright notice that is included in or attached to the work + (an example is provided in the Appendix below). + + "Derivative Works" shall mean any work, whether in Source or Object + form, that is based on (or derived from) the Work and for which the + editorial revisions, annotations, elaborations, or other modifications + represent, as a whole, an original work of authorship. For the purposes + of this License, Derivative Works shall not include works that remain + separable from, or merely link (or bind by name) to the interfaces of, + the Work and Derivative Works thereof. + + "Contribution" shall mean any work of authorship, including + the original version of the Work and any modifications or additions + to that Work or Derivative Works thereof, that is intentionally + submitted to Licensor for inclusion in the Work by the copyright owner + or by an individual or Legal Entity authorized to submit on behalf of + the copyright owner. For the purposes of this definition, "submitted" + means any form of electronic, verbal, or written communication sent + to the Licensor or its representatives, including but not limited to + communication on electronic mailing lists, source code control systems, + and issue tracking systems that are managed by, or on behalf of, the + Licensor for the purpose of discussing and improving the Work, but + excluding communication that is conspicuously marked or otherwise + designated in writing by the copyright owner as "Not a Contribution." + + "Contributor" shall mean Licensor and any individual or Legal Entity + on behalf of whom a Contribution has been received by Licensor and + subsequently incorporated within the Work. + + 2. Grant of Copyright License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + copyright license to reproduce, prepare Derivative Works of, + publicly display, publicly perform, sublicense, and distribute the + Work and such Derivative Works in Source or Object form. + + 3. Grant of Patent License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + (except as stated in this section) patent license to make, have made, + use, offer to sell, sell, import, and otherwise transfer the Work, + where such license applies only to those patent claims licensable + by such Contributor that are necessarily infringed by their + Contribution(s) alone or by combination of their Contribution(s) + with the Work to which such Contribution(s) was submitted. If You + institute patent litigation against any entity (including a + cross-claim or counterclaim in a lawsuit) alleging that the Work + or a Contribution incorporated within the Work constitutes direct + or contributory patent infringement, then any patent licenses + granted to You under this License for that Work shall terminate + as of the date such litigation is filed. + + 4. Redistribution. You may reproduce and distribute copies of the + Work or Derivative Works thereof in any medium, with or without + modifications, and in Source or Object form, provided that You + meet the following conditions: + + (a) You must give any other recipients of the Work or + Derivative Works a copy of this License; and + + (b) You must cause any modified files to carry prominent notices + stating that You changed the files; and + + (c) You must retain, in the Source form of any Derivative Works + that You distribute, all copyright, patent, trademark, and + attribution notices from the Source form of the Work, + excluding those notices that do not pertain to any part of + the Derivative Works; and + + (d) If the Work includes a "NOTICE" text file as part of its + distribution, then any Derivative Works that You distribute must + include a readable copy of the attribution notices contained + within such NOTICE file, excluding those notices that do not + pertain to any part of the Derivative Works, in at least one + of the following places: within a NOTICE text file distributed + as part of the Derivative Works; within the Source form or + documentation, if provided along with the Derivative Works; or, + within a display generated by the Derivative Works, if and + wherever such third-party notices normally appear. The contents + of the NOTICE file are for informational purposes only and + do not modify the License. You may add Your own attribution + notices within Derivative Works that You distribute, alongside + or as an addendum to the NOTICE text from the Work, provided + that such additional attribution notices cannot be construed + as modifying the License. + + You may add Your own copyright statement to Your modifications and + may provide additional or different license terms and conditions + for use, reproduction, or distribution of Your modifications, or + for any such Derivative Works as a whole, provided Your use, + reproduction, and distribution of the Work otherwise complies with + the conditions stated in this License. + + 5. Submission of Contributions. Unless You explicitly state otherwise, + any Contribution intentionally submitted for inclusion in the Work + by You to the Licensor shall be under the terms and conditions of + this License, without any additional terms or conditions. + Notwithstanding the above, nothing herein shall supersede or modify + the terms of any separate license agreement you may have executed + with Licensor regarding such Contributions. + + 6. Trademarks. This License does not grant permission to use the trade + names, trademarks, service marks, or product names of the Licensor, + except as required for reasonable and customary use in describing the + origin of the Work and reproducing the content of the NOTICE file. + + 7. Disclaimer of Warranty. Unless required by applicable law or + agreed to in writing, Licensor provides the Work (and each + Contributor provides its Contributions) on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or + implied, including, without limitation, any warranties or conditions + of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A + PARTICULAR PURPOSE. You are solely responsible for determining the + appropriateness of using or redistributing the Work and assume any + risks associated with Your exercise of permissions under this License. + + 8. Limitation of Liability. In no event and under no legal theory, + whether in tort (including negligence), contract, or otherwise, + unless required by applicable law (such as deliberate and grossly + negligent acts) or agreed to in writing, shall any Contributor be + liable to You for damages, including any direct, indirect, special, + incidental, or consequential damages of any character arising as a + result of this License or out of the use or inability to use the + Work (including but not limited to damages for loss of goodwill, + work stoppage, computer failure or malfunction, or any and all + other commercial damages or losses), even if such Contributor + has been advised of the possibility of such damages. + + 9. Accepting Warranty or Additional Liability. While redistributing + the Work or Derivative Works thereof, You may choose to offer, + and charge a fee for, acceptance of support, warranty, indemnity, + or other liability obligations and/or rights consistent with this + License. However, in accepting such obligations, You may act only + on Your own behalf and on Your sole responsibility, not on behalf + of any other Contributor, and only if You agree to indemnify, + defend, and hold each Contributor harmless for any liability + incurred by, or claims asserted against, such Contributor by reason + of your accepting any such warranty or additional liability. + + END OF TERMS AND CONDITIONS + + APPENDIX: How to apply the Apache License to your work. + + To apply the Apache License to your work, attach the following + boilerplate notice, with the fields enclosed by brackets "[]" + replaced with your own identifying information. (Don't include + the brackets!) The text should be enclosed in the appropriate + comment syntax for the file format. We also recommend that a + file or class name and description of purpose be included on the + same "printed page" as the copyright notice for easier + identification within third-party archives. + + Copyright 2021-2022 Eduardo González Ponferrada + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. \ No newline at end of file diff --git a/README.md b/README.md new file mode 100644 index 0000000000000000000000000000000000000000..a72d4697b1650fb063482ee83d5098361dacbdec --- /dev/null +++ b/README.md @@ -0,0 +1,70 @@ +--- +language: +- es +- af +- ar +- arz +- as +- bn +- fr +- sw +- eu +- ca +- zh +- en +- hi +- ur +- id +- pt +- vi +- gu +- kn +- ml +- mr +- ta +- te +- yo +tags: +- kenlm +- perplexity +- n-gram +- kneser-ney +- bigscience +license: mit +datasets: +- wikipedia +- oscar +duplicated_from: edugp/kenlm +--- + +# KenLM models +This repo contains several KenLM models trained on different tokenized datasets and languages. +KenLM models are probabilistic n-gram languge models that models. One use case of these models consist on fast perplexity estimation for [filtering or sampling large datasets](https://huggingface.co/bertin-project/bertin-roberta-base-spanish). For example, one could use a KenLM model trained on French Wikipedia to run inference on a large dataset and filter out samples that are very unlike to appear on Wikipedia (high perplexity), or very simple non-informative sentences that could appear repeatedly (low perplexity). + +At the root of this repo you will find different directories named after the dataset models were trained on (e.g. `wikipedia`, `oscar`). Within each directory, you will find several models trained on different language subsets of the dataset (e.g. `en (English)`, `es (Spanish)`, `fr (French)`). For each language you will find three different files +* `{language}.arpa.bin`: The trained KenLM model binary +* `{language}.sp.model`: The trained SentencePiece model used for tokenization +* `{language}.sp.vocab`: The vocabulary file for the SentencePiece model + +The models have been trained using some of the preprocessing steps from [cc_net](https://github.com/facebookresearch/cc_net), in particular replacing numbers with zeros and normalizing punctuation. So, it is important to keep the default values for the parameters: `lower_case`, `remove_accents`, `normalize_numbers` and `punctuation` when using the pre-trained models in order to replicate the same pre-processing steps at inference time. + +# Dependencies +* KenLM: `pip install https://github.com/kpu/kenlm/archive/master.zip` +* SentencePiece: `pip install sentencepiece` + +# Example: +``` +from model import KenlmModel + + +# Load model trained on English wikipedia +model = KenlmModel.from_pretrained("wikipedia", "en") + +# Get perplexity +model.get_perplexity("I am very perplexed") +# 341.3 (low perplexity, since sentence style is formal and with no grammar mistakes) + +model.get_perplexity("im hella trippin") +# 46793.5 (high perplexity, since the sentence is colloquial and contains grammar mistakes) +``` +In the example above we see that, since Wikipedia is a collection of encyclopedic articles, a KenLM model trained on it will naturally give lower perplexity scores to sentences with formal language and no grammar mistakes than colloquial sentences with grammar mistakes. \ No newline at end of file diff --git a/mc4/ig.arpa.bin b/mc4/ig.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..36f746a8c96cdb98a6c9ef74d6779e7267abc5f1 --- /dev/null +++ b/mc4/ig.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:bf7c27f49a2368b603b363f093417352e36a7227cef7a214631050efbd3c04b0 +size 2025000654 diff --git a/mc4/ig.arpa.trie.bin b/mc4/ig.arpa.trie.bin new file mode 100644 index 0000000000000000000000000000000000000000..7e540655b548a9fca20f1eba56b2dc452eb3528f --- /dev/null +++ b/mc4/ig.arpa.trie.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:557f588644ccca523169495129fa258851a552885ebf400a2a2b9c3d9a26d476 +size 952564451 diff --git a/mc4/ig.sp.model b/mc4/ig.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..df009044e2d8137e905383b90ad4620e60b3d601 --- /dev/null +++ b/mc4/ig.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:c987d7b1646ba45136dc06d509a1e667a9337f197db54e0472cc09f4ee4f0986 +size 885958 diff --git a/mc4/ig.sp.vocab b/mc4/ig.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..f52815705914425723371f9ff2987a14479b14a5 --- /dev/null +++ b/mc4/ig.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:ca68f0396db1795beac9ca8aa8ed14782fb1cc01c52de7e97b35c5c7b0f0ee91 +size 683760 diff --git a/mc4/ny.arpa.bin b/mc4/ny.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..a58320640ce1593fdc1f61fb13b26bd23cd6f41d --- /dev/null +++ b/mc4/ny.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:5d6dbd24886c53dda31f687f0e929473f378d396f97b8cf05bf8e73bdc49e2e2 +size 3556104547 diff --git a/mc4/ny.arpa.trie.bin b/mc4/ny.arpa.trie.bin new file mode 100644 index 0000000000000000000000000000000000000000..9cb4af44516e173b15bcb74cdcbba7db7fef02a6 --- /dev/null +++ b/mc4/ny.arpa.trie.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:c385e996bd99aacabcc6479616e8ec8c0eac73c8c1064e68649bc0fc283741cb +size 1694641098 diff --git a/mc4/ny.sp.model b/mc4/ny.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..8ff90805af8e972231cfa7563eecf87f0e8d6d01 --- /dev/null +++ b/mc4/ny.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:d45d61f40c2df9f20a85749f43724ad2f475b6ba2fb09d86db2983719cfc2b25 +size 902895 diff --git a/mc4/ny.sp.vocab b/mc4/ny.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..7b5811ac813689f06179fdcbbf2d82b50e703862 --- /dev/null +++ b/mc4/ny.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:4367a6be4ca21743c2c465c731d0b3f8c6bfe4bbd2355f79ff1ea64ed5515988 +size 700958 diff --git a/mc4/sn.arpa.bin b/mc4/sn.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..6eb851381ba49a721b0929b13aabc69bd8895842 --- /dev/null +++ b/mc4/sn.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:039060e9ac7db9039c1141c6ba231b4e1cf97a5aee63aeb1f4641042fc08bd4c +size 4802972178 diff --git a/mc4/sn.arpa.trie.bin b/mc4/sn.arpa.trie.bin new file mode 100644 index 0000000000000000000000000000000000000000..58ce7b037d05316af902f76fb37515db04d41b7f --- /dev/null +++ b/mc4/sn.arpa.trie.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:82bc0994e6cd2783dda3a23e257727aa28e833f72cdc95be2406b102a2edea25 +size 2331081939 diff --git a/mc4/sn.sp.model b/mc4/sn.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..98832329858165378d025f8e036eb78d155b9ff6 --- /dev/null +++ b/mc4/sn.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:47d52abd2c26da133b6901ef3bfda12d032e7c7e90b668c6c41ab2a179dc66aa +size 900223 diff --git a/mc4/sn.sp.vocab b/mc4/sn.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..542fb22cf14a4ce56f88367bab5b7d0f1a4a7a7b --- /dev/null +++ b/mc4/sn.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:98ee54644a596b0d5f476eaf68726ee442834d9bdcef97716bb7aeb8d758130e +size 698093 diff --git a/mc4/st.arpa.bin b/mc4/st.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..591b7a2442229fbbd611ce1f9aa327b4a31f748b --- /dev/null +++ b/mc4/st.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:b1aa917b1f02c4fb2b6927ce2dc46e50d5f8171dc375d333207ddb06c0dbd6d6 +size 2103573048 diff --git a/mc4/st.arpa.trie.bin b/mc4/st.arpa.trie.bin new file mode 100644 index 0000000000000000000000000000000000000000..1fef09a621e79c403e7e1f341950f89638a14bb2 --- /dev/null +++ b/mc4/st.arpa.trie.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:611c4bc26db22fbf4abdc396dd25f9138da577c6ace9997e12a172387d58270f +size 984849770 diff --git a/mc4/st.sp.model b/mc4/st.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..4314fdcb3df63ebfbba3b1102b0e1a3b3f8a117e --- /dev/null +++ b/mc4/st.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:3a8e2d5442162fa975ef8bfbdbd2201eaf98bec71ab614bd74036a287974b538 +size 922202 diff --git a/mc4/st.sp.vocab b/mc4/st.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..716fbf717e1eb0cbca6b5095a4837297eabaeab1 --- /dev/null +++ b/mc4/st.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:987599705afc4ad40b02282871783a40464991c9d2cabd529217b049d6371446 +size 720086 diff --git a/mc4/xh.arpa.bin b/mc4/xh.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..dd0229314ebf03fa9e1c788c467931dde81fa24e --- /dev/null +++ b/mc4/xh.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:b2c5cb3fee548ce0629849d47747d141efff1b2de7c9dc5e2be166f946b072b8 +size 1917154691 diff --git a/mc4/xh.arpa.trie.bin b/mc4/xh.arpa.trie.bin new file mode 100644 index 0000000000000000000000000000000000000000..2e3186b32027c97585443c1a83acef3ede71301a --- /dev/null +++ b/mc4/xh.arpa.trie.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:013bbccd45e9c775e511870e163b7e03a18bd1e2092c9e283b817b5b47149ba9 +size 928523775 diff --git a/mc4/xh.sp.model b/mc4/xh.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..44b7cd13e003017345cc1f68af0be90977226702 --- /dev/null +++ b/mc4/xh.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:8d4278c5447178104c6b689b6e7d85b59cbfe8a9e9054eb5ee47267f9ca86b91 +size 899309 diff --git a/mc4/xh.sp.vocab b/mc4/xh.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..a7fd8261ce3831423887061e1a15c31c78e5edbd --- /dev/null +++ b/mc4/xh.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:1a1ef95b95c11a58262d7975acaa3bfc8aaaa23baf731b7c535ec4fa97b35c79 +size 697154 diff --git a/mc4/zu.arpa.bin b/mc4/zu.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..1f3acfdc668903b9ea3eafa12c05959c0451de39 --- /dev/null +++ b/mc4/zu.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:1e1393f05347e426060fa98d5b8a763834b35b513579d32ff9937446855283bf +size 6377741974 diff --git a/mc4/zu.arpa.trie.bin b/mc4/zu.arpa.trie.bin new file mode 100644 index 0000000000000000000000000000000000000000..962e0cf4d0082b757abb5adc05791fafba701639 --- /dev/null +++ b/mc4/zu.arpa.trie.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:03d9ec365a667f6061a96f8592cacde990d8dda18783b847dfb4ad7b650de21c +size 3105856237 diff --git a/mc4/zu.sp.model b/mc4/zu.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..8d8c3b2cb135454cba8812cd305787c982602b3f --- /dev/null +++ b/mc4/zu.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:e8242c97953d7f31ad00a36fd4b3ea347eb056a5c2a97ceeae983273a8430adb +size 868576 diff --git a/mc4/zu.sp.vocab b/mc4/zu.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..35fa863545f45f85551e2fba80b69c4662292d1a --- /dev/null +++ b/mc4/zu.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:0230bbd8d49a103701575ee218420589ee52a644c172084017522fc0733a74f6 +size 666409 diff --git a/model.py b/model.py new file mode 100644 index 0000000000000000000000000000000000000000..5bf4015950324cf9c092bcb2e72c634e2f12e3fa --- /dev/null +++ b/model.py @@ -0,0 +1,163 @@ +import os +import re +import unicodedata +from typing import Dict + +import kenlm +import sentencepiece +from huggingface_hub import cached_download, hf_hub_url + + +class SentencePiece: + def __init__( + self, + model: str, + ): + super().__init__() + self.sp = sentencepiece.SentencePieceProcessor() + self.sp.load(str(model)) + + def do(self, text: dict) -> dict: + tokenized = self.sp.encode_as_pieces(text) + return " ".join(tokenized) + + +class KenlmModel: + digit_re: re.Pattern = re.compile(r"\d") + unicode_punct: Dict[str, str] = { + ",": ",", + "。": ".", + "、": ",", + "„": '"', + "”": '"', + "“": '"', + "«": '"', + "»": '"', + "1": '"', + "」": '"', + "「": '"', + "《": '"', + "》": '"', + "´": "'", + "∶": ":", + ":": ":", + "?": "?", + "!": "!", + "(": "(", + ")": ")", + ";": ";", + "–": "-", + "—": " - ", + ".": ". ", + "~": "~", + "’": "'", + "…": "...", + "━": "-", + "〈": "<", + "〉": ">", + "【": "[", + "】": "]", + "%": "%", + "►": "-", + } + unicode_punct_re = re.compile(f"[{''.join(unicode_punct.keys())}]") + non_printing_chars_re = re.compile( + f"[{''.join(map(chr, list(range(0,32)) + list(range(127,160))))}]" + ) + kenlm_model_dir = None + sentence_piece_model_dir = None + + def __init__( + self, + model_dataset: str, + language: str, + lower_case: bool = False, + remove_accents: bool = False, + normalize_numbers: bool = True, + punctuation: int = 1, + ): + self.model = kenlm.Model(os.path.join(model_dataset, f"{language}.arpa.bin")) + self.tokenizer = SentencePiece(os.path.join(model_dataset, f"{language}.sp.model")) + self.accent = remove_accents + self.case = lower_case + self.numbers = normalize_numbers + self.punct = punctuation + + @classmethod + def from_pretrained( + cls, + model_dataset: str, + language: str, + ): + return cls( + model_dataset, + language, + False, + False, + True, + 1, + ) + + def pp(self, log_score, length): + return 10.0 ** (-log_score / length) + + def get_perplexity(self, doc: str, normalize_cc_net: bool = True): + if normalize_cc_net: + doc = self.normalize( + doc, + accent=self.accent, + case=self.case, + numbers=self.numbers, + punct=self.punct, + ) + # Tokenize (after normalizing): See https://github.com/facebookresearch/cc_net/blob/bda555bd1cf1ee2e0b925363e62a61cd46c8b60d/cc_net/mine.py#L352 for full pipeline + doc = self.tokenizer.do(doc) + doc_log_score, doc_length = 0, 0 + for line in doc.split("\n"): + log_score = self.model.score(line) + length = len(line.split()) + 1 + doc_log_score += log_score + doc_length += length + return round(self.pp(doc_log_score, doc_length), 1) + + def normalize( + self, + line: str, + accent: bool = True, + case: bool = True, + numbers: bool = True, + punct: int = 1, + ) -> str: + line = line.strip() + if not line: + return line + if case: + line = line.lower() + if accent: + line = self.strip_accents(line) + if numbers: + line = self.digit_re.sub("0", line) + if punct == 1: + line = self.replace_unicode_punct(line) + elif punct == 2: + line = self.remove_unicode_punct(line) + line = self.remove_non_printing_char(line) + return line + + def strip_accents(self, line: str) -> str: + """Strips accents from a piece of text.""" + nfd = unicodedata.normalize("NFD", line) + output = [c for c in nfd if unicodedata.category(c) != "Mn"] + if len(output) == line: + return line + return "".join(output) + + def replace_unicode_punct(self, text: str) -> str: + return "".join(self.unicode_punct.get(c, c) for c in text) + + def remove_unicode_punct(self, text: str) -> str: + """More aggressive version of replace_unicode_punct but also faster.""" + return self.unicode_punct_re.sub("", text) + + def remove_non_printing_char(self, text: str) -> str: + return self.non_printing_chars_re.sub("", text) diff --git a/oscar/af.arpa.bin b/oscar/af.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..979247ba6c8fc0d6130c0f664e56752d1bc1f094 --- /dev/null +++ b/oscar/af.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:ad1c7a15e7dc4552fbe331387cf1d2aa9b2995354ff58743969d943e4fbc9a2b +size 1699310488 diff --git a/oscar/af.sp.model b/oscar/af.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..e69179bba713ae22f661dd721c8c8e8035ff6b6a --- /dev/null +++ b/oscar/af.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:a3249527d2b6fff0db32f3feeadf4acc806034c58aa2877b7804c1dbd079faf0 +size 965654 diff --git a/oscar/af.sp.vocab b/oscar/af.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..4bcb1a81170526a8f21ac15f0980986c110d3bbe --- /dev/null +++ b/oscar/af.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:141388a76a9e49b671152fe2962f8114e2aab1fef540751f385a10e6dd0fe3a6 +size 763598 diff --git a/oscar/ar.arpa.bin b/oscar/ar.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..2d390e329bbe7e0da6ca7b8462ce0fd16371d8c3 --- /dev/null +++ b/oscar/ar.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:8e845ca65266408136026abc25279a7e9df45d6d4c48a16f15fcbcacb48ffd43 +size 22880497746 diff --git a/oscar/ar.sp.model b/oscar/ar.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..5373bd11ff2eb4c486c31935639929baaaf5833a --- /dev/null +++ b/oscar/ar.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:88b367b53707607953aeb7ca5e28d6904aedae50278cdd799fd15e5d5088bd4b +size 1073054 diff --git a/oscar/ar.sp.vocab b/oscar/ar.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..c75073aed1490cc05d6a4fa38f1cb5c25e040db8 --- /dev/null +++ b/oscar/ar.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:fd019d3c9e386bdbaeb1e37a6faeb2457b761142a852f2a9e97ef79dffbb6254 +size 870884 diff --git a/oscar/arz.arpa.bin b/oscar/arz.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..cf43e49a851713c9868ff8c344c588386fa95fbe --- /dev/null +++ b/oscar/arz.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:8d84ceb2b604df092a6862acc780f13d294ffcee482a6fdda418e0f2b0c83b87 +size 288231441 diff --git a/oscar/arz.sp.model b/oscar/arz.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..5f69e67035cc1421cba24ae43e0f2b5ad7220efd --- /dev/null +++ b/oscar/arz.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:b98bc5055ff7f03c83c37da78933399ab5d13e26fa9ae634d95a8a4c1360e895 +size 1063144 diff --git a/oscar/arz.sp.vocab b/oscar/arz.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..ee65b7c1ae2ab482c85ea97fcb9c24fb0af3e049 --- /dev/null +++ b/oscar/arz.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:3e05eb13200da58a46dc4feec9d2daac5c167f11caf5aed860aca503c4122e77 +size 861047 diff --git a/oscar/as.arpa.bin b/oscar/as.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..252aa167b762db77c1e3aa36b0fbcdc2552d9d20 --- /dev/null +++ b/oscar/as.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:f6f0bef94b578617b8bd942b108938496a1c21e9be8b645db256bc3e54fedb54 +size 410830531 diff --git a/oscar/as.sp.model b/oscar/as.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..d7b69a9b111b28c03cb15ae47f733f6904b2341a --- /dev/null +++ b/oscar/as.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:3e5d588d095646ba80f0cfd15e83d418852fde6d97861bc6c4ba655f4b06d476 +size 1274635 diff --git a/oscar/as.sp.vocab b/oscar/as.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..b09013805eca4c7de705635c44f93737c57b941b --- /dev/null +++ b/oscar/as.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:2c3e4bc59c25b48174798a378e433b2d966fdfc3ac84ded0519348d953bc4ece +size 1072886 diff --git a/oscar/bn.arpa.bin b/oscar/bn.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..f4a426101bf3b6b285db2cad2a4f8c6014540c82 --- /dev/null +++ b/oscar/bn.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:50fe651938e0286e9b0be63d212658573dd5ed3f1e6061d09f006bc5393a114a +size 16813119142 diff --git a/oscar/bn.sp.model b/oscar/bn.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..e1a8d72bd42b07317ed7acc4ecc79c721ead5bf4 --- /dev/null +++ b/oscar/bn.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:fd007f17eb669e546d0564896b514a8816cf1805b7ef1ab53879fcf467f49410 +size 1367929 diff --git a/oscar/bn.sp.vocab b/oscar/bn.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..a831bd06f18a80c90d8ddb6847eb7129d0266206 --- /dev/null +++ b/oscar/bn.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:6c51b211e9e184239c53e9bca1b44e56e728c55802adc6854d8ef9c9ba0a74c9 +size 1165756 diff --git a/oscar/ca.arpa.bin b/oscar/ca.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..80f756060376c9c1c5c64395a57b8c3e49e1c9a2 --- /dev/null +++ b/oscar/ca.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:db1f9b6be02b3df19df0ba55ff697607ef7b4ca3a5045a4211f2c4f6ca9ee788 +size 12565197725 diff --git a/oscar/ca.sp.model b/oscar/ca.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..71ec5b4f76c020bd97419fe9544b2013f74e414c --- /dev/null +++ b/oscar/ca.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:3b8ac5a125f25995996995f4ac14acfcfdf8125657e22ab0b6897b39044d22b9 +size 960913 diff --git a/oscar/ca.sp.vocab b/oscar/ca.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..810d84066346a188e4355df93611cc2871f53f8f --- /dev/null +++ b/oscar/ca.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:e30e0486b473439e2c96eae51689b220529e40de7e6eee8f586ffc09afb89d3b +size 758607 diff --git a/oscar/en.arpa.bin b/oscar/en.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..d77868e8631bcd8b79aded943376864bdc0024e0 --- /dev/null +++ b/oscar/en.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:c85daa6f6a85d4e4ec14b1695d82bdc47ca6fd8f8034ecc4947e383aadf0d6f8 +size 34132973800 diff --git a/oscar/en.sp.model b/oscar/en.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..bbdee7f77c106a33a1834686946ccff1988d31c2 --- /dev/null +++ b/oscar/en.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:d81c313df13d5194f65a7ef37d583934f4c151091c68c70a72625f98232c7223 +size 936812 diff --git a/oscar/en.sp.vocab b/oscar/en.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..4f61e063711e78fb281eaf8a82e7f123a2159b1b --- /dev/null +++ b/oscar/en.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:6a08d8e45267868c780d86c7ae2829bd00dec3621bf991bb99ed4deafb36c88d +size 734613 diff --git a/oscar/es.arpa.bin b/oscar/es.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..d630d4b96c8d0848b4ea933e7d31506c88a65fc4 --- /dev/null +++ b/oscar/es.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:7665a9e1a610104827a3c4e6c4b86a3a6c4a9507f7fd74a2e06673134d39bf62 +size 21752241307 diff --git a/oscar/es.arpa.trie.bin b/oscar/es.arpa.trie.bin new file mode 100644 index 0000000000000000000000000000000000000000..7c3fb914f539ee3cf41fccf2436cc67dde9a46da --- /dev/null +++ b/oscar/es.arpa.trie.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:7cae1d0ee375a65e6dd70e04a4f42d18d9717f39bb4d786d6fcf294fc2163e2a +size 10317550833 diff --git a/oscar/es.sp.model b/oscar/es.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..8679955bfe11ab9c1f341e22464603050b773e0c --- /dev/null +++ b/oscar/es.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:60cf65b358504763244160f45ef4eaf23f53420d50a729368e10b1b8f110fe83 +size 968322 diff --git a/oscar/es.sp.vocab b/oscar/es.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..831c1cd59c84b80eee823d12f4f007e043cd2ebe --- /dev/null +++ b/oscar/es.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:936e2334f04ff9a65db238735c7d624e5917fa757c74a02e0463fe7aca162d25 +size 766070 diff --git a/oscar/eu.arpa.bin b/oscar/eu.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..9c0fb7be7576f9ed5801b65512cdf03cacda6de6 --- /dev/null +++ b/oscar/eu.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:b080c9f7ce5fbd928b5092e9a24d2e5feec9a0ce951ed7d58e86378b75a3adef +size 3448327088 diff --git a/oscar/eu.sp.model b/oscar/eu.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..a7ca3e1741a4ff9c3e2fc1f48d403d51bcedc60a --- /dev/null +++ b/oscar/eu.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:acc5a5670ead3f43ef434cb2537d1b2845387b4a9161a009bbab08a35356ca91 +size 977887 diff --git a/oscar/eu.sp.vocab b/oscar/eu.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..4b2809cd41a9904ce4c739dbbabd213b24f8503b --- /dev/null +++ b/oscar/eu.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:87b240e0281e67747af45bbc88037d2d3ab5ca5a66e014b2d3dc95e2dc12d987 +size 775786 diff --git a/oscar/fr.arpa.bin b/oscar/fr.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..259a8918feb99dec4cbf2ced5bcad502dd0a13d8 --- /dev/null +++ b/oscar/fr.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:fc4e68abb3a035c0b7fea192ebb3d2aa0344932b95539bd4ea09175b0a9cb957 +size 17909261916 diff --git a/oscar/fr.sp.model b/oscar/fr.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..f25f84c1fe33f2358efc9920d44743433e1d13f8 --- /dev/null +++ b/oscar/fr.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:226f8995135e2563bc07082d663a1bb3964791114c0ab493603839d7ccff6123 +size 953526 diff --git a/oscar/fr.sp.vocab b/oscar/fr.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..c087fcb37fb11c82753bd397d19505a9a1760b3b --- /dev/null +++ b/oscar/fr.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:0b6ef91e7cc3420593f55b3071e9fc79a0eb88133ca6652f77c841dd9ad37397 +size 751344 diff --git a/oscar/gu.arpa.bin b/oscar/gu.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..24fa2c6e90aa22ef6e6b5ea197867a1003e576b4 --- /dev/null +++ b/oscar/gu.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:bedffe2f3ae219eed6303fadcc3e11d3bc345e6cd6aeabb94afd6241fa2110bd +size 3409368861 diff --git a/oscar/gu.sp.model b/oscar/gu.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..e92f1ea4d162925b2c90848e0d156c5e6d3f96b3 --- /dev/null +++ b/oscar/gu.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:61baa071c5da3bba96f7de66d592acfeb4b54dbd1bd5a64f85b4f4f505c0d483 +size 1279618 diff --git a/oscar/gu.sp.vocab b/oscar/gu.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..78ea425b08149440bd0a7535e7d581eb924e4df0 --- /dev/null +++ b/oscar/gu.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:c8f3df0e3182ecabc498f1cf4991bf05d5f1abe1ba416eef3e00f99a3101e469 +size 1077391 diff --git a/oscar/hi.arpa.bin b/oscar/hi.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..7f11f50f69cf8f6a282a04cc02b63468960bce83 --- /dev/null +++ b/oscar/hi.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:ddc0dc7b1e704d0921693e7410dae55eddc45445b38f8843ab29356970c56adc +size 15540100982 diff --git a/oscar/hi.sp.model b/oscar/hi.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..205321d4d1c49094495df1acee8ce4b4375597b4 --- /dev/null +++ b/oscar/hi.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:7557f95d5b53f9b86d450d68b8f4ef00ee3b71d7d54f19e16a0014a455889b20 +size 1247474 diff --git a/oscar/hi.sp.vocab b/oscar/hi.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..2afa933ef070a2d2e1d1e3fee2f0572673276067 --- /dev/null +++ b/oscar/hi.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:2c721b86b60b9784443d0c9429eaab4487a1e8c98b1dfefc07643a9443ed4b70 +size 1045400 diff --git a/oscar/id.arpa.bin b/oscar/id.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..335f901f86577347dcfa647a2b8c2e528677a548 --- /dev/null +++ b/oscar/id.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:041de8869bd35ea3974273121e0d8032710a32a9830d8fc87bace1989bbe7798 +size 12521082121 diff --git a/oscar/id.sp.model b/oscar/id.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..f07bc312f8bb283d2d72401cd0bfb671b24ce2ca --- /dev/null +++ b/oscar/id.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:22cf0d46a8d53615f9315ce19627a7d13b13b10737f0885ff5fb501fc3e9c3cd +size 938674 diff --git a/oscar/id.sp.vocab b/oscar/id.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..f2aa3cbb7d52a1a8025e95546f686d0e1816e6ae --- /dev/null +++ b/oscar/id.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:254a7ac16e6226f10616150bebdd4e927946310d534541645a8d2b32703f75ea +size 736567 diff --git a/oscar/kn.arpa.bin b/oscar/kn.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..91e69e8691e61a4ebb1ba3c9fc85ebf361b4fa02 --- /dev/null +++ b/oscar/kn.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:aae01ed67f8ed9c0fe3b64a412920a057bc8543bc7f25502b54a34fd4466a7f7 +size 4244656986 diff --git a/oscar/kn.sp.model b/oscar/kn.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..df87ddaa5bc6996c1e07d5daae2cda585d09ba41 --- /dev/null +++ b/oscar/kn.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:2b26e2ccd47828938f8dc37cb8db2f8ae08c27f24f0cb31bb1eb7260f3033350 +size 1450801 diff --git a/oscar/kn.sp.vocab b/oscar/kn.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..1d4bd7f31ad9fbf55d74b2aecaa58b246c391ad3 --- /dev/null +++ b/oscar/kn.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:f727e38c84d5f88455c0a55a70cf4db91b024f9e2180d8f73bcc05467d9af97f +size 1248652 diff --git a/oscar/ml.arpa.bin b/oscar/ml.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..16655f918428bc210872c0e1451f0746c6b694a6 --- /dev/null +++ b/oscar/ml.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:6fb33c87c6c63c2082cbe61f99651c893461920413e43acc3d0a9333fbf05be5 +size 8838203502 diff --git a/oscar/ml.sp.model b/oscar/ml.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..95f63b8367fa20899d99a498c6d165df3d40f635 --- /dev/null +++ b/oscar/ml.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:2c77be766284f6ebe1713c5ac9b6fd9e17c0e535026979165c8e429127e4485d +size 1502428 diff --git a/oscar/ml.sp.vocab b/oscar/ml.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..d147e833736668f349e0e4a378bc1f55f89bb78e --- /dev/null +++ b/oscar/ml.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:67f8d1f69088d91793fe07f9fee15aa36287b936280e02db501d79bbbf43de04 +size 1300245 diff --git a/oscar/mr.arpa.bin b/oscar/mr.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..e4e0e8c5d6b923c7b97efddbf23556790acdd3cb --- /dev/null +++ b/oscar/mr.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:77cc71c6ea7f6ae4cd6348e4305b2ca7b39cad73c0b638996605f9872c8bc008 +size 5852783358 diff --git a/oscar/mr.sp.model b/oscar/mr.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..811932cd6650c654f9b193b2c155b8d97970783d --- /dev/null +++ b/oscar/mr.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:e5821af648d2b6129dd3508018ba34a6287e02e99eacc4135d992af153896656 +size 1366052 diff --git a/oscar/mr.sp.vocab b/oscar/mr.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..fba2e7a6999e8b581c5fb701a6ae73851ec6ad35 --- /dev/null +++ b/oscar/mr.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:cf96ab55b126ce759e348ea724315910387b8eccee129c56e0795026314498c0 +size 1164166 diff --git a/oscar/pt.arpa.bin b/oscar/pt.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..c85a796c63f111eae5c17c6b73702b9125cadc46 --- /dev/null +++ b/oscar/pt.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:e56b42e488023ebf49ceabe355268dc5095421e71d2c82491e918a13c5433e32 +size 19366111307 diff --git a/oscar/pt.sp.model b/oscar/pt.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..d5f0dc8ad16a2987d1959102f1b230aa5cc367e2 --- /dev/null +++ b/oscar/pt.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:be19bbb4209c7631a4f6399b3b4fb5e066e8d32a281741676c37319c5332a675 +size 969085 diff --git a/oscar/pt.sp.vocab b/oscar/pt.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..f212cea6ea4aa317703632a6c2c88dc05e18edb7 --- /dev/null +++ b/oscar/pt.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:3c0739faee270544af79d120ac251d26d70dd29b97367d96c0fc50e1448cccd6 +size 766930 diff --git a/oscar/sw.arpa.bin b/oscar/sw.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..e7116a74ed924639552cb7f2e5b9eda697228799 --- /dev/null +++ b/oscar/sw.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:0a23bdecb335b989e245a824245b4db3de6d007f66da1129324b122fce2eb77c +size 90715311 diff --git a/oscar/sw.sp.model b/oscar/sw.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..c68ca1e31aabde2dd8ef6767e6d1f515115c6f7b --- /dev/null +++ b/oscar/sw.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:7c09895923bc10d6396c49a0876b692e152e6fd8a4ae060d48bdc5f51fb38199 +size 901280 diff --git a/oscar/sw.sp.vocab b/oscar/sw.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..23c4b9bfec426b63bffb386a589be3510e9f6e49 --- /dev/null +++ b/oscar/sw.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:9855dede77751446eb61e92758c392ff9b989699ad7cca3572328c5846b9f3ce +size 699555 diff --git a/oscar/te.arpa.bin b/oscar/te.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..e10a7efa5a98b5b49bb36e4bb22fe67fe74e09c7 --- /dev/null +++ b/oscar/te.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:2994526917f0ef27dc093e724896b1e783aeb177edf555ff57fde2b7a6410cbc +size 6404163397 diff --git a/oscar/te.sp.model b/oscar/te.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..23144a53fd017738d4b98939ae39384d6158b37c --- /dev/null +++ b/oscar/te.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:8d732b270553f0948fb2e34628f6d1de4ea9b1f6a01aa78d0d957e8e6c60bb9d +size 1424762 diff --git a/oscar/te.sp.vocab b/oscar/te.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..4f105c4d87318ddc64c1d5cacd9ec7c6bc734fbe --- /dev/null +++ b/oscar/te.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:47c2976c977317564316305db592af74d19f5a08036775459e19bab04cc97104 +size 1222616 diff --git a/oscar/ur.arpa.bin b/oscar/ur.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..dc1e7fc5d643c056e730cc02d78dc89149927172 --- /dev/null +++ b/oscar/ur.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:c5fbe546cfb43e9b458e7f3901fac74ea3e43493620cd6217a496ac7a8fea802 +size 495308934 diff --git a/oscar/ur.sp.model b/oscar/ur.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..e989160fbca123364bec39d6e7ca4a60d6357acf --- /dev/null +++ b/oscar/ur.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:7f4be5d42c2223ecc1411b48fd97e42738174633ff983aab78bec3c0cae552d9 +size 1041320 diff --git a/oscar/ur.sp.vocab b/oscar/ur.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..6762a7eec6becb4dd632e9d4ea4047a15b17cd5b --- /dev/null +++ b/oscar/ur.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:9ade1608d8ddcfdbb311a48336396e581e300a808ba3a22e865a39486a6b4d1f +size 839343 diff --git a/oscar/vi.arpa.bin b/oscar/vi.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..867c72e0edeb0e4b165e9ff1ed8b7b1dcf46d207 --- /dev/null +++ b/oscar/vi.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:538af9edb1a367fe68efe1af642981f70c38236fd569d4df733112fc487ce44b +size 2121892252 diff --git a/oscar/vi.sp.model b/oscar/vi.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..7f3d205f3bff4db24fa4701206bf9284973252de --- /dev/null +++ b/oscar/vi.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:4723afd4735ba691e65bf12eecb65e17b884f38d5a097fe74bf7e6090780d5f4 +size 877199 diff --git a/oscar/vi.sp.vocab b/oscar/vi.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..9009cc8c5e88add0e668d7bcadef925daa0e8131 --- /dev/null +++ b/oscar/vi.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:26cd29f33f9925833ea76a04de13ed15f0c21da84558f4698f2ee6b6a147b6ae +size 675089 diff --git a/oscar/zh.arpa.bin b/oscar/zh.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..39d39488716ae15656b8b2577ae82a7bdcf3f53a --- /dev/null +++ b/oscar/zh.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:ef2e80b4126db36b4fcdbd0d634b2f7ee3106caf83b300f83fae606945114ac0 +size 8846556453 diff --git a/oscar/zh.sp.model b/oscar/zh.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..66a9d578934bbdbf0731d2dd52bf62b15885db76 --- /dev/null +++ b/oscar/zh.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:12db7b5ffe942ba3b05424f12ee742a85ed294ec1eac01889f4d4385d0c9dc2b +size 862816 diff --git a/oscar/zh.sp.vocab b/oscar/zh.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..6ac27bc109cc65b16722ad7cbc859732b084157a --- /dev/null +++ b/oscar/zh.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:b8080d4ee148251704bb5701b18c3b45a7fd9d94f5fc0dd7b118b8dd8c195217 +size 660710 diff --git a/wikipedia/af.arpa.bin b/wikipedia/af.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..7025231e46c65181fb0f79c8c85d1587e8403590 --- /dev/null +++ b/wikipedia/af.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:59a052343c45c83892b62bb49229b4a3c282386c6e33c225bd566a1692d55ffe +size 425344923 diff --git a/wikipedia/af.sp.model b/wikipedia/af.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..555ac41192905f23c886cc714778f683223ef9cf --- /dev/null +++ b/wikipedia/af.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:ea0c0a9c656190eebc9a75618dda128044480ed596a9586e119705793bf0c288 +size 963253 diff --git a/wikipedia/af.sp.vocab b/wikipedia/af.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..79e6a1a3c79f7184dd871f475002c1cdf75170bd --- /dev/null +++ b/wikipedia/af.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:a076a1df43cc9cd2c16df17c5199913a3f4f96314650087081bc3f86c3ace125 +size 761249 diff --git a/wikipedia/ar.arpa.bin b/wikipedia/ar.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..02336a2d9efaac09e6aaab90fffbf78a68741770 --- /dev/null +++ b/wikipedia/ar.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:e5ad5fe3355e9775d0045ac38ee24ef585b373c99350bc612e5bda9cbdd701fe +size 2824717990 diff --git a/wikipedia/ar.sp.model b/wikipedia/ar.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..706e096c46ca062d1447fc0d244561d65319da7f --- /dev/null +++ b/wikipedia/ar.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:281e3d75365a1801a8fe5def0b89dd0e5bb73ac0a2451be8bc5a55495760e115 +size 1070890 diff --git a/wikipedia/ar.sp.vocab b/wikipedia/ar.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..f3ca432480105a1fa70f4fd972facac46041788d --- /dev/null +++ b/wikipedia/ar.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:d67edf7b7a404e5e381153bd13f394899268848ce7df6e832ea7cddab22ccbc4 +size 868550 diff --git a/wikipedia/arz.arpa.bin b/wikipedia/arz.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..797b13eaf65a3ca7a2f19a1847d9d81f31a041df --- /dev/null +++ b/wikipedia/arz.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:37eead92ad7aae94e6e6ba0652073e19dc6c052e314364eeeed059c7b20ed1b6 +size 358578027 diff --git a/wikipedia/arz.sp.model b/wikipedia/arz.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..feeecd918316f29c4df4f3f404bc7a1888009048 --- /dev/null +++ b/wikipedia/arz.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:a88f8508bc163b055dd36e2efb660ca527887009522ef075bef816bc8309267d +size 1024148 diff --git a/wikipedia/arz.sp.vocab b/wikipedia/arz.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..a82462aeaf2822d6f770a389cd0272165df175d0 --- /dev/null +++ b/wikipedia/arz.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:f6b2788a17e037da07f7368179da97bae1a2f9fd9226e6ecda02df308a11bfcf +size 822247 diff --git a/wikipedia/as.arpa.bin b/wikipedia/as.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..e2b76df92a6640ac967f74a37abb9ceab2eca2b0 --- /dev/null +++ b/wikipedia/as.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:2bd609abe5f749d6afc6be4bfff497546fa9cd62dc97c7cc3cc04cca4d3e003f +size 82081501 diff --git a/wikipedia/as.sp.model b/wikipedia/as.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..796d47b2b32d988519047fc543a37b0c0ea914e6 --- /dev/null +++ b/wikipedia/as.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:6594e4bf5a52244f97725148e9e2fc509c7209d43c6434a1089954b33d9abc5f +size 1253635 diff --git a/wikipedia/as.sp.vocab b/wikipedia/as.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..c1b02ee51c3b54b4a997a2fce56fee4297e8062a --- /dev/null +++ b/wikipedia/as.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:30eca87f359b9fc5656927f99e81b3bc76c6df26ed75103d3c5b415df160e7c0 +size 1051292 diff --git a/wikipedia/bn.arpa.bin b/wikipedia/bn.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..a17d9d0c066e92df78405c9b52c2e18eb8cbda4d --- /dev/null +++ b/wikipedia/bn.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:9d2148db7af960f9468adc5c0b2c39f75d969b816014098c45de93517ac1c555 +size 612069451 diff --git a/wikipedia/bn.sp.model b/wikipedia/bn.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..a78256fcccf0325b434319af4634eaff1cc46979 --- /dev/null +++ b/wikipedia/bn.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:e4e90d25ca0c465b9c8da8c93b6591be30dbb8c123b24d52177c2027f81a264d +size 1366364 diff --git a/wikipedia/bn.sp.vocab b/wikipedia/bn.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..3dddd1e462aff8bee6050b571e90815344dac388 --- /dev/null +++ b/wikipedia/bn.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:33f918c7300e965c8c2f74ddc9ca6ca741c0b06779216a8eed36b7a269b25754 +size 1164806 diff --git a/wikipedia/ca.arpa.bin b/wikipedia/ca.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..f34ded13f981c30eabb0d5a71ed76f4edf2c0660 --- /dev/null +++ b/wikipedia/ca.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:2ece1e503d4b44409069ea9c5c5125b74792b575143169e08cf9a27248f9a78e +size 2809368958 diff --git a/wikipedia/ca.sp.model b/wikipedia/ca.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..192debd00465dd3bb3a4481ac509de34e5ca7f26 --- /dev/null +++ b/wikipedia/ca.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:abc6936e2ff5dcdc86962ffaeef48ef66f567d568ef7090d28123ed6618b455c +size 946977 diff --git a/wikipedia/ca.sp.vocab b/wikipedia/ca.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..dacbee1e658ddc12212f174f4f753ee106b56dbe --- /dev/null +++ b/wikipedia/ca.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:b6565aa2f52125563e36de113de83c2deb2febb7ab762925e8ded4de9d51dec5 +size 744547 diff --git a/wikipedia/en.arpa.bin b/wikipedia/en.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..b74834cdda59ef28c35b172721256d427086ddff --- /dev/null +++ b/wikipedia/en.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:04923fccbb4e63005c40f01d66112659416de01accd80d16e366a592289ee07a +size 4444690658 diff --git a/wikipedia/en.sp.model b/wikipedia/en.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..d5cd3c4f88420f22d0a8a7123311ce894baec8ac --- /dev/null +++ b/wikipedia/en.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:cf8147a573770b4e6c0d4df1dcb75453baa88190706dab406be7711b84f059de +size 931348 diff --git a/wikipedia/en.sp.vocab b/wikipedia/en.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..5f102f5b65ce8a218a9ad678dd68f242ca540611 --- /dev/null +++ b/wikipedia/en.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:a9c3c51a7736d736cc620cbe9a4c9430533469e57a54bc29546067a252f7d872 +size 729017 diff --git a/wikipedia/es.arpa.bin b/wikipedia/es.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..56aac55a53d45f8a0c118fb05c8783ea926c4240 --- /dev/null +++ b/wikipedia/es.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:26964ff8185eb105021fc0e9eaa0a1de590c4a12f8aa3fe12112b29d42281cf3 +size 3828418653 diff --git a/wikipedia/es.sp.model b/wikipedia/es.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..55c14b0e2e5e0cc9f64a3603fad0727387e9c725 --- /dev/null +++ b/wikipedia/es.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:aae545566a995d3374fbc8ac1d4e0c7073008da8ae32acfe7f176136a8efcf37 +size 961535 diff --git a/wikipedia/es.sp.vocab b/wikipedia/es.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..293ed725feeea5e7eeaf1b33c51f4ec446f69157 --- /dev/null +++ b/wikipedia/es.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:512a75ae6d672411d7b88e5d4a21fb4f6e0692251aca611a06dcd0b09fcfd309 +size 758934 diff --git a/wikipedia/eu.arpa.bin b/wikipedia/eu.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..4e78c2b4e4b86eeb7a87ac32cd061809848762a9 --- /dev/null +++ b/wikipedia/eu.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:2d04c4d1233b40044e2facc978987ecd4a6d4f84032f2af3f85f7079676fa08b +size 774011873 diff --git a/wikipedia/eu.sp.model b/wikipedia/eu.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..eea2d3ff9efbbd0b2b3c4998e0db8d38284a110c --- /dev/null +++ b/wikipedia/eu.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:447cbd1714e51e6a7b4dd8ff55b7bd975fdb7f6ba873cb6f8a1fe36b5867dbb6 +size 955869 diff --git a/wikipedia/eu.sp.vocab b/wikipedia/eu.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..ad1ade04add487e9bb0bb33edd9f609dcc418b30 --- /dev/null +++ b/wikipedia/eu.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:c1c3804a145aa2e8c6e402aa545995f6b7cb25a88fcd30941e93f4468e9445b2 +size 754016 diff --git a/wikipedia/fr.arpa.bin b/wikipedia/fr.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..ce37085daf3a67245cbf521688bcaab8e6e3f1f7 --- /dev/null +++ b/wikipedia/fr.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:301c82d52a8e34f63937afc12970794c8783244c8c0b085a8bbfb0d54dcb9374 +size 2829042764 diff --git a/wikipedia/fr.sp.model b/wikipedia/fr.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..cca05bbf3f6c79cbcabfae11ca5b082bc7891ac8 --- /dev/null +++ b/wikipedia/fr.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:b1b70d5e6556ad245e02ac76919a714ad0b7d288955df65ecd3831a42950b653 +size 942639 diff --git a/wikipedia/fr.sp.vocab b/wikipedia/fr.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..2edd93266676586536047f8b83fa329df299a760 --- /dev/null +++ b/wikipedia/fr.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:b756874e5a729ce6a5fa9ab6be9b9fd128f8bae8df11b5182eebf4a43be217a0 +size 740441 diff --git a/wikipedia/gu.arpa.bin b/wikipedia/gu.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..bdd6a84ec3dcea4a9f0d729e78ba822f0b96a461 --- /dev/null +++ b/wikipedia/gu.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:014a37ee3a33b292c6adaca1b215805d3e01f16a3a805744a5734c4d64226a42 +size 73964184 diff --git a/wikipedia/gu.sp.model b/wikipedia/gu.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..b32f109e0c6b49d21798d73d850f10a961024bbd --- /dev/null +++ b/wikipedia/gu.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:14ed9cacbf0af4de675008bb91e092f87000ce4da695d4ab746b5e2791a3b0b4 +size 1239566 diff --git a/wikipedia/gu.sp.vocab b/wikipedia/gu.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..89c5a8b1a0fa5b60eb0820f7376b38a23b132bcd --- /dev/null +++ b/wikipedia/gu.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:0def0145c6f6fd6f4ddab6ad220372c380591f2a1460ebb9d8d9dfc8b4baa35c +size 1037538 diff --git a/wikipedia/hi.arpa.bin b/wikipedia/hi.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..4134e9ea58c544af483abada0672570c70237696 --- /dev/null +++ b/wikipedia/hi.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:f76e9238ccab63fc175ed40786888c0078cc7bb1de9519536a89473a60a17f8d +size 547247715 diff --git a/wikipedia/hi.sp.model b/wikipedia/hi.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..b62f61e983e2558f2551863fb535315f970e567b --- /dev/null +++ b/wikipedia/hi.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:bd2408405c7884b129600c427c5ccb919a8f5a5597437e4127ee20b85a70ab4f +size 1256555 diff --git a/wikipedia/hi.sp.vocab b/wikipedia/hi.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..65d88f445b4943cc74eac785f95913842c2621ee --- /dev/null +++ b/wikipedia/hi.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:4c7900bdbdbe9d44a5de3c6fd6675d1230484750e7967d739d214edba17d1f95 +size 1054438 diff --git a/wikipedia/id.arpa.bin b/wikipedia/id.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..78faacd90dc0d4b5e419e54c019b6f48870b7d4c --- /dev/null +++ b/wikipedia/id.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:6e099b6216a558d6c6f6108895e2e13fbc6ffd00b59791d16d6a5f85103ac0be +size 1847280248 diff --git a/wikipedia/id.sp.model b/wikipedia/id.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..24ae7910761c798089f0da99992cb0679423ee95 --- /dev/null +++ b/wikipedia/id.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:b217615a7b185e5e0c967ea5b7156fe149145221e32a54b96dfed15d98b3c807 +size 926624 diff --git a/wikipedia/id.sp.vocab b/wikipedia/id.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..e596d1a9857b77cb04a12c8b8fc5a826bfaa2f11 --- /dev/null +++ b/wikipedia/id.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:61fa5857ba776aae9d8258297b8bba4d3ca6f05aaddc0d18a8a6dab32cb15d37 +size 724181 diff --git a/wikipedia/kn.arpa.bin b/wikipedia/kn.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..adad1a31974d5f05792b8e0bfd61a16954ed685c --- /dev/null +++ b/wikipedia/kn.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:83b27c4e22da4ec1e8750eb20a68e06db6edb793d2f8af66b7e0177b07f36f7b +size 190340463 diff --git a/wikipedia/kn.sp.model b/wikipedia/kn.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..3340a5649f537db2c4d74c92daa08e850d637b56 --- /dev/null +++ b/wikipedia/kn.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:ac335a4fa39266be5e116db3d93a1f650237e310bd5056144d8def2cd86f8512 +size 1481373 diff --git a/wikipedia/kn.sp.vocab b/wikipedia/kn.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..cb91c818fc2e8cf37d3809bb8da26cc85db0756b --- /dev/null +++ b/wikipedia/kn.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:46fcfb39272b4d831a97cc137525458394a808819cf1453a7838ff08aed6ba17 +size 1279698 diff --git a/wikipedia/ml.arpa.bin b/wikipedia/ml.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..2d9c8f2ace32069401688a833b9a7dd118d150f9 --- /dev/null +++ b/wikipedia/ml.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:1f36870e83f406ebfb0431100cdc00bab4ea07a6f46ee75e7f213877af6d81c8 +size 494747731 diff --git a/wikipedia/ml.sp.model b/wikipedia/ml.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..3ea0a4c2953c9f63c9cc510576047f96b88a3ead --- /dev/null +++ b/wikipedia/ml.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:f4c4d8a67471b2bf0cb02cabb6fb3fedea0ee8e1df4b7ce0fd45f639e48d39cc +size 1468485 diff --git a/wikipedia/ml.sp.vocab b/wikipedia/ml.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..eeeae57498d97f771b9e5edafc6de0575f7abc52 --- /dev/null +++ b/wikipedia/ml.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:35bd1fe3528eb34e4022ae7b76e42245d7555a80234687986193f56b9f265ad5 +size 1266139 diff --git a/wikipedia/mr.arpa.bin b/wikipedia/mr.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..3fec467c4a43364843bd26392cf5cc0290f6e6eb --- /dev/null +++ b/wikipedia/mr.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:d4e28d84af8bf9d47670ba153d951db6c61af06194485cb4f354ef7cf39d3a0f +size 213245201 diff --git a/wikipedia/mr.sp.model b/wikipedia/mr.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..6a64b4b1f901a95c120e0b04b08be7fa5d7481f9 --- /dev/null +++ b/wikipedia/mr.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:b299918fd297d7ee8330bf43dae3d55bdf62c321d43e775f8f111dea325b34ae +size 1282288 diff --git a/wikipedia/mr.sp.vocab b/wikipedia/mr.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..c4d1cbbb3e28a90cc67c69328ab0133c7261c529 --- /dev/null +++ b/wikipedia/mr.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:a056b935c9f7976b2f74617795d8e6d01f2d6392569e7f2292b3ebd23c12bb47 +size 1078130 diff --git a/wikipedia/pt.arpa.bin b/wikipedia/pt.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..1ed3b02dab66efb87a4da18948d7ec7f4a5ffa90 --- /dev/null +++ b/wikipedia/pt.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:ad7241c4b11d902fa092506b731f61e5f67177897c2598b750d1a2e519be87ad +size 3220168756 diff --git a/wikipedia/pt.sp.model b/wikipedia/pt.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..3c2ab113c5644ebf7b1d8d23790b90b16c964d75 --- /dev/null +++ b/wikipedia/pt.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:1707a7517b61ca9d4d333dabcc5ec7024e44c6466ff6faea9ccc95a0f1b2737c +size 958101 diff --git a/wikipedia/pt.sp.vocab b/wikipedia/pt.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..9b660da272d1f32f7103415105640eff7d056c4a --- /dev/null +++ b/wikipedia/pt.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:0f40ce4ecfe48b4a79f9364eb9e1de08b31e2ef3e7a0175c6b9fb89db8615e31 +size 755542 diff --git a/wikipedia/sw.arpa.bin b/wikipedia/sw.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..c0b3fe775babd1093135f578f5f44861e3e14489 --- /dev/null +++ b/wikipedia/sw.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:5245f3df623fad2d6299b3f825b0b4c95f4ca74048827b532991a2381cbafb53 +size 145248567 diff --git a/wikipedia/sw.sp.model b/wikipedia/sw.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..95ce77e713121b01bace14cae8d57af7480825e5 --- /dev/null +++ b/wikipedia/sw.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:e99a6ea3302bcf847b33fce951f9a619f48035bf4c0322be8f39d57d22beadfb +size 918517 diff --git a/wikipedia/sw.sp.vocab b/wikipedia/sw.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..1cccb7b576ad2760cdd6cee9f66727c6e68ef0a9 --- /dev/null +++ b/wikipedia/sw.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:6025606bcb2bd8ebf37f7184e0f1f7f6fbc57544d2b10332d8c26ab9fe28904f +size 716917 diff --git a/wikipedia/ta.arpa.bin b/wikipedia/ta.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..4506d74cdf26d4d66312ca4cf38424a158190f7a --- /dev/null +++ b/wikipedia/ta.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:1e749160961748c6db53ff28af3e5effcb7a6f5e8aa9d36b97761dee0a2bb54f +size 646855589 diff --git a/wikipedia/ta.sp.model b/wikipedia/ta.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..9a5cb673c658049776b2ba187268dc59141b2a4f --- /dev/null +++ b/wikipedia/ta.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:d98497fc2ddabad9df591ac4ace4fbed5e2c39c68760020c2e1ffa151bad6cea +size 1493561 diff --git a/wikipedia/ta.sp.vocab b/wikipedia/ta.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..1a9ba740252a62c47486151e59cee30b18fccadd --- /dev/null +++ b/wikipedia/ta.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:87ede7e88cad486355b7058a6f95173df825cbeadf3f88fae8048de1b5562fe2 +size 1291326 diff --git a/wikipedia/te.arpa.bin b/wikipedia/te.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..7f69c1a8e7c00397f8c26b54b23f420bc5058f76 --- /dev/null +++ b/wikipedia/te.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:56f58f1871ca160d51ef1a6384811d52d84de66ac257cec552981cbf01387df3 +size 243049387 diff --git a/wikipedia/te.sp.model b/wikipedia/te.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..016ee21efdd2689fb44eac5c0f41542be1527395 --- /dev/null +++ b/wikipedia/te.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:423cb3a352e79011c29f050706fa5ef89291245537dbadb8e669e1dd46de6477 +size 1461816 diff --git a/wikipedia/te.sp.vocab b/wikipedia/te.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..e5a352596d5b7c4e07cfefd2666124e2d1f961b2 --- /dev/null +++ b/wikipedia/te.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:8b1f23a34067398e923c0c3da9143920ba7ad26d0176050345ddcb57f49def05 +size 1260131 diff --git a/wikipedia/ur.arpa.bin b/wikipedia/ur.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..fac408af367f11efad8728e5aaa93db496960201 --- /dev/null +++ b/wikipedia/ur.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:cda435e6d899daa71aba3cffccd0558ef9eb8a00f2b9ae8ba7b69326dc535511 +size 396138774 diff --git a/wikipedia/ur.sp.model b/wikipedia/ur.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..577aaf5a3930a6f95f17988c266de438f850aced --- /dev/null +++ b/wikipedia/ur.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:97b15e17d55fa19c6254bf3955744bfb3e19084a603ecddf1fb405f72d2f93e1 +size 1001211 diff --git a/wikipedia/ur.sp.vocab b/wikipedia/ur.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..972bd242f1f4a137cfa2a4838abb915fb152175e --- /dev/null +++ b/wikipedia/ur.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:921bb69dc9d7c692a738cb08fef503ff6a10423741cce5345a95f3fe82516107 +size 799327 diff --git a/wikipedia/vi.arpa.bin b/wikipedia/vi.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..16f13009f85108afd3906c8d6e65343855c6e0ed --- /dev/null +++ b/wikipedia/vi.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:983460dc00aaaec7325139cd87e89e937fcf5ac0cba4b16f23241fcc52d3c0ca +size 1414396214 diff --git a/wikipedia/vi.sp.model b/wikipedia/vi.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..34212b522db66b664b21513fbc9cbbff2f80a3ae --- /dev/null +++ b/wikipedia/vi.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:b1393f7ca703337a5b94f86ddb8e17e3171fc1ca388ca035942f594e0f0d958d +size 906762 diff --git a/wikipedia/vi.sp.vocab b/wikipedia/vi.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..f19d062ed46b42bef36d62e36364c74d10ffb904 --- /dev/null +++ b/wikipedia/vi.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:792406a617a2c2ca2ba247fbc413266014fa22c344a08f379af817e9ce05e340 +size 704830 diff --git a/wikipedia/yo.arpa.bin b/wikipedia/yo.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..c05bce01b28f1878d4de44bb3acc0af453782b9a --- /dev/null +++ b/wikipedia/yo.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:f270e49fd8ac5b84c8bfbbec7a76d8dd37bd24e655ee0e24f055bb20dc93a4b4 +size 42746747 diff --git a/wikipedia/yo.sp.model b/wikipedia/yo.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..3ba785cab7277d4d8f00498f2ea8ea42d0c01451 --- /dev/null +++ b/wikipedia/yo.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:4ead51e5cad5bd7d27b7292d86054a85ec26d36172d3109bd4177176ac768ce4 +size 872629 diff --git a/wikipedia/yo.sp.vocab b/wikipedia/yo.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..67f5d24bd7678e8090a8b0bf428f4a3b1c6636b8 --- /dev/null +++ b/wikipedia/yo.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:204c6cde6e263592a8c10d76ef4c912e2bf830887bd0d4a66828a78245591e41 +size 667494 diff --git a/wikipedia/zh.arpa.bin b/wikipedia/zh.arpa.bin new file mode 100644 index 0000000000000000000000000000000000000000..2766dcde40e268c30e301b11067de07b10108fff --- /dev/null +++ b/wikipedia/zh.arpa.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:240f156d70a4b04cb078b4f127ae0103378454143a77442c18e5e24b93404e56 +size 3635106545 diff --git a/wikipedia/zh.sp.model b/wikipedia/zh.sp.model new file mode 100644 index 0000000000000000000000000000000000000000..1bd32c5fd51660d7f14be1e636862b1a9d197e3b --- /dev/null +++ b/wikipedia/zh.sp.model @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:ff2189b2cc84a513a76d24f9a0154e52f0afaf3010dc5fd1034ed37c9d2b5970 +size 876286 diff --git a/wikipedia/zh.sp.vocab b/wikipedia/zh.sp.vocab new file mode 100644 index 0000000000000000000000000000000000000000..76ae3adfb6b8b530c87a8a82a418eeed74eaef4c --- /dev/null +++ b/wikipedia/zh.sp.vocab @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:48b2823a974418c5ce8bbd7da987926a87d67af2891afbace68779732698a5a7 +size 674108