Do you have the tokenised file for that? and the pylaia config files? Thanks

by sana-ngu - opened 20 days ago

Discussion

sana-ngu

20 days ago

Do you have the tokenised file for that? and the pylaia config files? Thanks

johnlockejrr

Owner 20 days ago

I might do because I trained it some time ago.

sana-ngu

20 days ago

Thank you for your reply. Now, I have successfully trained on KHATT data, and I got an acceptable CER, but the LM is not improving the results. What tool are you using to create the n-gram LM? kenLM? Another question is, how do we prepare the data for building the n-gram? do you use ~~and~~ to indicate start and end of sequence? I m sespecting I am not creating the correct n-gram lm

johnlockejrr

Owner 18 days ago

•

edited 18 days ago

I use kenlm.
For arpa plain text: kenlm/build/bin/lmplz --order 6 --text corpus_characters.txt --arpa language_model.arpa --discount_fallback
For binary (faster) model: kenlm/build/bin/build_binary language_model.arpa language_model.binary

(venv) incognito@DESKTOP-NHKR7QL:~ $ head /home/incognito/pylaia-dev/KHATT_v1.0_dataset/corpus_characters.txt
ر ف ا ظ <space> ق ي ا ر <space> ي ؤ ل <space> ن ب <space> ف و ؤ ر <space> ه ب ح ص ب <space> م ا غ ر ض <space> ر ف ظ م <space> ح و ن <space> ب ه ذ
ل ف ا و ق <space> ت أ د ب <space> . <space> ج ح ل ل <space> ف ي ف ع <space> ن ز ا خ <space> ل ا ل ه و <space> ط و ع ط ع
ا ن ي ع س و <space> ا ن ف ط <space> ا ن ل و ص و <space> د ن ع <space> . <space> ي ب ل ي <space> ج ا ح <space> ر ث ا <space> ج ا ح <space> ج ي ج ح ل ا
ت ا م ل ك ب <space>   م ئ ا ن <space> و ه و <space> م ل ك ت ي <space> ه م ي خ ل ا <space> ي ف <space> ي ر ا ج <space> ن ا ك <space> . خ ي ش <space> ع م
. ك ت م ز ل <space> ط ب ا ظ ل ا <space> ه ل <space> س ل ـ غ ب <space> ض ق ن ا <space> ل ث م <space> ا ه م ه ف ا <space> ا ل
ش <space> س <space> ب <space> ، ض <space> خ <space> ث <space> ، <space> ك <space> ع <space> ظ <space> ا ن ب ا ح ص ا <space> غ ل ب <space> ل ه <space> ح ج ا ر <space> ت ل أ س
: ص ن ل ا <space> ا ذ ه ل <space> ه ي ل ا ت ل ا <space> ه د ئ ا ف <space> م ل ع ت <space> ل ه <space>   . ج ح ل ا <space> ي ف <space> ا ن ن ا <space> ـ ه <space> غ <space> ص
ر س ن <space> ، <space> ث ب <space>   ، <space> ء ا ن <space> ، ط ي غ <space> ، ق ا ر د <space> ، ش م ش م
غ ل ب <space> ل ه <space> ح ج ا ر <space> ت ل أ س <space> . ك ت م ز ل <space> ط ب ا ض ل ا <space> ه ل <space> س ل ـ غ ب <space> ض ق ن ا <space> ل ث م <space> ا ه م ه ف أ <space> ا ل <space> ت ا م ل ك ب <space>   م ئ ا ن <space> و ه و <space> م ل ك ت ي
ة ي ل ا ت ل ا <space> ت ا م ل ك ل ا <space> ة د ئ ا ف <space> م ل ع ت <space> ل ه <space> ج ح ل ا <space> ي ف <space> ا ن ن أ <space> ـ ه <space> غ <space> ص <space> ، ش <space> س <space> ب <space> ، ض <space> خ <space> ث <space> ، <space> ك <space> ع <space> ظ <space> ا ن ب ا ح ص أ

NB: The corpus_characters.txt file in this example is in LTR before PyLaia had implemmented RTL switch! Now, on the latest PyLaia it needs to be in normal RTL.

sana-ngu

17 days ago

I am using the latest Pylaia with the RTL; but the predicted text i am gtting is reversed which i dont know why (which may contripute to the LM not improving the results). when you used the LTR text , how you process the images? are you use it as it is?
or mirror it?

johnlockejrr

Owner 17 days ago

•

edited 17 days ago

Train the images as they are, don't flip them! Just use the RTL switch. When predicting use again the RTL switch. If you use the predict in a Linux terminal you'll see the output as LTR but if you copy and paste it in a UTF-8 capable editor you'll see it in RTL (because it is RTL screwed up by the terminal)

sana-ngu

15 days ago

I am doing that; the only problem now is that the LM doesn't improve the results; it makes it worse, so how do we deal with RTL properties when creating LM?

johnlockejrr

Owner 12 days ago

I'm not working with Teklia nor I'm affiliated with them:
https://huggingface.co/Teklia
https://support.teklia.com/

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment