Do you have the tokenised file for that? and the pylaia config files? Thanks
Do you have the tokenised file for that? and the pylaia config files? Thanks
I might do because I trained it some time ago.
Thank you for your reply. Now, I have successfully trained on KHATT data, and I got an acceptable CER, but the LM is not improving the results. What tool are you using to create the n-gram LM? kenLM? Another question is, how do we prepare the data for building the n-gram? do you use and to indicate start and end of sequence? I m sespecting I am not creating the correct n-gram lm
I use kenlm
.
For arpa plain text: kenlm/build/bin/lmplz --order 6 --text corpus_characters.txt --arpa language_model.arpa --discount_fallback
For binary (faster) model: kenlm/build/bin/build_binary language_model.arpa language_model.binary
(venv) incognito@DESKTOP-NHKR7QL:~ $ head /home/incognito/pylaia-dev/KHATT_v1.0_dataset/corpus_characters.txt
ر ف ا ظ <space> ق ي ا ر <space> ي ؤ ل <space> ن ب <space> ف و ؤ ر <space> ه ب ح ص ب <space> م ا غ ر ض <space> ر ف ظ م <space> ح و ن <space> ب ه ذ
ل ف ا و ق <space> ت أ د ب <space> . <space> ج ح ل ل <space> ف ي ف ع <space> ن ز ا خ <space> ل ا ل ه و <space> ط و ع ط ع
ا ن ي ع س و <space> ا ن ف ط <space> ا ن ل و ص و <space> د ن ع <space> . <space> ي ب ل ي <space> ج ا ح <space> ر ث ا <space> ج ا ح <space> ج ي ج ح ل ا
ت ا م ل ك ب <space> م ئ ا ن <space> و ه و <space> م ل ك ت ي <space> ه م ي خ ل ا <space> ي ف <space> ي ر ا ج <space> ن ا ك <space> . خ ي ش <space> ع م
. ك ت م ز ل <space> ط ب ا ظ ل ا <space> ه ل <space> س ل ـ غ ب <space> ض ق ن ا <space> ل ث م <space> ا ه م ه ف ا <space> ا ل
ش <space> س <space> ب <space> ، ض <space> خ <space> ث <space> ، <space> ك <space> ع <space> ظ <space> ا ن ب ا ح ص ا <space> غ ل ب <space> ل ه <space> ح ج ا ر <space> ت ل أ س
: ص ن ل ا <space> ا ذ ه ل <space> ه ي ل ا ت ل ا <space> ه د ئ ا ف <space> م ل ع ت <space> ل ه <space> . ج ح ل ا <space> ي ف <space> ا ن ن ا <space> ـ ه <space> غ <space> ص
ر س ن <space> ، <space> ث ب <space> ، <space> ء ا ن <space> ، ط ي غ <space> ، ق ا ر د <space> ، ش م ش م
غ ل ب <space> ل ه <space> ح ج ا ر <space> ت ل أ س <space> . ك ت م ز ل <space> ط ب ا ض ل ا <space> ه ل <space> س ل ـ غ ب <space> ض ق ن ا <space> ل ث م <space> ا ه م ه ف أ <space> ا ل <space> ت ا م ل ك ب <space> م ئ ا ن <space> و ه و <space> م ل ك ت ي
ة ي ل ا ت ل ا <space> ت ا م ل ك ل ا <space> ة د ئ ا ف <space> م ل ع ت <space> ل ه <space> ج ح ل ا <space> ي ف <space> ا ن ن أ <space> ـ ه <space> غ <space> ص <space> ، ش <space> س <space> ب <space> ، ض <space> خ <space> ث <space> ، <space> ك <space> ع <space> ظ <space> ا ن ب ا ح ص أ
NB: The corpus_characters.txt
file in this example is in LTR before PyLaia
had implemmented RTL switch! Now, on the latest PyLaia
it needs to be in normal RTL.
I am using the latest Pylaia with the RTL; but the predicted text i am gtting is reversed which i dont know why (which may contripute to the LM not improving the results). when you used the LTR text , how you process the images? are you use it as it is?
or mirror it?
Train the images as they are, don't flip them! Just use the RTL switch. When predicting use again the RTL switch. If you use the predict in a Linux terminal you'll see the output as LTR but if you copy and paste it in a UTF-8 capable editor you'll see it in RTL (because it is RTL screwed up by the terminal)
I am doing that; the only problem now is that the LM doesn't improve the results; it makes it worse, so how do we deal with RTL properties when creating LM?
I'm not working with Teklia
nor I'm affiliated with them:
https://huggingface.co/Teklia
https://support.teklia.com/