|
--- |
|
language: dv |
|
--- |
|
|
|
# dv-labse |
|
|
|
This is an experiment in cross-lingual transfer learning, to insert Dhivehi word and |
|
word-piece tokens into Google's LaBSE model. |
|
|
|
- Original model weights: https://huggingface.co/setu4993/LaBSE |
|
- Original model announcement: https://ai.googleblog.com/2020/08/language-agnostic-bert-sentence.html |
|
|
|
This currently outperforms dv-wave and dv-MuRIL (a similar transfer learning model) on |
|
the Maldivian News Classification task https://github.com/Sofwath/DhivehiDatasets |
|
|
|
- mBERT: 52% |
|
- dv-wave (ELECTRA): 89% |
|
- dv-muril: 90.7% |
|
- dv-labse: 91.5% (may continue training) |
|
|
|
## Training |
|
|
|
- Start with LaBSE (similar to mBERT) with no Thaana vocabulary |
|
- Based on PanLex dictionaries, attach 1,100 Dhivehi words to Sinhalese or English embeddings |
|
- Add remaining words and word-pieces from dv-wave's vocabulary to vocab.txt |
|
- Continue BERT pretraining on Dhivehi text |
|
|
|
CoLab notebook: |
|
https://colab.research.google.com/drive/1CUn44M2fb4Qbat2pAvjYqsPvWLt1Novi |
|
|