Safetensors
Hebrew
bert

DictaBERT-splinter: Splintering Nonconcatenative Languages for Better Tokenization

DictaBERT-splinter is a BERT-style language model for Hebrew, released here.

This is the base model pretrained with the masked-language-modeling objective.

Sample usage:

from transformers import AutoModelForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('dicta-il/dictabert-splinter', trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained('dicta-il/dictabert-splinter')

model.eval()

sentence = '讘砖谞转 1948 讛砖诇讬诐 讗驻专讬诐 拽讬砖讜谉 讗转 [MASK] 讘驻讬住讜诇 诪转讻转 讜讘转讜诇讚讜转 讛讗诪谞讜转 讜讛讞诇 诇驻专住诐 诪讗诪专讬诐 讛讜诪讜专讬住讟讬讬诐'

output = model(tokenizer.encode(sentence, return_tensors='pt'))
# the [MASK] is the 7th token (including [CLS])
import torch
top_2 = torch.topk(output.logits[0, 7, :], 2)[1]
print('\n'.join(tokenizer.batch_decode(top_2))) # should print 讛转诪讞讜转讜 / 诇讬诪讜讚讬讜 

Citation

If you use DictaBERT-splinter in your research, please cite Splintering Nonconcatenative Languages for Better Tokenization

BibTeX:

@misc{gazit2025splinteringnonconcatenativelanguagesbetter,
      title={Splintering Nonconcatenative Languages for Better Tokenization}, 
      author={Bar Gazit and Shaltiel Shmidman and Avi Shmidman and Yuval Pinter},
      year={2025},
      eprint={2503.14433},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.14433}, 
}

License

Shield: CC BY 4.0

This work is licensed under a Creative Commons Attribution 4.0 International License.

CC BY 4.0

Downloads last month
0
Safetensors
Model size
184M params
Tensor type
F32
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.