PrahokBART (big-sized model)

PrahokBART is a pre-trained sequence-to-sequence model trained from scratch for Khmer using carefully curated Khmer and English corpora. This model was trained considering the linguistic issues of Khmer by incorporating linguistic components such as word segmentation and normalization. This model can be finetuned to build natural language generation application for Khmer such as English<->Khmer translation, summarization, headline generation, etc. This model is more efficient than mBART50. You can read more about PrahokBART in this paper.

Basic Usage

Preprocessing: Input texts should be normalized (encodings) and word segmented. We only perform word segmentation and assume texts have been normalized here. Please try normalization yourself using this.

from khmernltk import word_tokenize

def word_segment(text):
    return " ".join(word_tokenize(text)).replace("   ", " β–‚ ")

def word_unsegment(text):
    return text.replace(" ", "").replace("β–‚", " ")

Load the model using AutoClass

from transformers import AutoModelForSeq2SeqLM
from transformers import AutoTokenizer

model_name="nict-astrec-att/prahokbart_big"

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

I/O format: The format of corpus that PrahokBART was trained on is Sentence </s> <2xx> for input and <2yy> Sentence </s> for output where xx and yy are language codes.

Forward pass

inp = tokenizer(
    word_segment("αžαŸ’αž‰αž»αŸ†αž‘αŸ…αžŸαžΆαž›αžΆαžšαŸ€αž“ </s> <2km>"),
    add_special_tokens=False, 
    return_tensors="pt", 
    padding=True
)

out = tokenizer(
    "<2en> I go to school </s>",
    add_special_tokens=False,
    return_tensors="pt",
    padding=True
).input_ids

model_output = model(
    input_ids=inp.input_ids,
    attention_mask=inp.attention_mask,
    labels=out,
) # forward pass

# For loss
model_output.loss ## This is not label smoothed.

# For logits
model_output.logits

Mask prediction: Let's ask the model to predict [MASK] parts of an input setence.

text = "αžαŸ’αž‰αž»αŸ†αž‘αŸ…αžŸαžΆαž›αžΆαžšαŸ€αž“[MASK] </s> <2km>" # I go to school [MASK] 
inp = tokenizer(
    word_segment(text), 
    add_special_tokens=False, 
    return_tensors="pt"
).input_ids

model_output=model.generate(
    inp,
    decoder_start_token_id=tokenizer._convert_token_to_id_with_added_voc("<2km>")
)

decoded_output=tokenizer.decode(
    model_output[0],
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)

print(word_unsegment(decoded_output))
# Output: αžαŸ’αž‰αž»αŸ†αž‘αŸ…αžŸαžΆαž›αžΆαžšαŸ€αž“αž“αŸ…αž”αŸ’αžšαž‘αŸαžŸαž‡αž”αŸ‰αž»αž“

αžαŸ’αž‰αž»αŸ†αž‘αŸ…αžŸαžΆαž›αžΆαžšαŸ€αž“αž“αŸ…αž”αŸ’αžšαž‘αŸαžŸαž‡αž”αŸ‰αž»αž“ = I go to school in Japan

Finetuning

Codes are avaiable in GitHub.

Citation

@inproceedings{kaing2025prahokbart,
  title={PrahokBART: A Pre-trained Sequence-to-Sequence Model for Khmer Natural Language Generation},
  author={Kaing, Hour and Dabre, Raj and Song, Haiyue and Tran, Van-Hien and Tanaka, Hideki and Utiyama, Masao},
  booktitle={Proceedings of the 31st International Conference on Computational Linguistics},
  pages={1309--1322},
  year={2025}
}
Downloads last month
8
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for nict-astrec-att/prahokbart_big

Finetunes
1 model