A newer version of this model is available: songhieng/khmer-mt5-summarization-1024tk-V2

Model Card for Model ID

Khmer mT5 Summarization Model (1024 Tokens) - V2

Introduction

This repository contains an improved version of the Khmer mT5 summarization model, songhieng/khmer-mt5-summarization-1024tk-V2. This version has been trained on an expanded dataset, including data from kimleang123/rfi_news, allowing for improved summarization performance on Khmer text.

Model Details

  • Base Model: google/mt5-small
  • Fine-tuned for: Khmer text summarization with extended input length
  • Training Dataset: kimleang123/rfi_news + previous dataset
  • Framework: Hugging Face transformers
  • Task Type: Sequence-to-Sequence (Seq2Seq)
  • Input: Khmer text (articles, paragraphs, or documents) up to 1024 tokens
  • Output: Summarized Khmer text
  • Training Hardware: GPU (Tesla T4)
  • Evaluation Metric: ROUGE Score

Installation & Setup

1️⃣ Install Dependencies

Ensure you have transformers, torch, and datasets installed:

pip install transformers torch datasets

2️⃣ Load the Model

To load and use the fine-tuned model:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = "songhieng/khmer-mt5-summarization-1024tk-V2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

How to Use

1️⃣ Using Python Code

def summarize_khmer(text, max_length=150):
    input_text = f"summarize: {text}"
    inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=1024)
    summary_ids = model.generate(**inputs, max_length=max_length, num_beams=5, length_penalty=2.0, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

khmer_text = "កម្ពុជាមានប្រជាជនប្រមាណ ១៦ លាននាក់ ហើយវាគឺជាប្រទេសនៅតំបន់អាស៊ីអាគ្នេយ៍។"
summary = summarize_khmer(khmer_text)
print("Khmer Summary:", summary)

2️⃣ Using Hugging Face Pipeline

from transformers import pipeline

summarizer = pipeline("summarization", model="songhieng/khmer-mt5-summarization-1024tk-V2")
khmer_text = "កម្ពុជាមានប្រជាជនប្រមាណ ១៦ លាននាក់ ហើយវាគឺជាប្រទេសនៅតំបន់អាស៊ីអាគ្នេយ៍។"
summary = summarizer(khmer_text, max_length=150, min_length=30, do_sample=False)
print("Khmer Summary:", summary[0]['summary_text'])

3️⃣ Deploy as an API using FastAPI

from fastapi import FastAPI

app = FastAPI()

@app.post("/summarize/")
def summarize(text: str):
    inputs = tokenizer(f"summarize: {text}", return_tensors="pt", truncation=True, max_length=1024)
    summary_ids = model.generate(**inputs, max_length=150, num_beams=5, length_penalty=2.0, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return {"summary": summary}

# Run with: uvicorn filename:app --reload

Model Evaluation

The model was evaluated using ROUGE scores, which measure the similarity between the generated summaries and the reference summaries.

from datasets import load_metric

rouge = load_metric("rouge")

def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions
    decoded_preds = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)
    return rouge.compute(predictions=decoded_preds, references=decoded_labels)

trainer.evaluate()

Saving & Uploading the Model

After fine-tuning, the model can be uploaded to the Hugging Face Hub:

model.push_to_hub("songhieng/khmer-mt5-summarization-1024tk-V2")
tokenizer.push_to_hub("songhieng/khmer-mt5-summarization-1024tk-V2")

To download it later:

model = AutoModelForSeq2SeqLM.from_pretrained("songhieng/khmer-mt5-summarization-1024tk-V2")
tokenizer = AutoTokenizer.from_pretrained("songhieng/khmer-mt5-summarization-1024tk-V2")

Summary

Feature Details
Base Model google/mt5-small
Task Summarization
Language Khmer (ខ្មែរ)
Dataset kimleang123/rfi_news + previous dataset
Framework Hugging Face Transformers
Evaluation Metric ROUGE Score
Deployment Hugging Face Model Hub, API (FastAPI), Python Code

Contributing

Contributions are welcome! Feel free to open issues or submit pull requests if you have any improvements or suggestions.

Contact

If you have any questions, feel free to reach out via Hugging Face Discussions or create an issue in the repository.

Built for the Khmer NLP Community

Downloads last month
42
Safetensors
Model size
300M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for songhieng/khmer-mt5-summarization-1024tk-V3

Base model

google/mt5-small
Finetuned
(424)
this model

Datasets used to train songhieng/khmer-mt5-summarization-1024tk-V3