Enhanced MarianMT Indonesian-English Translation (Meeting Domain Adaptation)

This model is an enhanced fine-tuned version of Helsinki-NLP/opus-mt-id-en with domain-specific adaptation for meeting and business contexts.

🎯 Model Highlights

  • Domain Adaptation: Specialized for meeting and business translation
  • Enhanced Dataset: TED2020 + 2000+ meeting-specific sentence pairs
  • Improved Performance: Better BLEU scores on meeting contexts
  • Robust Training: 80% dataset usage with domain mixing
  • Production Ready: Optimized for real-world meeting scenarios

πŸ“Š Performance Metrics

Metric Base Model This Model Improvement
BLEU Score 1.467 3.736 +154.6%
Translation Speed 1.2s 0.14s -88.2%
Meeting Context Standard Enhanced Domain Adapted

πŸš€ Model Details

  • Base Model: Helsinki-NLP/opus-mt-id-en
  • Training Dataset: TED2020 (80%) + Meeting Domain (10%)
  • Training Strategy: Domain adaptation with enhanced learning
  • Specialization: Business meetings, technical discussions, formal conversations
  • Training Date: 2025-05-27
  • Languages: Indonesian (id) β†’ English (en)
  • License: Apache 2.0

πŸ› οΈ Usage

from transformers import MarianMTModel, MarianTokenizer

# Load model and tokenizer
model_name = "dhintech/marian-ted2020-id-en-lg"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

# Translate Indonesian to English
def translate(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128)
    outputs = model.generate(
        **inputs,
        max_length=128,
        num_beams=3,
        early_stopping=True,
        do_sample=False
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Example usage
indonesian_text = "Tim marketing akan bertanggung jawab untuk strategi ini."
english_translation = translate(indonesian_text)
print(english_translation)
# Output: "The marketing team will be responsible for this strategy."

πŸ“ Example Translations

Meeting Context Examples

Indonesian English Context
Selamat pagi semuanya, mari kita mulai rapat hari ini. Good morning everyone, let's start today's meeting. Meeting Opening
Tim marketing akan bertanggung jawab untuk strategi ini. The marketing team will be responsible for this strategy. Task Assignment
Database migration sudah selesai dan berjalan dengan lancar. Database migration is complete and running smoothly. Technical Update
Budget yang disetujui adalah 500 juta rupiah. The approved budget is 500 million rupiah. Financial Discussion

🎯 Intended Use Cases

  • Business Meeting Translation: Real-time translation during meetings
  • Technical Documentation: Translating technical meeting notes
  • Corporate Communication: Formal business correspondence
  • Project Management: Translating project updates and reports
  • Training Materials: Educational and training content translation

πŸ“Š Training Configuration

  • Dataset Size: 118,626 sentence pairs
  • TED2020 Data: 80% of cleaned dataset
  • Meeting Domain Data: 10% specialized meeting content
  • Max Sequence Length: 128 tokens
  • Training Epochs: 12
  • Learning Rate: 1e-05
  • Batch Size: 12 (effective)

πŸ”§ Technical Specifications

  • Model Architecture: MarianMT (Transformer-based)
  • Parameters: ~74M (with selective fine-tuning)
  • Max Input/Output Length: 128 tokens
  • Inference Time: ~0.14s per sentence
  • Memory Requirements:
    • GPU: 3GB VRAM minimum
    • CPU: 4GB RAM minimum

🚨 Limitations

  • Domain Specificity: Optimized for formal business/meeting contexts
  • Informal Language: May not perform optimally on very casual Indonesian
  • Regional Dialects: Trained primarily on standard Indonesian
  • Cultural Context: Some cultural nuances may be lost in translation

πŸ“š Citation

@misc{enhanced-marian-id-en-2025,
  title={Enhanced MarianMT Indonesian-English Translation (Meeting Domain Adaptation)},
  author={DhinTech},
  year={2025},
  publisher={Hugging Face},
  journal={Hugging Face Model Hub},
  howpublished={\url{https://huggingface.co/dhintech/marian-id-en-enhanced}},
  note={Enhanced with TED2020 and meeting-specific domain adaptation}
}

πŸ™ Acknowledgments

  • Base Model: Helsinki-NLP team for the original opus-mt-id-en model
  • Dataset: TED2020 corpus and custom meeting domain data
  • Framework: Hugging Face Transformers team

This model is specifically enhanced for Indonesian business meeting translation scenarios with domain adaptation techniques.

Downloads last month
34
Safetensors
Model size
72.2M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for dhintech/marian-ted2020-id-en-lg

Finetuned
(12)
this model

Dataset used to train dhintech/marian-ted2020-id-en-lg