Enhanced MarianMT Indonesian-English Translation (Meeting Domain Adaptation)
This model is an enhanced fine-tuned version of Helsinki-NLP/opus-mt-id-en with domain-specific adaptation for meeting and business contexts.
π― Model Highlights
- Domain Adaptation: Specialized for meeting and business translation
- Enhanced Dataset: TED2020 + 2000+ meeting-specific sentence pairs
- Improved Performance: Better BLEU scores on meeting contexts
- Robust Training: 80% dataset usage with domain mixing
- Production Ready: Optimized for real-world meeting scenarios
π Performance Metrics
Metric | Base Model | This Model | Improvement |
---|---|---|---|
BLEU Score | 1.467 | 3.736 | +154.6% |
Translation Speed | 1.2s | 0.14s | -88.2% |
Meeting Context | Standard | Enhanced | Domain Adapted |
π Model Details
- Base Model: Helsinki-NLP/opus-mt-id-en
- Training Dataset: TED2020 (80%) + Meeting Domain (10%)
- Training Strategy: Domain adaptation with enhanced learning
- Specialization: Business meetings, technical discussions, formal conversations
- Training Date: 2025-05-27
- Languages: Indonesian (id) β English (en)
- License: Apache 2.0
π οΈ Usage
from transformers import MarianMTModel, MarianTokenizer
# Load model and tokenizer
model_name = "dhintech/marian-ted2020-id-en-lg"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
# Translate Indonesian to English
def translate(text):
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128)
outputs = model.generate(
**inputs,
max_length=128,
num_beams=3,
early_stopping=True,
do_sample=False
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
# Example usage
indonesian_text = "Tim marketing akan bertanggung jawab untuk strategi ini."
english_translation = translate(indonesian_text)
print(english_translation)
# Output: "The marketing team will be responsible for this strategy."
π Example Translations
Meeting Context Examples
Indonesian | English | Context |
---|---|---|
Selamat pagi semuanya, mari kita mulai rapat hari ini. | Good morning everyone, let's start today's meeting. | Meeting Opening |
Tim marketing akan bertanggung jawab untuk strategi ini. | The marketing team will be responsible for this strategy. | Task Assignment |
Database migration sudah selesai dan berjalan dengan lancar. | Database migration is complete and running smoothly. | Technical Update |
Budget yang disetujui adalah 500 juta rupiah. | The approved budget is 500 million rupiah. | Financial Discussion |
π― Intended Use Cases
- Business Meeting Translation: Real-time translation during meetings
- Technical Documentation: Translating technical meeting notes
- Corporate Communication: Formal business correspondence
- Project Management: Translating project updates and reports
- Training Materials: Educational and training content translation
π Training Configuration
- Dataset Size: 118,626 sentence pairs
- TED2020 Data: 80% of cleaned dataset
- Meeting Domain Data: 10% specialized meeting content
- Max Sequence Length: 128 tokens
- Training Epochs: 12
- Learning Rate: 1e-05
- Batch Size: 12 (effective)
π§ Technical Specifications
- Model Architecture: MarianMT (Transformer-based)
- Parameters: ~74M (with selective fine-tuning)
- Max Input/Output Length: 128 tokens
- Inference Time: ~0.14s per sentence
- Memory Requirements:
- GPU: 3GB VRAM minimum
- CPU: 4GB RAM minimum
π¨ Limitations
- Domain Specificity: Optimized for formal business/meeting contexts
- Informal Language: May not perform optimally on very casual Indonesian
- Regional Dialects: Trained primarily on standard Indonesian
- Cultural Context: Some cultural nuances may be lost in translation
π Citation
@misc{enhanced-marian-id-en-2025,
title={Enhanced MarianMT Indonesian-English Translation (Meeting Domain Adaptation)},
author={DhinTech},
year={2025},
publisher={Hugging Face},
journal={Hugging Face Model Hub},
howpublished={\url{https://huggingface.co/dhintech/marian-id-en-enhanced}},
note={Enhanced with TED2020 and meeting-specific domain adaptation}
}
π Acknowledgments
- Base Model: Helsinki-NLP team for the original opus-mt-id-en model
- Dataset: TED2020 corpus and custom meeting domain data
- Framework: Hugging Face Transformers team
This model is specifically enhanced for Indonesian business meeting translation scenarios with domain adaptation techniques.
- Downloads last month
- 34
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support
Model tree for dhintech/marian-ted2020-id-en-lg
Base model
Helsinki-NLP/opus-mt-id-en