Model Card: Japanese-English Academic Translator [Sentence-Level]
Model Details
Model name: ywc1/marian-finetuned-ja-en
Developed by: Susie Xu and Kenneth Zhang
Languages: Japanese → English
Finetuned from: Helsinki-NLP/opus-mt-ja-en
Architecture: MarianMT (Transformer encoder–decoder)
Model Description
This model is fine-tuned from MarianMT for academic text translation from Japanese to English at the sentence level. It is designed to preserve technical vocabulary, proper nouns, and factual accuracy in the scientific domain, based on the ASPEC (Asian Scientific Paper Excerpt Corpus). Because it was trained on single sentences, it may underperform on multi-sentence or paragraph inputs. For longer academic passages, use ywc1/mbart-finetuned-ja-en-para.
Intended Uses & Limitations
Intended uses:
Translating individual sentences from Japanese academic papers to English.
Assisting researchers in quickly understanding scientific literature written in Japanese.
Limitations:
Not optimized for conversational, literary, or informal text.
May produce less fluent results for multi-sentence inputs.
Sentence-level fluency can sometimes lag behind factual accuracy.
How to Use
from transformers import MarianMTModel, MarianTokenizer
model = MarianMTModel.from_pretrained("ywc1/marian-finetuned-ja-en") tokenizer = MarianTokenizer.from_pretrained("ywc1/marian-finetuned-ja-en")
text = "DERSソフトウェアを用いれば「ふげん発電所」の線量率を詳細に計算できる。" inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True) translated = model.generate(**inputs) print(tokenizer.decode(translated[0], skip_special_tokens=True))
Web interface: Hugging Face Spaces
Training Details
Training Data
Dataset: ASPEC (Asian Scientific Paper Excerpt Corpus) – Japanese-English subset.
Size used: 300,000 sentence pairs (~30% of total dataset).
Domain: Academic papers in science and technology (pre-2010).
Preprocessing:
Removed dataset IDs.
Kept only Japanese-English aligned pairs.
Tokenized source text (Japanese) and target text (English) for encoder-decoder input.
Training Procedure
Compute: Google Cloud Vertex AI, NVIDIA L1 GPU (40GB RAM) for most training; occasional NVIDIA T4 on Colab.
Hyperparameters:
Learning Rate: 0.0003
Batch Size: 2
Weight Decay: 1.0
Gradient Accumulation Steps: 4
Epochs: 6
Training time: ~15 hours
Evaluation
Testing Data
Official ASPEC Japanese-English test split (sentence level).
Metrics
Metric Base Fine-tuned % Improvement BLEU 12.19 32.73 +169% METEOR 0.42 0.66 +58% COMET 0.77 0.85 +11% Example Outputs
Input: DERSソフトウェアを用いれば「ふげん発電所」の線量率を詳細に計算できる。 Reference: Details of dose rate of "Fugen Power Plant" can be calculated by using DERS software. Model Output: Using the DERS software, the dose rate of "Fugen power plant" can be calculated in detail.
Environmental Impact
Hardware: 1× NVIDIA L1 GPU (40GB RAM)
Training time: ~15 hours
Cloud provider: Google Cloud Vertex AI
Citation
BibTeX:
@misc{xu2025marianmtacademic, title={Japanese-English Academic Translator [Sentence-Level]}, author={Xu, Yifan and Zhang, Kenneth}, year={2025}, publisher={Hugging Face}, howpublished={\url{https://huggingface.co/susiexyf/marian-finetuned-ja-en}} }
- Downloads last month
- 23
Spaces using susiexyf/marian-finetuned-ja-en 2
Evaluation results
- BLEU on ASPEC Japanese-Englishtest set self-reported32.730
- METEOR on ASPEC Japanese-Englishtest set self-reported0.660
- COMET on ASPEC Japanese-Englishtest set self-reported0.850