Model Card: Japanese-English Academic Translator [Sentence-Level]

Model Details

Model name: ywc1/marian-finetuned-ja-en

Developed by: Susie Xu and Kenneth Zhang

Languages: Japanese → English

Finetuned from: Helsinki-NLP/opus-mt-ja-en

Architecture: MarianMT (Transformer encoder–decoder)

Model Description

This model is fine-tuned from MarianMT for academic text translation from Japanese to English at the sentence level. It is designed to preserve technical vocabulary, proper nouns, and factual accuracy in the scientific domain, based on the ASPEC (Asian Scientific Paper Excerpt Corpus). Because it was trained on single sentences, it may underperform on multi-sentence or paragraph inputs. For longer academic passages, use ywc1/mbart-finetuned-ja-en-para.

Intended Uses & Limitations

Intended uses:

Translating individual sentences from Japanese academic papers to English.

Assisting researchers in quickly understanding scientific literature written in Japanese.

Limitations:

Not optimized for conversational, literary, or informal text.

May produce less fluent results for multi-sentence inputs.

Sentence-level fluency can sometimes lag behind factual accuracy.

How to Use

from transformers import MarianMTModel, MarianTokenizer

model = MarianMTModel.from_pretrained("ywc1/marian-finetuned-ja-en") tokenizer = MarianTokenizer.from_pretrained("ywc1/marian-finetuned-ja-en")

text = "ＤＥＲＳソフトウェアを用いれば「ふげん発電所」の線量率を詳細に計算できる。" inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True) translated = model.generate(**inputs) print(tokenizer.decode(translated[0], skip_special_tokens=True))

Web interface: Hugging Face Spaces

Training Details

Training Data

Dataset: ASPEC (Asian Scientific Paper Excerpt Corpus) – Japanese-English subset.

Size used: 300,000 sentence pairs (~30% of total dataset).

Domain: Academic papers in science and technology (pre-2010).

Preprocessing:

Removed dataset IDs.

Kept only Japanese-English aligned pairs.

Tokenized source text (Japanese) and target text (English) for encoder-decoder input.

Training Procedure

Compute: Google Cloud Vertex AI, NVIDIA L1 GPU (40GB RAM) for most training; occasional NVIDIA T4 on Colab.

Hyperparameters:

Learning Rate: 0.0003

Batch Size: 2

Weight Decay: 1.0

Gradient Accumulation Steps: 4

Epochs: 6

Training time: ~15 hours

Evaluation

Testing Data

Official ASPEC Japanese-English test split (sentence level).

Metrics

Metric Base Fine-tuned % Improvement BLEU 12.19 32.73 +169% METEOR 0.42 0.66 +58% COMET 0.77 0.85 +11% Example Outputs

Input: ＤＥＲＳソフトウェアを用いれば「ふげん発電所」の線量率を詳細に計算できる。 Reference: Details of dose rate of "Fugen Power Plant" can be calculated by using DERS software. Model Output: Using the DERS software, the dose rate of "Fugen power plant" can be calculated in detail.

Environmental Impact

Hardware: 1× NVIDIA L1 GPU (40GB RAM)