quickmt-zh-en / README.md
radinplaid's picture
Upload folder using huggingface_hub
180f838 verified
|
raw
history blame
4.88 kB

quickmt-zh-en Neural Machine Translation Model

Usage

Install quickmt

git clone https://github.com/quickmt/quickmt.git
pip install ./quickmt/

Download model

quickmt-model-download quickmt/quickmt-zh-en ./quickmt-zh-en

Use model

Inference with quickmt:

from quickmt import Translator

# Auto-detects GPU, set to "cpu" to force CPU inference
t = Translator("./quickmt-zh-en/", device="auto")

# Translate - set beam size to 5 for higher quality (but slower speed)
t(["他补充道:“我们现在有 4 个月大没有糖尿病的老鼠,但它们曾经得过该病。”"], beam_size=1)

# Get alternative translations by sampling
# You can pass any cTranslate2 `translate_batch` arguments
t(["他补充道:“我们现在有 4 个月大没有糖尿病的老鼠,但它们曾经得过该病。”"], sampling_temperature=1.2, beam_size=1, sampling_topk=50, sampling_topp=0.9)

The model is in ctranslate2 format, and the tokenizers are sentencepiece, so you can use the model files directly if you want. It would be fairly easy to get them to work with e.g. LibreTranslate which also uses ctranslate2 and sentencepiece.

Model Information

Metrics

BLEU and CHRF2 calculated with sacrebleu on the Flores200 devtest test set ("zho_Hans"->"eng_Latn").

"Time" is the time to translate the following input with a single CPU core:

2019冠状病毒病(英語:Coronavirus disease 2019,缩写:COVID-19[17][18]),是一種由嚴重急性呼吸系統綜合症冠狀病毒2型(縮寫:SARS-CoV-2)引發的傳染病,导致了一场持续的疫情,成为人類歷史上致死人數最多的流行病之一。

Model bleu chrf2 Time (s)
quickmt/quickmt-zh-en 28.58 57.46 0.670
Helsinki-NLP/opus-mt-zh-en 23.35 53.60 0.838
facebook/m2m100_418M 18.96 50.06 11.5
facebook/nllb-200-distilled-600M 26.22 55.17 13.2
facebook/nllb-200-distilled-1.3B 28.54 57.34 23.6
facebook/m2m100_1.2B 24.68 54.68 25.7
google/madlad400-3b-mt 28.74 58.01 ???

quickmt-zh-en is the fastest and delivers fairly high quality.

Helsinki-NLP/opus-mt-zh-en is one of the most downloaded machine translation models on HuggingFace, and this model is considerably more accurate and a bit faster.

Training Configuration

### Vocab
src_vocab_size: 20000
tgt_vocab_size: 20000
share_vocab: False

data:
    corpus_1:
        path_src: hf://quickmt/quickmt-train-zh-en/zh
        path_tgt: hf://quickmt/quickmt-train-zh-en/en
        path_sco: hf://quickmt/quickmt-train-zh-en/sco
    valid:
        path_src: zh-en/dev.zho
        path_tgt: zh-en/dev.eng

transforms: [sentencepiece, filtertoolong]
transforms_configs:
  sentencepiece:
    src_subword_model: "zh-en/src.spm.model"
    tgt_subword_model: "zh-en/tgt.spm.model"
  filtertoolong:
    src_seq_length: 512
    tgt_seq_length: 512

training:
    # Run configuration
    model_path: quickmt-zh-en
    keep_checkpoint: 4
    save_checkpoint_steps: 1000
    train_steps: 104000
    valid_steps: 1000
    
    # Train on a single GPU
    world_size: 1
    gpu_ranks: [0]

    # Batching
    batch_type: "tokens"
    batch_size: 13312
    valid_batch_size: 13312
    batch_size_multiple: 8
    accum_count: [4]
    accum_steps: [0]

    # Optimizer & Compute
    compute_dtype: "bfloat16"
    optim: "pagedadamw8bit"
    learning_rate: 1.0
    warmup_steps: 10000
    decay_method: "noam"
    adam_beta2: 0.998

    # Data loading
    bucket_size: 262144
    num_workers: 4
    prefetch_factor: 100

    # Hyperparams
    dropout_steps: [0]
    dropout: [0.1]
    attention_dropout: [0.1]
    max_grad_norm: 0
    label_smoothing: 0.1
    average_decay: 0.0001
    param_init_method: xavier_uniform
    normalization: "tokens"

model:
    architecture: "transformer"
    layer_norm: standard
    share_embeddings: false
    share_decoder_embeddings: true
    add_ffnbias: true
    mlp_activation_fn: gated-silu
    add_estimator: false
    add_qkvbias: false
    norm_eps: 1e-6
    hidden_size: 1024
    encoder:
        layers: 8
    decoder:
        layers: 2
    heads: 16
    transformer_ff: 4096
    embeddings:
        word_vec_size: 1024
        position_encoding_type: "SinusoidalInterleaved"