(English part follows Japanese one.)

byBERT-JP-V2 100M

バイト単位のtokenizerを採用した，日本語 BERT モデルです。

利用方法

transformers version 4.56.1 において、動作確認をしています。

import argparse

import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer

MASK_PLACEHOLDER = "<mask>"
SAMPLE_INPUT_TEXTS = [
    f"東北大学は宮城県{MASK_PLACEHOLDER * 6}市にある大学です。", # 6 bytes mask
    f"日本一高い山は{MASK_PLACEHOLDER * 9}です。", # 9 bytes mask
]


def main(args):
    torch.manual_seed(args.seed)
    device = torch.device("cuda")
    
    tokenizer = AutoTokenizer.from_pretrained(
        args.model_name_or_path,
        trust_remote_code=True,
    )
    model = AutoModelForMaskedLM.from_pretrained(
        args.model_name_or_path,
        dtype=torch.bfloat16,
        trust_remote_code=True,
    )
    model.to(device)
    model.eval()
    
    input_texts = [
        s.replace(MASK_PLACEHOLDER, tokenizer.mask_token)
        for s in SAMPLE_INPUT_TEXTS
    ]
    batch = tokenizer(input_texts, return_tensors="pt", padding="longest")
    
    batch = batch.to(device)
    outputs = model(**batch)
    decoded_ids = torch.argmax(outputs.logits, dim=-1)
    is_pad = batch.input_ids == tokenizer.pad_token_id
    decoded_ids[is_pad] = tokenizer.pad_token_id
    decoded_texts = tokenizer.batch_decode(decoded_ids, skip_special_tokens=False)
    
    
    for input_ids, decoded_text in zip(batch.input_ids, decoded_texts):
        input_text = tokenizer.decode(input_ids, skip_special_tokens=False)
        print("===")
        print(f"Input: {input_text}")
        print(f"Decoded: {decoded_text}")
    

if __name__ == "__main__":
    parser = argparse.ArgumentParser(allow_abbrev=False)
    parser.add_argument(
        "--model_name_or_path",
        "-m",
        type=str,
        default="tohoku-nlp/bybert-jp-v2-100m",
        help="Path to the model or model identifier from huggingface.co/models."
    )
    parser.add_argument("--seed", "-s", type=int, help="Random seed", default=42)
    args = parser.parse_args()
    main(args)

モデルアーキテクチャ

Llama アーキテクチャをベースとし、Causal Attention Mask を取り除くことで、Encoder 型言語モデルとして利用しています。
具体的には、以下のモジュールを採用しています。

学習データ

llm-jp-corpus-v3 の日本語コーパスのサブセット (ja_cc, ja_warp_html, ja_warp_pdf, ja_wiki, kaken) を使用しました。
また、学習時には Whole Word Masking を実施しています。
Whole Word Masking 単語分割器には、vibrato を利用しました。
辞書は bccwj-suw+unidic-cwj-3_1_1 を用いています。

学習時の設定

モデルの重みを初期化した Llama アーキテクチャベースの Encoder モデルを from scratch で学習させています。
各モデルの学習設定は以下の通りです。

	Params.	Tokens	Steps	Batch Size (tokens)
tohoku-nlp/bybert-jp-100m	107 M	623 B	198,000	3,145,728
tohoku-nlp/bybert-jp-200m	205 M	637 B	270,000	2,359,296
tohoku-nlp/bybert-jp-400m	397 M	1.23 T	308,000	3,981,312
tohoku-nlp/bybert-jp-v2-100m	114 M	2.76 T	330,000	8,388,608

学習には、Masked Language Modeling (MLM) のみ実施し、Next Sentence Prediction (NSP) は実施していません。また，tohoku-nlp/bybert-jp-v2-100mでは

学習データ量を2.85T tokensに増やす
unicodeのencodingに独自形式を採用
マスク率を50%で学習．その後30%に減少
QKVの線形変換にバイアス項を追加
batch sizeのwarmupを導入

により，小規模モデルながら比較的高い性能を達成しています．

学習設定の詳細

	bybert-jp-100,200,400m	bybert-jp-next-100m
Max Learning Rate	1.0E-3	1.0E-3
Min Learning Rate	1.0E-6	1.0E-6
Learning Rate Warmup Steps	2,000	2,000
Scheduler	cosine	cosine
Optimizer	AdamW	AdamW
Optimizer Config	beta_1 = 0.9, beta_2 = 0.999, eps = 1.0E-8	beta_1 = 0.9, beta_2 = 0.999, eps = 1.0E-8
Weight Decay	0.01	0.01
Gradient Clipping	1.0	1.0
Sequence Length	3,072	4,096
MLM Probability	0.3	0.5 -> 0.3
Replace Masked-token Probability	0.8	0.8
Replace Random-token Probability	0.1	0.1

学習にはMegatron-LMをベースに，独自の変更を加えたコードベースを使用しています。

評価

評価指標として、単語のマスクされた単語の予測正解率を用いた。実験設定の詳細は工藤 et al. (2025) を参照してください。評価結果は以下の通りです。

	ichikara	wiki
tohoku-nlp/bybert-jp-100m	58.0	26.3
tohoku-nlp/bybert-jp-200m	60.5	33.0
tohoku-nlp/bybert-jp-400m	67.4	38.5
tohoku-nlp/bybert-jp-v2-100m	63.4	40.5

その他，

モデルアーキテクチャ探索
ハイパーパラメータ探索
内部機序等のパフォーマンス以外の側面からの分析についても工藤 et al. (2025) を参照してください。

ライセンス

このモデルは Apache License 2.0 の下で配布しています。

免責事項

本モデルの作者は本モデルを作成するにあたって、その内容、機能等について細心の注意を払っておりますが、モデルの出力が正確であるかどうか、安全なものであるか等について保証をするものではなく、何らの責任を負うものではありません。
本モデルの利用により、万一、利用者に何らかの不都合や損害が発生したとしても、モデルやデータセットの作者や作者の所属組織は何らの責任を負うものではありません。

謝辞

このモデルの学習にあたり様々な面でご協力いただきました Tohoku NLP Group の皆様に感謝いたします。

作成者

byBERT-JP-NEXT 100M

A Japanese BERT model that adopts a byte-level tokenizer.

Usage

import argparse

import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer

MASK_PLACEHOLDER = "<mask>"
SAMPLE_INPUT_TEXTS = [
    f"東北大学は宮城県{MASK_PLACEHOLDER * 6}市にある大学です。", # 6 bytes mask
    f"日本一高い山は{MASK_PLACEHOLDER * 9}です。", # 9 bytes mask
]


def main(args):
    torch.manual_seed(args.seed)
    device = torch.device("cuda")
    
    tokenizer = AutoTokenizer.from_pretrained(
        args.model_name_or_path,
        trust_remote_code=True,
    )
    model = AutoModelForMaskedLM.from_pretrained(
        args.model_name_or_path,
        dtype=torch.bfloat16,
        trust_remote_code=True,
    )
    model.to(device)
    model.eval()
    
    input_texts = [
        s.replace(MASK_PLACEHOLDER, tokenizer.mask_token)
        for s in SAMPLE_INPUT_TEXTS
    ]
    batch = tokenizer(input_texts, return_tensors="pt", padding="longest")
    
    batch = batch.to(device)
    outputs = model(**batch)
    decoded_ids = torch.argmax(outputs.logits, dim=-1)
    is_pad = batch.input_ids == tokenizer.pad_token_id
    decoded_ids[is_pad] = tokenizer.pad_token_id
    decoded_texts = tokenizer.batch_decode(decoded_ids, skip_special_tokens=False)
    
    
    for input_ids, decoded_text in zip(batch.input_ids, decoded_texts):
        input_text = tokenizer.decode(input_ids, skip_special_tokens=False)
        print("===")
        print(f"Input: {input_text}")
        print(f"Decoded: {decoded_text}")
    

if __name__ == "__main__":
    parser = argparse.ArgumentParser(allow_abbrev=False)
    parser.add_argument(
        "--model_name_or_path",
        "-m",
        type=str,
        default="tohoku-nlp/bybert-jp-v2-100m",
        help="Path to the model or model identifier from huggingface.co/models."
    )
    parser.add_argument("--seed", "-s", type=int, help="Random seed", default=42)
    args = parser.parse_args()
    main(args)

We have confirmed operation with transformers version 4.56.1.

Model Architecture

Based on the Llama architecture, we use it as an Encoder-type language model by removing the Causal Attention Mask.
Specifically, we adopt the following modules:

Training Data

We used a subset of Japanese corpora (ja_cc, ja_warp_html, ja_warp_pdf, ja_wiki, kaken) from llm-jp-corpus-v3.
Additionally, we implemented Whole Word Masking during training.
For the Whole Word Masking word segmenter, we used vibrato.
We used the bccwj-suw+unidic-cwj-3_1_1 dictionary.

Training Configuration

We trained the Llama architecture-based Encoder model with initialized weights from scratch.
The training configuration for each model is as follows:

	Params.	Tokens	Steps	Batch Size (tokens)
tohoku-nlp/bybert-jp-100m	107 M	623 B	198,000	3,145,728
tohoku-nlp/bybert-jp-200m	205 M	637 B	270,000	2,359,296
tohoku-nlp/bybert-jp-400m	397 M	1.23 T	308,000	3,981,312
tohoku-nlp/bybert-jp-v2-100m	114 M	2.76 T	330,000	8,388,608

Training was performed using only Masked Language Modeling (MLM), without Next Sentence Prediction (NSP). Additionally, for tohoku-nlp/bybert-jp-v2-100m:

Increased training data volume to 2.85T tokens
Adopted proprietary format for unicode encoding
Trained with 50% mask rate, then reduced to 30%
Added bias term to QKV linear transformations
Introduced batch size warmup

Through these improvements, we achieved relatively high performance despite being a small-scale model.

Detailed Training Configuration

	bybert-jp-100,200,400m	bybert-jp-next-100m
Max Learning Rate	1.0E-3	1.0E-3
Min Learning Rate	1.0E-6	1.0E-6
Learning Rate Warmup Steps	2,000	2,000
Scheduler	cosine	cosine
Optimizer	AdamW	AdamW
Optimizer Config	beta_1 = 0.9, beta_2 = 0.999, eps = 1.0E-8	beta_1 = 0.9, beta_2 = 0.999, eps = 1.0E-8
Weight Decay	0.01	0.01
Gradient Clipping	1.0	1.0
Sequence Length	3,072	4,096
MLM Probability	0.3	0.5 -> 0.3
Replace Masked-token Probability	0.8	0.8
Replace Random-token Probability	0.1	0.1

For training, we use a codebase based on Megatron-LM with our own modifications.

Evaluation

We used the prediction accuracy of masked words as the evaluation metric. For details of the experimental setup, please refer to Kudo et al. (2025). The evaluation results are as follows:

	ichikara	wiki
tohoku-nlp/bybert-jp-100m	58.0	26.3
tohoku-nlp/bybert-jp-200m	60.5	33.0
tohoku-nlp/bybert-jp-400m	67.4	38.5
tohoku-nlp/bybert-jp-v2-100m	63.4	40.5

For other aspects including:

Model architecture exploration
Hyperparameter exploration
Analysis from non-performance perspectives such as internal mechanisms

Please refer to Kudo et al. (2025).

License

This model is distributed under the Apache License 2.0.

Disclaimer

While the authors of this model have paid careful attention to its content and functionality in creating this model, they do not warrant that the model's output is accurate or safe, and assume no responsibility whatsoever.
The authors of the model and dataset and their affiliated organizations assume no responsibility for any inconvenience or damage that may occur to users through the use of this model.

Acknowledgments

We would like to thank everyone at Tohoku NLP Group for their cooperation in various aspects of training this model.

Creators

Downloads last month: 190

Safetensors

Model size

0.1B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including tohoku-nlp/bybert-jp-v2-100m

ByBERT-JP

Collection

4 items • Updated Sep 19 • 3