makiart/multilingual-ModernBert-base-preview

This model was developed by the Algomatic team using computational resources provided by the ABCI Generative AI Hackathon.

Context Length: 8192
Vocabulary Size: 151,680
Total Training Tokens: Approximately 250B tokens
Parameter Count: 228M
Non-embedding Parameter Count: 110M
Utilizes fineweb and fineweb2

How to Use

Install the required package using:

pip install -U transformers>=4.48.0

If your GPU supports FlashAttention, you can achieve more efficient inference by installing:

pip install flash-attn --no-build-isolation

Example Usage

import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline

model = AutoModelForMaskedLM.from_pretrained("makiart/multilingual-ModernBert-base", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("makiart/multilingual-ModernBert-base")
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)


results = fill_mask("우리의 대부분의 고뇌는 가능했을 또 다른 인생을 [MASK] 데서 시작된다.")

for result in results:
    print(result)

# {'score': 0.248046875, 'token': 128956, 'token_str': ' 하는', 'sequence': '우리의 대부분의 고뇌는 가능했을 또 다른 인생을 하는 데서 시작된다.'}
# {'score': 0.1328125, 'token': 61298, 'token_str': ' 한', 'sequence': '우리의 대부분의 고뇌는 가능했을 또 다른 인생을 한 데서 시작된다.'}
# {'score': 0.06689453125, 'token': 95002, 'token_str': ' 할', 'sequence': '우리의 대부분의 고뇌는 가능했을 또 다른 인생을 할 데서 시작된다.'}
# {'score': 0.055419921875, 'token': 130679, 'token_str': ' 위한', 'sequence': '우리의 대부분의 고뇌는 가능했을 또 다른 인생을 위한 데서 시작된다.'}
# {'score': 0.04052734375, 'token': 131582, 'token_str': ' 통해', 'sequence': '우리의 대부분의 고뇌는 가능했을 또 다른 인생을 통해 데서 시작된다.'}


results = fill_mask("Pinning our hopes on the unreliable notion of our potential is the root of all our [MASK].")

for result in results:
    print(result)

# {'score': 0.20703125, 'token': 5322, 'token_str': ' problems', 'sequence': 'Pinning our hopes on the unreliable notion of our potential is the root of all our problems.'}
# {'score': 0.09765625, 'token': 27850, 'token_str': ' failures', 'sequence': 'Pinning our hopes on the unreliable notion of our potential is the root of all our failures.'}
# {'score': 0.040771484375, 'token': 34565, 'token_str': ' troubles', 'sequence': 'Pinning our hopes on the unreliable notion of our potential is the root of all our troubles.'}
# {'score': 0.03173828125, 'token': 18707, 'token_str': ' dreams', 'sequence': 'Pinning our hopes on the unreliable notion of our potential is the root of all our dreams.'}
# {'score': 0.028076171875, 'token': 23209, 'token_str': ' fears', 'sequence': 'Pinning our hopes on the unreliable notion of our potential is the root of all our fears.'}


results = fill_mask("我们必须[MASK]，我们只能成为此时此地的那个自己，而无法成为其他任何人。")

for result in results:
    print(result)

# {'score': 0.177734375, 'token': 99392, 'token_str': '知道', 'sequence': '我们必须知道，我们只能成为此时此地的那个自己，而无法成为其他任何人。'}
# {'score': 0.138671875, 'token': 104953, 'token_str': '承认', 'sequence': '我们必须承认，我们只能成为此时此地的那个自己，而无法成为其他任何人。'}
# {'score': 0.12255859375, 'token': 101265, 'token_str': '明白', 'sequence': '我们必须明白，我们只能成为此时此地的那个自己，而无法成为其他任何人。'}
# {'score': 0.07421875, 'token': 105712, 'token_str': '记住', 'sequence': '我们必须记住，我们只能成为此时此地的那个自己，而无法成为其他任何人。'}
# {'score': 0.0654296875, 'token': 106836, 'token_str': '认识到', 'sequence': '我们必须认识到，我们只能成为此时此地的那个自己，而无法成为其他任何人。'}

Model Description

Training Approach: The model was trained using a two-stage Masked Language Modeling (MLM) process:
- Masking Rate: 30%
- Training Data: Approximately 200B tokens with a context length of 1024 and 50B tokens with a context length of 8192.
Tokenizer: Based on Qwen2.5, the tokenizer features:
- A vocabulary size of 151,680 tokens.
- Customizations that allow it to distinguish indentations in code, enabling better handling of programming texts.
Dataset:
- Utilizes the fineweb and fineweb2 datasets.
- For languages with an abundance of data, the volume has been reduced.
Computational Resources: Training was conducted using one node (H200 x 8) provided by ABCI, over the course of approximately 3 days.

Evaluation

A comprehensive evaluation has not been performed yet 😭.

Based on the total training token count, it is anticipated that the model might be less competitive compared to existing models.

このモデルはABCI 生成AIハッカソンにて提供された計算資源によってAlgomaticチームが作成したモデルです。

コンテキスト長：8192
語彙数：151,680
総学習トークン数：約250B Tokens
パラメータ数：228M
埋め込み抜きパラメータ数：110M
fineweb, fineweb2を利用

How to Use

pip install -U transformers>=4.48.0

GPUがFlashAttentionに対応しているのであれば下記のインストールをすると効率よく推論できます。

pip install flash-attn --no-build-isolation

Example Usage

import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline

model = AutoModelForMaskedLM.from_pretrained("makiart/multilingual-ModernBert-base-preview", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("makiart/multilingual-ModernBert-base-preview")
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)

results = fill_mask("大きな[MASK]を一人で切り分けて食べるというのは孤独の極地ですからね")

for result in results:
    print(result)

# {'score': 0.11865234375, 'token': 142732, 'token_str': 'ケーキ', 'sequence': '大きなケーキを一人で切り分けて食べるというのは孤独の極地ですからね'}
# {'score': 0.10498046875, 'token': 52853, 'token_str': '物', 'sequence': '大きな物を一人で切り分けて食べるというのは孤独の極地ですからね'}
# {'score': 0.08154296875, 'token': 108371, 'token_str': '魚', 'sequence': '大きな魚を一人で切り分けて食べるというのは孤独の極地ですからね'}
# {'score': 0.05615234375, 'token': 111974, 'token_str': '料理', 'sequence': '大きな料理を一人で切り分けて食べるというのは孤独の極地ですからね'}
# {'score': 0.043701171875, 'token': 115913, 'token_str': '動物', 'sequence': '大きな動物を一人で切り分けて食べるというのは孤独の極地ですからね'}

Model Description

2段階形式でMLM学習を行いました。
- マスキングレートは30%
- 1024のコンテキスト長で約200B Tokens
- 8192のコンテキスト長で約50B Tokens
トークナイザーはqwen2.5をベースとしています
- 語彙数は151,680です
- - コードのテキストにも対応できるようにインデント部分を区別できるようにカスタムしています
データセット
- fineweb, fineweb2を利用
- データ量の多い言語は削減
計算資源
- ABCIから提供いただいた計算資源のうち1ノード(H200 x 8)を利用し、約3日間の間で学習

Evaluation

ちゃんとした評価はできていません😭

総合学習トークン数的に既存のモデルよりも劣ることが予想されます。

makiart
/

multilingual-ModernBert-base-preview

makiart/multilingual-ModernBert-base-preview

How to Use

Example Usage

Model Description

Evaluation

How to Use

Example Usage

Model Description

Evaluation

Datasets used to train makiart/multilingual-ModernBert-base-preview