makiart/multilingual-ModernBert-large-preview
This model was developed by the Algomatic team using computational resources provided by the ABCI Generative AI Hackathon.
- Context Length: 8192
- Vocabulary Size: 151,680
- Total Training Tokens: Approximately 60B tokens (after inheriting weights from the base model)
- Parameter Count: 500M
- Non-embedding Parameter Count: 343M
- Utilizes fineweb and fineweb2
How to Use
Install the required package using:
pip install -U transformers>=4.48.0
If your GPU supports FlashAttention, you can achieve more efficient inference by installing:
pip install flash-attn --no-build-isolation
Example Usage
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
model = AutoModelForMaskedLM.from_pretrained("makiart/multilingual-ModernBert-large", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("makiart/multilingual-ModernBert-large")
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
results = fill_mask("우리의 대부분의 고뇌는 가능했을 또 다른 인생을 [MASK] 데서 시작된다.")
for result in results:
print(result)
# {'score': 0.09716796875, 'token': 131582, 'token_str': ' 통해', 'sequence': '우리의 대부분의 고뇌는 가능했을 또 다른 인생을 통해 데서 시작된다.'}
# {'score': 0.058837890625, 'token': 61298, 'token_str': ' 한', 'sequence': '우리의 대부분의 고뇌는 가능했을 또 다른 인생을 한 데서 시작된다.'}
# {'score': 0.04296875, 'token': 128956, 'token_str': ' 하는', 'sequence': '우리의 대부분의 고뇌는 가능했을 또 다른 인생을 하는 데서 시작된다.'}
# {'score': 0.02783203125, 'token': 130039, 'token_str': ' 위해', 'sequence': '우리의 대부분의 고뇌는 가능했을 또 다른 인생을 위해 데서 시작된다.'}
# {'score': 0.026123046875, 'token': 134108, 'token_str': ' 만들어', 'sequence': '우리의 대부분의 고뇌는 가능했을 또 다른 인생을 만들어 데서 시작된다.'}
results = fill_mask("Pinning our hopes on the unreliable notion of our potential is the root of all our [MASK].")
for result in results:
print(result)
# {'score': 0.1845703125, 'token': 5322, 'token_str': ' problems', 'sequence': 'Pinning our hopes on the unreliable notion of our potential is the root of all our problems.'}
# {'score': 0.08740234375, 'token': 27850, 'token_str': ' failures', 'sequence': 'Pinning our hopes on the unreliable notion of our potential is the root of all our failures.'}
# {'score': 0.06005859375, 'token': 23209, 'token_str': ' fears', 'sequence': 'Pinning our hopes on the unreliable notion of our potential is the root of all our fears.'}
# {'score': 0.0322265625, 'token': 34565, 'token_str': ' troubles', 'sequence': 'Pinning our hopes on the unreliable notion of our potential is the root of all our troubles.'}
# {'score': 0.0250244140625, 'token': 18707, 'token_str': ' dreams', 'sequence': 'Pinning our hopes on the unreliable notion of our potential is the root of all our dreams.'}
results = fill_mask("我们必须[MASK],我们只能成为此时此地的那个自己,而无法成为其他任何人。")
for result in results:
print(result)
# {'score': 0.1904296875, 'token': 104953, 'token_str': '承认', 'sequence': '我们必须承认,我们只能成为此时此地的那个自己,而无法成为其他任何人。'}
# {'score': 0.1484375, 'token': 99392, 'token_str': '知道', 'sequence': '我们必须知道,我们只能成为此时此地的那个自己,而无法成为其他任何人。'}
# {'score': 0.1484375, 'token': 106836, 'token_str': '认识到', 'sequence': '我们必须认识到,我们只能成为此时此地的那个自己,而无法成为其他任何人。'}
# {'score': 0.10205078125, 'token': 101265, 'token_str': '明白', 'sequence': '我们必须明白,我们只能成为此时此地的那个自己,而无法成为其他任何人。'}
# {'score': 0.0703125, 'token': 105712, 'token_str': '记住', 'sequence': '我们必须记住,我们只能成为此时此地的那个自己,而无法成为其他任何人。'}
Model Description
- Training Approach:
- The base model's weights are inherited by tiling from the middle.
- Approximately 60B tokens with a context length of 8192.
- Tokenizer: Based on Qwen2.5, with a vocabulary size of 151,680 tokens. It has been customized to differentiate indentation, making it better suited for handling code text.
- Dataset:
- Utilizes the fineweb and fineweb2 datasets.
- For languages with an abundance of data, the volume has been reduced.
- Computational Resources: Training was conducted using one node (H200 x 8) provided by ABCI, over the course of approximately 2 days.
Evaluation
A comprehensive evaluation has not been performed yet 😭.
Based on the total training token count, it is anticipated that the model might be less competitive compared to existing models.Based on the total training token count, it is anticipated that the model might be less competitive compared to existing models.
このモデルはABCI 生成AIハッカソンにて提供された計算資源によってAlgomaticチームが作成したモデルです。
- コンテキスト長:8192
- 語彙数:151,680
- 総学習トークン数:約60B Tokens(Baseから重み継承後)
- パラメータ数:500M
- 埋め込み抜きパラメータ数:343M
- fineweb, fineweb2を利用
How to Use
pip install -U transformers>=4.48.0
GPUがFlashAttentionに対応しているのであれば下記のインストールをすると効率よく推論できます。
pip install flash-attn --no-build-isolation
Example Usage
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
model = AutoModelForMaskedLM.from_pretrained("makiart/multilingual-ModernBert-large", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("makiart/multilingual-ModernBert-large")
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
results = fill_mask("たとえ[MASK]の中であっても鍋から的確に意中の具をつまみだせる技術")
for result in results:
print(result)
# {'score': 0.5078125, 'token': 45629, 'token_str': '家', 'sequence': 'たとえ家の中であっても鍋から的確に意中の具をつまみだせる技術'}
# {'score': 0.11279296875, 'token': 116990, 'token_str': '鍋', 'sequence': 'たとえ鍋の中であっても鍋から的確に意中の具をつまみだせる技術'}
# {'score': 0.060546875, 'token': 105010, 'token_str': '厨房', 'sequence': 'たとえ厨房の中であっても鍋から的確に意中の具をつまみだせる技術'}
# {'score': 0.02685546875, 'token': 101064, 'token_str': '家庭', 'sequence': 'たとえ家庭の中であっても鍋から的確に意中の具をつまみだせる技術'}
# {'score': 0.0184326171875, 'token': 142935, 'token_str': 'キッチン', 'sequence': 'たとえキッチンの中であっても鍋から的確に意中の具をつまみだせる技術'}
Model Description
- baseモデルの重みをtile weights from middleする
- 8192のコンテキスト長で約60B Tokens
- トークナイザーはqwen2.5をベースとしています
- 語彙数は151,680です
- コードのテキストにも対応できるようにインデント部分を区別できるようにカスタムされています
- データセット
- fineweb, fineweb2を利用
- データ量の多い言語は削減
- 計算資源
- ABCIから提供いただいた計算資源のうち1ノード(H200 x 8)を利用し、約2日間の間で学習
Evaluation
ちゃんとした評価はできていません😭
総合学習トークン数的に既存のモデルよりも劣ることが予想されます。
- Downloads last month
- 1,196