makiart/multilingual-ModernBert-large-preview

This model was developed by the Algomatic team using computational resources provided by the ABCI Generative AI Hackathon.

Context Length: 8192
Vocabulary Size: 151,680
Total Training Tokens: Approximately 60B tokens (after inheriting weights from the base model)
Parameter Count: 500M
Non-embedding Parameter Count: 343M
Utilizes fineweb and fineweb2

How to Use

Install the required package using:

pip install -U transformers>=4.48.0

If your GPU supports FlashAttention, you can achieve more efficient inference by installing:

pip install flash-attn --no-build-isolation

Example Usage

import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline

model = AutoModelForMaskedLM.from_pretrained("makiart/multilingual-ModernBert-large", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("makiart/multilingual-ModernBert-large")
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)


results = fill_mask("우리의 대부분의 고뇌는 가능했을 또 다른 인생을 [MASK] 데서 시작된다.")

for result in results:
    print(result)

# {'score': 0.09716796875, 'token': 131582, 'token_str': ' 통해', 'sequence': '우리의 대부분의 고뇌는 가능했을 또 다른 인생을 통해 데서 시작된다.'}
# {'score': 0.058837890625, 'token': 61298, 'token_str': ' 한', 'sequence': '우리의 대부분의 고뇌는 가능했을 또 다른 인생을 한 데서 시작된다.'}
# {'score': 0.04296875, 'token': 128956, 'token_str': ' 하는', 'sequence': '우리의 대부분의 고뇌는 가능했을 또 다른 인생을 하는 데서 시작된다.'}
# {'score': 0.02783203125, 'token': 130039, 'token_str': ' 위해', 'sequence': '우리의 대부분의 고뇌는 가능했을 또 다른 인생을 위해 데서 시작된다.'}
# {'score': 0.026123046875, 'token': 134108, 'token_str': ' 만들어', 'sequence': '우리의 대부분의 고뇌는 가능했을 또 다른 인생을 만들어 데서 시작된다.'}


results = fill_mask("Pinning our hopes on the unreliable notion of our potential is the root of all our [MASK].")

for result in results:
    print(result)

# {'score': 0.1845703125, 'token': 5322, 'token_str': ' problems', 'sequence': 'Pinning our hopes on the unreliable notion of our potential is the root of all our problems.'}
# {'score': 0.08740234375, 'token': 27850, 'token_str': ' failures', 'sequence': 'Pinning our hopes on the unreliable notion of our potential is the root of all our failures.'}
# {'score': 0.06005859375, 'token': 23209, 'token_str': ' fears', 'sequence': 'Pinning our hopes on the unreliable notion of our potential is the root of all our fears.'}
# {'score': 0.0322265625, 'token': 34565, 'token_str': ' troubles', 'sequence': 'Pinning our hopes on the unreliable notion of our potential is the root of all our troubles.'}
# {'score': 0.0250244140625, 'token': 18707, 'token_str': ' dreams', 'sequence': 'Pinning our hopes on the unreliable notion of our potential is the root of all our dreams.'}


results = fill_mask("我们必须[MASK]，我们只能成为此时此地的那个自己，而无法成为其他任何人。")

for result in results:
    print(result)

# {'score': 0.1904296875, 'token': 104953, 'token_str': '承认', 'sequence': '我们必须承认，我们只能成为此时此地的那个自己，而无法成为其他任何人。'}
# {'score': 0.1484375, 'token': 99392, 'token_str': '知道', 'sequence': '我们必须知道，我们只能成为此时此地的那个自己，而无法成为其他任何人。'}
# {'score': 0.1484375, 'token': 106836, 'token_str': '认识到', 'sequence': '我们必须认识到，我们只能成为此时此地的那个自己，而无法成为其他任何人。'}
# {'score': 0.10205078125, 'token': 101265, 'token_str': '明白', 'sequence': '我们必须明白，我们只能成为此时此地的那个自己，而无法成为其他任何人。'}
# {'score': 0.0703125, 'token': 105712, 'token_str': '记住', 'sequence': '我们必须记住，我们只能成为此时此地的那个自己，而无法成为其他任何人。'}

Model Description

Training Approach:
- The base model's weights are inherited by tiling from the middle.
- Approximately 60B tokens with a context length of 8192.
Tokenizer:　Based on Qwen2.5, with a vocabulary size of 151,680 tokens. It has been customized to differentiate indentation, making it better suited for handling code text.
Dataset:
- Utilizes the fineweb and fineweb2 datasets.
- For languages with an abundance of data, the volume has been reduced.
Computational Resources:　Training was conducted using one node (H200 x 8) provided by ABCI, over the course of approximately 2 days.

Evaluation

A comprehensive evaluation has not been performed yet 😭.

Based on the total training token count, it is anticipated that the model might be less competitive compared to existing models.Based on the total training token count, it is anticipated that the model might be less competitive compared to existing models.

このモデルはABCI 生成AIハッカソンにて提供された計算資源によってAlgomaticチームが作成したモデルです。

コンテキスト長：8192
語彙数：151,680
総学習トークン数：約60B Tokens（Baseから重み継承後）
パラメータ数：500M
埋め込み抜きパラメータ数：343M
fineweb, fineweb2を利用

How to Use

pip install -U transformers>=4.48.0

GPUがFlashAttentionに対応しているのであれば下記のインストールをすると効率よく推論できます。

pip install flash-attn --no-build-isolation

Example Usage

import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline

model = AutoModelForMaskedLM.from_pretrained("makiart/multilingual-ModernBert-large", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("makiart/multilingual-ModernBert-large")
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)

results = fill_mask("たとえ[MASK]の中であっても鍋から的確に意中の具をつまみだせる技術")

for result in results:
    print(result)

# {'score': 0.5078125, 'token': 45629, 'token_str': '家', 'sequence': 'たとえ家の中であっても鍋から的確に意中の具をつまみだせる技術'}
# {'score': 0.11279296875, 'token': 116990, 'token_str': '鍋', 'sequence': 'たとえ鍋の中であっても鍋から的確に意中の具をつまみだせる技術'}
# {'score': 0.060546875, 'token': 105010, 'token_str': '厨房', 'sequence': 'たとえ厨房の中であっても鍋から的確に意中の具をつまみだせる技術'}
# {'score': 0.02685546875, 'token': 101064, 'token_str': '家庭', 'sequence': 'たとえ家庭の中であっても鍋から的確に意中の具をつまみだせる技術'}
# {'score': 0.0184326171875, 'token': 142935, 'token_str': 'キッチン', 'sequence': 'たとえキッチンの中であっても鍋から的確に意中の具をつまみだせる技術'}

Model Description

baseモデルの重みをtile weights from middleする
8192のコンテキスト長で約60B Tokens
トークナイザーはqwen2.5をベースとしています
- 語彙数は151,680です
- コードのテキストにも対応できるようにインデント部分を区別できるようにカスタムされています
データセット
- fineweb, fineweb2を利用
- データ量の多い言語は削減
計算資源
- ABCIから提供いただいた計算資源のうち1ノード(H200 x 8)を利用し、約2日間の間で学習

Evaluation

ちゃんとした評価はできていません😭

総合学習トークン数的に既存のモデルよりも劣ることが予想されます。

makiart
/

multilingual-ModernBert-large-preview

makiart/multilingual-ModernBert-large-preview

How to Use

Example Usage

Model Description

Evaluation

How to Use

Example Usage

Model Description

Evaluation

Datasets used to train makiart/multilingual-ModernBert-large-preview