Update README.md

cdbd682 verified 5 months ago

8.35 kB

	---
	license: mit
	datasets:
	- HuggingFaceFW/fineweb
	- HuggingFaceFW/fineweb-2
	pipeline_tag: fill-mask
	---
	# makiart/multilingual-ModernBert-base-preview

	This model was developed by the [Algomatic](https://algomatic.jp/) team using computational resources provided by the [ABCI Generative AI Hackathon](https://abci.ai/event/2024/12/23/ja_abci_3.0_genai_hackathon.html).

	- Context Length: 8192
	- Vocabulary Size: 151,680
	- Total Training Tokens: Approximately 250B tokens
	- Parameter Count: 228M
	- Non-embedding Parameter Count: 110M
	- Utilizes fineweb and fineweb2

	## How to Use

	Install the required package using:

	```bash
	pip install -U transformers>=4.48.0
	```

	If your GPU supports FlashAttention, you can achieve more efficient inference by installing:

	```bash
	pip install flash-attn --no-build-isolation
	```

	## Example Usage

	```python
	import torch
	from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline

	model = AutoModelForMaskedLM.from_pretrained("makiart/multilingual-ModernBert-base", torch_dtype=torch.bfloat16)
	tokenizer = AutoTokenizer.from_pretrained("makiart/multilingual-ModernBert-base")
	fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)


	results = fill_mask("우리의 대부분의 고뇌는 가능했을 또 다른 인생을 [MASK] 데서 시작된다.")

	for result in results:
	print(result)

	# {'score': 0.248046875, 'token': 128956, 'token_str': ' 하는', 'sequence': '우리의 대부분의 고뇌는 가능했을 또 다른 인생을 하는 데서 시작된다.'}
	# {'score': 0.1328125, 'token': 61298, 'token_str': ' 한', 'sequence': '우리의 대부분의 고뇌는 가능했을 또 다른 인생을 한 데서 시작된다.'}
	# {'score': 0.06689453125, 'token': 95002, 'token_str': ' 할', 'sequence': '우리의 대부분의 고뇌는 가능했을 또 다른 인생을 할 데서 시작된다.'}
	# {'score': 0.055419921875, 'token': 130679, 'token_str': ' 위한', 'sequence': '우리의 대부분의 고뇌는 가능했을 또 다른 인생을 위한 데서 시작된다.'}
	# {'score': 0.04052734375, 'token': 131582, 'token_str': ' 통해', 'sequence': '우리의 대부분의 고뇌는 가능했을 또 다른 인생을 통해 데서 시작된다.'}


	results = fill_mask("Pinning our hopes on the unreliable notion of our potential is the root of all our [MASK].")

	for result in results:
	print(result)

	# {'score': 0.20703125, 'token': 5322, 'token_str': ' problems', 'sequence': 'Pinning our hopes on the unreliable notion of our potential is the root of all our problems.'}
	# {'score': 0.09765625, 'token': 27850, 'token_str': ' failures', 'sequence': 'Pinning our hopes on the unreliable notion of our potential is the root of all our failures.'}
	# {'score': 0.040771484375, 'token': 34565, 'token_str': ' troubles', 'sequence': 'Pinning our hopes on the unreliable notion of our potential is the root of all our troubles.'}
	# {'score': 0.03173828125, 'token': 18707, 'token_str': ' dreams', 'sequence': 'Pinning our hopes on the unreliable notion of our potential is the root of all our dreams.'}
	# {'score': 0.028076171875, 'token': 23209, 'token_str': ' fears', 'sequence': 'Pinning our hopes on the unreliable notion of our potential is the root of all our fears.'}


	results = fill_mask("我们必须[MASK]，我们只能成为此时此地的那个自己，而无法成为其他任何人。")

	for result in results:
	print(result)

	# {'score': 0.177734375, 'token': 99392, 'token_str': '知道', 'sequence': '我们必须知道，我们只能成为此时此地的那个自己，而无法成为其他任何人。'}
	# {'score': 0.138671875, 'token': 104953, 'token_str': '承认', 'sequence': '我们必须承认，我们只能成为此时此地的那个自己，而无法成为其他任何人。'}
	# {'score': 0.12255859375, 'token': 101265, 'token_str': '明白', 'sequence': '我们必须明白，我们只能成为此时此地的那个自己，而无法成为其他任何人。'}
	# {'score': 0.07421875, 'token': 105712, 'token_str': '记住', 'sequence': '我们必须记住，我们只能成为此时此地的那个自己，而无法成为其他任何人。'}
	# {'score': 0.0654296875, 'token': 106836, 'token_str': '认识到', 'sequence': '我们必须认识到，我们只能成为此时此地的那个自己，而无法成为其他任何人。'}
	```

	## Model Description

	- Training Approach: The model was trained using a two-stage Masked Language Modeling (MLM) process:
	- Masking Rate: 30%
	- Training Data: Approximately 200B tokens with a context length of 1024 and 50B tokens with a context length of 8192.
	- Tokenizer: Based on Qwen2.5, the tokenizer features:
	- A vocabulary size of 151,680 tokens.
	- Customizations that allow it to distinguish indentations in code, enabling better handling of programming texts.
	- Dataset:
	- Utilizes the fineweb and fineweb2 datasets.
	- For languages with an abundance of data, the volume has been reduced.
	- Computational Resources: Training was conducted using one node (H200 x 8) provided by ABCI, over the course of approximately 3 days.

	## Evaluation

	A comprehensive evaluation has not been performed yet 😭.

	Based on the total training token count, it is anticipated that the model might be less competitive compared to existing models.

	---

	このモデルは[ABCI 生成AIハッカソン](https://abci.ai/event/2024/12/23/ja_abci_3.0_genai_hackathon.html)にて提供された計算資源によって[Algomatic](https://algomatic.jp/)チームが作成したモデルです。

	- コンテキスト長：8192
	- 語彙数：151,680
	- 総学習トークン数：約250B Tokens
	- パラメータ数：228M
	- 埋め込み抜きパラメータ数：110M
	- fineweb, fineweb2を利用

	## How to Use

	```bash
	pip install -U transformers>=4.48.0
	```

	GPUがFlashAttentionに対応しているのであれば下記のインストールをすると効率よく推論できます。

	```bash
	pip install flash-attn --no-build-isolation
	```

	## Example Usage

	```python
	import torch
	from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline

	model = AutoModelForMaskedLM.from_pretrained("makiart/multilingual-ModernBert-base-preview", torch_dtype=torch.bfloat16)
	tokenizer = AutoTokenizer.from_pretrained("makiart/multilingual-ModernBert-base-preview")
	fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)

	results = fill_mask("大きな[MASK]を一人で切り分けて食べるというのは孤独の極地ですからね")

	for result in results:
	print(result)

	# {'score': 0.11865234375, 'token': 142732, 'token_str': 'ケーキ', 'sequence': '大きなケーキを一人で切り分けて食べるというのは孤独の極地ですからね'}
	# {'score': 0.10498046875, 'token': 52853, 'token_str': '物', 'sequence': '大きな物を一人で切り分けて食べるというのは孤独の極地ですからね'}
	# {'score': 0.08154296875, 'token': 108371, 'token_str': '魚', 'sequence': '大きな魚を一人で切り分けて食べるというのは孤独の極地ですからね'}
	# {'score': 0.05615234375, 'token': 111974, 'token_str': '料理', 'sequence': '大きな料理を一人で切り分けて食べるというのは孤独の極地ですからね'}
	# {'score': 0.043701171875, 'token': 115913, 'token_str': '動物', 'sequence': '大きな動物を一人で切り分けて食べるというのは孤独の極地ですからね'}
	```

	## Model Description

	- 2段階形式でMLM学習を行いました。
	- マスキングレートは30%
	- 1024のコンテキスト長で約200B Tokens
	- 8192のコンテキスト長で約50B Tokens
	- トークナイザーはqwen2.5をベースとしています
	- 語彙数は151,680です
	- - コードのテキストにも対応できるようにインデント部分を区別できるようにカスタムしています
	- データセット
	- fineweb, fineweb2を利用
	- データ量の多い言語は削減
	- 計算資源
	- ABCIから提供いただいた計算資源のうち1ノード(H200 x 8)を利用し、約3日間の間で学習

	## Evaluation

	ちゃんとした評価はできていません😭

	総合学習トークン数的に既存のモデルよりも劣ることが予想されます。