Gerald001 (Gerald Stanje)

New activity in microsoft/bitnet-b1.58-2B-4T 3 months ago

Will the fine-tuning code be provided?

➕ 3

4

#6 opened 3 months ago by

AXCXEPT

New activity in answerdotai/ModernBERT-base 3 months ago

sagemaker not supporting modernBERT trained model with transformers 4.49.0

5

#69 opened 4 months ago by

devs9

gpu requirements

1

#73 opened 4 months ago by

Gerald001

fine tune model and convert to onnx

4

#77 opened 3 months ago by

Gerald001

commented on Fine-tune ModernBERT for text classification using synthetic data 3 months ago

where do you see f1 score of 0.89 ?

commented on Fine-tune ModernBERT for text classification using synthetic data 3 months ago

hi @davidberenstein1957 , this code seems not to work for transformers: 4.49.0. any idea? i see eval_f1 is 0.007867705980913528...

i get this output:

python3 train4.py
Parameter 'function'=<function tokenize at 0x7fec4c3b6b90> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 900/900 [00:00<00:00, 2251.77 examples/s]
Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 2668.59 examples/s]
Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at answerdotai/ModernBERT-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 4.115044247787611e-05, 'epoch': 0.88}
{'eval_loss': nan, 'eval_f1': 0.007867705980913528, 'eval_runtime': 11.5539, 'eval_samples_per_second': 8.655, 'eval_steps_per_second': 1.125, 'epoch': 1.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3.230088495575221e-05, 'epoch': 1.77}
{'eval_loss': nan, 'eval_f1': 0.007867705980913528, 'eval_runtime': 0.3503, 'eval_samples_per_second': 285.465, 'eval_steps_per_second': 37.11, 'epoch': 2.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.345132743362832e-05, 'epoch': 2.65}
{'eval_loss': nan, 'eval_f1': 0.007867705980913528, 'eval_runtime': 0.3496, 'eval_samples_per_second': 286.027, 'eval_steps_per_second': 37.184, 'epoch': 3.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 1.4601769911504426e-05, 'epoch': 3.54}
{'eval_loss': nan, 'eval_f1': 0.007867705980913528, 'eval_runtime': 0.3529, 'eval_samples_per_second': 283.348, 'eval_steps_per_second': 36.835, 'epoch': 4.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 5.752212389380531e-06, 'epoch': 4.42}
{'eval_loss': nan, 'eval_f1': 0.007867705980913528, 'eval_runtime': 0.3147, 'eval_samples_per_second': 317.753, 'eval_steps_per_second': 41.308, 'epoch': 5.0}
{'train_runtime': 149.6166, 'train_samples_per_second': 30.077, 'train_steps_per_second': 3.776, 'train_loss': 0.0, 'epoch': 5.0}                  
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 565/565 [02:29<00:00,  3.78it/s]
Device set to use cuda:0
[2025-03-26 15:16:13,918] torch._dynamo.convert_frame: [WARNING] torch._dynamo hit config.cache_size_limit (8)
[2025-03-26 15:16:13,918] torch._dynamo.convert_frame: [WARNING]    function: 'compiled_mlp' (/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/modernbert/modeling_modernbert.py:552)
[2025-03-26 15:16:13,918] torch._dynamo.convert_frame: [WARNING]    last reason: ___check_global_state()
[2025-03-26 15:16:13,918] torch._dynamo.convert_frame: [WARNING] To log all recompilation reasons, use TORCH_LOGS="recompiles".
[2025-03-26 15:16:13,918] torch._dynamo.convert_frame: [WARNING] To diagnose recompilation issues, see https://pytorch.org/docs/master/compile/troubleshooting.html.
[{'label': 'business-and-industrial', 'score': nan}]

full code:

from datasets import load_dataset
from datasets.arrow_dataset import Dataset
from datasets.dataset_dict import DatasetDict, IterableDatasetDict
from datasets.iterable_dataset import IterableDataset
import os
import torch

os.environ["TOKENIZERS_PARALLELISM"] = "false"
# UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
torch.set_float32_matmul_precision('high')

# Dataset id from huggingface.co/dataset
dataset_id = "argilla/synthetic-domain-text-classification"
 
# Load raw dataset
train_dataset = load_dataset(dataset_id, split='train')

split_dataset = train_dataset.train_test_split(test_size=0.1)
split_dataset['train'][0]
# {'text': 'Recently, there has been an increase in property values within the suburban areas of several cities due to improvements in infrastructure and lifestyle amenities such as parks, retail stores, and educational institutions nearby. Additionally, new housing developments are emerging, catering to different family needs with varying sizes and price ranges. These changes have influenced investment decisions for many looking to buy or sell properties.', 'label': 14}

from transformers import AutoTokenizer
 
# Model id to load the tokenizer
model_id = "answerdotai/ModernBERT-base"

# Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
 
# Tokenize helper function
def tokenize(batch):
    return tokenizer(batch['text'], padding=True, truncation=True, return_tensors="pt")
 
# Tokenize dataset
if "label" in split_dataset["train"].features.keys():
    split_dataset =  split_dataset.rename_column("label", "labels") # to match Trainer
tokenized_dataset = split_dataset.map(tokenize, batched=True, remove_columns=["text"])
 
tokenized_dataset["train"].features.keys()
# dict_keys(['labels', 'input_ids', 'attention_mask'])

from transformers import AutoModelForSequenceClassification
 
# Model id to load the tokenizer
model_id = "answerdotai/ModernBERT-base"
 
# Prepare model labels - useful for inference
labels = tokenized_dataset["train"].features["labels"].names
num_labels = len(labels)
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = str(i)
    id2label[str(i)] = label
 
# Download the model from huggingface.co/models
model = AutoModelForSequenceClassification.from_pretrained(
    model_id, num_labels=num_labels, label2id=label2id, id2label=id2label,
)

import numpy as np
from sklearn.metrics import f1_score
 
# Metric helper method
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    score = f1_score(
            labels, predictions, labels=labels, pos_label=1, average="weighted"
        )
    return {"f1": float(score) if score == 1 else score}

from huggingface_hub import HfFolder
from transformers import Trainer, TrainingArguments
 
# Define training args
training_args = TrainingArguments(
    output_dir = "ModernBERT-domain-classifier",
    per_device_train_batch_size=8,#32,
    per_device_eval_batch_size=8,#16,
    learning_rate=5e-5,
    num_train_epochs=5,
    bf16=True, # bfloat16 training 
    optim="adamw_torch_fused", # improved optimizer 
    # logging & evaluation strategies
    logging_strategy="steps",
    logging_steps=100,
    eval_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    load_best_model_at_end=True,
    #use_mps_device=True,
    metric_for_best_model="f1",
    # push to hub parameters
    push_to_hub=False,
    hub_strategy="every_save",
    hub_token=HfFolder.get_token(),
)
 
# Create a Trainer instance
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    compute_metrics=compute_metrics,
)
trainer.train()
# {'train_runtime': 3642.7783, 'train_samples_per_second': 1.235, 'train_steps_per_second': 0.04, 'train_loss': 0.535627057634551, 'epoch': 5.0}

from transformers import pipeline
 
model_save_path = "ModernBERT-domain-classifier-save"
trainer.save_model(model_save_path)
# Save processor and create model card
tokenizer.save_pretrained(model_save_path)

# load model from huggingface.co/models using our repository id
classifier = pipeline(
    task="text-classification", 
    model=model_save_path, 
    device=0,
)
 
sample = "Smoking is bad for your health."

print(classifier(sample))
# [{'label': 'health', 'score': 0.6779336333274841}]

New activity in meta-llama/Meta-Llama-Guard-2-8B 11 months ago