Gerald Stanje

Gerald001
·

AI & ML interests

None yet

Recent Activity

Organizations

Cisco's profile picture Hugging Face Discord Community's profile picture

Gerald001's activity

New activity in microsoft/bitnet-b1.58-2B-4T 27 days ago
New activity in answerdotai/ModernBERT-base about 2 months ago

fine tune model and convert to onnx

4
#77 opened about 2 months ago by
Gerald001
view reply

hi @davidberenstein1957 , this code seems not to work for transformers: 4.49.0. any idea? i see eval_f1 is 0.007867705980913528...

i get this output:

python3 train4.py
Parameter 'function'=<function tokenize at 0x7fec4c3b6b90> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 900/900 [00:00<00:00, 2251.77 examples/s]
Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 2668.59 examples/s]
Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at answerdotai/ModernBERT-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 4.115044247787611e-05, 'epoch': 0.88}
{'eval_loss': nan, 'eval_f1': 0.007867705980913528, 'eval_runtime': 11.5539, 'eval_samples_per_second': 8.655, 'eval_steps_per_second': 1.125, 'epoch': 1.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3.230088495575221e-05, 'epoch': 1.77}
{'eval_loss': nan, 'eval_f1': 0.007867705980913528, 'eval_runtime': 0.3503, 'eval_samples_per_second': 285.465, 'eval_steps_per_second': 37.11, 'epoch': 2.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.345132743362832e-05, 'epoch': 2.65}
{'eval_loss': nan, 'eval_f1': 0.007867705980913528, 'eval_runtime': 0.3496, 'eval_samples_per_second': 286.027, 'eval_steps_per_second': 37.184, 'epoch': 3.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 1.4601769911504426e-05, 'epoch': 3.54}
{'eval_loss': nan, 'eval_f1': 0.007867705980913528, 'eval_runtime': 0.3529, 'eval_samples_per_second': 283.348, 'eval_steps_per_second': 36.835, 'epoch': 4.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 5.752212389380531e-06, 'epoch': 4.42}
{'eval_loss': nan, 'eval_f1': 0.007867705980913528, 'eval_runtime': 0.3147, 'eval_samples_per_second': 317.753, 'eval_steps_per_second': 41.308, 'epoch': 5.0}
{'train_runtime': 149.6166, 'train_samples_per_second': 30.077, 'train_steps_per_second': 3.776, 'train_loss': 0.0, 'epoch': 5.0}                  
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 565/565 [02:29<00:00,  3.78it/s]
Device set to use cuda:0
[2025-03-26 15:16:13,918] torch._dynamo.convert_frame: [WARNING] torch._dynamo hit config.cache_size_limit (8)
[2025-03-26 15:16:13,918] torch._dynamo.convert_frame: [WARNING]    function: 'compiled_mlp' (/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/modernbert/modeling_modernbert.py:552)
[2025-03-26 15:16:13,918] torch._dynamo.convert_frame: [WARNING]    last reason: ___check_global_state()
[2025-03-26 15:16:13,918] torch._dynamo.convert_frame: [WARNING] To log all recompilation reasons, use TORCH_LOGS="recompiles".
[2025-03-26 15:16:13,918] torch._dynamo.convert_frame: [WARNING] To diagnose recompilation issues, see https://pytorch.org/docs/master/compile/troubleshooting.html.
[{'label': 'business-and-industrial', 'score': nan}]

full code:

from datasets import load_dataset
from datasets.arrow_dataset import Dataset
from datasets.dataset_dict import DatasetDict, IterableDatasetDict
from datasets.iterable_dataset import IterableDataset
import os
import torch

os.environ["TOKENIZERS_PARALLELISM"] = "false"
# UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
torch.set_float32_matmul_precision('high')

# Dataset id from huggingface.co/dataset
dataset_id = "argilla/synthetic-domain-text-classification"
 
# Load raw dataset
train_dataset = load_dataset(dataset_id, split='train')

split_dataset = train_dataset.train_test_split(test_size=0.1)
split_dataset['train'][0]
# {'text': 'Recently, there has been an increase in property values within the suburban areas of several cities due to improvements in infrastructure and lifestyle amenities such as parks, retail stores, and educational institutions nearby. Additionally, new housing developments are emerging, catering to different family needs with varying sizes and price ranges. These changes have influenced investment decisions for many looking to buy or sell properties.', 'label': 14}

from transformers import AutoTokenizer
 
# Model id to load the tokenizer
model_id = "answerdotai/ModernBERT-base"

# Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
 
# Tokenize helper function
def tokenize(batch):
    return tokenizer(batch['text'], padding=True, truncation=True, return_tensors="pt")
 
# Tokenize dataset
if "label" in split_dataset["train"].features.keys():
    split_dataset =  split_dataset.rename_column("label", "labels") # to match Trainer
tokenized_dataset = split_dataset.map(tokenize, batched=True, remove_columns=["text"])
 
tokenized_dataset["train"].features.keys()
# dict_keys(['labels', 'input_ids', 'attention_mask'])

from transformers import AutoModelForSequenceClassification
 
# Model id to load the tokenizer
model_id = "answerdotai/ModernBERT-base"
 
# Prepare model labels - useful for inference
labels = tokenized_dataset["train"].features["labels"].names
num_labels = len(labels)
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = str(i)
    id2label[str(i)] = label
 
# Download the model from huggingface.co/models
model = AutoModelForSequenceClassification.from_pretrained(
    model_id, num_labels=num_labels, label2id=label2id, id2label=id2label,
)

import numpy as np
from sklearn.metrics import f1_score
 
# Metric helper method
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    score = f1_score(
            labels, predictions, labels=labels, pos_label=1, average="weighted"
        )
    return {"f1": float(score) if score == 1 else score}

from huggingface_hub import HfFolder
from transformers import Trainer, TrainingArguments
 
# Define training args
training_args = TrainingArguments(
    output_dir = "ModernBERT-domain-classifier",
    per_device_train_batch_size=8,#32,
    per_device_eval_batch_size=8,#16,
    learning_rate=5e-5,
    num_train_epochs=5,
    bf16=True, # bfloat16 training 
    optim="adamw_torch_fused", # improved optimizer 
    # logging & evaluation strategies
    logging_strategy="steps",
    logging_steps=100,
    eval_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    load_best_model_at_end=True,
    #use_mps_device=True,
    metric_for_best_model="f1",
    # push to hub parameters
    push_to_hub=False,
    hub_strategy="every_save",
    hub_token=HfFolder.get_token(),
)
 
# Create a Trainer instance
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    compute_metrics=compute_metrics,
)
trainer.train()
# {'train_runtime': 3642.7783, 'train_samples_per_second': 1.235, 'train_steps_per_second': 0.04, 'train_loss': 0.535627057634551, 'epoch': 5.0}

from transformers import pipeline
 
model_save_path = "ModernBERT-domain-classifier-save"
trainer.save_model(model_save_path)
# Save processor and create model card
tokenizer.save_pretrained(model_save_path)

# load model from huggingface.co/models using our repository id
classifier = pipeline(
    task="text-classification", 
    model=model_save_path, 
    device=0,
)
 
sample = "Smoking is bad for your health."

print(classifier(sample))
# [{'label': 'health', 'score': 0.6779336333274841}]
New activity in google-bert/bert-base-uncased 9 months ago
New activity in meta-llama/Llama-Guard-3-8B 9 months ago

classification probability

4
#14 opened 10 months ago by
Gerald001
New activity in sentence-transformers/all-MiniLM-L6-v2 about 1 year ago

Adding ONNX file of this model

1
1
#40 opened over 1 year ago by
TDK2434

Upload model.onnx

2
1
#19 opened almost 2 years ago by
tkelmATlegends
New activity in aws-neuron/optimum-neuron-cache about 1 year ago

models for inf2.

5
#33 opened about 1 year ago by
AC2132