SentenceTransformer based on neuralmind/bert-base-portuguese-cased

This is a sentence-transformers model finetuned from neuralmind/bert-base-portuguese-cased. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: neuralmind/bert-base-portuguese-cased
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 768 dimensions
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("wilsonmarciliojr/bertimbau-embed-hard-neg")
# Run inference
sentences = [
    'O 1.º troféu disputado em Portugal foi ganho pelo Sporting e o Sporting é líder do campeonato com o FC Porto .',
    'O primeiro troféu que se disputou em Portugal foi ganho pelo Sporting.',
    'Alexandre Pato recebeu em posição legal, fez o gol, mas o impedimento foi marcado.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Semantic Similarity

Metric sts-dev sts-test
pearson_cosine 0.797 0.756
spearman_cosine 0.7938 0.7401

Training Details

Training Dataset

Unnamed Dataset

  • Size: 26,156 training samples
  • Columns: anchor, positive, and negative
  • Approximate statistics based on the first 1000 samples:
    anchor positive negative
    type string string string
    details
    • min: 13 tokens
    • mean: 24.92 tokens
    • max: 43 tokens
    • min: 10 tokens
    • mean: 18.61 tokens
    • max: 33 tokens
    • min: 9 tokens
    • mean: 18.79 tokens
    • max: 39 tokens
  • Samples:
    anchor positive negative
    Quatro jovens foram assassinados na madrugada de hoje (19) em Carapicuíba, município da região metropolitana de São Paulo. Quatro jovens foram assassinados em Carapicuíba. O enterro ocorreu no Cemitério Municipal de Carapicuíba.
    Quatro jovens foram assassinados na madrugada de hoje (19) em Carapicuíba, município da região metropolitana de São Paulo. Quatro jovens foram assassinados em Carapicuíba. Esta madrugada (14) foi coroada a nova Miss EUA.
    Quatro jovens foram assassinados na madrugada de hoje (19) em Carapicuíba, município da região metropolitana de São Paulo. Quatro jovens foram assassinados em Carapicuíba. Há alguns de focos de incêndio na Região Metropolitana de Manaus.
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Evaluation Dataset

Unnamed Dataset

  • Size: 5,520 evaluation samples
  • Columns: anchor, positive, and negative
  • Approximate statistics based on the first 1000 samples:
    anchor positive negative
    type string string string
    details
    • min: 15 tokens
    • mean: 25.03 tokens
    • max: 47 tokens
    • min: 9 tokens
    • mean: 20.03 tokens
    • max: 40 tokens
    • min: 9 tokens
    • mean: 19.4 tokens
    • max: 40 tokens
  • Samples:
    anchor positive negative
    Um novo rumor direto da Coréia do Sul nos dá uma ideia do material que será usado no próximo Galaxy S7, que será anunciado oficialmente em janeiro de 2016. O novo Galaxy S7 deverá ser anunciado oficialmente em janeiro de 2016. Comparado com o Galaxy S6 da Samsung, a diferença na bateria é muito grande.
    Um novo rumor direto da Coréia do Sul nos dá uma ideia do material que será usado no próximo Galaxy S7, que será anunciado oficialmente em janeiro de 2016. O novo Galaxy S7 deverá ser anunciado oficialmente em janeiro de 2016. Teremos um smartphone criado pela grande empresa de refrigerante Pepsi.
    Um novo rumor direto da Coréia do Sul nos dá uma ideia do material que será usado no próximo Galaxy S7, que será anunciado oficialmente em janeiro de 2016. O novo Galaxy S7 deverá ser anunciado oficialmente em janeiro de 2016. Recorde-se que a irmã de Kim Kardashian e o companheiro se separaram no passado mês de julho, depois de nove anos juntos.
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 220
  • per_device_eval_batch_size: 220
  • num_train_epochs: 5
  • warmup_ratio: 0.1
  • fp16: True
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 220
  • per_device_eval_batch_size: 220
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 5
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • tp_size: 0
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss Validation Loss sts-dev_spearman_cosine sts-test_spearman_cosine
-1 -1 - - 0.6424 -
0.0840 10 - 0.1726 0.6642 -
0.1681 20 - 0.0523 0.7141 -
0.2521 30 - 0.0242 0.7580 -
0.3361 40 - 0.0160 0.7759 -
0.4202 50 - 0.0100 0.7848 -
0.5042 60 - 0.0069 0.7893 -
0.5882 70 - 0.0059 0.7904 -
0.6723 80 - 0.0059 0.7907 -
0.7563 90 - 0.0053 0.7908 -
0.8403 100 0.1681 0.0049 0.7921 -
0.9244 110 - 0.0049 0.7925 -
1.0084 120 - 0.0049 0.7929 -
1.0924 130 - 0.0050 0.7925 -
1.1765 140 - 0.0053 0.7922 -
1.2605 150 - 0.0052 0.7919 -
1.3445 160 - 0.0048 0.7922 -
1.4286 170 - 0.0046 0.7923 -
1.5126 180 - 0.0045 0.7928 -
1.5966 190 - 0.0045 0.7932 -
1.6807 200 0.0013 0.0047 0.7933 -
1.7647 210 - 0.0047 0.7929 -
1.8487 220 - 0.0047 0.7928 -
1.9328 230 - 0.0047 0.7928 -
2.0168 240 - 0.0046 0.7926 -
2.1008 250 - 0.0047 0.7927 -
2.1849 260 - 0.0047 0.7927 -
2.2689 270 - 0.0047 0.7929 -
2.3529 280 - 0.0045 0.7933 -
2.4370 290 - 0.0045 0.7934 -
2.5210 300 0.0007 0.0045 0.7932 -
2.6050 310 - 0.0045 0.7933 -
2.6891 320 - 0.0046 0.7932 -
2.7731 330 - 0.0046 0.7932 -
2.8571 340 - 0.0046 0.7933 -
2.9412 350 - 0.0047 0.7934 -
3.0252 360 - 0.0047 0.7934 -
3.1092 370 - 0.0046 0.7935 -
3.1933 380 - 0.0046 0.7936 -
3.2773 390 - 0.0047 0.7937 -
3.3613 400 0.0005 0.0046 0.7937 -
3.4454 410 - 0.0046 0.7937 -
3.5294 420 - 0.0046 0.7937 -
3.6134 430 - 0.0046 0.7937 -
3.6975 440 - 0.0046 0.7938 -
3.7815 450 - 0.0046 0.7938 -
3.8655 460 - 0.0047 0.7939 -
3.9496 470 - 0.0046 0.7940 -
4.0336 480 - 0.0046 0.7940 -
4.1176 490 - 0.0046 0.7940 -
4.2017 500 0.0005 0.0046 0.7940 -
4.2857 510 - 0.0046 0.7939 -
4.3697 520 - 0.0046 0.7938 -
4.4538 530 - 0.0046 0.7938 -
4.5378 540 - 0.0046 0.7938 -
4.6218 550 - 0.0046 0.7939 -
4.7059 560 - 0.0046 0.7939 -
4.7899 570 - 0.0046 0.7938 -
4.8739 580 - 0.0046 0.7938 -
4.9580 590 - 0.0046 0.7938 -
-1 -1 - - - 0.7401

Framework Versions

  • Python: 3.11.12
  • Sentence Transformers: 4.1.0
  • Transformers: 4.51.3
  • PyTorch: 2.6.0+cu124
  • Accelerate: 1.5.2
  • Datasets: 3.5.0
  • Tokenizers: 0.21.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
6
Safetensors
Model size
109M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for wilsonmarciliojr/bertimbau-embed-hard-neg

Finetuned
(130)
this model

Evaluation results