SentenceTransformer based on yahyaabd/allstats-search-mini-v1-1-mnrl

This is a sentence-transformers model finetuned from yahyaabd/allstats-search-mini-v1-1-mnrl on the bps-pub-cosine-pairs dataset. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("yahyaabd/allstats-search-mini-v2")
# Run inference
sentences = [
    'q-786',
    'Angka Kematian Bayi oper P#rovinsi',
    'f3b02f2b6706e104ea9d5b74',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Semantic Similarity

Metric sts-dev sts-test
pearson_cosine 0.9041 0.9069
spearman_cosine 0.8335 0.8381

Training Details

Training Dataset

bps-pub-cosine-pairs

  • Dataset: bps-pub-cosine-pairs at 038a9de
  • Size: 64,260 training samples
  • Columns: query_id, query, corpus_id, title, and score
  • Approximate statistics based on the first 1000 samples:
    query_id query corpus_id title score
    type string string string string float
    details
    • min: 4 tokens
    • mean: 5.18 tokens
    • max: 6 tokens
    • min: 4 tokens
    • mean: 13.33 tokens
    • max: 38 tokens
    • min: 7 tokens
    • mean: 17.38 tokens
    • max: 22 tokens
    • min: 5 tokens
    • mean: 13.13 tokens
    • max: 30 tokens
    • min: 0.1
    • mean: 0.56
    • max: 0.9
  • Samples:
    query_id query corpus_id title score
    q-1599 Nilai Tukar Nelayan 0b0da8fc2b6af9329a6d9cfe Statistik Hotel dan Akomodasi Lainnya di Indonesia 2013 0.1
    q-1599 nilai tukar nelayan 0b0da8fc2b6af9329a6d9cfe Statistik Hotel dan Akomodasi Lainnya di Indonesia 2013 0.1
    q-1599 NILAI TUKAR NELAYAN 0b0da8fc2b6af9329a6d9cfe Statistik Hotel dan Akomodasi Lainnya di Indonesia 2013 0.1
  • Loss: CosineSimilarityLoss with these parameters:
    {
        "loss_fct": "torch.nn.modules.loss.MSELoss"
    }
    

Evaluation Dataset

bps-pub-cosine-pairs

  • Dataset: bps-pub-cosine-pairs at 038a9de
  • Size: 8,067 evaluation samples
  • Columns: query_id, query, corpus_id, title, and score
  • Approximate statistics based on the first 1000 samples:
    query_id query corpus_id title score
    type string string string string float
    details
    • min: 4 tokens
    • mean: 5.2 tokens
    • max: 6 tokens
    • min: 4 tokens
    • mean: 12.77 tokens
    • max: 33 tokens
    • min: 13 tokens
    • mean: 17.25 tokens
    • max: 23 tokens
    • min: 5 tokens
    • mean: 13.37 tokens
    • max: 38 tokens
    • min: 0.1
    • mean: 0.57
    • max: 0.9
  • Samples:
    query_id query corpus_id title score
    q-1273 Sosek Desember 2021 b7890a143bc751d1d84dcf4a Laporan Bulanan Data Sosial Ekonomi Desember 2021 0.9
    q-1273 sosek desember 2021 b7890a143bc751d1d84dcf4a Laporan Bulanan Data Sosial Ekonomi Desember 2021 0.9
    q-1273 SOSEK DESEMBER 2021 b7890a143bc751d1d84dcf4a Laporan Bulanan Data Sosial Ekonomi Desember 2021 0.9
  • Loss: CosineSimilarityLoss with these parameters:
    {
        "loss_fct": "torch.nn.modules.loss.MSELoss"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • learning_rate: 1e-05
  • num_train_epochs: 2
  • warmup_ratio: 0.1
  • fp16: True
  • load_best_model_at_end: True
  • label_smoothing_factor: 0.01
  • eval_on_start: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 1e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 2
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.01
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: True
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss Validation Loss sts-dev_spearman_cosine sts-test_spearman_cosine
0 0 - 0.3848 0.8288 -
0.0995 100 0.236 0.0950 0.8396 -
0.1990 200 0.0655 0.0487 0.8452 -
0.2985 300 0.0407 0.0342 0.8437 -
0.3980 400 0.0309 0.0291 0.8427 -
0.4975 500 0.0247 0.0253 0.8427 -
0.5970 600 0.0211 0.0235 0.8427 -
0.6965 700 0.0198 0.0224 0.8395 -
0.7960 800 0.0168 0.0212 0.8405 -
0.8955 900 0.0166 0.0206 0.8384 -
0.9950 1000 0.0145 0.0195 0.8388 -
1.0945 1100 0.0119 0.0193 0.8395 -
1.1940 1200 0.0113 0.0190 0.8376 -
1.2935 1300 0.0108 0.0189 0.8330 -
1.3930 1400 0.0119 0.0180 0.8364 -
1.4925 1500 0.0105 0.0184 0.8338 -
1.5920 1600 0.0092 0.0180 0.8355 -
1.6915 1700 0.009 0.0182 0.8319 -
1.7910 1800 0.0096 0.0178 0.8337 -
1.8905 1900 0.0099 0.0178 0.8326 -
1.99 2000 0.0094 0.0178 0.8335 -
-1 -1 - - - 0.8381
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 3.4.0
  • Transformers: 4.48.1
  • PyTorch: 2.5.1+cu124
  • Accelerate: 1.3.0
  • Datasets: 3.2.0
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}
Downloads last month
27
Safetensors
Model size
118M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for yahyaabd/allstats-search-mini-v2

Dataset used to train yahyaabd/allstats-search-mini-v2

Evaluation results