SentenceTransformer based on yahyaabd/allstats-search-mini-v1-1-mnrl

This is a sentence-transformers model finetuned from yahyaabd/allstats-search-mini-v1-1-mnrl on the bps-pub-cosine-pairs dataset. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("yahyaabd/allstats-search-mini-v1-1-mnrl-v2")
# Run inference
sentences = [
    'q-4068',
    'Berapa persentase rumah tangga dengan akses sanitasi layak?',
    '43a5856225b1ff1cb95e319a',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Semantic Similarity

Metric sts-dev sts-test
pearson_cosine 0.9259 0.9299
spearman_cosine 0.8465 0.8497

Training Details

Training Dataset

bps-pub-cosine-pairs

  • Dataset: bps-pub-cosine-pairs at d58662e
  • Size: 8,082 training samples
  • Columns: query_id, query, corpus_id, title, and score
  • Approximate statistics based on the first 1000 samples:
    query_id query corpus_id title score
    type string string string string float
    details
    • min: 4 tokens
    • mean: 5.21 tokens
    • max: 6 tokens
    • min: 4 tokens
    • mean: 11.04 tokens
    • max: 30 tokens
    • min: 4 tokens
    • mean: 17.4 tokens
    • max: 23 tokens
    • min: 5 tokens
    • mean: 13.02 tokens
    • max: 43 tokens
    • min: 0.1
    • mean: 0.55
    • max: 0.9
  • Samples:
    query_id query corpus_id title score
    q-1599 Nilai Tukar Nelayan 0b0da8fc2b6af9329a6d9cfe Statistik Hotel dan Akomodasi Lainnya di Indonesia 2013 0.1
    q-3595 Berapa angka statistik pertambangan non migas Indonesia periode 2012? 3c83610c3e2e5242177e2b11 Statistik Pertambangan Non Minyak dan Gas Bumi 2011-2015 0.9
    q-9112 Bagaimana situasi angkatan kerja Indonesia di bulan Februari 2021? b547a5642aeb04d071cb83d4 Keadaan Angkatan Kerja di Indonesia Februari 2021 0.9
  • Loss: CosineSimilarityLoss with these parameters:
    {
        "loss_fct": "torch.nn.modules.loss.MSELoss"
    }
    

Evaluation Dataset

bps-pub-cosine-pairs

  • Dataset: bps-pub-cosine-pairs at d58662e
  • Size: 1,010 evaluation samples
  • Columns: query_id, query, corpus_id, title, and score
  • Approximate statistics based on the first 1000 samples:
    query_id query corpus_id title score
    type string string string string float
    details
    • min: 4 tokens
    • mean: 5.22 tokens
    • max: 6 tokens
    • min: 4 tokens
    • mean: 11.19 tokens
    • max: 31 tokens
    • min: 7 tokens
    • mean: 17.25 tokens
    • max: 23 tokens
    • min: 5 tokens
    • mean: 13.24 tokens
    • max: 44 tokens
    • min: 0.1
    • mean: 0.56
    • max: 0.9
  • Samples:
    query_id query corpus_id title score
    q-1273 Sosek Desember 2021 b7890a143bc751d1d84dcf4a Laporan Bulanan Data Sosial Ekonomi Desember 2021 0.9
    q-4882 Ekspor Indonesia menurut SITC 2019-2020 9f3d9054c2f29bc478d56cd1 Statistik Perdagangan Luar Negeri Indonesia Ekspor Menurut Kode SITC, 2019-2020 0.9
    q-7141 Pengeluaran konsumsi penduduk Indonesia Maret 2018 4194e924ca33f087b68ab2de Pengeluaran untuk Konsumsi Penduduk Indonesia, Maret 2018 0.9
  • Loss: CosineSimilarityLoss with these parameters:
    {
        "loss_fct": "torch.nn.modules.loss.MSELoss"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 32
  • learning_rate: 1e-05
  • warmup_ratio: 0.1
  • fp16: True
  • load_best_model_at_end: True
  • label_smoothing_factor: 0.01
  • eval_on_start: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 32
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 1e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 3
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.01
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: True
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss Validation Loss sts-dev_spearman_cosine sts-test_spearman_cosine
0 0 - 0.3773 0.8467 -
0.0395 10 0.3676 0.3628 0.8469 -
0.0791 20 0.3166 0.3161 0.8474 -
0.1186 30 0.2743 0.2423 0.8483 -
0.1581 40 0.1933 0.1625 0.8494 -
0.1976 50 0.1473 0.1154 0.8507 -
0.2372 60 0.1046 0.1020 0.8514 -
0.2767 70 0.0839 0.0878 0.8519 -
0.3162 80 0.0839 0.0759 0.8519 -
0.3557 90 0.0756 0.0667 0.8521 -
0.3953 100 0.0611 0.0597 0.8522 -
0.4348 110 0.0562 0.0554 0.8520 -
0.4743 120 0.0557 0.0518 0.8517 -
0.5138 130 0.0411 0.0482 0.8514 -
0.5534 140 0.0481 0.0454 0.8510 -
0.5929 150 0.0474 0.0423 0.8500 -
0.6324 160 0.0433 0.0404 0.8498 -
0.6719 170 0.0389 0.0390 0.8502 -
0.7115 180 0.0423 0.0373 0.8503 -
0.7510 190 0.0348 0.0360 0.8495 -
0.7905 200 0.0404 0.0346 0.8492 -
0.8300 210 0.0285 0.0334 0.8494 -
0.8696 220 0.0322 0.0317 0.8482 -
0.9091 230 0.0311 0.0305 0.8469 -
0.9486 240 0.027 0.0298 0.8462 -
0.9881 250 0.03 0.0292 0.8462 -
1.0277 260 0.0245 0.0292 0.8458 -
1.0672 270 0.026 0.0290 0.8447 -
1.1067 280 0.0325 0.0279 0.8466 -
1.1462 290 0.0208 0.0274 0.8458 -
1.1858 300 0.0249 0.0271 0.8451 -
1.2253 310 0.026 0.0264 0.8444 -
1.2648 320 0.0234 0.0261 0.8469 -
1.3043 330 0.024 0.0267 0.8482 -
1.3439 340 0.0212 0.0254 0.8480 -
1.3834 350 0.033 0.0247 0.8473 -
1.4229 360 0.0246 0.0244 0.8473 -
1.4625 370 0.0241 0.0242 0.8477 -
1.5020 380 0.0187 0.0237 0.8473 -
1.5415 390 0.0228 0.0235 0.8474 -
1.5810 400 0.0169 0.0234 0.8475 -
1.6206 410 0.0249 0.0233 0.8470 -
1.6601 420 0.0223 0.0234 0.8475 -
1.6996 430 0.0174 0.0232 0.8477 -
1.7391 440 0.0249 0.0229 0.8480 -
1.7787 450 0.0243 0.0229 0.8483 -
1.8182 460 0.0203 0.0232 0.8485 -
1.8577 470 0.0198 0.0226 0.8477 -
1.8972 480 0.019 0.0223 0.8464 -
1.9368 490 0.0185 0.0218 0.8465 -
1.9763 500 0.0168 0.0218 0.8468 -
2.0158 510 0.019 0.0217 0.8472 -
2.0553 520 0.0194 0.0217 0.8476 -
2.0949 530 0.0192 0.0216 0.8475 -
2.1344 540 0.0175 0.0215 0.8473 -
2.1739 550 0.013 0.0214 0.8477 -
2.2134 560 0.017 0.0212 0.8478 -
2.2530 570 0.0157 0.0212 0.8478 -
2.2925 580 0.0169 0.0211 0.8473 -
2.3320 590 0.0192 0.0210 0.8475 -
2.3715 600 0.0116 0.0208 0.8472 -
2.4111 610 0.0151 0.0207 0.8473 -
2.4506 620 0.0182 0.0205 0.8472 -
2.4901 630 0.0143 0.0205 0.8471 -
2.5296 640 0.0193 0.0204 0.8470 -
2.5692 650 0.0194 0.0203 0.8469 -
2.6087 660 0.0132 0.0204 0.8469 -
2.6482 670 0.0208 0.0204 0.8464 -
2.6877 680 0.0155 0.0203 0.8461 -
2.7273 690 0.0142 0.0203 0.8461 -
2.7668 700 0.0162 0.0203 0.8460 -
2.8063 710 0.0198 0.0203 0.8461 -
2.8458 720 0.0138 0.0204 0.8465 -
2.8854 730 0.0145 0.0204 0.8465 -
2.9249 740 0.0129 0.0204 0.8466 -
2.9644 750 0.0108 0.0204 0.8465 -
-1 -1 - - - 0.8497
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 3.4.0
  • Transformers: 4.48.1
  • PyTorch: 2.5.1+cu124
  • Accelerate: 1.3.0
  • Datasets: 3.2.0
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}
Downloads last month
10
Safetensors
Model size
118M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for yahyaabd/allstats-search-mini-v1-1-mnrl-v2

Dataset used to train yahyaabd/allstats-search-mini-v1-1-mnrl-v2

Evaluation results