SentenceTransformer based on yahyaabd/allstats-search-mini-v1-1-mnrl

This is a sentence-transformers model finetuned from yahyaabd/allstats-search-mini-v1-1-mnrl on the bps-sts-dataset-v1 dataset. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("yahyaabd/allstats-search-mini-v1-1-mnrl-sts")
# Run inference
sentences = [
    'PDRB per kapita Provinsi Riau sangat dipengaruhi oleh harga minyak bumi dunia.',
    'The Riau Islands province is known for its beautiful beaches and marine tourism.',
    'Di wilayah perkotaan, angka kemiskinan pada Maret 2023 adalah 7,29%.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Semantic Similarity

Metric sts-dev sts-test
pearson_cosine 0.8384 0.8686
spearman_cosine 0.8363 0.8631

Training Details

Training Dataset

bps-sts-dataset-v1

  • Dataset: bps-sts-dataset-v1 at 5c8f96e
  • Size: 2,436 training samples
  • Columns: sentence1, sentence2, and score
  • Approximate statistics based on the first 1000 samples:
    sentence1 sentence2 score
    type string string float
    details
    • min: 6 tokens
    • mean: 20.49 tokens
    • max: 36 tokens
    • min: 9 tokens
    • mean: 20.71 tokens
    • max: 45 tokens
    • min: 0.0
    • mean: 0.51
    • max: 1.0
  • Samples:
    sentence1 sentence2 score
    bagaimana capaian Tujuan Pembangunan Berkelanjutan di Indonesia? Laporan Pencapaian Indikator Tujuan Pembangunan Berkelanjutan (TPB/SDGs) Indonesia, Edisi 2024 0.8
    Jumlah perpustakaan umum di Indonesia tahun 2022 sebanyak 170.000 unit. Minat baca masyarakat Indonesia masih perlu ditingkatkan melalui berbagai program literasi. 0.4
    Jumlah sekolah negeri jenjang SMP di Kota Bandar Lampung adalah 30 sekolah. Laju deforestasi di Provinsi Kalimantan Tengah masih mengkhawatirkan. 0.0
  • Loss: CosineSimilarityLoss with these parameters:
    {
        "loss_fct": "torch.nn.modules.loss.MSELoss"
    }
    

Evaluation Dataset

bps-sts-dataset-v1

  • Dataset: bps-sts-dataset-v1 at 5c8f96e
  • Size: 522 evaluation samples
  • Columns: sentence1, sentence2, and score
  • Approximate statistics based on the first 522 samples:
    sentence1 sentence2 score
    type string string float
    details
    • min: 9 tokens
    • mean: 20.83 tokens
    • max: 39 tokens
    • min: 8 tokens
    • mean: 20.84 tokens
    • max: 44 tokens
    • min: 0.0
    • mean: 0.5
    • max: 1.0
  • Samples:
    sentence1 sentence2 score
    Persentase desa yang memiliki fasilitas internet di Provinsi Y pada tahun 2021 adalah 85%. Luas perkebunan kelapa sawit di Provinsi Y pada tahun 2021 adalah 500.000 hektar. 0.2
    Kontribusi sektor UMKM terhadap PDRB Kota Malang pada tahun 2023 sebesar 60%. Usaha Mikro, Kecil, dan Menengah menyumbang 60 persen terhadap total Produk Domestik Regional Bruto di kota pendidikan Malang pada tahun 2023. 1.0
    Jumlah Industri Kecil dan Menengah (IKM) di Kabupaten Tegal, Jawa Tengah, bertambah 200 unit pada tahun 2024. Di Tegal, sebuah kabupaten di Jateng, terjadi penambahan 200 unit IKM sepanjang tahun 2024. 1.0
  • Loss: CosineSimilarityLoss with these parameters:
    {
        "loss_fct": "torch.nn.modules.loss.MSELoss"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 32
  • learning_rate: 1e-05
  • warmup_ratio: 0.1
  • fp16: True
  • load_best_model_at_end: True
  • label_smoothing_factor: 0.01
  • eval_on_start: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 32
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 1e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 3
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • tp_size: 0
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.01
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: True
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss Validation Loss sts-dev_spearman_cosine sts-test_spearman_cosine
0 0 - 0.0588 0.7404 -
0.1299 10 0.0535 0.0577 0.7454 -
0.2597 20 0.046 0.0539 0.7614 -
0.3896 30 0.0552 0.0497 0.7796 -
0.5195 40 0.0442 0.0470 0.7947 -
0.6494 50 0.0437 0.0450 0.8057 -
0.7792 60 0.0425 0.0438 0.8123 -
0.9091 70 0.0465 0.0423 0.8183 -
1.0390 80 0.0384 0.0414 0.8223 -
1.1688 90 0.0362 0.0405 0.8260 -
1.2987 100 0.0309 0.0401 0.8276 -
1.4286 110 0.0335 0.0397 0.8289 -
1.5584 120 0.033 0.0394 0.8307 -
1.6883 130 0.0272 0.0392 0.8317 -
1.8182 140 0.032 0.0390 0.8324 -
1.9481 150 0.0317 0.0387 0.8331 -
2.0779 160 0.0322 0.0385 0.8338 -
2.2078 170 0.0285 0.0383 0.8345 -
2.3377 180 0.0299 0.0382 0.8349 -
2.4675 190 0.0326 0.0381 0.8351 -
2.5974 200 0.0258 0.0380 0.8356 -
2.7273 210 0.0282 0.0379 0.8361 -
2.8571 220 0.0286 0.0379 0.8363 -
2.987 230 0.025 0.0379 0.8363 -
-1 -1 - - - 0.8631
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.11.12
  • Sentence Transformers: 3.4.0
  • Transformers: 4.51.3
  • PyTorch: 2.6.0+cu124
  • Accelerate: 1.6.0
  • Datasets: 3.2.0
  • Tokenizers: 0.21.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}
Downloads last month
51
Safetensors
Model size
118M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for yahyaabd/allstats-search-mini-v1-1-mnrl-sts

Dataset used to train yahyaabd/allstats-search-mini-v1-1-mnrl-sts

Evaluation results