gabrielegabellone's picture
Add new SentenceTransformer model.
afaa78b verified
metadata
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:46957
  - loss:TripletLoss
base_model: sentence-transformers/all-MiniLM-L6-v2
widget:
  - source_sentence: How to load documents?
    sentences:
      - >-
        MapCity contains the geometries that are displayed on the interactive
        map on the frontend.
      - >-
        The maps app contains State, Region, Province, Company, City,
        Particella, Map, MapCity, Property, Group, PropertyHasMap, Palette,
        PaletteColor, and Geotiff models.
      - >-
        Use the load_documents command which creates document file instances
        from folders in ./files/2-Database-solare path.
  - source_sentence: What is the MapCity model?
    sentences:
      - >-
        The document app contains Document, DocumentFile, Type, Language, Theme,
        Keyword, and Oss models used in the document consultation section.
      - >-
        Document contains all the document metadata such as name, author, year,
        type, language used in the document consultation section.
      - >-
        MapCity contains the geometries that are displayed on the interactive
        map on the frontend.
  - source_sentence: What is the cleantables command?
    sentences:
      - >-
        Takes care of eliminating all instances of Palette, Group, MapCity, Map,
        Province, and Property models.
      - >-
        Set CORS_ALLOWED_ORIGINS in the environment file with allowed origins
        like localhost,127.0.0.1,http://localhost:3000.
      - |-
        from matplotlib import pyplot as plt

        colors = ['Accent', 'Accent_r', 'Blues', 'Blues_r', 'BrBG', 'BrBG_r',]
        ax = res_union. plot(cmap=colors[random. randint(0, len(colors))])
        ax = res_union. plot(cmap='Greens_r')
        gdf1. plot(ax=ax, facecolor='none', edgecolor='k')
        gdf2. plot(ax=ax, facecolor='none', edgecolor='k')
        plt. savefig("overlay. png")
        ```.
  - source_sentence: How to restore a database dump?
    sentences:
      - >-
        Use the generategeotiff command which generates Geotiff instances from a
        shapefile. Run with python manage.py generategeotiff <path>.
      - >-
        Use the generategeotiff command which generates Geotiff instances from a
        shapefile. Run with python manage.py generategeotiff <path>.
      - >-
        Copy the dump file to data/postgresql folder, then inside the database
        container run pg_restore -U $POSTGRES_USER -d $POSTGRES_DB --clean
        --if-exists /var/lib/postgresql/data/db_backup.dump
  - source_sentence: What is the State model?
    sentences:
      - >-
        State contains the geometries of the states, in our specific case it
        contains only the entire geometries of the Italian state.
      - >-
        Use the load_documents command which creates document file instances
        from folders in ./files/2-Database-solare path.
      - >-
        The maps app contains State, Region, Province, Company, City,
        Particella, Map, MapCity, Property, Group, PropertyHasMap, Palette,
        PaletteColor, and Geotiff models.
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
  - cosine_accuracy
model-index:
  - name: SentenceTransformer based on sentence-transformers/all-MiniLM-L6-v2
    results:
      - task:
          type: triplet
          name: Triplet
        dataset:
          name: val triplet eval
          type: val-triplet-eval
        metrics:
          - type: cosine_accuracy
            value: 1
            name: Cosine Accuracy

SentenceTransformer based on sentence-transformers/all-MiniLM-L6-v2

This is a sentence-transformers model finetuned from sentence-transformers/all-MiniLM-L6-v2. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: sentence-transformers/all-MiniLM-L6-v2
  • Maximum Sequence Length: 256 tokens
  • Output Dimensionality: 384 dimensions
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("gabrielegabellone/all-mini-mediterraneo-triplets-v4")
# Run inference
sentences = [
    'What is the State model?',
    'State contains the geometries of the states, in our specific case it contains only the entire geometries of the Italian state.',
    'The maps app contains State, Region, Province, Company, City, Particella, Map, MapCity, Property, Group, PropertyHasMap, Palette, PaletteColor, and Geotiff models.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Triplet

Metric Value
cosine_accuracy 1.0

Training Details

Training Dataset

Unnamed Dataset

  • Size: 46,957 training samples
  • Columns: sentence_0, sentence_1, and sentence_2
  • Approximate statistics based on the first 1000 samples:
    sentence_0 sentence_1 sentence_2
    type string string string
    details
    • min: 7 tokens
    • mean: 9.62 tokens
    • max: 15 tokens
    • min: 21 tokens
    • mean: 35.4 tokens
    • max: 68 tokens
    • min: 21 tokens
    • mean: 73.85 tokens
    • max: 239 tokens
  • Samples:
    sentence_0 sentence_1 sentence_2
    How to restore a database dump? Copy the dump file to data/postgresql folder, then inside the database container run pg_restore -U $POSTGRES_USER -d $POSTGRES_DB --clean --if-exists /var/lib/postgresql/data/db_backup.dump filter(id__in=ids_dataframe1)
    ids_dataframe2 = df2. split(',')
    maps = Map. objects. filter(id__in=ids_dataframe2)
    if not provinces or not maps:
    return Response('Provinces or maps not found', status=status. HTTP_404_NOT_FOUND)
    ```

    2. Then we use the **Geodataframe.
    What is the Region model? Region contains the geometries of the regions, in our specific case it only contains the geometries of the Italian regions. The command allows loading data into the project based on a compiled excel file.
    Allows loading data on scenarios, shapefiles, palettes, software, particles and companies.
    1. Run the command::

    bash<br> python manage. py flow<br>

    2. Choose the type of data to load:
    ```
    Executing consistency checks.
    Load scenarios data. (y/n): n
    Load softwares data.
    What is the generategeotiff command? This command generates Geotiff instances from a shapefile. For each Property present in the shapefile, a Geotiff instance will be created. MINIO_ROOT_USER=minio12345
    MINIO_ROOT_PASSWORD=minio12345
    MINIO_ENDPOINT=minio:9000
    MINIO_EXTERNAL_ENDPOINT=localhost:9000 #CDN
    MINIO_USE_HTTPS=False
    MINIO_EXTERNAL_ENDPOINT_USE_HTTPS=False #true online

    PGADMIN_DEFAULT_EMAIL=admin@admin. com
    PGADMIN_DEFAULT_PASSWORD=strongpassword

    VERSION=1. 0. 0

    SHAPEFILE_VERSION=gadm41_ITA_
    ```.
  • Loss: TripletLoss with these parameters:
    {
        "distance_metric": "TripletDistanceMetric.EUCLIDEAN",
        "triplet_margin": 5
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • num_train_epochs: 4
  • multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1
  • num_train_epochs: 4
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.0
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • hub_revision: None
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • liger_kernel_config: None
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: round_robin

Training Logs

Epoch Step Training Loss val-triplet-eval_cosine_accuracy
0.1704 500 4.5287 -
0.3407 1000 4.1121 1.0
0.5111 1500 3.7883 -
0.6814 2000 3.6668 1.0
0.8518 2500 3.6262 -
1.0 2935 - 1.0
1.0221 3000 3.586 1.0
1.1925 3500 3.5752 -
1.3629 4000 3.5576 1.0
1.5332 4500 3.556 -
1.7036 5000 3.5389 1.0
1.8739 5500 3.526 -
2.0 5870 - 1.0
2.0443 6000 3.5228 1.0
2.2147 6500 3.5234 -
2.3850 7000 3.5122 1.0
2.5554 7500 3.517 -
2.7257 8000 3.5056 1.0
2.8961 8500 3.5103 -
3.0 8805 - 1.0
3.0664 9000 3.5071 1.0
3.2368 9500 3.4977 -
3.4072 10000 3.4929 1.0
3.5775 10500 3.4964 -
3.7479 11000 3.4914 1.0
3.9182 11500 3.491 -
4.0 11740 - 1.0

Framework Versions

  • Python: 3.11.13
  • Sentence Transformers: 4.1.0
  • Transformers: 4.53.2
  • PyTorch: 2.6.0+cu124
  • Accelerate: 1.8.1
  • Datasets: 4.0.0
  • Tokenizers: 0.21.2

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

TripletLoss

@misc{hermans2017defense,
    title={In Defense of the Triplet Loss for Person Re-Identification},
    author={Alexander Hermans and Lucas Beyer and Bastian Leibe},
    year={2017},
    eprint={1703.07737},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}