manmah's picture
Add new SentenceTransformer model
00b2e7f verified
metadata
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:156
  - loss:MatryoshkaLoss
  - loss:MultipleNegativesRankingLoss
base_model: Snowflake/snowflake-arctic-embed-l
widget:
  - source_sentence: What was the typical context length accepted by most models last year?
    sentences:
      - >-
        Prompt injection is a natural consequence of this gulibility. I’ve seen
        precious little progress on tackling that problem in 2024, and we’ve
        been talking about it since September 2022.

        I’m beginning to see the most popular idea of “agents” as dependent on
        AGI itself. A model that’s robust against gulliblity is a very tall
        order indeed.

        Evals really matter

        Anthropic’s Amanda Askell (responsible for much of the work behind
        Claude’s Character):
      - >-
        Gemini 1.5 Pro also illustrated one of the key themes of 2024: increased
        context lengths. Last year most models accepted 4,096 or 8,192 tokens,
        with the notable exception of Claude 2.1 which accepted 200,000. Today
        every serious provider has a 100,000+ token model, and Google’s Gemini
        series accepts up to 2 million.
      - >-
        Here’s the rest of the transcript. It’s bland and generic, but my phone
        can pitch bland and generic Christmas movies to Netflix now!

        LLM prices crashed, thanks to competition and increased efficiency

        The past twelve months have seen a dramatic collapse in the cost of
        running a prompt through the top tier hosted LLMs.

        In December 2023 (here’s the Internet Archive for the OpenAI pricing
        page) OpenAI were charging $30/million input tokens for GPT-4, $10/mTok
        for the then-new GPT-4 Turbo and $1/mTok for GPT-3.5 Turbo.
  - source_sentence: >-
      What challenges does the author face when trying to evaluate multiple
      LLMs?
    sentences:
      - >-
        We don’t yet know how to build GPT-4

        Frustratingly, despite the enormous leaps ahead we’ve had this year, we
        are yet to see an alternative model that’s better than GPT-4.

        OpenAI released GPT-4 in March, though it later turned out we had a
        sneak peak of it in February when Microsoft used it as part of the new
        Bing.

        This may well change in the next few weeks: Google’s Gemini Ultra has
        big claims, but isn’t yet available for us to try out.

        The team behind Mistral are working to beat GPT-4 as well, and their
        track record is already extremely strong considering their first public
        model only came out in September, and they’ve released two significant
        improvements since then.
      - >-
        I find I have to work with an LLM for a few weeks in order to get a good
        intuition for it’s strengths and weaknesses. This greatly limits how
        many I can evaluate myself!

        The most frustrating thing for me is at the level of individual
        prompting.

        Sometimes I’ll tweak a prompt and capitalize some of the words in it, to
        emphasize that I really want it to OUTPUT VALID MARKDOWN or similar. Did
        capitalizing those words make a difference? I still don’t have a good
        methodology for figuring that out.

        We’re left with what’s effectively Vibes Based Development. It’s vibes
        all the way down.

        I’d love to see us move beyond vibes in 2024!

        LLMs are really smart, and also really, really dumb
      - >-
        Except... you can run generated code to see if it’s correct. And with
        patterns like ChatGPT Code Interpreter the LLM can execute the code
        itself, process the error message, then rewrite it and keep trying until
        it works!

        So hallucination is a much lesser problem for code generation than for
        anything else. If only we had the equivalent of Code Interpreter for
        fact-checking natural language!

        How should we feel about this as software engineers?

        On the one hand, this feels like a threat: who needs a programmer if
        ChatGPT can write code for you?
  - source_sentence: >-
      What are some ways mentioned to run local, private large language models
      (LLMs) on personal devices?
    sentences:
      - >-
        A lot of people are excited about AI agents—an infuriatingly vague term
        that seems to be converging on “AI systems that can go away and act on
        your behalf”. We’ve been talking about them all year, but I’ve seen few
        if any examples of them running in production, despite lots of exciting
        prototypes.

        I think this is because of gullibility.

        Can we solve this? Honestly, I’m beginning to suspect that you can’t
        fully solve gullibility without achieving AGI. So it may be quite a
        while before those agent dreams can really start to come true!

        Code may be the best application

        Over the course of the year, it’s become increasingly clear that writing
        code is one of the things LLMs are most capable of.
      - >-
        I run a bunch of them on my laptop. I run Mistral 7B (a surprisingly
        great model) on my iPhone. You can install several different apps to get
        your own, local, completely private LLM. My own LLM project provides a
        CLI tool for running an array of different models via plugins.

        You can even run them entirely in your browser using WebAssembly and the
        latest Chrome!

        Hobbyists can build their own fine-tuned models

        I said earlier that building an LLM was still out of reach of hobbyists.
        That may be true for training from scratch, but fine-tuning one of those
        models is another matter entirely.
      - >-
        Prompt injection is a natural consequence of this gulibility. I’ve seen
        precious little progress on tackling that problem in 2024, and we’ve
        been talking about it since September 2022.

        I’m beginning to see the most popular idea of “agents” as dependent on
        AGI itself. A model that’s robust against gulliblity is a very tall
        order indeed.

        Evals really matter

        Anthropic’s Amanda Askell (responsible for much of the work behind
        Claude’s Character):
  - source_sentence: >-
      How has the value of prompt-driven app generation changed from 2023 to
      2024?
    sentences:
      - >-
        On paper, a 64GB Mac should be a great machine for running models due to
        the way the CPU and GPU can share the same memory. In practice, many
        models are released as model weights and libraries that reward NVIDIA’s
        CUDA over other platforms.

        The llama.cpp ecosystem helped a lot here, but the real breakthrough has
        been Apple’s MLX library, “an array framework for Apple Silicon”. It’s
        fantastic.

        Apple’s mlx-lm Python library supports running a wide range of
        MLX-compatible models on my Mac, with excellent performance.
        mlx-community on Hugging Face offers more than 1,000 models that have
        been converted to the necessary format.
      - >-
        The environmental impact got much, much worse

        The much bigger problem here is the enormous competitive buildout of the
        infrastructure that is imagined to be necessary for these models in the
        future.

        Companies like Google, Meta, Microsoft and Amazon are all spending
        billions of dollars rolling out new datacenters, with a very material
        impact on the electricity grid and the environment. There’s even talk of
        spinning up new nuclear power stations, but those can take decades.

        Is this infrastructure necessary? DeepSeek v3’s $6m training cost and
        the continued crash in LLM prices might hint that it’s not. But would
        you want to be the big tech executive that argued NOT to build out this
        infrastructure only to be proven wrong in a few years’ time?
      - >-
        These abilities are just a few weeks old at this point, and I don’t
        think their impact has been fully felt yet. If you haven’t tried them
        out yet you really should.

        Both Gemini and OpenAI offer API access to these features as well.
        OpenAI started with a WebSocket API that was quite challenging to use,
        but in December they announced a new WebRTC API which is much easier to
        get started with. Building a web app that a user can talk to via voice
        is easy now!

        Prompt driven app generation is a commodity already

        This was possible with GPT-4 in 2023, but the value it provides became
        evident in 2024.
  - source_sentence: >-
      What makes the prompt-driven custom interface feature powerful and easy to
      build despite the challenges of browser sandboxing?
    sentences:
      - >-
        This prompt-driven custom interface feature is so powerful and easy to
        build (once you’ve figured out the gnarly details of browser sandboxing)
        that I expect it to show up as a feature in a wide range of products in
        2025.

        Universal access to the best models lasted for just a few short months

        For a few short months this year all three of the best available
        models—GPT-4o, Claude 3.5 Sonnet and Gemini 1.5 Pro—were freely
        available to most of the world.
      - >-
        The environmental impact got much, much worse

        The much bigger problem here is the enormous competitive buildout of the
        infrastructure that is imagined to be necessary for these models in the
        future.

        Companies like Google, Meta, Microsoft and Amazon are all spending
        billions of dollars rolling out new datacenters, with a very material
        impact on the electricity grid and the environment. There’s even talk of
        spinning up new nuclear power stations, but those can take decades.

        Is this infrastructure necessary? DeepSeek v3’s $6m training cost and
        the continued crash in LLM prices might hint that it’s not. But would
        you want to be the big tech executive that argued NOT to build out this
        infrastructure only to be proven wrong in a few years’ time?
      - >-
        We don’t yet know how to build GPT-4

        Frustratingly, despite the enormous leaps ahead we’ve had this year, we
        are yet to see an alternative model that’s better than GPT-4.

        OpenAI released GPT-4 in March, though it later turned out we had a
        sneak peak of it in February when Microsoft used it as part of the new
        Bing.

        This may well change in the next few weeks: Google’s Gemini Ultra has
        big claims, but isn’t yet available for us to try out.

        The team behind Mistral are working to beat GPT-4 as well, and their
        track record is already extremely strong considering their first public
        model only came out in September, and they’ve released two significant
        improvements since then.
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
  - cosine_accuracy@1
  - cosine_accuracy@3
  - cosine_accuracy@5
  - cosine_accuracy@10
  - cosine_precision@1
  - cosine_precision@3
  - cosine_precision@5
  - cosine_precision@10
  - cosine_recall@1
  - cosine_recall@3
  - cosine_recall@5
  - cosine_recall@10
  - cosine_ndcg@10
  - cosine_mrr@10
  - cosine_map@100
model-index:
  - name: SentenceTransformer based on Snowflake/snowflake-arctic-embed-l
    results:
      - task:
          type: information-retrieval
          name: Information Retrieval
        dataset:
          name: Unknown
          type: unknown
        metrics:
          - type: cosine_accuracy@1
            value: 0.875
            name: Cosine Accuracy@1
          - type: cosine_accuracy@3
            value: 1
            name: Cosine Accuracy@3
          - type: cosine_accuracy@5
            value: 1
            name: Cosine Accuracy@5
          - type: cosine_accuracy@10
            value: 1
            name: Cosine Accuracy@10
          - type: cosine_precision@1
            value: 0.875
            name: Cosine Precision@1
          - type: cosine_precision@3
            value: 0.3333333333333333
            name: Cosine Precision@3
          - type: cosine_precision@5
            value: 0.20000000000000004
            name: Cosine Precision@5
          - type: cosine_precision@10
            value: 0.10000000000000002
            name: Cosine Precision@10
          - type: cosine_recall@1
            value: 0.875
            name: Cosine Recall@1
          - type: cosine_recall@3
            value: 1
            name: Cosine Recall@3
          - type: cosine_recall@5
            value: 1
            name: Cosine Recall@5
          - type: cosine_recall@10
            value: 1
            name: Cosine Recall@10
          - type: cosine_ndcg@10
            value: 0.9538662191964322
            name: Cosine Ndcg@10
          - type: cosine_mrr@10
            value: 0.9375
            name: Cosine Mrr@10
          - type: cosine_map@100
            value: 0.9375
            name: Cosine Map@100

SentenceTransformer based on Snowflake/snowflake-arctic-embed-l

This is a sentence-transformers model finetuned from Snowflake/snowflake-arctic-embed-l. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: Snowflake/snowflake-arctic-embed-l
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 1024 dimensions
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("manmah/legal-ft-2aefb51e-1a19-43c1-a5ff-7d28d65534da")
# Run inference
sentences = [
    'What makes the prompt-driven custom interface feature powerful and easy to build despite the challenges of browser sandboxing?',
    'This prompt-driven custom interface feature is so powerful and easy to build (once you’ve figured out the gnarly details of browser sandboxing) that I expect it to show up as a feature in a wide range of products in 2025.\nUniversal access to the best models lasted for just a few short months\nFor a few short months this year all three of the best available models—GPT-4o, Claude 3.5 Sonnet and Gemini 1.5 Pro—were freely available to most of the world.',
    'We don’t yet know how to build GPT-4\nFrustratingly, despite the enormous leaps ahead we’ve had this year, we are yet to see an alternative model that’s better than GPT-4.\nOpenAI released GPT-4 in March, though it later turned out we had a sneak peak of it in February when Microsoft used it as part of the new Bing.\nThis may well change in the next few weeks: Google’s Gemini Ultra has big claims, but isn’t yet available for us to try out.\nThe team behind Mistral are working to beat GPT-4 as well, and their track record is already extremely strong considering their first public model only came out in September, and they’ve released two significant improvements since then.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_accuracy@1 0.875
cosine_accuracy@3 1.0
cosine_accuracy@5 1.0
cosine_accuracy@10 1.0
cosine_precision@1 0.875
cosine_precision@3 0.3333
cosine_precision@5 0.2
cosine_precision@10 0.1
cosine_recall@1 0.875
cosine_recall@3 1.0
cosine_recall@5 1.0
cosine_recall@10 1.0
cosine_ndcg@10 0.9539
cosine_mrr@10 0.9375
cosine_map@100 0.9375

Training Details

Training Dataset

Unnamed Dataset

  • Size: 156 training samples
  • Columns: sentence_0 and sentence_1
  • Approximate statistics based on the first 156 samples:
    sentence_0 sentence_1
    type string string
    details
    • min: 12 tokens
    • mean: 20.82 tokens
    • max: 32 tokens
    • min: 43 tokens
    • mean: 135.28 tokens
    • max: 214 tokens
  • Samples:
    sentence_0 sentence_1
    What new feature does ChatGPT voice mode offer as of December? The most recent twist, again from December (December was a lot) is live video. ChatGPT voice mode now provides the option to share your camera feed with the model and talk about what you can see in real time. Google Gemini have a preview of the same feature, which they managed to ship the day before ChatGPT did.
    Which company released a similar live video feature just before ChatGPT? The most recent twist, again from December (December was a lot) is live video. ChatGPT voice mode now provides the option to share your camera feed with the model and talk about what you can see in real time. Google Gemini have a preview of the same feature, which they managed to ship the day before ChatGPT did.
    When did OpenAI make GPT-4o free for all users? OpenAI made GPT-4o free for all users in May, and Claude 3.5 Sonnet was freely available from its launch in June. This was a momentus change, because for the previous year free users had mostly been restricted to GPT-3.5 level models, meaning new users got a very inaccurate mental model of what a capable LLM could actually do.
    That era appears to have ended, likely permanently, with OpenAI’s launch of ChatGPT Pro. This $200/month subscription service is the only way to access their most capable model, o1 Pro.
    Since the trick behind the o1 series (and the future models it will undoubtedly inspire) is to expend more compute time to get better results, I don’t think those days of free access to the best available models are likely to return.
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "MultipleNegativesRankingLoss",
        "matryoshka_dims": [
            768,
            512,
            256,
            128,
            64
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 10
  • per_device_eval_batch_size: 10
  • num_train_epochs: 10
  • multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 10
  • per_device_eval_batch_size: 10
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1
  • num_train_epochs: 10
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.0
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • tp_size: 0
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: round_robin

Training Logs

Epoch Step cosine_ndcg@10
1.0 16 0.9484
2.0 32 0.9539
3.0 48 0.9692
3.125 50 0.9846
4.0 64 0.9692
5.0 80 0.9692
6.0 96 0.9539
6.25 100 0.9385
7.0 112 0.9539
8.0 128 0.9539
9.0 144 0.9539
9.375 150 0.9539
10.0 160 0.9539

Framework Versions

  • Python: 3.13.2
  • Sentence Transformers: 4.1.0
  • Transformers: 4.51.3
  • PyTorch: 2.7.0
  • Accelerate: 1.6.0
  • Datasets: 3.5.1
  • Tokenizers: 0.21.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}