metadata
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- generated_from_trainer
- dataset_size:156
- loss:MatryoshkaLoss
- loss:MultipleNegativesRankingLoss
base_model: Snowflake/snowflake-arctic-embed-l
widget:
- source_sentence: What was the typical context length accepted by most models last year?
sentences:
- >-
Prompt injection is a natural consequence of this gulibility. I’ve seen
precious little progress on tackling that problem in 2024, and we’ve
been talking about it since September 2022.
I’m beginning to see the most popular idea of “agents” as dependent on
AGI itself. A model that’s robust against gulliblity is a very tall
order indeed.
Evals really matter
Anthropic’s Amanda Askell (responsible for much of the work behind
Claude’s Character):
- >-
Gemini 1.5 Pro also illustrated one of the key themes of 2024: increased
context lengths. Last year most models accepted 4,096 or 8,192 tokens,
with the notable exception of Claude 2.1 which accepted 200,000. Today
every serious provider has a 100,000+ token model, and Google’s Gemini
series accepts up to 2 million.
- >-
Here’s the rest of the transcript. It’s bland and generic, but my phone
can pitch bland and generic Christmas movies to Netflix now!
LLM prices crashed, thanks to competition and increased efficiency
The past twelve months have seen a dramatic collapse in the cost of
running a prompt through the top tier hosted LLMs.
In December 2023 (here’s the Internet Archive for the OpenAI pricing
page) OpenAI were charging $30/million input tokens for GPT-4, $10/mTok
for the then-new GPT-4 Turbo and $1/mTok for GPT-3.5 Turbo.
- source_sentence: >-
What challenges does the author face when trying to evaluate multiple
LLMs?
sentences:
- >-
We don’t yet know how to build GPT-4
Frustratingly, despite the enormous leaps ahead we’ve had this year, we
are yet to see an alternative model that’s better than GPT-4.
OpenAI released GPT-4 in March, though it later turned out we had a
sneak peak of it in February when Microsoft used it as part of the new
Bing.
This may well change in the next few weeks: Google’s Gemini Ultra has
big claims, but isn’t yet available for us to try out.
The team behind Mistral are working to beat GPT-4 as well, and their
track record is already extremely strong considering their first public
model only came out in September, and they’ve released two significant
improvements since then.
- >-
I find I have to work with an LLM for a few weeks in order to get a good
intuition for it’s strengths and weaknesses. This greatly limits how
many I can evaluate myself!
The most frustrating thing for me is at the level of individual
prompting.
Sometimes I’ll tweak a prompt and capitalize some of the words in it, to
emphasize that I really want it to OUTPUT VALID MARKDOWN or similar. Did
capitalizing those words make a difference? I still don’t have a good
methodology for figuring that out.
We’re left with what’s effectively Vibes Based Development. It’s vibes
all the way down.
I’d love to see us move beyond vibes in 2024!
LLMs are really smart, and also really, really dumb
- >-
Except... you can run generated code to see if it’s correct. And with
patterns like ChatGPT Code Interpreter the LLM can execute the code
itself, process the error message, then rewrite it and keep trying until
it works!
So hallucination is a much lesser problem for code generation than for
anything else. If only we had the equivalent of Code Interpreter for
fact-checking natural language!
How should we feel about this as software engineers?
On the one hand, this feels like a threat: who needs a programmer if
ChatGPT can write code for you?
- source_sentence: >-
What are some ways mentioned to run local, private large language models
(LLMs) on personal devices?
sentences:
- >-
A lot of people are excited about AI agents—an infuriatingly vague term
that seems to be converging on “AI systems that can go away and act on
your behalf”. We’ve been talking about them all year, but I’ve seen few
if any examples of them running in production, despite lots of exciting
prototypes.
I think this is because of gullibility.
Can we solve this? Honestly, I’m beginning to suspect that you can’t
fully solve gullibility without achieving AGI. So it may be quite a
while before those agent dreams can really start to come true!
Code may be the best application
Over the course of the year, it’s become increasingly clear that writing
code is one of the things LLMs are most capable of.
- >-
I run a bunch of them on my laptop. I run Mistral 7B (a surprisingly
great model) on my iPhone. You can install several different apps to get
your own, local, completely private LLM. My own LLM project provides a
CLI tool for running an array of different models via plugins.
You can even run them entirely in your browser using WebAssembly and the
latest Chrome!
Hobbyists can build their own fine-tuned models
I said earlier that building an LLM was still out of reach of hobbyists.
That may be true for training from scratch, but fine-tuning one of those
models is another matter entirely.
- >-
Prompt injection is a natural consequence of this gulibility. I’ve seen
precious little progress on tackling that problem in 2024, and we’ve
been talking about it since September 2022.
I’m beginning to see the most popular idea of “agents” as dependent on
AGI itself. A model that’s robust against gulliblity is a very tall
order indeed.
Evals really matter
Anthropic’s Amanda Askell (responsible for much of the work behind
Claude’s Character):
- source_sentence: >-
How has the value of prompt-driven app generation changed from 2023 to
2024?
sentences:
- >-
On paper, a 64GB Mac should be a great machine for running models due to
the way the CPU and GPU can share the same memory. In practice, many
models are released as model weights and libraries that reward NVIDIA’s
CUDA over other platforms.
The llama.cpp ecosystem helped a lot here, but the real breakthrough has
been Apple’s MLX library, “an array framework for Apple Silicon”. It’s
fantastic.
Apple’s mlx-lm Python library supports running a wide range of
MLX-compatible models on my Mac, with excellent performance.
mlx-community on Hugging Face offers more than 1,000 models that have
been converted to the necessary format.
- >-
The environmental impact got much, much worse
The much bigger problem here is the enormous competitive buildout of the
infrastructure that is imagined to be necessary for these models in the
future.
Companies like Google, Meta, Microsoft and Amazon are all spending
billions of dollars rolling out new datacenters, with a very material
impact on the electricity grid and the environment. There’s even talk of
spinning up new nuclear power stations, but those can take decades.
Is this infrastructure necessary? DeepSeek v3’s $6m training cost and
the continued crash in LLM prices might hint that it’s not. But would
you want to be the big tech executive that argued NOT to build out this
infrastructure only to be proven wrong in a few years’ time?
- >-
These abilities are just a few weeks old at this point, and I don’t
think their impact has been fully felt yet. If you haven’t tried them
out yet you really should.
Both Gemini and OpenAI offer API access to these features as well.
OpenAI started with a WebSocket API that was quite challenging to use,
but in December they announced a new WebRTC API which is much easier to
get started with. Building a web app that a user can talk to via voice
is easy now!
Prompt driven app generation is a commodity already
This was possible with GPT-4 in 2023, but the value it provides became
evident in 2024.
- source_sentence: >-
What makes the prompt-driven custom interface feature powerful and easy to
build despite the challenges of browser sandboxing?
sentences:
- >-
This prompt-driven custom interface feature is so powerful and easy to
build (once you’ve figured out the gnarly details of browser sandboxing)
that I expect it to show up as a feature in a wide range of products in
2025.
Universal access to the best models lasted for just a few short months
For a few short months this year all three of the best available
models—GPT-4o, Claude 3.5 Sonnet and Gemini 1.5 Pro—were freely
available to most of the world.
- >-
The environmental impact got much, much worse
The much bigger problem here is the enormous competitive buildout of the
infrastructure that is imagined to be necessary for these models in the
future.
Companies like Google, Meta, Microsoft and Amazon are all spending
billions of dollars rolling out new datacenters, with a very material
impact on the electricity grid and the environment. There’s even talk of
spinning up new nuclear power stations, but those can take decades.
Is this infrastructure necessary? DeepSeek v3’s $6m training cost and
the continued crash in LLM prices might hint that it’s not. But would
you want to be the big tech executive that argued NOT to build out this
infrastructure only to be proven wrong in a few years’ time?
- >-
We don’t yet know how to build GPT-4
Frustratingly, despite the enormous leaps ahead we’ve had this year, we
are yet to see an alternative model that’s better than GPT-4.
OpenAI released GPT-4 in March, though it later turned out we had a
sneak peak of it in February when Microsoft used it as part of the new
Bing.
This may well change in the next few weeks: Google’s Gemini Ultra has
big claims, but isn’t yet available for us to try out.
The team behind Mistral are working to beat GPT-4 as well, and their
track record is already extremely strong considering their first public
model only came out in September, and they’ve released two significant
improvements since then.
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
- cosine_accuracy@1
- cosine_accuracy@3
- cosine_accuracy@5
- cosine_accuracy@10
- cosine_precision@1
- cosine_precision@3
- cosine_precision@5
- cosine_precision@10
- cosine_recall@1
- cosine_recall@3
- cosine_recall@5
- cosine_recall@10
- cosine_ndcg@10
- cosine_mrr@10
- cosine_map@100
model-index:
- name: SentenceTransformer based on Snowflake/snowflake-arctic-embed-l
results:
- task:
type: information-retrieval
name: Information Retrieval
dataset:
name: Unknown
type: unknown
metrics:
- type: cosine_accuracy@1
value: 0.875
name: Cosine Accuracy@1
- type: cosine_accuracy@3
value: 1
name: Cosine Accuracy@3
- type: cosine_accuracy@5
value: 1
name: Cosine Accuracy@5
- type: cosine_accuracy@10
value: 1
name: Cosine Accuracy@10
- type: cosine_precision@1
value: 0.875
name: Cosine Precision@1
- type: cosine_precision@3
value: 0.3333333333333333
name: Cosine Precision@3
- type: cosine_precision@5
value: 0.20000000000000004
name: Cosine Precision@5
- type: cosine_precision@10
value: 0.10000000000000002
name: Cosine Precision@10
- type: cosine_recall@1
value: 0.875
name: Cosine Recall@1
- type: cosine_recall@3
value: 1
name: Cosine Recall@3
- type: cosine_recall@5
value: 1
name: Cosine Recall@5
- type: cosine_recall@10
value: 1
name: Cosine Recall@10
- type: cosine_ndcg@10
value: 0.9538662191964322
name: Cosine Ndcg@10
- type: cosine_mrr@10
value: 0.9375
name: Cosine Mrr@10
- type: cosine_map@100
value: 0.9375
name: Cosine Map@100
SentenceTransformer based on Snowflake/snowflake-arctic-embed-l
This is a sentence-transformers model finetuned from Snowflake/snowflake-arctic-embed-l. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: Snowflake/snowflake-arctic-embed-l
- Maximum Sequence Length: 512 tokens
- Output Dimensionality: 1024 dimensions
- Similarity Function: Cosine Similarity
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("manmah/legal-ft-2aefb51e-1a19-43c1-a5ff-7d28d65534da")
# Run inference
sentences = [
'What makes the prompt-driven custom interface feature powerful and easy to build despite the challenges of browser sandboxing?',
'This prompt-driven custom interface feature is so powerful and easy to build (once you’ve figured out the gnarly details of browser sandboxing) that I expect it to show up as a feature in a wide range of products in 2025.\nUniversal access to the best models lasted for just a few short months\nFor a few short months this year all three of the best available models—GPT-4o, Claude 3.5 Sonnet and Gemini 1.5 Pro—were freely available to most of the world.',
'We don’t yet know how to build GPT-4\nFrustratingly, despite the enormous leaps ahead we’ve had this year, we are yet to see an alternative model that’s better than GPT-4.\nOpenAI released GPT-4 in March, though it later turned out we had a sneak peak of it in February when Microsoft used it as part of the new Bing.\nThis may well change in the next few weeks: Google’s Gemini Ultra has big claims, but isn’t yet available for us to try out.\nThe team behind Mistral are working to beat GPT-4 as well, and their track record is already extremely strong considering their first public model only came out in September, and they’ve released two significant improvements since then.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
Evaluation
Metrics
Information Retrieval
- Evaluated with
InformationRetrievalEvaluator
Metric | Value |
---|---|
cosine_accuracy@1 | 0.875 |
cosine_accuracy@3 | 1.0 |
cosine_accuracy@5 | 1.0 |
cosine_accuracy@10 | 1.0 |
cosine_precision@1 | 0.875 |
cosine_precision@3 | 0.3333 |
cosine_precision@5 | 0.2 |
cosine_precision@10 | 0.1 |
cosine_recall@1 | 0.875 |
cosine_recall@3 | 1.0 |
cosine_recall@5 | 1.0 |
cosine_recall@10 | 1.0 |
cosine_ndcg@10 | 0.9539 |
cosine_mrr@10 | 0.9375 |
cosine_map@100 | 0.9375 |
Training Details
Training Dataset
Unnamed Dataset
- Size: 156 training samples
- Columns:
sentence_0
andsentence_1
- Approximate statistics based on the first 156 samples:
sentence_0 sentence_1 type string string details - min: 12 tokens
- mean: 20.82 tokens
- max: 32 tokens
- min: 43 tokens
- mean: 135.28 tokens
- max: 214 tokens
- Samples:
sentence_0 sentence_1 What new feature does ChatGPT voice mode offer as of December?
The most recent twist, again from December (December was a lot) is live video. ChatGPT voice mode now provides the option to share your camera feed with the model and talk about what you can see in real time. Google Gemini have a preview of the same feature, which they managed to ship the day before ChatGPT did.
Which company released a similar live video feature just before ChatGPT?
The most recent twist, again from December (December was a lot) is live video. ChatGPT voice mode now provides the option to share your camera feed with the model and talk about what you can see in real time. Google Gemini have a preview of the same feature, which they managed to ship the day before ChatGPT did.
When did OpenAI make GPT-4o free for all users?
OpenAI made GPT-4o free for all users in May, and Claude 3.5 Sonnet was freely available from its launch in June. This was a momentus change, because for the previous year free users had mostly been restricted to GPT-3.5 level models, meaning new users got a very inaccurate mental model of what a capable LLM could actually do.
That era appears to have ended, likely permanently, with OpenAI’s launch of ChatGPT Pro. This $200/month subscription service is the only way to access their most capable model, o1 Pro.
Since the trick behind the o1 series (and the future models it will undoubtedly inspire) is to expend more compute time to get better results, I don’t think those days of free access to the best available models are likely to return. - Loss:
MatryoshkaLoss
with these parameters:{ "loss": "MultipleNegativesRankingLoss", "matryoshka_dims": [ 768, 512, 256, 128, 64 ], "matryoshka_weights": [ 1, 1, 1, 1, 1 ], "n_dims_per_step": -1 }
Training Hyperparameters
Non-Default Hyperparameters
eval_strategy
: stepsper_device_train_batch_size
: 10per_device_eval_batch_size
: 10num_train_epochs
: 10multi_dataset_batch_sampler
: round_robin
All Hyperparameters
Click to expand
overwrite_output_dir
: Falsedo_predict
: Falseeval_strategy
: stepsprediction_loss_only
: Trueper_device_train_batch_size
: 10per_device_eval_batch_size
: 10per_gpu_train_batch_size
: Noneper_gpu_eval_batch_size
: Nonegradient_accumulation_steps
: 1eval_accumulation_steps
: Nonetorch_empty_cache_steps
: Nonelearning_rate
: 5e-05weight_decay
: 0.0adam_beta1
: 0.9adam_beta2
: 0.999adam_epsilon
: 1e-08max_grad_norm
: 1num_train_epochs
: 10max_steps
: -1lr_scheduler_type
: linearlr_scheduler_kwargs
: {}warmup_ratio
: 0.0warmup_steps
: 0log_level
: passivelog_level_replica
: warninglog_on_each_node
: Truelogging_nan_inf_filter
: Truesave_safetensors
: Truesave_on_each_node
: Falsesave_only_model
: Falserestore_callback_states_from_checkpoint
: Falseno_cuda
: Falseuse_cpu
: Falseuse_mps_device
: Falseseed
: 42data_seed
: Nonejit_mode_eval
: Falseuse_ipex
: Falsebf16
: Falsefp16
: Falsefp16_opt_level
: O1half_precision_backend
: autobf16_full_eval
: Falsefp16_full_eval
: Falsetf32
: Nonelocal_rank
: 0ddp_backend
: Nonetpu_num_cores
: Nonetpu_metrics_debug
: Falsedebug
: []dataloader_drop_last
: Falsedataloader_num_workers
: 0dataloader_prefetch_factor
: Nonepast_index
: -1disable_tqdm
: Falseremove_unused_columns
: Truelabel_names
: Noneload_best_model_at_end
: Falseignore_data_skip
: Falsefsdp
: []fsdp_min_num_params
: 0fsdp_config
: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}tp_size
: 0fsdp_transformer_layer_cls_to_wrap
: Noneaccelerator_config
: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed
: Nonelabel_smoothing_factor
: 0.0optim
: adamw_torchoptim_args
: Noneadafactor
: Falsegroup_by_length
: Falselength_column_name
: lengthddp_find_unused_parameters
: Noneddp_bucket_cap_mb
: Noneddp_broadcast_buffers
: Falsedataloader_pin_memory
: Truedataloader_persistent_workers
: Falseskip_memory_metrics
: Trueuse_legacy_prediction_loop
: Falsepush_to_hub
: Falseresume_from_checkpoint
: Nonehub_model_id
: Nonehub_strategy
: every_savehub_private_repo
: Nonehub_always_push
: Falsegradient_checkpointing
: Falsegradient_checkpointing_kwargs
: Noneinclude_inputs_for_metrics
: Falseinclude_for_metrics
: []eval_do_concat_batches
: Truefp16_backend
: autopush_to_hub_model_id
: Nonepush_to_hub_organization
: Nonemp_parameters
:auto_find_batch_size
: Falsefull_determinism
: Falsetorchdynamo
: Noneray_scope
: lastddp_timeout
: 1800torch_compile
: Falsetorch_compile_backend
: Nonetorch_compile_mode
: Noneinclude_tokens_per_second
: Falseinclude_num_input_tokens_seen
: Falseneftune_noise_alpha
: Noneoptim_target_modules
: Nonebatch_eval_metrics
: Falseeval_on_start
: Falseuse_liger_kernel
: Falseeval_use_gather_object
: Falseaverage_tokens_across_devices
: Falseprompts
: Nonebatch_sampler
: batch_samplermulti_dataset_batch_sampler
: round_robin
Training Logs
Epoch | Step | cosine_ndcg@10 |
---|---|---|
1.0 | 16 | 0.9484 |
2.0 | 32 | 0.9539 |
3.0 | 48 | 0.9692 |
3.125 | 50 | 0.9846 |
4.0 | 64 | 0.9692 |
5.0 | 80 | 0.9692 |
6.0 | 96 | 0.9539 |
6.25 | 100 | 0.9385 |
7.0 | 112 | 0.9539 |
8.0 | 128 | 0.9539 |
9.0 | 144 | 0.9539 |
9.375 | 150 | 0.9539 |
10.0 | 160 | 0.9539 |
Framework Versions
- Python: 3.13.2
- Sentence Transformers: 4.1.0
- Transformers: 4.51.3
- PyTorch: 2.7.0
- Accelerate: 1.6.0
- Datasets: 3.5.1
- Tokenizers: 0.21.1
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
MatryoshkaLoss
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
MultipleNegativesRankingLoss
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}