--- title: SemScore tags: - evaluate - metric description: 'SemScore measures semantic textual similarity between candidate and reference texts. It has been shown to strongly correlate with human judgment on a system-level when evaluating the instructing following capabilities of language models. Given a set of model-generated outputs and target completions, a pre-trained sentence-transformer is used to calculate cosine similarities between them.' sdk: gradio sdk_version: 3.19.1 app_file: app.py pinned: false --- # Metric Card for SemScore ## Metric Description SemScore measures semantic textual similarity between candidate and reference texts. It has been shown to strongly correlate with human judgment on a system-level when evaluating the instructing following capabilities of language models. Given a set of model-generated outputs and target completions, a pre-trained [sentence transformer](https://www.sbert.net) is used to calculate cosine similarities between them. ## How to Use When loading SemScore, you can choose any pre-trained encoder-only model uploaded to HF Hub in order to compute the score. The default model (if no `model_name` is specified) is `sentence-transformers/all-mpnet-base-v2`. ```python import evaluate semscore = evaluate.load("aynetdia/semscore", "model_name") ``` SemScore takes 2 mandatory arguments in order to calculate the final score: - `predictions`: a list of strings with model predictions (e.g. isntruction completions) to score. - `references`: a list of strings with "gold" references (e.g. target completions). It also accepts optional arguments: Its optional arguments are: - `batch_size`: the batch size for calculating the score (default value is `32`). - `device`: CPU/GPU device on which the score will be calculated (default value is `None`, i.e. `cpu`). ```python predictions = ['This is an example sentence', 'Each sentence is considered'] references = ['This is an example sentence', 'Each sentence is considered'] results = semscore.compute(predictions=predictions, references=references, batch_size=2, device="cuda:0") ``` ### Output Values The output of SemScore is a dictionary with the following values: - `semscore`: aggregated system-level SemScore. - `similarities`: cosine similarities between individual prediction-reference pairs. #### Values from Popular Papers [SemScore paper](https://arxiv.org/abs/2401.17072) reports correlation of SemScore to human ratings in comparison to other popular metrics relying on "gold" references for predictions, as well as reference-free LLM-based evaluation methods. The comparison is done based on evaluation of instruction-tuned LLMs. ## Limitations and Bias One limitation of SemScore is its dependence on an underlying transformer model to compute semantic textual similarity between model and target outputs. This implementation relies on the strongest sentence transformer model, as reported by the authors of the `sentence-transformers` library, by default. However, better embedding models have become available since the publication of the SemScore paper (e.g. those listed in the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard)). In addition, a more general limitation is that SemScore requires at least one gold-standard target output against which to compare a generated response. This target output should be human created or at least human-vetted. ## Citation ```bibtex @misc{semscore, title={SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity}, author={Ansar Aynetdinov and Alan Akbik}, year={2024}, eprint={2401.17072}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2401.17072}, } ``` ## Further References - [SemScore paper](https://arxiv.org/abs/2401.17072)