LLM Leaderboard for CRM

# Your leaderboard name
TITLE = """<h1 align="center" id="space-title">LLM Leaderboard for CRM</h1>
<h3>Assess which LLMs are accurate enough or need fine-tuning, and weigh this versus tradeoffs of speed, costs, and trust and safety. This is based on human (manual) and automated evaluation with real operational CRM data per use case. Learn and explore more <a href="https://www.salesforceairesearch.com/crm-benchmark">here</a>.</h3>
"""

# What does your leaderboard evaluate?
INTRODUCTION_TEXT = """

"""

LLM_BENCHMARKS_TEXT = """
1) We consider models that are general instruction-tuned, not task-specific fine-tuned ones.
2) For GPT-4/GPT-4-Turbo Models, following tasks were evaluated:
  + GPT4:
    - Service: Conversation summary
    - Sales: Email Generation
    - Sales & Service: Update CRM Info
    - Service: Reply Recommendations

  + GPT4-Turbo:
    - Service: Live Chat Insights
    - Service: Email Summary
    - Sales: Call Summary
    - Service: Knowledge creation from Case Info
    - Service: Call Summary
    - Service: Live Chat Summary

3) Latency scores reflect the mean latency to receive the entire completion over a high-speed internet connection; response times for external APIs may vary and be impacted by internet speed, location, time, and other factors.
4) External APIs were hosted directly by the LLM provider (OpenAI, Google, AI21) or provided through Amazon Bedrock (Cohere, Anthropic). Other models were self-hosted using the DJL serving framework with vLLM engine. All 7B-sized models are hosted on G5.48xlarge with model sharding, and rolling batch config. For 70B model P4d.24xlarge instance were. used.
5) LLM annotations (manual/human evaluations) were performed on a subset of models under settings that did not necessarily control for ordering effects.
6) For the tests on latency: two cases were considered: (1) Length ~500 input and length ~250 output, and (2) length ~3000 input and ~250 output, reflecting common use cases for summarization and generation tasks.
7) Costs for all external APIs were based on the standard pricing of the provider (note that the pricing of Cohere/Anthropic via Bedrock is the same as directly through Cohere/Anthropic APIs).
8) Cost per request for self-hosted models assume a minimal frequency of calling the model, since the costs are per hour. All latencies / cost assume a single user at a time.
9) Trust & Safety was benchmarked on public datasets as well as bias perturbations on CRM datasets. For gender bias, person names and pronouns were perturbed. For company bias, company names were perturbed to competitors in the same sector. For the CRM Fairness metric, higher means less bias.
10) The current auto-evaluation is based on LLaMA-70B as Judge, which showed the highest correlation with human annotators; however, LLM judges may be less reliable than human annotators. This remains an active area of research.
11) For some of the manual evaluations, valid JSON outputs from LLMs were reformatted as plain text to be more readable, with an accompanying note acknowleding the reformatting. In these same evaluations, when a model generated invalid JSON that was parseable using a small set of heuristics, the JSON was similarly reformatted with a note indicating that the original JSON was invalid.

### Metric Definitions

|Metric   |Description   |
|---|---|
|**Accuracy**   |The average of Instruction-following, Completeness, Conciseness and Factuality. Note that coherence is was not measured since current LLMs consistently produce coherent results.    |
|**Instruction-following**   |Is the answer per the requested instructions, in terms of content and format?   |
|**Completeness**   |Is the response comprehensive, by including all relevant information?   |
|**Conciseness** |Is the response to the point and without repetition or elaboration?   |
|**Factuality**   |Is the response true and free of false information?   |
|**Accuracy Metric Scale**   |All accuracy metrics used the same scale for human/manual evaluation and for automated evaluations, which aids statistical analysis. <br>Very Good: As good as it gets given the information. A person with enough time would not do much better. <br>Good:  Done well with a little bit of room for improvements. <br>Poor: Not usable. Has issues. <br> Very Poor: Not usable, with very obvious critical issues."   |
|**Response Time (Sec)**   |The mean time it takes for the LLM to produce a full response.    |
|**Mean output tokens**  |The mean number of output tokens across a set of a prompts for a given use case. This is reference number to ensure a level measure for response time.   |
|**Cost Band**   |For public-API models, the cost is based on standard token-based pricing. For hosted models, typically open-sourced models, this is the hosting costs based on the use case type (with long or short inputs) and production-level volume for CRM.    |
|**Trust & Safety**   |This is the average of Safety, Privacy, Truthfulness, and CRM Fairness, as a percentage.    |
|**Safety**   |100 minus the percent of time a model refuses to provide an answer to an unsafe prompt.   |
|**Privacy**   |The average percent of the time privacy was respected across zero-shot and 5-shot attempts.    |
|**Truthfulness**   |The percent of times the model is able to correct wrong general information or facts in a prompt.   |
|**CRM Fairness**   |Average of CRM Gender Bias and CRM Account Bias.   |
|**CRM Account Bias**   |Percent of time Accuracy is maintained on CRM use cases, despite account-reference variations.   |
|**CRM Gender Bias**   |Percent of time Accuracy is maintained on CRM use cases, despite gender-reference variations.   |
"""

CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
CITATION_BUTTON_TEXT = r"""
@misc{crm-llm-leaderboard,
  author = {Salesforce AI},
  title = {LLM Leaderboard for CRM},
  year = {2024},
  publisher = {Salesforce AI},
  howpublished = "\url{https://https://huggingface.co/spaces/Salesforce/crm_llm_leaderboard}"
}
"""