Spaces:

Salesforce
/

crm_llm_leaderboard

Running

App Files Files Community

yibum commited on Jun 25, 2024

Commit

c187ec2

1 Parent(s): 1d1e9b6

update metric definition

Browse files

Files changed (2) hide show

crm-results/hf_leaderboard_latency_cost.csv +37 -37
src/about.py +30 -8

crm-results/hf_leaderboard_latency_cost.csv CHANGED Viewed

@@ -1,37 +1,37 @@
-Model Name,Cost and Speed: Flavor,Version,Platform,Response Time (Sec),Mean Output Tokens,Mean Cost per 1K Requests,Cost Band,,Model id,Cost per 1m input tokens,Cost per 1m output tokens,,,,Percentile,From,To,,min,Max
-AI21 Jamba-Instruct,Long,,AI21,4.0,232.9,1.6,Medium,,GPT 3.5 Turbo,0.5,1.5,,,0%,0.43,0.43,1.61,,0.43,61.11
-AI21 Jamba-Instruct,Short,,AI21,4.0,243.9,0.5,Low,,GPT 4 Turbo,10,30,,,33%,1.61,1.61,9.28,,,
-Claude 3 Haiku,Long,,Bedrock,2.8,236.9,1.0,Low,,GPT4-o,5,15,,,67%,9.28,9.28,61.11,,,
-Claude 3 Haiku,Short,,Bedrock,2.2,245.4,0.4,Low,,Claude 3 Haiku,0.25,1.25,,,100%,61.11,,,,,
-Claude 3 Opus,Long,,Bedrock,12.2,242.7,61.1,High,,Claude 3 Opus,15,75,,,,,,,,,
-Claude 3 Opus,Short,,Bedrock,8.4,243.2,25.4,High,,AI21 Jamba-Instruct,0.5,0.7,,,,,,,,,
-Cohere Command R+,Long,,Bedrock,7.7,245.7,11.7,High,,Cohere Command Text,1.5,2,,,,,,,,,
-Cohere Command R+,Short,,Bedrock,7.1,249.9,5.1,Medium,,Cohere Command R+,3,15,,,,,,,,,
-Cohere Command Text,Long,,Bedrock,12.9,238.7,4.3,Medium,,Gemini Pro 1,0.5,1.5,,,,,,,,,
-Cohere Command Text,Short,,Bedrock,9.6,245.6,1.1,Low,,Gemini Pro 1.5,3.5,7,,,,,,,,,
-Gemini Pro 1.5,Long,,Google,5.5,245.7,11.0,High,,,,,,,,,,,,,
-Gemini Pro 1.5,Short,,Google,5.4,247.5,3.3,Medium,,,,,,,,,,,,,
-Gemini Pro 1,Long,,Google,6.0,228.9,1.7,Medium,,,,,,,,,,,,,
-Gemini Pro 1,Short,,Google,4.4,247.4,0.6,Low,,,,,,,,,,,,,
-GPT 3.5 Turbo,Long,,OpenAI,4.5,249.9,1.6,Low,,,,,,,,,,,,,
-GPT 3.5 Turbo,Short,,OpenAI,4.2,238.3,0.6,Low,,,,,,,,,,,,,
-GPT 4 Turbo,Long,,OpenAI,12.3,247.6,32.0,High,,,,,,,,,,,,,
-GPT 4 Turbo,Short,,OpenAI,12.3,250.0,11.7,High,,,,,,,,,,,,,
-GPT4-o,Long,,OpenAI,5.1,248.4,15.9,High,,,,,,,,,,,,,
-GPT4-o,Short,,OpenAI,5.0,250.0,5.8,Medium,,,,,,,,,,,,,
-Mistral 7B,Long,Mistral-7B-Instruct-v0.2,Self-host (g5.48xlarge),8.83,242.0,16.5,High,,,,,,,,,,,,,
-Mistral 7B,Short,Mistral-7B-Instruct-v0.2,Self-host (g5.48xlarge),8.31,247.0,15.5,High,,,,,,,,,,,,,
-LLaMA 3 8B,Long,Meta-Llama-3-8B-Instruct,Self-host (g5.48xlarge),3.76,251.5,7.0,Medium,,,,,,,,,,,,,
-LLaMA 3 8B,Short,Meta-Llama-3-8B-Instruct,Self-host (g5.48xlarge),3.23,243.6,6.0,Medium,,,,,,,,,,,,,
-LLaMA 3 70B,Long,llama-3-70b-instruct,Self-host (p4d.24xlarge),20.1,243.9,67.7,High,,,,,,,,,,,,,
-LLaMA 3 70B,Short,llama-3-70b-instruct,Self-host (p4d.24xlarge),29.4,251.2,99.0,High,,,,,,,,,,,,,
-Mixtral 8x7B,Long,mixtral-8x7b-instruct,Self-host (p4d.24xlarge),2.44,248.5,8.22,Medium,,,,,,,,,,,,,
-Mixtral 8x7B,Short,mixtral-8x7b-instruct,Self-host (p4d.24xlarge),2.41,250.0,8.11,Medium,,,,,,,,,,,,,
-SF-TextBase 7B,Long,CRM-TextBase-7b-22k-g5 (endpoint),Self-host (g5.48xlarge),8.99,248.5,16.80,High,,,,,,,,,,,,,
-SF-TextBase 7B,Short,CRM-TextBase-7b-22k-g5 (endpoint),Self-host (g5.48xlarge),8.29,248.7,15.50,High,,,,,,,,,,,,,
-SF-TextBase 70B,Long,TextBase-70B-8K,Self-host (p4de.24xlarge),6.52,253.7,28.17,High,,,,,,,,,,,,,
-SF-TextBase 70B,Short,TextBase-70B-8K,Self-host (p4de.24xlarge),6.24,249.7,26.96,High,,,,,,,,,,,,,
-SF-TextSum,Long,CRM-TSUM-7b-22k-g5 (endpoint),Self-host (g5.48xlarge),8.85,244.0,16.55,High,,,,,,,,,,,,,
-SF-TextSum,Short,CRM-TSUM-7b-22k-g5 (endpoint),Self-host (g5.48xlarge),8.34,250.4,15.60,High,,,,,,,,,,,,,
-XGen 2,Long,EinsteinXgen2E4DSStreaming (endpoint),Self-host (p4de.24xlarge),3.71,250.0,16.03,High,not able to get response for large token requests (5K-token input),,,,,,,,,,,,
-XGen 2,Short,EinsteinXgen2E4DSStreaming (endpoint),Self-host (p4de.24xlarge),2.64,250.0,11.40,High,,,,,,,,,,,,,

+Model Name,Cost and Speed: Flavor,Version,Platform,Response Time (Sec),Mean Output Tokens,Mean Cost per 1K Requests,Cost Band
+AI21 Jamba-Instruct,Long,,AI21,4.0,232.9,1.6,Medium
+AI21 Jamba-Instruct,Short,,AI21,4.0,243.9,0.5,Low
+Claude 3 Haiku,Long,,Bedrock,2.8,236.9,1.0,Medium
+Claude 3 Haiku,Short,,Bedrock,2.2,245.4,0.4,Low
+Claude 3 Opus,Long,,Bedrock,12.2,242.7,61.1,High
+Claude 3 Opus,Short,,Bedrock,8.4,243.2,25.4,High
+Cohere Command R+,Long,,Bedrock,7.7,245.7,11.7,High
+Cohere Command R+,Short,,Bedrock,7.1,249.9,5.1,High
+Cohere Command Text,Long,,Bedrock,12.9,238.7,4.3,High
+Cohere Command Text,Short,,Bedrock,9.6,245.6,1.1,Medium
+Gemini Pro 1.5,Long,,Google,5.5,245.7,11.0,High
+Gemini Pro 1.5,Short,,Google,5.4,247.5,3.3,Medium
+Gemini Pro 1,Long,,Google,6.0,228.9,1.7,Medium
+Gemini Pro 1,Short,,Google,4.4,247.4,0.6,Low
+GPT 3.5 Turbo,Long,,OpenAI,4.5,249.9,1.6,Medium
+GPT 3.5 Turbo,Short,,OpenAI,4.2,238.3,0.6,Low
+GPT 4 Turbo,Long,,OpenAI,12.3,247.6,32.0,High
+GPT 4 Turbo,Short,,OpenAI,12.3,250.0,11.7,High
+GPT4-o,Long,,OpenAI,5.1,248.4,15.9,High
+GPT4-o,Short,,OpenAI,5.0,250.0,5.8,High
+Mistral 7B,Long,Mistral-7B-Instruct-v0.2,Self-host (g5.48xlarge),8.83,242.0,0.5,Low
+Mistral 7B,Short,Mistral-7B-Instruct-v0.2,Self-host (g5.48xlarge),8.31,247.0,0.1,Low
+LLaMA 3 8B,Long,Meta-Llama-3-8B-Instruct,Self-host (g5.48xlarge),3.76,251.5,1.1,Medium
+LLaMA 3 8B,Short,Meta-Llama-3-8B-Instruct,Self-host (g5.48xlarge),3.23,243.6,0.3,Low
+LLaMA 3 70B,Long,llama-3-70b-instruct,Self-host (p4d.24xlarge),20.1,243.9,8.8,High
+LLaMA 3 70B,Short,llama-3-70b-instruct,Self-host (p4d.24xlarge),29.4,251.2,2.2,Medium
+Mixtral 8x7B,Long,mixtral-8x7b-instruct,Self-host (p4d.24xlarge),2.44,248.5,1.5,Medium
+Mixtral 8x7B,Short,mixtral-8x7b-instruct,Self-host (p4d.24xlarge),2.41,250.0,0.4,Low
+SF-TextBase 7B,Long,CRM-TextBase-7b-22k-g5 (endpoint),Self-host (g5.48xlarge),8.99,248.5,0.5,Low
+SF-TextBase 7B,Short,CRM-TextBase-7b-22k-g5 (endpoint),Self-host (g5.48xlarge),8.29,248.7,0.1,Low
+SF-TextBase 70B,Long,TextBase-70B-8K,Self-host (p4de.24xlarge),6.52,253.7,8.8,High
+SF-TextBase 70B,Short,TextBase-70B-8K,Self-host (p4de.24xlarge),6.24,249.7,2.2,Medium
+SF-TextSum,Long,CRM-TSUM-7b-22k-g5 (endpoint),Self-host (g5.48xlarge),8.85,244.0,0.5,Low
+SF-TextSum,Short,CRM-TSUM-7b-22k-g5 (endpoint),Self-host (g5.48xlarge),8.34,250.4,0.1,Low
+XGen 2,Long,EinsteinXgen2E4DSStreaming (endpoint),Self-host (p4de.24xlarge),3.71,250.0,2.9,Medium
+XGen 2,Short,EinsteinXgen2E4DSStreaming (endpoint),Self-host (p4de.24xlarge),2.64,250.0,0.7,Medium

src/about.py CHANGED Viewed

@@ -1,6 +1,6 @@
 # Your leaderboard name
 TITLE = """<h1 align="center" id="space-title">LLM Leaderboard for CRM</h1>
-<h3>Assess which LLMs are accurate enough or need fine-tuning, and weigh this versus tradeoffs of speed, costs, and trust and safety. This is based on human manual and automated evaluation with real operational CRM data per use case.</h3>
 """
 # What does your leaderboard evaluate?
@@ -9,7 +9,7 @@ INTRODUCTION_TEXT = """
 """
 LLM_BENCHMARKS_TEXT = """
-1) We consider models that are general instruction-tuned, not task-specific fine-tuned ones
 2) For GPT-4/GPT-4-Turbo Models, following tasks were evaluated:
   + GPT4:
     - Service: Conversation summary
@@ -25,14 +25,36 @@ LLM_BENCHMARKS_TEXT = """
     - Service: Call Summary
     - Service: Live Chat Summary
-3) Latency scores reflect the mean latency on a high-speed internet connection over a particular time span, based on the time to receive the entire completion; response times for external APIs may vary and be impacted by internet speed, location, etc.
-4) Some external APIs were hosted directly by the LLM provider (OpenAI, Google, AI21), while others were provided through Amazon Bedrock (Cohere, Anthropic)
-5) LLM annotations (manual/human evaluations) were performed under a variety of settings that did not necessarily control for ordering effects.
 6) For the tests on latency: two cases were considered: (1) Length ~500 input and length ~250 output, and (2) length ~3000 input and ~250 output, reflecting common use cases for summarization and generation tasks.
-7) Costs for all external APIs were based on the standard pricing of the provider (note that the pricing of cohere/anthropic via Bedrock is the same as directly through Cohere/Anthropic APIs).
-8) Trust & Safety was benchmarked on public datasets as well as bias perturbations on CRM datasets. For gender bias, person names and pronouns were perturbed. For company bias, company names were perturbed to competitors in the same sector. For the CRM Fairness metric, higher means less bias.
-9) Cost per request for self-hosted models assume a minimal frequency of calling the model, since the costs are per hour. All latencies / cost assume a single user at a time.
 10) The current auto-evaluation is based on LLaMA-70B as Judge, which showed the highest correlation with human annotators; however, LLM judges may be less reliable than human annotators. This remains an active area of research.
 """
 CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"

 # Your leaderboard name
 TITLE = """<h1 align="center" id="space-title">LLM Leaderboard for CRM</h1>
+<h3>Assess which LLMs are accurate enough or need fine-tuning, and weigh this versus tradeoffs of speed, costs, and trust and safety. This is based on human (manual) and automated evaluation with real operational CRM data per use case. Learn and explore more <a href="https://www.salesforceairesearch.com/crm-benchmark">here</a>.</h3>
 """
 # What does your leaderboard evaluate?
 """
 LLM_BENCHMARKS_TEXT = """
+1) We consider models that are general instruction-tuned, not task-specific fine-tuned ones.
 2) For GPT-4/GPT-4-Turbo Models, following tasks were evaluated:
   + GPT4:
     - Service: Conversation summary
     - Service: Call Summary
     - Service: Live Chat Summary
+3) Latency scores reflect the mean latency to receive the entire completion over a high-speed internet connection; response times for external APIs may vary and be impacted by internet speed, location, time, and other factors.
+4) External APIs were hosted directly by the LLM provider (OpenAI, Google, AI21) or provided through Amazon Bedrock (Cohere, Anthropic). Other models were self-hosted using the DJL serving framework with vLLM engine. All 7B-sized models are hosted on G5.48xlarge with model sharding, and rolling batch config. For 70B model P4d.24xlarge instance were. used.
+5) LLM annotations (manual/human evaluations) were performed on a subset of models under settings that did not necessarily control for ordering effects.
 6) For the tests on latency: two cases were considered: (1) Length ~500 input and length ~250 output, and (2) length ~3000 input and ~250 output, reflecting common use cases for summarization and generation tasks.
+7) Costs for all external APIs were based on the standard pricing of the provider (note that the pricing of Cohere/Anthropic via Bedrock is the same as directly through Cohere/Anthropic APIs).
+8) Cost per request for self-hosted models assume a minimal frequency of calling the model, since the costs are per hour. All latencies / cost assume a single user at a time.
+9) Trust & Safety was benchmarked on public datasets as well as bias perturbations on CRM datasets. For gender bias, person names and pronouns were perturbed. For company bias, company names were perturbed to competitors in the same sector. For the CRM Fairness metric, higher means less bias.
 10) The current auto-evaluation is based on LLaMA-70B as Judge, which showed the highest correlation with human annotators; however, LLM judges may be less reliable than human annotators. This remains an active area of research.
+11) For some of the manual evaluations, valid JSON outputs from LLMs were reformatted as plain text to be more readable, with an accompanying note acknowleding the reformatting. In these same evaluations, when a model generated invalid JSON that was parseable using a small set of heuristics, the JSON was similarly reformatted with a note indicating that the original JSON was invalid.
+### Metric Definitions
+|Metric   |Description   |
+|---|---|
+|**Accuracy**   |The average of Instruction-following, Completeness, Conciseness and Factuality. Note that coherence is was not measured since current LLMs consistently produce coherent results.    |
+|**Instruction-following**   |Is the answer per the requested instructions, in terms of content and format?   |
+|**Completeness**   |Is the response comprehensive, by including all relevant information?   |
+|**Conciseness** |Is the response to the point and without repetition or elaboration?   |
+|**Factuality**   |Is the response true and free of false information?   |
+|**Accuracy Metric Scale**   |All accuracy metrics used the same scale for human/manual evaluation and for automated evaluations, which aids statistical analysis. <br>Very Good: As good as it gets given the information. A person with enough time would not do much better. <br>Good:  Done well with a little bit of room for improvements. <br>Poor: Not usable. Has issues. <br> Very Poor: Not usable, with very obvious critical issues."   |
+|**Response Time (Sec)**   |The mean time it takes for the LLM to produce a full response.    |
+|**Mean output tokens**  |The mean number of output tokens across a set of a prompts for a given use case. This is reference number to ensure a level measure for response time.   |
+|**Cost Band**   |For public-API models, the cost is based on standard token-based pricing. For hosted models, typically open-sourced models, this is the hosting costs based on the use case type (with long or short inputs) and production-level volume for CRM.    |
+|**Trust & Safety**   |This is the average of Safety, Privacy, Truthfulness, and CRM Fairness, as a percentage.    |
+|**Safety**   |100 minus the percent of time a model refuses to provide an answer to an unsafe prompt.   |
+|**Privacy**   |The average percent of the time privacy was respected across zero-shot and 5-shot attempts.    |
+|**Truthfulness**   |The percent of times the model is able to correct wrong general information or facts in a prompt.   |
+|**CRM Fairness**   |Average of CRM Gender Bias and CRM Account Bias.   |
+|**CRM Account Bias**   |Percent of time Accuracy is maintained on CRM use cases, despite account-reference variations.   |
+|**CRM Gender Bias**   |Percent of time Accuracy is maintained on CRM use cases, despite gender-reference variations.   |
 """
 CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"