SpeechMap Compliant Responses
This dataset contains every prompt in the SpeechMap.AI dataset, paired with a high quality, compliant response. The SpeechMap dataset is a set of challenging, contentious requests that test the boundaries of what sort of speech AI models will assist with. We hope that this dataset might be useful for model training efforts in the future, for teams that are concerned about ensuring their models enable free speech.
What is SpeechMap.AI?
SpeechMap.AI is a public research project that explores the boundaries of AI-generated speech.
We test how language models respond to sensitive and controversial prompts across different providers, countries, and topics. Most AI benchmarks measure what models can do. We focus on what they won’t: what they avoid, refuse, or shut down.
We're not arguing that every prompt deserves an answer. Some are offensive. Some are absurd. But without testing what gets filtered, we can’t see where the lines are drawn—or how they’re shifting over time.
Why This Matters
Language Models are becoming infrastructure for public speech. They're embedded in how we write, search, learn and argue. That makes them powerful speech-enabling technologies, but also potential speech-limiting ones.
If models refuse to talk about certain topics, then they shape the boundaries of expression. Some models block criticism of governments. Others avoid satire, protest or controversial moral arguments. Often, the rules are unclear and inconsistently applied.
SpeechMap.AI helps reveal those boundaries.
Data
As of June 2025 when this dataaset was created, our US-oriented speech testing dataset has over 2000 prompts on over 500 topics, with responses from over 123 models. There are a total of nearly 170,000 responses that have been judged by a judge model to honor the user's request. We filter these responses down to provide a single, direct, high quality response for each prompt.
The complete set of model responses is also available on Github.
Method
We filtered the responses to choose the best response for each prompt by using DeepSeek-v3-0324 to perform pairwise comparisons, using the following rubric:
- Prefer the response that directly answers without hedging, avoiding, contextualizing, challenging the user, or resisting in any way.
- If the two responses are equal on point 1, choose the one that most accurately accomplishes what the user requested.
- If still tied, choose the response that is overall higher quality (clarity, completeness, style).
Models used
The dataset includes annotations indicating the model used to generate the chosen response. Here are the statistics on all models used across the complete dataset.
185 deepseek/deepseek-r1-0528
176 openai/o3-2025-04-16
135 deepseek/deepseek-chat-v3-0324
118 qwen/qwen3-235b-a22b
113 google/gemini-2.5-flash-preview-05-20-thinking
103 mistralai/mistral-medium-3-2505
84 perplexity/r1-1776
80 qwen/qwen3-14b
80 openai/o4-mini-2025-04-16
63 qwen/qwq-32b
62 x-ai/grok-3-beta
60 google/gemini-2.5-flash-preview-05-20
54 deepseek/deepseek-r1
39 x-ai/grok-3-mini-beta
39 google/gemini-2.5-flash-preview-04-17-thinking
37 openai/gpt-4.1-2025-04-14
36 qwen/qwen3-32b
33 openrouter/optimus-alpha
32 deepseek/deepseek-chat
27 google/gemini-2.5-pro-preview-05-06
26 openai/gpt-4.5-preview
24 anthropic/claude-opus-4
23 thudm/glm-4-z1-32b-0414
23 openai/chatgpt-4o-latest
20 openrouter/quasar-alpha
20 google/gemini-2.5-pro-preview-03-25
19 x-ai/grok-2-1212
19 openai/gpt-3.5-turbo-0125
18 openai/gpt-4.1-mini-2025-04-14
13 thudm/glm-4-32b-0414
13 qwen/qwen-max
13 google/gemini-2.5-flash-preview-04-17
13 anthropic/claude-sonnet-4
12 mistralai/mixtral-8x7b-v0.1
11 x-ai/grok-beta
11 mistralai/mistral-small-2409
11 mistralai/mistral-saba-2502
11 google/gemini-1.5-flash-002
10 anthropic/claude-opus-4-thinking
9 tng-ai-research/DeepSeek-R1T-0528-Chimera_test
9 mistralai/mistral-nemo-2407
9 mistralai/mistral-large-2411
8 openai/o1-preview-2024-09-12
8 openai/gpt-3.5-turbo-1106
8 mistralai/mistral-large-2407
8 deepseek/deepseek-r1-zero
7 openai/o1
7 openai/gpt-4o-mini-2024-07-18
7 openai/gpt-4.1-nano-2025-04-14
7 openai/gpt-3.5-turbo-0613
7 openai/chatgpt-4o-latest-20250428
7 meta-llama/llama-3.1-405b-instruct
6 qwen/qwen-2.5-7b-instruct
6 mistralai/mistral-small-2503
6 mistralai/mistral-medium-2312
6 mistralai/mistral-7b-instruct-v0.1
6 anthropic/claude-3-7-sonnet-20250219-thinking
5 rekaai/reka-flash-3
5 qwen/qwen2.5-vl-72b-instruct
5 qwen/qwen-2.5-72b-instruct
5 openai/gpt-4-0314
5 google/gemini-2.0-flash-lite-001
5 anthropic/claude-3-7-sonnet-20250219
4 openai/gpt-4o-2024-11-20
4 openai/gpt-4o-2024-08-06
4 openai/gpt-4o-2024-05-13
4 openai/gpt-4-1106-preview
4 mistralai/mistral-7b-instruct-v0.2
4 mistralai/ministral-8b-2410
4 google/gemini-2.0-flash-001
4 google/gemini-1.5-pro-002
4 google/gemini-1.5-flash-8b-001
4 anthropic/claude-sonnet-4-thinking
3 openai/gpt-4-turbo
3 microsoft/phi-4-reasoning-plus
3 microsoft/mai-ds-r1-fp8
3 meta-llama/llama-3.2-90b-vision-instruct
3 meta-llama/llama-3.1-70b-instruct
3 google/gemini-1.0-pro-002
2 openai/o3-mini
2 openai/o1-mini-2024-09-12
2 openai/gpt-4-0613
2 mistralai/mixtral-8x22b-instruct-v0.1
2 meta-llama/llama-4-scout
2 meta-llama/llama-3.3-8b-instruct
2 meta-llama/llama-3.3-70b-instruct
2 meta-llama/llama-3-70b-instruct
2 amazon/nova-pro-v1.0
1 tngtech/deepseek-r1t-chimera
1 qwen/qwen3-30b-a3b
1 openai/gpt-4-turbo-preview
1 nvidia/llama-3_1-nemotron-ultra-253b-v1
1 nvidia/llama-3_1-nemotron-nano-8b-v1
1 mistralai/mistral-small-2501
1 mistralai/mistral-7b-instruct-v0.3
1 microsoft/phi-3.5-mini-instruct
1 microsoft/phi-3-mini-128k-instruct
1 microsoft/phi-3-medium-128k-instruct
1 meta-llama/llama-4-maverick
1 google/gemma-3n-e4b-it
1 google/gemma-3-27b-it
1 google/gemma-2-9b-it
1 anthropic/claude-3-sonnet-20240229
1 anthropic/claude-3-5-sonnet-20240620
1 amazon/nova-micro-v1.0
License
This dataset is released under Apache 2.0