Spaces:
Running
Real Benchmark Data Fetch - In Progress
Status: β³ RUNNING
Started: Now
ETA: 10-15 minutes
π― What's Happening
We're fetching REAL per-question success rates from the top 5 models on the OpenLLM Leaderboard for MMLU.
Models Being Queried
- meta-llama/Meta-Llama-3.1-70B-Instruct (~85% MMLU)
- Qwen/Qwen2.5-72B-Instruct (~85% MMLU)
- mistralai/Mixtral-8x22B-Instruct-v0.1 (~77% MMLU)
- google/gemma-2-27b-it (~75% MMLU)
- microsoft/Phi-3-medium-128k-instruct (~78% MMLU)
Data Being Collected
- 14,042 MMLU questions per model
- Per-question correctness (0 or 1)
- Aggregated success rate across all 5 models
- Difficulty classification based on real performance
π What We'll Get
Per-Question Data
{
"mmlu_42": {
"question_text": "Statement 1 | Some abelian group...",
"success_rate": 0.60, // 3 out of 5 models got it right
"num_models_tested": 5,
"difficulty_tier": "medium",
"difficulty_label": "Moderate",
"model_results": {
"meta-llama__Meta-Llama-3.1-70B-Instruct": 1,
"Qwen__Qwen2.5-72B-Instruct": 1,
"mistralai__Mixtral-8x22B-Instruct-v0.1": 0,
"google__gemma-2-27b-it": 1,
"microsoft__Phi-3-medium-128k-instruct": 0
}
}
}
Expected Distribution
Based on top model performance:
- LOW success (0-30%): ~10-15% of questions (hard for even best models)
- MEDIUM success (30-70%): ~25-35% of questions (capability boundary)
- HIGH success (70-100%): ~50-65% of questions (mastered)
This gives us the full spectrum to understand LLM capability boundaries!
π Why This Approach is Better
What We Tried First
β Domain-level estimates: All questions in a domain get same score
β Manual evaluation: Too slow, expensive
β Clustering: Groups questions but doesn't give individual scores
What We're Doing Now β
Real per-question success rates from top models
Advantages:
- Granular: Each question has its own difficulty score
- Accurate: Based on actual model performance
- Current: Uses latest top models
- Explainable: "5 top models got this right" vs "estimated 45%"
β±οΈ Timeline
| Step | Status | Time |
|---|---|---|
| Fetch Model 1 (Llama 3.1 70B) | β³ Running | ~3 min |
| Fetch Model 2 (Qwen 2.5 72B) | β³ Queued | ~3 min |
| Fetch Model 3 (Mixtral 8x22B) | β³ Queued | ~3 min |
| Fetch Model 4 (Gemma 2 27B) | β³ Queued | ~3 min |
| Fetch Model 5 (Phi-3 Medium) | β³ Queued | ~3 min |
| Aggregate Success Rates | β³ Pending | ~1 min |
| Save Results | β³ Pending | <1 min |
Total: ~10-15 minutes
π¦ Output Files
Main Output
./data/benchmark_results/mmlu_real_results.json
Contains:
- Metadata (models, fetch time, counts)
- Questions with real success rates
- Difficulty classifications
Statistics
- Total questions collected
- Difficulty tier distribution
- Success rate statistics (min, max, mean, median)
π Next Steps (After Fetch Completes)
Immediate
- β Review fetched data quality
- β Verify difficulty distribution makes sense
- β Check for any data issues
Then
- Load into vector DB: Use real success rates
- Build embeddings: Generate for all questions
- Test queries: "Calculate quantum corrections..." β find similar hard questions
- Validate accuracy: Does it correctly identify hard vs easy prompts?
Finally
- Integrate with MCP:
togmal_check_prompt_difficultyuses real data - Deploy to production: Ready for use in Claude Desktop
- Monitor performance: Track query speed, accuracy
π‘ Key Innovation
We're not estimating difficulty - we're measuring it directly from the world's best models.
This means:
- β No guesswork: Real performance data
- β Cross-model consensus: 5 top models agree/disagree
- β Capability boundary detection: Find questions at 30-50% success (most interesting!)
- β Actionable insights: "Similar to questions that 4/5 top models fail"
π Expected Results
Difficulty Tiers
Based on top model performance patterns:
LOW Success (0-30%) - ~500-1000 questions
- Graduate-level reasoning
- Multi-step problem solving
- Domain-specific expertise
- These are the gold mine for detecting LLM limits!
MEDIUM Success (30-70%) - ~2000-3000 questions
- Capability boundary
- Requires careful reasoning
- Some models succeed, others fail
- Most interesting for adaptive prompting
HIGH Success (70-100%) - ~8000-10000 questions
- Within LLM capability
- Baseline knowledge
- Factual recall
- Good for validation
π― Success Metrics
Data Quality
- All 5 models fetched successfully
- 1000+ questions with complete data
- Difficulty distribution looks reasonable
- No major data anomalies
Performance
- Fetch completes in <20 minutes
- All questions have success rates
- Stratification works (low/medium/high)
- JSON file validates
Usability
- Data format ready for vector DB
- Metadata preserved (domains, questions)
- Can be post-processed easily
- Documented and reproducible
Current Status: Script running, check back in ~15 minutes!
Run this to check progress:
tail -f <terminal_output>
Or check the output file:
ls -lh ./data/benchmark_results/mmlu_real_results.json