Spaces:
Running
Running
change name to SWE-Model-Arena
Browse files- .github/workflows/hf_sync.yml +1 -1
- README.md +8 -8
- app.py +3 -3
.github/workflows/hf_sync.yml
CHANGED
|
@@ -30,6 +30,6 @@ jobs:
|
|
| 30 |
env:
|
| 31 |
HF_TOKEN: ${{ secrets.HF_TOKEN }}
|
| 32 |
run: |
|
| 33 |
-
git remote add huggingface https://user:${HF_TOKEN}@huggingface.co/spaces/SWE-Arena/
|
| 34 |
git fetch huggingface
|
| 35 |
git push huggingface main --force
|
|
|
|
| 30 |
env:
|
| 31 |
HF_TOKEN: ${{ secrets.HF_TOKEN }}
|
| 32 |
run: |
|
| 33 |
+
git remote add huggingface https://user:${HF_TOKEN}@huggingface.co/spaces/SWE-Arena/SWE-Model-Arena
|
| 34 |
git fetch huggingface
|
| 35 |
git push huggingface main --force
|
README.md
CHANGED
|
@@ -1,5 +1,5 @@
|
|
| 1 |
---
|
| 2 |
-
title: SWE-Arena
|
| 3 |
emoji: 🎯
|
| 4 |
colorFrom: blue
|
| 5 |
colorTo: purple
|
|
@@ -11,9 +11,9 @@ pinned: false
|
|
| 11 |
short_description: Chatbot arena for software engineering tasks
|
| 12 |
---
|
| 13 |
|
| 14 |
-
# SWE-Arena: An Interactive Platform for Evaluating Foundation Models in Software Engineering
|
| 15 |
|
| 16 |
-
Welcome to **SWE-Arena**, an open-source platform designed for evaluating software engineering-focused foundation models (FMs), particularly large language models (LLMs). SWE-Arena benchmarks models in iterative, context-rich workflows that are characteristic of software engineering (SE) tasks.
|
| 17 |
|
| 18 |
## Key Features
|
| 19 |
|
|
@@ -26,9 +26,9 @@ Welcome to **SWE-Arena**, an open-source platform designed for evaluating softwa
|
|
| 26 |
- Consistency score: Quantify model determinism and reliability through self-play matches
|
| 27 |
- **Transparent, Open-Source Leaderboard**: View real-time model rankings across diverse SE workflows with full transparency.
|
| 28 |
|
| 29 |
-
## Why SWE-Arena?
|
| 30 |
|
| 31 |
-
Existing evaluation frameworks (e.g. [LMArena](https://lmarena.ai)) often don't address the complex, iterative nature of SE tasks. SWE-Arena fills critical gaps by:
|
| 32 |
|
| 33 |
- Supporting context-rich, multi-turn evaluations to capture iterative workflows
|
| 34 |
- Integrating repository-level context through RepoChat to simulate real-world development scenarios
|
|
@@ -51,7 +51,7 @@ Existing evaluation frameworks (e.g. [LMArena](https://lmarena.ai)) often don't
|
|
| 51 |
|
| 52 |
### Usage
|
| 53 |
|
| 54 |
-
1. Navigate to the [SWE-Arena platform](https://huggingface.co/spaces/SE-Arena/
|
| 55 |
2. Sign in with your Hugging Face account
|
| 56 |
3. Enter your SE task prompt (optionally include a repository URL for RepoChat)
|
| 57 |
4. Engage in multi-round interactions and vote on model performance
|
|
@@ -66,7 +66,7 @@ We welcome contributions from the community! Here's how you can help:
|
|
| 66 |
|
| 67 |
## Privacy Policy
|
| 68 |
|
| 69 |
-
Your interactions are anonymized and used solely for improving SWE-Arena and FM benchmarking. By using SWE-Arena, you agree to our Terms of Service.
|
| 70 |
|
| 71 |
## Future Plans
|
| 72 |
|
|
@@ -78,4 +78,4 @@ Your interactions are anonymized and used solely for improving SWE-Arena and FM
|
|
| 78 |
|
| 79 |
## Contact
|
| 80 |
|
| 81 |
-
For inquiries or feedback, please [open an issue](https://github.com/SE-Arena/
|
|
|
|
| 1 |
---
|
| 2 |
+
title: SWE-Model-Arena
|
| 3 |
emoji: 🎯
|
| 4 |
colorFrom: blue
|
| 5 |
colorTo: purple
|
|
|
|
| 11 |
short_description: Chatbot arena for software engineering tasks
|
| 12 |
---
|
| 13 |
|
| 14 |
+
# SWE-Model-Arena: An Interactive Platform for Evaluating Foundation Models in Software Engineering
|
| 15 |
|
| 16 |
+
Welcome to **SWE-Model-Arena**, an open-source platform designed for evaluating software engineering-focused foundation models (FMs), particularly large language models (LLMs). SWE-Model-Arena benchmarks models in iterative, context-rich workflows that are characteristic of software engineering (SE) tasks.
|
| 17 |
|
| 18 |
## Key Features
|
| 19 |
|
|
|
|
| 26 |
- Consistency score: Quantify model determinism and reliability through self-play matches
|
| 27 |
- **Transparent, Open-Source Leaderboard**: View real-time model rankings across diverse SE workflows with full transparency.
|
| 28 |
|
| 29 |
+
## Why SWE-Model-Arena?
|
| 30 |
|
| 31 |
+
Existing evaluation frameworks (e.g. [LMArena](https://lmarena.ai)) often don't address the complex, iterative nature of SE tasks. SWE-Model-Arena fills critical gaps by:
|
| 32 |
|
| 33 |
- Supporting context-rich, multi-turn evaluations to capture iterative workflows
|
| 34 |
- Integrating repository-level context through RepoChat to simulate real-world development scenarios
|
|
|
|
| 51 |
|
| 52 |
### Usage
|
| 53 |
|
| 54 |
+
1. Navigate to the [SWE-Model-Arena platform](https://huggingface.co/spaces/SE-Arena/SWE-Model-Arena)
|
| 55 |
2. Sign in with your Hugging Face account
|
| 56 |
3. Enter your SE task prompt (optionally include a repository URL for RepoChat)
|
| 57 |
4. Engage in multi-round interactions and vote on model performance
|
|
|
|
| 66 |
|
| 67 |
## Privacy Policy
|
| 68 |
|
| 69 |
+
Your interactions are anonymized and used solely for improving SWE-Model-Arena and FM benchmarking. By using SWE-Model-Arena, you agree to our Terms of Service.
|
| 70 |
|
| 71 |
## Future Plans
|
| 72 |
|
|
|
|
| 78 |
|
| 79 |
## Contact
|
| 80 |
|
| 81 |
+
For inquiries or feedback, please [open an issue](https://github.com/SE-Arena/SWE-Model-Arena/issues/new) in this repository. We welcome your contributions and suggestions!
|
app.py
CHANGED
|
@@ -676,7 +676,7 @@ with gr.Blocks(js=clickable_links_js) as app:
|
|
| 676 |
leaderboard_intro = gr.Markdown(
|
| 677 |
"""
|
| 678 |
# 🏆 FM4SE Leaderboard: Community-Driven Evaluation of Top Foundation Models (FMs) in Software Engineering (SE) Tasks
|
| 679 |
-
The SWE-Arena is an open-source platform designed to evaluate foundation models through human preference, fostering transparency and collaboration. This platform aims to empower the SE community to assess and compare the performance of leading FMs in related tasks. For technical details, check out our [paper](https://arxiv.org/abs/2502.01860).
|
| 680 |
""",
|
| 681 |
elem_classes="leaderboard-intro",
|
| 682 |
)
|
|
@@ -717,7 +717,7 @@ with gr.Blocks(js=clickable_links_js) as app:
|
|
| 717 |
# Add a citation block in Markdown
|
| 718 |
citation_component = gr.Markdown(
|
| 719 |
"""
|
| 720 |
-
Made with ❤️ for SWE-Arena. If this work is useful to you, please consider citing:
|
| 721 |
```
|
| 722 |
@inproceedings{zhao2025se,
|
| 723 |
title={SWE-Arena: An Interactive Platform for Evaluating Foundation Models in Software Engineering},
|
|
@@ -731,7 +731,7 @@ with gr.Blocks(js=clickable_links_js) as app:
|
|
| 731 |
# Add title and description as a Markdown component
|
| 732 |
arena_intro = gr.Markdown(
|
| 733 |
f"""
|
| 734 |
-
# ⚔️ SWE-Arena: Explore and Test Top FMs with SE Tasks by Community Voting
|
| 735 |
|
| 736 |
## 📜How It Works
|
| 737 |
- **Blind Comparison**: Submit a SE-related query to two anonymous FMs randomly selected from up to {len(available_models)} top models from OpenAI, Gemini, Grok, Claude, Deepseek, Qwen, Llama, Mistral, and others.
|
|
|
|
| 676 |
leaderboard_intro = gr.Markdown(
|
| 677 |
"""
|
| 678 |
# 🏆 FM4SE Leaderboard: Community-Driven Evaluation of Top Foundation Models (FMs) in Software Engineering (SE) Tasks
|
| 679 |
+
The SWE-Model-Arena is an open-source platform designed to evaluate foundation models through human preference, fostering transparency and collaboration. This platform aims to empower the SE community to assess and compare the performance of leading FMs in related tasks. For technical details, check out our [paper](https://arxiv.org/abs/2502.01860).
|
| 680 |
""",
|
| 681 |
elem_classes="leaderboard-intro",
|
| 682 |
)
|
|
|
|
| 717 |
# Add a citation block in Markdown
|
| 718 |
citation_component = gr.Markdown(
|
| 719 |
"""
|
| 720 |
+
Made with ❤️ for SWE-Model-Arena. If this work is useful to you, please consider citing:
|
| 721 |
```
|
| 722 |
@inproceedings{zhao2025se,
|
| 723 |
title={SWE-Arena: An Interactive Platform for Evaluating Foundation Models in Software Engineering},
|
|
|
|
| 731 |
# Add title and description as a Markdown component
|
| 732 |
arena_intro = gr.Markdown(
|
| 733 |
f"""
|
| 734 |
+
# ⚔️ SWE-Model-Arena: Explore and Test Top FMs with SE Tasks by Community Voting
|
| 735 |
|
| 736 |
## 📜How It Works
|
| 737 |
- **Blind Comparison**: Submit a SE-related query to two anonymous FMs randomly selected from up to {len(available_models)} top models from OpenAI, Gemini, Grok, Claude, Deepseek, Qwen, Llama, Mistral, and others.
|