SWE Arena
SWE Arena extends Chatbot Arena with powerful code execution capabilities, enabling evaluation of LLM-generated programs across a wide range of outputs - from simple computations to complex visual interfaces.
What is SWE Arena?
SWE Arena introduces a plug-and-play code execution environment for Chatbot Arena. It enables direct evaluation of LLM capabilities in:
- General-purpose code execution across multiple languages
- Output visualization ranging from text, images, to interactive UIs
Why SWE Arena?
SWE Arena is designed to address the limitations of Chatbot Arena, particularly in terms of precise code evaluation. Human judgement on code generation is not always reliable [1, 2], and generally requires non-trivial knowledge of the language and its libraries. We consider this a significant limitation for the development of advanced AI systems.
There are several concurrent features and projects that explore the capabilities of LLMs designing complex programs with execution. Claude Artifacts by Anthropic is one of the first features in this space to let users interact with LLM-generated frontend applications. v0 by Vercel also allows users to ship LLM-generated frontend applications with frontend frameworks. Based on this, WebDev Arena by Chatbot Arena and Code Arena by Together AI focus on evaluating LLM-generated frontend applications. SWE Arena aims to extend this capability to a wider range of outputs, not just frontend applications, but also programs that can be run on backend servers and data analysis.
Supported Outputs
SWE Arena can visualize various types of code execution outputs:
- Documents (Markdown or Plain Text)
- Websites (single HTML webpage)
- Scalable Vector Graphics (SVG) images
- Plots
- Tables
- React/Vue components
- Gradio applications
- Streamlit applications
- PyGame
- Mermaid diagrams
Technical Implementation
SWE Arena builds upon FastChat, the foundation of Chatbot Arena, providing seamless code execution capabilities. The implementation focuses on:
- Code Execution: Secure, sandboxed environment using E2B for executing code in supported language (Python, JavaScript, etc.).
- Dependency Management: Automatic installation of various dependencies via npm and pip (backed by uv).
- Code Editing: On-the-fly code modification, testing, and re-execution.
- One-Side Chat: In the side-by-side chat mode, the user can choose to only interact one of the models to further understand the model's capabilities.
- Interaction Tracking: Comprehensive logging of user interactions on the rendered UI.
Expected Outcomes
SWE Arena aims to deliver several key outcomes:
- Leaderboard: A dynamic Elo rating system tracking LLM performance in execution-based code generation, providing transparent comparisons across different models.
- Human Preference Data: Collection of high-quality human feedback on code outputs, creating a valuable dataset for improving code-generation capabilities.
- Interactions and Trajectories: Comprehensive logs of human-AI interactions during code development and user trajectories on the rendered GUI.
Future Plans
SWE Arena is currently in the early stage of development. We plan to continuously add more features towards the goal of Computer Intelligence.
Meanwhile, we are actively working with Chatbot Arena to integrate SWE Arena into their platform.