SWE Arena

SWE Arena extends Chatbot Arena with powerful code execution capabilities, enabling evaluation of LLM-generated programs across a wide range of outputs - from simple computations to complex visual interfaces.

What is SWE Arena?

SWE Arena introduces a plug-and-play code execution environment for Chatbot Arena. It enables direct evaluation of LLM capabilities in:

General-purpose code execution across multiple languages
Output visualization ranging from text, images, to interactive UIs

Why SWE Arena?

SWE Arena is designed to address the limitations of Chatbot Arena, particularly in terms of precise code evaluation. Human judgement on code generation is not always reliable [1, 2], and generally requires non-trivial knowledge of the language and its libraries. We consider this a significant limitation for the development of advanced AI systems.

There are several concurrent features and projects that explore the capabilities of LLMs designing complex programs with execution. Claude Artifacts by Anthropic is one of the first features in this space to let users interact with LLM-generated frontend applications. v0 by Vercel also allows users to ship LLM-generated frontend applications with frontend frameworks. Based on this, WebDev Arena by Chatbot Arena and Code Arena by Together AI focus on evaluating LLM-generated frontend applications. SWE Arena aims to extend this capability to a wider range of outputs, not just frontend applications, but also programs that can be run on backend servers and data analysis.

Supported Outputs

SWE Arena can visualize various types of code execution outputs:

Documents (Markdown or Plain Text)
Websites (single HTML webpage)
Scalable Vector Graphics (SVG) images
Plots
Tables
React/Vue components
Gradio applications
Streamlit applications
PyGame
Mermaid diagrams

Technical Implementation

SWE Arena builds upon FastChat, the foundation of Chatbot Arena, providing seamless code execution capabilities. The implementation focuses on:

Code Execution: Secure, sandboxed environment using E2B for executing code in supported language (Python, JavaScript, etc.).
Dependency Management: Automatic installation of various dependencies via npm and pip (backed by uv).
Code Editing: On-the-fly code modification, testing, and re-execution.
One-Side Chat: In the side-by-side chat mode, the user can choose to only interact one of the models to further understand the model's capabilities.
Interaction Tracking: Comprehensive logging of user interactions on the rendered UI.

Expected Outcomes

SWE Arena aims to deliver several key outcomes:

Leaderboard: A dynamic Elo rating system tracking LLM performance in execution-based code generation, providing transparent comparisons across different models.
Human Preference Data: Collection of high-quality human feedback on code outputs, creating a valuable dataset for improving code-generation capabilities.
Interactions and Trajectories: Comprehensive logs of human-AI interactions during code development and user trajectories on the rendered GUI.

Future Plans

SWE Arena is currently in the early stage of development. We plan to continuously add more features towards the goal of Computer Intelligence.

Meanwhile, we are actively working with Chatbot Arena to integrate SWE Arena into their platform.

Frequently Asked Questions

Why is the code execution process of SWE Arena a bit slow?

Before code execution, SWE Arena parses the code and installs various packages to ensure the code can be executed. This is why the code execution process is a bit slow.

What can not SWE Arena do?

Currently, SWE Arena does not support programming languages other than JavaScript, TypeScript, HTML, and Python. In addition, SWE Arena can not execute code that use desktop-level UIs (e.g., Tkinter, PyQt, etc.) or take user inputs from the keyboard.

How do I know if SWE Arena will use my personal identifiable information (PII)?

While SWE Arena collects the user input, we will redact the PII (e.g., API keys, etc.) by using StarPII, an NER model that trained on a large-scale code dataset that can identify and mask the PII.

Can I contribute to the project?

Yes! SWE Arena is an open-source project, and we welcome contributions. You can find our repository on GitHub and join our community through email. We appreciate help in various areas including development, testing, and documentation.