SWE Arena

SWE Arena extends Chatbot Arena with powerful code execution capabilities, enabling evaluation of LLM-generated programs across a wide range of outputs - from simple computations to complex visual interfaces.

What is SWE Arena?

SWE Arena introduces a plug-and-play code execution environment for Chatbot Arena. It enables direct evaluation of LLM capabilities in:

Why SWE Arena?

SWE Arena is designed to address the limitations of Chatbot Arena, particularly in terms of precise code evaluation. Human judgement on code generation is not always reliable [1, 2], and generally requires non-trivial knowledge of the language and its libraries. We consider this a significant limitation for the development of advanced AI systems.

There are several concurrent features and projects that explore the capabilities of LLMs designing complex programs with execution. Claude Artifacts by Anthropic is one of the first features in this space to let users interact with LLM-generated frontend applications. v0 by Vercel also allows users to ship LLM-generated frontend applications with frontend frameworks. Based on this, WebDev Arena by Chatbot Arena and Code Arena by Together AI focus on evaluating LLM-generated frontend applications. SWE Arena aims to extend this capability to a wider range of outputs, not just frontend applications, but also programs that can be run on backend servers and data analysis.

Supported Outputs

SWE Arena can visualize various types of code execution outputs:

Technical Implementation

SWE Arena builds upon FastChat, the foundation of Chatbot Arena, providing seamless code execution capabilities. The implementation focuses on:

Expected Outcomes

SWE Arena aims to deliver several key outcomes:

Future Plans

SWE Arena is currently in the early stage of development. We plan to continuously add more features towards the goal of Computer Intelligence.

Meanwhile, we are actively working with Chatbot Arena to integrate SWE Arena into their platform.

Frequently Asked Questions

Why is the code execution process of SWE Arena a bit slow?
Before code execution, SWE Arena parses the code and installs various packages to ensure the code can be executed. This is why the code execution process is a bit slow.
What can not SWE Arena do?
Currently, SWE Arena does not support programming languages other than JavaScript, TypeScript, HTML, and Python. In addition, SWE Arena can not execute code that use desktop-level UIs (e.g., Tkinter, PyQt, etc.) or take user inputs from the keyboard.
How do I know if SWE Arena will use my personal identifiable information (PII)?
While SWE Arena collects the user input, we will redact the PII (e.g., API keys, etc.) by using StarPII, an NER model that trained on a large-scale code dataset that can identify and mask the PII.
Can I contribute to the project?
Yes! SWE Arena is an open-source project, and we welcome contributions. You can find our repository on GitHub and join our community through email. We appreciate help in various areas including development, testing, and documentation.