OpenSWE: Efficient SWE Environment Synthesis at Scale

arXiv Paper   |   Code   |   Environments & Scripts

OpenSWE is the largest fully transparent framework for SWE agent training in Python, comprising 45,320 executable Docker environments spanning over 12.8k repositories, with all Dockerfiles, evaluation scripts, and infrastructure fully open-sourced for reproducibility. OpenSWE is built through a multi-agent synthesis pipeline deployed across a 64-node distributed cluster, automating repository exploration, Dockerfile construction, evaluation script generation, and iterative test analysis. Beyond scale, we propose a quality-centric filtering pipeline that characterizes the inherent difficulty of each environment, filtering out instances that are either unsolvable or insufficiently challenging and retaining only those that maximize learning efficiency. With $891K spent on environment construction and an additional $576K on trajectory sampling and difficulty-aware curation, the project yields about 13,000 curated trajectories from roughly 9,000 quality-guaranteed environments.

This repository contains the official implementation of the OpenSWE pipeline—an extensible SWE-bench–like dataset generation framework that supports custom data schemas, parallel multi-machine building, and full evaluation integration with SWE-agent / SWE-bench-fork (with provided patches).

Highlights

  • Unprecedented Scale with Full Transparency: We release 45,320 executable environments from 12.8k repositories at a construction cost of $891K, with complete infrastructure including all Dockerfiles, evaluation scripts, and the distributed synthesis pipeline, enabling reproducibility and community-driven improvements.

  • Quality-Centric Filtering via Difficulty-Aware Curation: A filtering pipeline characterizes environment difficulty to filter out unsolvable and trivially simple instances (e.g., PR–Issue misalignment, triviality). With an additional $576K investment in trajectory sampling and curation, we obtain about 13,000 curated trajectories from roughly 9,000 high-quality environments.

  • Strong Empirical Validation: OpenSWE-32B and OpenSWE-72B achieve 62.4% and 66.0% on SWE-bench Verified, establishing SOTA among SFT-based methods in the Qwen2.5 series. Models trained on OpenSWE consistently outperform SWE-rebench across all scales and scaffolds, with a log-linear data scaling trend showing no saturation, and SWE-focused training yields substantial out-of-domain improvements (e.g., up to 12 points on MATH-500, 5+ on science benchmarks) without degrading factual recall.

News

  • Paper: OpenSWE (daVinci-Env) introduces the largest fully transparent SWE environment synthesis framework, with multi-agent pipeline design and scaling/curation analysis.

  • SOTA: OpenSWE-32B / OpenSWE-72B set new SOTA among Qwen2.5 SFT methods on SWE-bench Verified (62.4% / 66.0%).

Performance

Environment scale comparison

Dataset # Repos # Images # Tasks Source
R2E-Gym (Subset) 10 2.4k 4.6k Synthetic
SWE-gym 11 2.4k 2.4k Real
SWE-rebench 3.5k 21.3k 21.3k Real
SWE-rebench (filtered) 3.3k 18.8k 18.8k Real
OpenSWE (ours) 12.8k 45.3k 45.3k Real

SWE-bench Verified (Pass@1)

Model Backbone Scaffold Score
SWE-Master-32B-RL Qwen2.5-Coder-32B-Inst. R2E-Gym 61.4
daVinci-Dev-32B Qwen2.5-32B-Base SWE-Agent 56.1
OpenSWE-32B (Ours) Qwen2.5-32B-Base OpenHands 59.8
OpenSWE-32B (Ours) Qwen2.5-32B-Base SWE-Agent 62.4
daVinci-Dev-72B Qwen2.5-72B-Base SWE-Agent 58.5
OpenSWE-72B (Ours) Qwen2.5-72B-Base OpenHands 65.0
OpenSWE-72B (Ours) Qwen2.5-72B-Base SWE-Agent 66.0

Impact of environment source (SWE-bench Verified Pass@1)

Training Data SWE-Agent 32B SWE-Agent 72B CodeAct 32B CodeAct 72B
SWE-rebench 50.2% 63.4% 51.4% 62.4%
OpenSWE 62.4% 66.0% 59.8% 65.0%
SWE-rebench + OpenSWE 61.4% 68.0% 60.3% 65.5%

Training on OpenSWE alone yields large improvements over SWE-rebench across all model sizes and scaffolds; combining with SWE-rebench further improves 72B (e.g., 68.0% SWE-Agent). Data scaling analysis shows log-linear improvement with no saturation (see paper for curves). General capability evaluation shows gains on code (e.g., HumanEval +29), math (e.g., MATH-500 +12.2 for 72B), and science benchmarks without degrading factual recall.

Quick Start

1. Data schema

Collect your dataset in the following schema:

Field Type Description
instance_id str Unique identifier for the sample.
repo str Full GitHub repo name (e.g., psf/requests).
base_commit str SHA of the commit immediately before the PR's first change.
end_commit str SHA of the final commit in the PR.
problem_statement str Issue description or problem to solve.
patch str Diff of changes to functional (non-test) code.
test_patch str Diff of changes to the test suite.
language str Primary programming language of the repo.

2. (Recommended) Prepare system

  • Download all git repositories into a repocache directory.
  • Build base Docker images with scripts/prepare_baseimg.py.

3. Apply patches for SWE-bench evaluation

Before running evaluation, apply:

Replace /path/to/openswe with your OpenSWE repo root. On conflicts use git apply --reject and resolve .rej files. Apply each patch once per repo.

4. Configure and run

Edit examples/run.sh (set OPENSWE_ROOT, DATA_PATH, OUTPUT_DIR, SETUP_DIR, RESULT_DIR, DATA_PATH, API keys, and DOCKER_REPOSITORY), then:

bash examples/run.sh

For multi-machine building, see Parallel Task Execution System.

Troubleshooting

  • Dataset missing: Ensure your dataset JSONL exists at the path set in DATA_PATH; check schema matches the table above.
  • Patch conflicts: Resolve .rej files after git apply --reject for swe-agent and swe-bench-fork.

Acknowledgement

OpenSWE is inspired by SWE-Rebench and SWE-Factory. We thank these teams for their open-source contributions.

License

This project is licensed under AGPL-3.0. See LICENSE for details.

Citation

If you find OpenSWE useful, please cite:

@article{openswe2026,
  title={daVinci-Env: Open SWE Environment Synthesis at Scale},
  author={Dayuan Fu and Shenyu Wu and Yunze Wu and Zerui Peng and Yaxing Huang and Jie Sun and Ji Zeng and Mohan Jiang and Lin Zhang and Yukun Li and Jiarui Hu and Liming Liu and Jinlong Hou and Pengfei Liu},
  journal={arXiv preprint},
  year={2026}
}
Downloads last month
-
Safetensors
Model size
73B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for GAIR/OpenSWE-72B