CoEnv: Driving Embodied Multi‑Agent Collaboration via Compositional Environment

Under Review

Li Kang^1,2★, Yutao Fan^2,3★, Rui Li^2,3★, Heng Zhou^2,4★,
Yiran Qin⁵, Zhemeng Zhang¹, Songtao Huang⁶, Xiufeng Song¹, Zaibin Zhang⁷,
Bruno N.Y. Chen⁸, Zhenfei Yin⁹, Dongzhan Zhou^2†, Wangmeng Zuo^3†, Lei Bai^2†

¹Shanghai Jiao Tong University ²Shanghai AI Laboratory ³Harbin Institute of Technology ⁴USTC
⁵CUHK-Shenzhen ⁶Fudan University ⁷Dalian University of Technology
⁸Carnegie Mellon University ⁹University of Oxford
^★Equal contribution ^†Corresponding author

Multi-agent embodied systems hold promise for complex collaborative manipulation, yet face critical challenges in spatial coordination, temporal reasoning, and shared workspace awareness. Inspired by human collaboration where cognitive planning occurs separately from physical execution, we introduce the concept of compositional environment—a synergistic integration of real-world and simulation components that enables multiple robotic agents to perceive intentions and operate within a unified decision-making space. Building on this concept, we present CoEnv, a framework that leverages simulation for safe strategy exploration while ensuring reliable real-world deployment. CoEnv operates through three stages: real-to-sim scene reconstruction that digitizes physical workspaces, VLM-driven action synthesis supporting both real-time planning with high-level interfaces and iterative planning with code-based trajectory generation, and validated sim-to-real transfer with collision detection for safe deployment. Extensive experiments on challenging multi-arm manipulation benchmarks demonstrate CoEnv's effectiveness in achieving high task success rates and execution efficiency, establishing a new paradigm for multi-agent embodied AI.

Paper Code

Introduction

Claude Code has recently emerged as one of the most capable agentic coding tools, demonstrating strong abilities in autonomous reasoning and complex program synthesis. A natural question arises: can such a code agent go beyond the digital world and drive coordination among physical robots?

Method

CoEnv operates through a three-stage pipeline that transforms real-world observations into coordinated multi-agent actions:

Real-to-Sim Scene Reconstruction: Multi-view RGBD observations are converted into a simulator-ready scene. We use Grounded SAM2 for object detection, FoundationPose for 6-DoF pose estimation with multi-view fusion, and ManiSkill (built on SAPIEN) as the simulation backend with iterative camera calibration refinement.

Simulation-Conditioned Action Synthesis: A VLM-based planner decomposes task goals into hierarchical sub-goals and execution plans. Actions are synthesized in two complementary modes: Interactive mode uses closed-loop VLM feedback with adaptive camera control and checkpoint verification for real-time re-planning; Iterative mode leverages a code agent to generate complete multi-agent trajectories in a single program, refined through simulation feedback.

Sim-to-Real Transfer: Validated trajectories are transferred to real robots via trajectory interpolation with collision volume verification, ensuring collision-free multi-agent execution.

Results

CoEnv achieves an overall 49% success rate across five challenging multi-agent manipulation tasks. The two execution modes exhibit complementary strengths: iterative mode excels at tasks requiring precise trajectory control (e.g., Cube Stacking: 9/10), while interactive mode dominates in complex multi-stage coordination (e.g., Brush Box: 7/10).

S_i: Subtask success rate (x/10). SR: Task success rate (x/10). Overall: Average across both modes.
Cube Stacking	Interactive	S1: 7/10, S2: 6/10	6/10	75%
Iterative	S1: 10/10, S2: 9/10	9/10
Ball Pickup	Interactive	S1: 9/10, S2: 4/10	4/10	50%
Iterative	S1: 6/10, S2: 6/10	6/10
Transfer Cylinder	Interactive	S1: 9/10, S2: 4/10, S3: 4/10	4/10	25%
Iterative	S1: 6/10, S2: 2/10, S3: 1/10	1/10
Place Cucumber	Interactive	S1: 9/10, S2: 7/10, S3: 4/10	4/10	35%
Iterative	S1: 8/10, S2: 8/10, S3: 3/10	3/10
Brush Box	Interactive	S1: 10/10, S2: 9/10, S3: 7/10	7/10	60%
Iterative	S1: 8/10, S2: 8/10, S3: 5/10	5/10

S_i: Subtask success rate (x/10). SR: Task success rate (x/10). Overall: Average across both modes.

Task

Mode

Subtask Milestones

SR (x/10)

Overall (%)

Cube Stacking

Interactive

S1: 7/10, S2: 6/10

6/10

75%

Iterative

S1: 10/10, S2: 9/10

9/10

Ball Pickup

Interactive

S1: 9/10, S2: 4/10

4/10

50%

Iterative

S1: 6/10, S2: 6/10

6/10

Transfer Cylinder

Interactive

S1: 9/10, S2: 4/10, S3: 4/10

4/10

25%

Iterative

S1: 6/10, S2: 2/10, S3: 1/10

1/10

Place Cucumber

Interactive

S1: 9/10, S2: 7/10, S3: 4/10

4/10

35%

Iterative

S1: 8/10, S2: 8/10, S3: 3/10

3/10

Brush Box

Interactive

S1: 10/10, S2: 9/10, S3: 7/10

7/10

60%

Iterative

S1: 8/10, S2: 8/10, S3: 5/10

5/10

Variant	Cube Stacking	Ball Pickup	Transfer Cyl.	Place Cuc.	Brush Box	Average
Full CoEnv	6/10	4/10	4/10	4/10	7/10	50%
w/o Adaptive Camera	5/10	6/10	0/10	4/10	0/10	30%
w/o Checkpoint Verification	4/10	2/10	0/10	4/10	0/10	20%

Variant

Cube Stacking

Ball Pickup

Transfer Cyl.

Place Cuc.

Brush Box

Average

Full CoEnv

6/10

4/10

7/10

50%

w/o Adaptive Camera

5/10

6/10

0/10

4/10

0/10

30%

w/o Checkpoint Verification

4/10

2/10

0/10

4/10

0/10

20%

Scalable Data Collection

Beyond task execution, CoEnv reveals a practical pathway toward scalable multi-agent data generation. The iterative mode produces substantially more episodes per session (up to 11.7× better throughput), as the code agent generates complete trajectories in a single program with lower per-episode token cost. Environment resets account for only ~10–32% of total token consumption, indicating the vast majority of the budget is devoted to productive reasoning.

Task	Interactive	Iterative	Reset (%)
Cube Stacking	1.5	17.5	31.57
Brush Box	2.5	9.5	10.54

Task

Interactive

Iterative

Reset (%)

Cube Stacking

1.5

17.5

31.57

Brush Box

2.5

9.5

10.54

Citation

If you find this project useful, welcome to cite us.

@article{kang2026coenv, title={CoEnv: Driving Embodied Multi-Agent Collaboration via Compositional Environment}, author={Li Kang and Yutao Fan and Rui Li and Heng Zhou and Yiran Qin and Zhemeng Zhang and Songtao Huang and Xiufeng Song and Zaibin Zhang and Bruno N.Y. Chen and Zhenfei Yin and Dongzhan Zhou and Wangmeng Zuo and Lei Bai}, journal={arXiv preprint arXiv:xxxx.xxxxx}, year={2026} }

CoEnv: Driving Embodied Multi‑Agent Collaboration via Compositional Environment

Introduction

Method

Tasks

Results

Ablation Study

Scalable Data Collection

Citation