CoEnv: Driving Embodied Multi‑Agent Collaboration via Compositional Environment

Under Review

Li Kang1,2★, Yutao Fan2,3★, Rui Li2,3★, Heng Zhou2,4★,
Yiran Qin5, Zhemeng Zhang1, Songtao Huang6, Xiufeng Song1, Zaibin Zhang7,
Bruno N.Y. Chen8, Zhenfei Yin9, Dongzhan Zhou2†, Wangmeng Zuo3†, Lei Bai2†

1Shanghai Jiao Tong University   2Shanghai AI Laboratory   3Harbin Institute of Technology   4USTC
5CUHK-Shenzhen   6Fudan University   7Dalian University of Technology
8Carnegie Mellon University   9University of Oxford
Equal contribution   Corresponding author

Multi-agent embodied systems hold promise for complex collaborative manipulation, yet face critical challenges in spatial coordination, temporal reasoning, and shared workspace awareness. Inspired by human collaboration where cognitive planning occurs separately from physical execution, we introduce the concept of compositional environment—a synergistic integration of real-world and simulation components that enables multiple robotic agents to perceive intentions and operate within a unified decision-making space. Building on this concept, we present CoEnv, a framework that leverages simulation for safe strategy exploration while ensuring reliable real-world deployment. CoEnv operates through three stages: real-to-sim scene reconstruction that digitizes physical workspaces, VLM-driven action synthesis supporting both real-time planning with high-level interfaces and iterative planning with code-based trajectory generation, and validated sim-to-real transfer with collision detection for safe deployment. Extensive experiments on challenging multi-arm manipulation benchmarks demonstrate CoEnv's effectiveness in achieving high task success rates and execution efficiency, establishing a new paradigm for multi-agent embodied AI.


Paper    Code   

Introduction

Claude Code has recently emerged as one of the most capable agentic coding tools, demonstrating strong abilities in autonomous reasoning and complex program synthesis. A natural question arises: can such a code agent go beyond the digital world and drive coordination among physical robots?

Claude Code meets Robotics

In CoEnv, we explore this possibility. Beyond generating trajectory code, Claude Code observes, thinks, and adapts within simulation—adjusting camera viewpoints to see through occlusions, evaluating outcomes, and refining plans iteratively, just as a human would walk around a workspace for a better view. This closed-loop reasoning enables robust multi-arm coordination before any physical execution.

As embodied AI scales toward multi-agent scenarios, a key challenge emerges: how do we get multiple robots to collaborate safely in a shared workspace—avoiding collisions, reasoning about each other's actions, and staying coordinated over long horizons?

Our answer is Compositional Environment—inspired by how humans mentally rehearse before acting. CoEnv fuses a real-world workspace with its simulation twin into a unified decision-making space: robots think and plan in simulation, then act in the physical world. This simple idea turns out to be surprisingly effective for multi-agent coordination.

Motivation

Motivation of CoEnv. Physical-world execution offers high fidelity but risks costly collisions, while the digital world enables cost-effective and safe testing. CoEnv composes both worlds through pose and action alignment, forming a compositional environment that supports real-to-sim reconstruction, simulation-conditioned action synthesis, and safe real-world deployment.

Method

CoEnv operates through a three-stage pipeline that transforms real-world observations into coordinated multi-agent actions:

  1. Real-to-Sim Scene Reconstruction: Multi-view RGBD observations are converted into a simulator-ready scene. We use Grounded SAM2 for object detection, FoundationPose for 6-DoF pose estimation with multi-view fusion, and ManiSkill (built on SAPIEN) as the simulation backend with iterative camera calibration refinement.

  2. Simulation-Conditioned Action Synthesis: A VLM-based planner decomposes task goals into hierarchical sub-goals and execution plans. Actions are synthesized in two complementary modes: Interactive mode uses closed-loop VLM feedback with adaptive camera control and checkpoint verification for real-time re-planning; Iterative mode leverages a code agent to generate complete multi-agent trajectories in a single program, refined through simulation feedback.

  3. Sim-to-Real Transfer: Validated trajectories are transferred to real robots via trajectory interpolation with collision volume verification, ensuring collision-free multi-agent execution.

Pipeline

Overview of CoEnv. The framework consists of three stages: (1) Real-to-Sim Scene Reconstruction using 3D asset generation, multi-view localization, and iterative camera calibration; (2) Simulation-Conditioned Action Synthesis with hierarchical task planning and two complementary execution modes (interactive and iterative); (3) Sim-to-Real Transfer with trajectory interpolation and collision volume verification.

Tasks

We evaluate CoEnv on five challenging multi-arm manipulation tasks with increasing coordination complexity, using two-agent (Franka Research 3 × 2) and three-agent (Franka Research 3 + AgileX Piper dual-arm) configurations.

Tasks

Task demonstrations. Five multi-agent manipulation tasks with keyframes from successful real-world executions. Two-agent tasks: Cube Stacking, Ball Pickup, and Transfer Cylinder. Three-agent tasks: Place Cucumber and Brush Box.

Results

CoEnv achieves an overall 49% success rate across five challenging multi-agent manipulation tasks. The two execution modes exhibit complementary strengths: iterative mode excels at tasks requiring precise trajectory control (e.g., Cube Stacking: 9/10), while interactive mode dominates in complex multi-stage coordination (e.g., Brush Box: 7/10).

Autonomous multi-agent manipulation demo (4× speed).

Si: Subtask success rate (x/10).     SR: Task success rate (x/10).     Overall: Average across both modes.
Task Mode Subtask Milestones SR (x/10) Overall (%)
Cube Stacking Interactive S1: 7/10, S2: 6/10 6/10 75%
Iterative S1: 10/10, S2: 9/10 9/10
Ball Pickup Interactive S1: 9/10, S2: 4/10 4/10 50%
Iterative S1: 6/10, S2: 6/10 6/10
Transfer Cylinder Interactive S1: 9/10, S2: 4/10, S3: 4/10 4/10 25%
Iterative S1: 6/10, S2: 2/10, S3: 1/10 1/10
Place Cucumber Interactive S1: 9/10, S2: 7/10, S3: 4/10 4/10 35%
Iterative S1: 8/10, S2: 8/10, S3: 3/10 3/10
Brush Box Interactive S1: 10/10, S2: 9/10, S3: 7/10 7/10 60%
Iterative S1: 8/10, S2: 8/10, S3: 5/10 5/10
Sim-to-Real

Simulation vs. Real-world execution. Side-by-side comparison demonstrating high visual correspondence between simulated planning and real-world execution across multiple tasks, validating the effectiveness of CoEnv's compositional environment approach.

Ablation Study

We ablate two key mechanisms in CoEnv's interactive mode: Adaptive Camera Control and Checkpoint Verification. Both are indispensable—removing adaptive camera reduces average success by 40%, while removing checkpoints causes a 60% drop due to cascading error propagation.

Variant Cube Stacking Ball Pickup Transfer Cyl. Place Cuc. Brush Box Average
Full CoEnv 6/104/104/104/107/1050%
w/o Adaptive Camera 5/106/100/104/100/1030%
w/o Checkpoint Verification 4/102/100/104/100/1020%

Scalable Data Collection

Beyond task execution, CoEnv reveals a practical pathway toward scalable multi-agent data generation. The iterative mode produces substantially more episodes per session (up to 11.7× better throughput), as the code agent generates complete trajectories in a single program with lower per-episode token cost. Environment resets account for only ~10–32% of total token consumption, indicating the vast majority of the budget is devoted to productive reasoning.

Task Interactive Iterative Reset (%)
Cube Stacking 1.5 17.5 31.57
Brush Box 2.5 9.5 10.54
Data Collection

Scalable data collection pipeline. CoEnv first synthesizes and validates manipulation strategies in simulation, then transfers them to real robots for physical execution, collecting real-world multi-agent trajectory data that can be used to train generalist policies, providing an alternative to manual teleoperation.

Citation

If you find this project useful, welcome to cite us.

@article{kang2026coenv,
    title={CoEnv: Driving Embodied Multi-Agent Collaboration via Compositional Environment},
    author={Li Kang and Yutao Fan and Rui Li and Heng Zhou and Yiran Qin and Zhemeng Zhang and Songtao Huang and Xiufeng Song and Zaibin Zhang and Bruno N.Y. Chen and Zhenfei Yin and Dongzhan Zhou and Wangmeng Zuo and Lei Bai},
    journal={arXiv preprint arXiv:xxxx.xxxxx},
    year={2026}
}