ComSim: Building Scalable Real-World Robot Data Generation via Compositional Simulation

Under Review

Yiran Qin^1,6*, Jiahua Ma^2*, Li Kang^3*, Wenzhan Li^2*,
Yihang Jiao², Xin Wen², Xiufeng Song³, Heng Zhou⁴, Jiwen Yu⁵,
Zhenfei Yin⁶, Xihui Liu⁵, Philip Torr⁶, Yilun Du⁷, Ruimao Zhang^2†

¹CUHK-Shenzhen ²Sun Yat-sen University ³Shanghai Jiao Tong University ⁴USTC
⁵The University of Hong Kong ⁶University of Oxford ⁷Harvard University
*Equal contribution ^†Corresponding author

Recent advancements in foundational models have greatly enhanced the capabilities of robotics, yet acquiring large-scale, high-quality training data remains a challenge due to extensive manual effort and limited coverage of real-world environments. We propose Compositional Simulation, a hybrid approach combining classical simulation and neural simulation to generate accurate action–video pairs while maintaining real-world consistency. Our method utilizes a closed-loop real–sim–real data augmentation pipeline, leveraging a small amount of real-world data to produce diverse, large-scale training datasets. A neural simulator transforms classical simulation videos into realistic representations, improving the accuracy of policy models trained for real-world deployment. Experiments demonstrate that our method significantly reduces the sim2real domain gap, resulting in higher success rates in real-world policy training and offering a scalable solution for bridging simulation and reality in robotics.

Paper Code Video

Method	PSNR ↑	SSIM ↑	CLIP Score ↑	LPIPS ↓	FID ↓	FVD ↓
Classical Sim	16.443	0.4342	0.7564	0.3629	187.40	61.048
Baseline (SD 1.5)	16.849	0.5129	0.7526	0.3494	254.59	50.369
Zero-Shot	13.093	0.5487	0.7308	0.4756	219.74	163.83
Ours-CD	8.464	0.1486	0.7216	0.8130	434.44	239.13
Ours-VD	18.153	0.5916	0.7884	0.2813	153.12	22.311
Ours-Full	19.577	0.6484	0.8102	0.2647	147.90	15.765

Task	Dist.	10 Real	20 Real	200 Sim Pre + 10 Real	10 Real + 200 Sim	10 Real + 200 PR	200 PR (0-shot)
ID: In-domain spatial distribution. OOD: Out-of-domain spatial distribution. PR: Pseudo-Real (ours).
Shake Bottle	ID	9/30	17/30	12/30	6/30	28/30	10/30
Shake Bottle	OOD	0/30	1/30	0/30	0/30	12/30	5/30
Stack Blocks Two	ID	5/30	13/30	8/30	2/30	15/30	7/30
Stack Blocks Two	OOD	0/30	0/30	0/30	0/30	6/30	3/30
Move Card Away	ID	12/30	24/30	15/30	6/30	29/30	18/30
Move Card Away	OOD	0/30	3/30	2/30	1/30	17/30	9/30
Card (Cluttered)	ID	7/30	18/30	11/30	3/30	25/30	15/30
Card (Cluttered)	OOD	0/30	1/30	1/30	1/30	16/30	8/30
Card (Colored BG)	ID	10/30	20/30	6/30	8/30	23/30	17/30
Card (Colored BG)	OOD	0/30	2/30	1/30	2/30	17/30	10/30
Mouse Pad (Cluttered)	ID	4/30	15/30	6/30	7/30	19/30	8/30
Mouse Pad (Cluttered)	OOD	0/30	0/30	1/30	1/30	10/30	5/30
Mouse Pad (Color BG)	ID	7/30	18/30	8/30	5/30	22/30	12/30
Mouse Pad (Color BG)	OOD	0/30	1/30	1/30	1/30	14/30	9/30
Handover (Cluttered)	ID	3/30	8/30	1/30	1/30	13/30	11/30
Handover (Cluttered)	OOD	0/30	0/30	0/30	0/30	5/30	3/30

Task	10 Real	20 Real	200 Sim Pre + 10 Real	10 Real + 200 Sim	10 Real + 200 PR	200 PR (0-shot)
Shake Bottle	0/30	0/30	0/30	0/30	15/30	9/30
Move Playing-Card Away	1/30	2/30	1/30	0/30	21/30	11/30

ComSim: Building Scalable Real-World Robot Data Generation via Compositional Simulation

Introduction

Method

Tasks

Results

Sim2Real Neural Simulation Quality

Real-World Policy Execution

Videos

Real-World Execution Demos (3× speed)

Sim2Real Neural Simulation Comparisons

Generalization

New-Object Generalization

More Visualizations

Citation