Learning Novel Manipulation Skills with
Agent-Agnostic Visual and Action Representations

*Equal contributors Corresponding author
1National Key Laboratory of General Artificial Intelligence, BIGAI.
2Department of Automation, THU. 3Institute for AI, PKU. 4UCLA. 5School of EECS, PKU.

Ag2Manip enables various manipulation tasks without necessitating domain-specific demonstrations.


Enhancing the ability of robotic systems to autonomously acquire novel manipulation skills is vital for applications ranging from assembly lines to service robots. Existing methods (e.g., VIP, R3M) rely on learning a generalized representation for manipulation tasks but overlook (i) the domain gap between distinct embodiments and (ii) the sparseness of successful task trajectories within the embodiment-specific action space, leading to misaligned and ambiguous task representations with inferior learning efficiency. Our work addresses the above challenges by introducing Ag2Manip (Agent-Agnostic representations for Manipulation) for learning novel manipulation skills. Our approach encompasses two principal innovations: (i) a novel agent-agnostic visual representation trained on human manipulation videos with embodiments masked to ensure generalizability, and (ii) an agent-agnostic action representation that abstracts the robot's kinematic chain into an agent proxy with a universally applicable action space to focus on the core interaction between the end-effector and the object. Through our experiments, Ag2Manip demonstrates remarkable improvements across a diverse array of manipulation tasks without necessitating domain-specific demonstrations, substantiating a significant 325% improvement in average success rate across 24 tasks from FrankaKitchen, ManiSkill, and PartManip. Further ablation studies underscore the critical role of both representations in achieving such improvements.

Pipeline of Ag2Manip


Our method consists of three parts: (left) learning an agent-agnostic visual representation; (middle) learning abstracted skills with an agent-agnostic action representation; and (right) retargeting the abstracted skills to a robot.

Simulation Results

Real-world Results


    title={Ag2Manip: Learning Novel Manipulation Skills with Agent-Agnostic Visual and Action Representations},
    author={Li, Puhao and Liu, Tengyu and Li, Yuyang and Han, Muzhi and Geng, Haoran and Wang, Shu and Zhu, Yixin and Zhu, Song-Chun and Huang, Siyuan},
    journal={arXiv preprint arXiv:2404.17521},