RoboBERT: An End-to-end Multimodal Robotic Manipulation Model
Published in arXiv (v2, cs.RO / cs.LG), 2025
Recommended citation: Sicheng Wang, Sheng Liu, Weiheng Wang, Jianhua Shan, Bin Fang. (2025). "RoboBERT: An End-to-end Multimodal Robotic Manipulation Model." arXiv preprint arXiv:2502.07837v2. doi:10.48550/arXiv.2502.07837. https://arxiv.org/pdf/2502.07837v2
RoboBERT is an end-to-end multimodal robotic manipulation model that integrates vision, language, and action via a diffusion-based policy network. It uses a two-stage training paradigm: (1) freeze most of the vision encoder and train with a single standard instruction phrasing for stable policy learning; (2) unfreeze modules and introduce diverse natural-language variants to align language to the learned policy without destabilizing performance. Systematic data augmentations are used to improve robustness to visual perturbations. :contentReference[oaicite:1]{index=1}
- arXiv: 2502.07837 (v2)
- PDF (v2): Download
- DOI: 10.48550/arXiv.2502.07837
- Project page: Link
- Code: GitHub
BibTeX
```bibtex @article{wang2025robobert, title = {RoboBERT: An End-to-end Multimodal Robotic Manipulation Model}, author = {Wang, Sicheng and Liu, Sheng and Wang, Weiheng and Shan, Jianhua and Fang, Bin}, journal = {arXiv preprint arXiv:2502.07837v2}, year = {2025}, doi = {10.48550/arXiv.2502.07837} }
