IROS 2026 Workshop
Multimodal Learning for Robotic Manipulation
Half-day Workshop • Pittsburgh, PA, USA • September 27, 2026
Foundation models trained on internet-scale data have achieved impressive generalization on vision and language domains, but robotic manipulation is fundamentally contact-rich. Success and safety often hinge on signals that vision alone cannot reliably infer — including touch, force/torque, proprioception, and audio. This mismatch creates a sensory gap: foundation models excel at semantics and geometry, yet we still lack a clear understanding of when vision-only policies break under contact, friction, compliance, occlusion, and real-time constraints.
This workshop brings together researchers in foundation models, tactile/haptic sensing, state estimation, and control to address a central question:
"Is vision all we need for manipulation, or what additional sensing and interaction modeling is minimally necessary?"
Vision–language–action and foundation-model pipelines for manipulation; failure-mode taxonomies for vision-only policies under contact, friction, compliance, and occlusion; sensory sufficiency evaluation via principled ablations and capability-centric metrics (success, robustness, recovery, and safety).
Tactile and haptic sensing, force/torque, proprioception, and audio for contact-rich interaction; sensor design, simulation, and calibration; identifying when non-visual signals change capability — not just accuracy.
Multimodal fusion and representation learning; contact-aware control interfaces; uncertainty estimation and recovery under limited real-world data; evaluation protocols, reproducible benchmarks, and negative/ablation results.
| Time | Event |
|---|---|
| 8:30 AM | Welcome |
| 8:45 AM | Invited Talk 1 |
| 9:15 AM | Invited Talk 2 |
| 9:45 AM | Poster Session & Coffee Break |
| 10:15 AM | Invited Talk 3 |
| 11:00 AM | Invited Talk 4 |
| 11:30 AM | Panel Debate: Is vision all we need? |
| 12:15 PM | Conclusion |
| 12:30 PM | Workshop Ends |
We invite submissions of short papers (4 pages + references) and extended abstracts (2 pages) on topics including but not limited to: multimodal sensing for manipulation, foundation models for contact-rich tasks, tactile/haptic perception, force/torque estimation, multimodal fusion, contact-aware control, and evaluation protocols for sensory sufficiency.
We welcome works-in-progress, ablation studies, negative results, and system papers with reproducible artifacts (datasets, protocols, benchmarks). Accepted contributions will be presented as posters with optional lightning talks.