V2A - Vision to Action: Learning robotic arm actions based on vision and language

Asian Conference on Computer Vision, 2020

Michal Nazarczuk and Krystian Mikolajczyk

[Paper]

Abstract

In this work, we present a new AI task - Vision to Action (V2A) - where an agent (robotic arm) is asked to perform a high-level task with objects (e.g. stacking) present in a scene. The agent has to suggest a plan consisting of primitive actions (e.g. simple movement, grasping) in order to successfully complete the given task. Queries are formulated in a way that forces the agent to perform visual reasoning over the presented scene before inferring the actions. We propose a novel approach based on multimodal attention for this task and demonstrate its performance on our new V2A dataset. We propose a method for building the V2A dataset by generating task instructions for each scene and designing an engine capable of assessing whether the sequence of primitives leads to a successful task completion.

Citation

@article{nazarczuk2020b,
  title={V2A - Vision to Action: Learning robotic arm actions based on vision and language},
  author={Nazarczuk, Michal and Mikolajczyk, Krystian},
  journal={Asian Conference on Computer Vision (ACCV)},
  year={2020}
  }