Does computer vision matter for action?

Brady Zhou, Philipp Krähenbühl, Vladlen Koltun
Science Robotics 2019, Special Section on Computer Vision
[pdf] [code]

Computer vision produces representations of scene content. Much computer vision research is predicated on the assumption that these intermediate representations are useful for action. That is, a sensorimotor system equipped with such representations is better able to act in the physical world. Recent work at the intersection of machine learning and robotics calls this assumption into question. High-profile results show that sensorimotor systems can be trained directly for the task at hand. These systems are trained end-to-end, from pixels to actions, with no explicit intermediate representations as studied in computer vision research. These results cast doubt on the role of computer vision in sensorimotor control. Thus the central question of our work: Does computer vision matter for action? We probe this question and its offshoots via immersive simulation, which allows us to conduct controlled reproducible experiments at scale. We instrument immersive three-dimensional environments to simulate challenges such as urban driving, off-road trail traversal, and battle. We train baseline pixels-to-actions models and compare them to corresponding models that are equipped with the kinds of intermediate representations studied in computer vision. Our main finding is that computer vision does matter. Models equipped with intermediate representations train faster, achieve higher task performance, and generalize better to previously unseen environments.