About the Event
The ability to perceive and understand people's actions enables humans to efficiently communicate and collaborate in society. Endowing machines with such ability is an important step for building assistive and socially-aware robots. This dissertation drives progress on visual action understanding in the scope of human-object interactions, a major branch of human actions that dominates our everyday life. Specifically, we address the challenges of two important tasks: visual recognition and visual synthesis.
The first part of this dissertation considers the recognition task. The main bottleneck of current research is a lack of proper benchmark, since existing action datasets contain only a small number of categories with limited diversity. To this end, we set out to construct a large-scale image benchmark by annotating web images through online crowdsourcing. The new ``HICO'' dataset surpasses prior datasets in term of both the number of images and action categories by one order of magnitude. The introduction of HICO enables us to benchmark state-of-the-art recognition approaches and also shed light on new challenges in the realm of large-scale interaction recognition.
The second part of this dissertation considers the synthesis task, and focuses particularly on the synthesis of body motion. The central goal is: given an image of a scene, synthesize the course of an action conditioned on the observed scene configuration. We investigate two types of synthesis tasks: semantic-driven synthesis and goal-driven synthesis. For the former, we propose a novel deep neural network architecture that extracts semantic information from the image and use it to predict future body poses. For the latter, we propose a novel reinforcement learning framework that can synthesize motion from varying initial human-object configurations.