Harbin Engineering University
Visual scene understanding includes detecting and recognizing objects, reasoning the visual relationships of the detected objects, and describing image regions with sentences. In order to achieve the more comprehensive and accurate understanding of scene image, we view object detection, visual relationship detection and image captioning as three visual tasks at different semantic levels in scene understanding, so as to propose an image understanding model based on multi-level semantic features to leverage the mutual connections across the three different semantic layers to solve the scene understanding tasks jointly. The model through a message pass graph to iterate and update the semantic features of object, relationship phrase and image captioning simultaneously. The updated semantic features are used to classify objects and visual relationships, generate scene graph and captions, and introduce a fusion attention mechanism to improve the accuracy of captions. The experimental results on the Visual Genome and COCO datasets show that the proposed method outperforms the existing methods on the scene graph generation and image captioning tasks.