[期刊论文][Regular Paper]


Spatial–temporal correlations learning and action-background jointed attention for weakly-supervised temporal action localization

作   者:
Huifen Xia;Yongzhao Zhan;Keyang Cheng;

出版年:2022

页     码:1529 - 1541
出版社:Springer Nature


摘   要:

Weakly supervised temporal action localization (W-TAL) is designed to detect and classify all the action instances in an untrimmed video with only video-level labels. Due to the lack of frame-level annotations, the correlations learning between action snippets and the separation between action and background are the two key issues for accurate action localization. To mine the intrinsic correlations of space and time embodied in the occurrences of action in a video and identify the action and background in the snippets, a novel method based on spatial–temporal correlations learning and action-background jointed attention for W-TAL is proposed. In this method, the graph convolution and 1-D temporal convolution networks are constructed to learn the spatial and temporal features of the video, respectively, then fused to learn a fruitful spatial–temporal correlative feature map. This ensures more completed features representation for action localization. Next, different from the other methods, action-background jointed attention mechanism is presented to explicitly modelled background as well as action in a three-branch classification network. This classification network can distinguish action and background and realize the separation of action and background better, so as to promote more accurate action localization. Experiments conducted on Thumos14 and ActivityNet1.3 show that our method outperforms state-of-the-art methods, especially at some high t-IoU thresholds, which further validates the effectiveness of our method.



关键字:

Spatial–temporal correlations; Graph convolution network; 1-D convolution network; Temporal action localization; Weakly-supervised


所属期刊
Multimedia Systems
ISSN: 0942-4962
来自:Springer Nature