[期刊论文][Full-length article]


Domain adaptive person search via GAN-based scene synthesis for cross-scene videos

作   者:
Huibing Wang;Tianxiang Cui;Mingze Yao;Huijuan Pang;Yushan Du;

出版年:2023

页    码:104796 - 104796
出版社:Elsevier BV


摘   要:

Person search has recently been a challenging task in the computer vision domain, which aims to search specific pedestrians from real cameras.Nevertheless, most surveillance videos comprise only a handful of images of each pedestrian, which often feature identical backgrounds and clothing. Hence, it is difficult to learn more discriminative features for person search in real scenes. To tackle this challenge, we draw on Generative Adversarial Networks (GAN) to synthesize data from surveillance videos. GAN has thrived in computer vision problems because it produces high-quality images efficiently. We merely alter the popular Fast R-CNN model, which is capable of processing videos and yielding accurate detection outcomes. In order to appropriately relieve the pressure brought by the two-stage model, we design an Assisted-Identity Query Module (AIDQ) to provide positive images for the behind part. Besides, the proposed novel GAN-based Scene Synthesis model that can synthesize high-quality cross-id person images for person search tasks. In order to facilitate the feature learning of the GAN-based Scene Synthesis model, we adopt an online learning strategy that collaboratively learns the synthesized images and original images. Extensive experiments on two widely used person search benchmarks, CUHK-SYSU and PRW, have shown that our method has achieved great performance, and the extensive ablation study further justifies our GAN-synthetic data can effectively increase the variability of the datasets and be more realistic. The code is available at https://github.com/crsm424/DA-GSS . Introduction Person search aims to find a specific pedestrian from the given images or videos taken in real-world scenes, which is a challenging task in the recent computer vision domain. In recent years, person search has attracted increasing attention due to its practical application, such as smart surveillance systems [1], activity analysis [2], [3], people tracking in criminal investigations [4], [5], and other fields. In general, existing person search methods adopt hand-cropped videos, which make the pedestrian bounding boxes clean and less noisy. However, hand-cropped processes require a lot of time and manpower, making them unsuitable for real-world scenarios. Therefore, person search needs to process the whole image which has a large number of pedestrians from the actual surveillance videos, rather than the pre-processed images. Besides, sharing features between detection and re-identification may also cause errors to accrue from each process, which will negatively impact the effectiveness of person search. The above two issues still cause challenges for existing person search methods to complete the task of real-time target searching in large-scale smart surveillance systems. Deep learning-based methods for person search have proposed two different strategies to solve the above issues, which are named two-stage and one-stage respectively due to the framework differences. One-stage methods utilize the unified framework which combines person detection and person re-ID into an end-to-end model [6], [7], [8], [9]. These unified frameworks specifically advocate for an additional layer to be added behind the detection network in order to modify the person-bounding boxes for the re-identification network. They use a combined loss during training that consists of a person detection loss and a person categorization loss. The goals of the searching task, however, conflict with those of detection and re-ID, so the shared features between the two tasks are inappropriate. In other words, the detection and re-ID tasks aim to find the common features of pedestrians and the unique features of a specific person, respectively. Therefore, jointly learning the two tasks may influence the optimization of the model, and some researchers utilize a two-stage framework to separate them as two independent networks. Two-stage frameworks for person search [10], [11], [12], [13] attempt to locate multiple pedestrians in the whole image with detection networks and then extract the above pedestrians which are fed to re-ID networks to complete re-identification task. Assisted in the great results of the detection model, two-stage frameworks mainly focus on how to effectively extract robust and discriminative representations. Existing two-stage person search methods have achieved great performance, but they still fail to notice the contradictory requirements between the sub-tasks in person search. In detection networks, gallery images are auto-detected from the general detector, which produces a large number of bounding boxes for each pedestrian. As a result, the re-ID task’s objectness information is ignored by the identified gallery images, which makes the problem of missed detection on query targets worse. Additionally, the re-ID framework does not agree with all of the detection results. Compared with existing re-ID datasets, the detected bounding boxes are more likely to have problems of misalignment, occlusions, and missing part of the person, even though the detected results do not contain person. Due to the aforementioned issues, the re-ID framework is unable to produce correct recognition results. The detection stage and re-ID stage consistency issue reduces search performance and restricts practicability. In order to resolve the contradiction between these two steps, we must therefore optimize the detection findings. No matter whether researchers adopt a one-stage framework to jointly complete the two sub-tasks, or a two-stage framework to separately solve the two sub-tasks. It is believed that the accuracy of pedestrian detection and the retrieval performance of re-ID have a mutual influence. Note that while some of the aforementioned approaches have achieved great performance, we find that if the accuracy of the first part of pedestrian detection has been improved, the re-ID framework utilizes higher quality candidate samples to compare against the query, which can improve the performance of person search. Therefore, we consider it more effective to enhance the search performance by obtaining more pedestrian images. In real scenarios, including the wildly used two datasets for person search, the monitoring device is situated in diverse places, and the uncertain number of the samples from cameras may cause the pedestrians appearing in each image to be random, sparse, and unbalanced. In view of the above problems, some researchers propose image-operated [14], [15], [16] methods for person re-identification and person search tasks, which aims to improve performance by generating diverse images. And the current video retrieval technology [17], [18], [19], [20] has been continually maturing, which greatly assists in the advancement of our two-stage video processing. Hence, inspired by the above discussions, we proposed a novel approach for person search with generative adversarial networks to eliminate the contradiction between the detection and re-ID stage, termed as Domain Adaptive Person Search via GAN-based Scene Synthesis (DA-GSS) for cross-scene videos. It combines two stages: pedestrian detection and person re-identification. During the detection stage, an auxiliary identity query module (AIDQ) is devised to manage the video’s detection results, aiming to crop the image and retain positive samples for the Reid stage. Specifically, the positive samples are obtained by discarding the background and unmarked identity images, and only keeping the instances that are likely to play an active role in the reid task. Besides, in order to enforce our model to learn more discriminative features, we adopt a generative adversarial network through scene synthesis in our proposed model. Our proposed GAN-based scene synthesis model adopts a generative adversarial network to synthesize data for cross-scene videos, which can effectively generate high-quality images and overcome the challenge of real-scene person search. Specifically, the proposed GAN-based scene synthesis model adopts encoders to separate appearance information and structure information from person images and utilizes decoders to synthesize person images with different appearance information. Moreover, we also design a discriminative module in the above model, which aims to online learn discriminative features from the synthetic images and original images. In summary, the major contributions of the proposed method are the following: • We propose a framework for domain adaptive person search that utilizes GAN-based scene synthesis for cross-scene videos, which can synthesize high-quality images across different videos, and learns discriminative features for person search. • In order to relieve the pressure bought by the two-stage model, we design an Assisted-Identity Query module for cropping the person image from the whole image and providing positive images for the behind part that can improve the overall performance of the person search model. • To make the proposed model more discriminative and robust for person search, GAN is used to synthesize cross-scene person data, which enforces our network to learn more discriminatory and finer-grained features. Besides, we conduct experiments on the widely used CUHK-SYSU and PRW datasets and find that the newly synthesized data helps improve the performance of the model. The remainder of the paper is outlined as follows. Section 2 introduces the related work. Section 3 presents the proposed methods about domain adaptive person search. Extensive experiments including complexity analysis and ablation study are conducted to verify our proposed model in Section 4. Finally, Section 5 concludes this paper. Section snippets Related work We first review the existing works on person search, which have drawn much interest recently. We also review some recent works about the two fields: generative adversarial network and person re-identification, which are the components of our proposed framework. Domain adaptive person search method As illustrated in Fig. 1, we show the details of our proposed person search framework DA-GSS, which can be generally divided into two parts, the detection model with AIDQ and the GAN-based Scene Synthesis network. In this section, we first present an overview of our whole frame and then describe more details for the proposed DA-GSS. Experiments In this part, we conduct experiments on the two benchmark datasets, CUHK and PRW. Besides, we also adopt ablation study on different part of our proposed model. Conclusion In this paper, we noticed that the existing research work does not solve the cross-domain problem in person search very well. In real scenes, due to the influence of weather, different types of cameras, and other factors, the performance of some models dropped off a cliff. To address this issue, we propose a GAN-based Scene Synthesis framework for domain adaptive person search. Specifically, this is the first time that GAN has been introduced inside a pedestrian search framework. We design an CRediT authorship contribution statement Huibing Wang: Conceptualization, Methodology, Writing-original-draft. Tianxiang Cui: Methodology, Validation. Mingze Yao: Validation. Huijuan Pang: Software. Yushan Du: Writing-review-editing. Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Acknowledgements This work was supported in part by the National Natural Science Foundation of China Grant 62002041 , Grant 62176037 , the Liaoning Fundamental Research Funds for Universities Grant LJKQZ2021010 , the Liaoning Doctoral Research Startup Fund Project Grant 2021-BS-075 and the Dalian Science and Technology Innovation Fund 2021JJ12GX028 and 2022JJ12GX019 . References (74) L. Wu et al. Robust hashing for multi-view data: Jointly learning low-rank kernelized similarity consensus and hash functions Image Vis. Comput. (2017) H. Wang et al. Towards adaptive consensus graph: Multi-view clustering via graph collaboration IEEE Trans. Multimed. (2022) J. Xiao et al. Ian: the individual aggregation network for person search Pattern Recogn. (2019) H. Wang et al. Discriminative feature and dictionary learning with part-aware model for vehicle re-identification Neurocomputing (2021) K. Zeng et al. Energy clustering for unsupervised person re-identification Image Vis. Comput. (2020) W. Shi et al. Iranet: Identity-relevance aware representation for cloth-changing person re-identification Image Vis. Comput. (2022) J. Lv et al. Person re-identification with expanded neighborhoods distance re-ranking Image Vis. Comput. (2020) X. Xu et al. Rethinking data collection for person re-identification: active redundancy reduction Pattern Recogn. (2021) R. Yao et al. Gan-based person search via deep complementary classifier with center-constrained triplet loss Pattern Recogn. (2020) Y. Zhao et al. Learning deep part-aware embedding for person retrieval Pattern Recogn. (2021) W. Wang, H. Song, S. Zhao, J. Shen, S. Zhao, S.C.H. Hoi, H. Ling, Learning unsupervised video object segmentation... Z. Wang, Q. Shi, C. Shen, A. van den Hengel, Bilinear programming for human activity recognition with unknown mrf... Z. Ma, W. Lu, J. Yin, X. Zhang, Robust visual tracking via hierarchical convolutional features-based sparse learning,... H. Wang et al. Kernelized multiview subspace analysis by self-weighted learning IEEE Trans. Multimed. (2021) Y. Yan, J. Li, J. Qin, S. Bai, S. Liao, L. Liu, F. Zhu, L. Shao, Anchor-free person search, in: Proceedings of the... B. Munjal, S. Amin, F. Tombari, F. Galasso, Query-guided end-to-end person search, in: Proceedings of the IEEE/CVF... D. Chen, S. Zhang, J. Yang, B. Schiele, Norm-aware embedding for efficient person search, in: Proceedings of the... H. Yao et al. Joint person objectness and repulsion for person search IEEE Trans. Image Process. (2021) C. Han, J. Ye, Y. Zhong, X. Tan, C. Zhang, C. Gao, N. Sang, Re-id driven localization refinement for person search, in:... C. Wang, B. Ma, H. Chang, S. Shan, X. Chen, Tcts: A task-consistent two-stage framework for person search, in:... D. Chen et al. Person search by separated modeling and a mask-guided two-stream cnn model IEEE Trans. Image Process. (2020) G. Jiang et al. Tensorial multi-view clustering via low-rank constrained high-order graph learning IEEE Trans. Circuits Syst. Video Technol. (2022) G. Wang, Y. Yang, J. Cheng, J. Wang, Z. Hou, Color-sensitive person re-identification, in: Proceedings of the 28th... Z. Yu et al. Apparel-invariant feature learning for person re-identification IEEE Trans. Multimed. (2021) X. Yang, F. Feng, W. Ji, M. Wang, T.-S. Chua, Deconfounded video moment retrieval with causal intervention, in:... X. Yang et al. Video moment retrieval with cross-modal neural architecture search IEEE Trans. Image Process. (2022) J. Dong et al. Dual encoding for video retrieval by text IEEE Trans. Pattern Anal. Mach. Intell. (2021) B. Qian, Y. Wang, R. Hong, M. Wang, Adaptive data-free quantization, arXiv preprint arXiv: 2303.06869... T. Xiao, S. Li, B. Wang, L. Lin, X. Wang, Joint detection and identification feature learning for person search, in:... L. Zheng, H. Zhang, S. Sun, M. Chandraker, Y. Yang, Q. Tian, Person re-identification in the wild, in: Proceedings of... Y. Yan, Q. Zhang, B. Ni, W. Zhang, M. Xu, X. Yang, Learning context graph for person search, in: Proceedings of the... Z. Li, D. Miao, Sequential end-to-end network for efficient person search, in: Proceedings of the AAAI Conference on... D. Chen, S. Zhang, W. Ouyang, J. Yang, Y. Tai, Person search via a mask-guided two-stream cnn model, in: Proceedings of... Y. Yan, J. Li, S. Liao, J. Qin, B. Ni, K. Lu, X. Yang, Exploring visual context for weakly supervised person search,... S. Liao, Y. Hu, X. Zhu, S.Z. Li, Person re-identification by local maximal occurrence representation and metric... M. Farenzena, L. Bazzani, A. Perina, V. Murino, M. Cristani, Person re-identification by symmetry-driven accumulation... H. Liu, Y. Wang, M. Wang, Y. Rui, Delving globally into texture and structure for image inpainting, in: Proceedings of... View more references Cited by (0) Recommended articles (6) Research article Adversarial anchor-guided feature refinement for adversarial defense Image and Vision Computing, Volume 136, 2023, Article 104722 Show abstract Adversarial training (AT), which is known as a robust training method for defending against adversarial examples, usually loses the performance of models for clean examples due to the feature distribution discrepancy between clean and adversarial. In this paper, we propose a novel Adversarial Anchor-guided Feature Refinement (AAFR) defense method aimed at reducing the discrepancy and delivering reliable performances for both clean and adversarial examples. We devise adversarial anchor that detects whether the feature comes from clean or adversarial example. Then, we use adversarial anchor to refine the feature to reduce the discrepancy. As a result, the proposed method substantially achieves adversarial robustness while preserving the performance for clean examples. The effectiveness of the proposed method is verified with comprehensive experiments on CIFAR-10, CIFAR-100, and Tiny ImageNet datasets. Research article Transformer-based feature interactor for person re-identification with margin self-punishment loss Image and Vision Computing, Volume 137, 2023, Article 104752 Show abstract Person re-identification aims to retrieve specific pedestrians from different cameras and scenes, in which extracting robust and discriminative features is crucial for this task. To explore the potential interactions among images and learn more robust representations, this paper proposes Transformer-based Feature Interactor(TFI) and improved Margin Self-punishment Softmax loss(MS-Softamx). The Transformer-based Feature Interactor consists of Group Channel Pyramid Attention(GCPA) and Neighbor Interaction Modeling(NIM). Firstly, the Group Channel Pyramid Attention module provides prior information for high-level semantics via low-level semantics. The attention information is gradually stacked from coarse to fine to obtain enhanced hierarchical multi-scale features. Then, Neighbor Interaction Modeling effectively model the input and similar neighbors to produce a more robust and discriminative image representation. To make TFI more focused on intra-class embedding learning, we also propose Margin Self-punishment Softmax guide deep network learning, which obtains a tighter custom classification boundary by pushing the inter-class threshold and minimizing the intra-class variance. The proposed method is verified on four datasets, and this achieves 92.8%/95.6% mAP/Rank-1 on Market1501, 86.1%/90.8% mAP/Rank-1 on DukeMTMC, 64.4%/ 81.2% mAP/Rank-1 on MSMT17, 79.7%/80.8% mAP/Rank-1 on CUHK03-detected and 81.8%/81.9% mAP/Rank-1 on CUHK03-labeled. Extensive experiments demonstrate that the proposed method achieves competitive performance with other state-of-the-art methods. Research article IAC-ReCAM: Two-dimensional attention modulation and category label guidance for weakly supervised semantic segmentation Image and Vision Computing, Volume 136, 2023, Article 104738 Show abstract Weakly supervised semantic segmentation (WSSS) approaches aim at pixel-level semantic category prediction using only image-level labels. The existing classifier-based method ReCAM has achieved good results, however, the classifiers tend to only focus on the most discriminative regions, resulting in an uneven distribution of features in the resulting class activation maps (CAMs). Besides, the classifiers are susceptible to image background interference and generate false activation mapping. To solve the above problems, we propose an improved method IAC-ReCAM, which introduces an activation network that integrated attention modulation and category label guidance based on the ReCAM method. We utilize the attention modulation module to reassign the feature distribution of the CAMs from the perspective of channels and spaces in turn. Meanwhile, we use the class label guidance module to suppress the generation of false activation mapping. Furthermore, we verified the effectiveness of the IAC-ReCAM method improvement work on both PASCAL VOC 2012 and MS COCO 2014 datasets, our method outperforms a large number of existing mainstream methods. Among them, compared with the ReCAM method, the mIoU of the pseudo-labels on the two datasets is improved by 2.9% and 1%, respectively. Research article Lightweight image denoising network with four-channel interaction transform Image and Vision Computing, Volume 137, 2023, Article 104766 Show abstract Image denoising has always been a fundamental task in computer vision. In recent years, deep learning methods have emerged as the dominant approach for image denoising and have significantly improved denoising performance. However, these deep denoising methods typically require large model sizes, making network training prohibitively expensive and limiting their applicability in realistic scenarios. To address this issue, we propose a Lightweight Image Denoising Network (LWNet) with a four-channel interaction transform that effectively reduces the model size. The proposed four-channel interaction transform first constructs the LWNet using four channels within the input and output dimensions. Specifically, an additional empty channel with all zeros is attached to the input image, and the output dimension has four channels. This additional channel significantly enhances the robustness of network training, as the expansion of features in the channel dimension provides richer information. Compared to three-channel networks, LWNet exhibits greater fault tolerance. Furthermore, the proposed LWNet uses a dual-branch structure to achieve the four-channel interaction transform in the feature space. One branch focuses on the feature learning of the additional channel within the input dimension, while the other branch handles the original three channels. This mechanism enables the network to retrieve abundant denoising features and adaptively inject them into the denoised images, significantly enhancing the denoising performance. Thanks to the powerful feature retrieval ability of the four-channel transform, the proposed LWNet can significantly decrease the required number of parameters. Extensive experimental results show that LWNet achieves the best denoising results on synthetic datasets using much fewer parameters. Even when extrapolating to real datasets for validation, it maintains better denoising performance with effective model size. Overall, the proposed LWNet offers an effective solution to reduce model size without compromising denoising performance and has potential practical applications in various image denoising scenarios. Research article A dedicated benchmark for contour-based corner detection evaluation Image and Vision Computing, Volume 136, 2023, Article 104716 Show abstract Numerous contour-based corner detection (CBCD) algorithms have been proposed recently, necessitating effective and practical evaluation. Most existing methods evaluate corner detection accuracy through metrics between the testing image and its attacked versions or rely on image-specific ground truth for corner evaluation. These methods use images as input, failing to solely evaluate the corner detection performance but combining it with contour extraction evaluation. Since contour extraction is another important research topic and existing CBCD algorithms almost have no contribution to contour extraction, this intertwining may negatively impact the evaluation results, hindering corner detection development. Furthermore, most evaluation methods directly provide simple statistical scores of evaluation metrics, such as the mean value, which are inadequate to reflect the overall performance distribution. This study presents a novel benchmark that is specifically designed for assessing CBCD methods, which includes two major contributions. Firstly, we design two dedicated datasets, one with the ground-truth corner and the other without them. The dedicated contours instead of images are employed as input to evaluate numerous CBCD methods, eliminating the impact of the extracted contour quality. When the ground-truth corners are unavailable, we employ additional contour attacks, including Gaussian noise, projective, and combined geometry on contours, to simulate real-world complex image processes compared with the attacks in existing evaluation methods. Secondly, we evaluate the performance of twelve CBCD methods using six distinct metrics based on the constructed contour datasets. To gain a deep insight into the overall performance distribution, the sign test method for hypothesis testing is utilized alongside some simple statistical measures for evaluation metric analysis. Experimental results demonstrated that no individual method performs the best across all six evaluation metrics, while different CBCD algorithms have their positive scenarios. The evaluation code will be publicly available at https://github.com/roylin1229/CBCD_eva . Research article Discriminable feature enhancement for unsupervised domain adaptation Image and Vision Computing, Volume 137, 2023, Article 104755 Show abstract Unsupervised domain adaptation addresses the problem of knowledge transformation from source domain to target domain, aiming to effectively alleviate data distribution mismatch and data labeling consumption. The issue of data distribution mismatches is widespread in actual agricultural visual tasks. Moreover, it is expensive and time-consuming to construct and label visual image data. For in-field cotton boll, its maturing status can greatly affect the yield and quality. Uneven distribution restrains the performance for maturing status recognition. Therefore, domain adaptation is essential for identifying cross-domain cotton boll maturity. Existing unsupervised domain adaptation methods obtain domain invariant feature for achieving domain alignment. However, the discriminability of features is less considered, which may result in unsatisfactory classification results. In this paper, an unsupervised domain adaptation method called discriminable feature enhancement(DFE-DA) is proposed to identify cross-domain cotton boll maturity. It enables to minimize intra-class distance by maximizing intra-domain density(MID) loss and realizes discriminable feature enhancement. The effectiveness of DFE-DA is verified on in-field cotton boll V2(IFCB-V2) dataset containing 2400 images. The experimental results demonstrate that DFE-DA has an average improvement of 12.8%, 10.3% and 7.6% compared with other methods in three different transfer tasks. Furthermore, the MID loss can cooperate well with other adversarial methods. To verity the robustness of DFE-DA, additional experiments conducted on the public benchmark Office-31 and Office-Home indicates it is comparable to the state-of-the-arts. View full text © 2023 Elsevier B.V. All rights reserved. About ScienceDirect Remote access Shopping cart Advertise Contact and support Terms and conditions Privacy policy We use cookies to help provide and enhance our service and tailor content and ads. By continuing you agree to the use of cookies . Copyright © 2023 Elsevier B.V. or its licensors or contributors. ScienceDirect® is a registered trademark of Elsevier B.V. ScienceDirect® is a registered trademark of Elsevier B.V.



关键字:

暂无


所属期刊
Image and Vision Computing
ISSN: 0262-8856
来自:Elsevier BV