[期刊论文][Full-length article]


Visible-infrared person re-identification using high utilization mismatch amending triplet loss

作   者:
Jianqing Zhu;Hanxiao Wu;Qianqian Zhao;Huanqiang Zeng;Xiaobin Zhu;Jingchang Huang;Canhui Cai;

出版年:2023

页    码:104797 - 104797
出版社:Elsevier BV


摘   要:

Visible-infrared person re-identification (VIPR) is a task of retrieving a specific pedestrian monitored by cameras in different spectra. A dilemma of VIPR is how to reasonably use intra-modal pairs. Fully discarding intra-modal pairs causes a low utilization of training data, while using intra-modal pairs brings a danger of distracting a VIPR model's concentration on handling cross-modal pairs, harming the cross-modal similarity metric learning. For that, a high utilization mismatch amending (HUMA) triplet loss function is proposed for VIPR. The key of HUMA is the multi-modal matching regularization (MMMR), which restricts variations of distance matrices calculated from cross- and intra-modal pairs to cohere cross- and intra-modal similarity metrics, allowing for a high utilization of training data and amending the adverse distractions of intra-modal pairs. In addition, to avoid the risk of harming feature discrimination caused by MMMR preferring coherence in similarity metrics, a novel separated loss function assignment (SLFA) strategy is designed to arrange MMMR well. Experimental results show that the proposed method is superior to state-of-the-art approaches. Introduction Given a query person image of the visible (or infrared) modality, visible-infrared person re-identification (VIPR) aims to retrieve person images from a gallery set of the opposite infrared (or visible) modality that have the same identity as the query image, as shown in Fig. 1. VIPR has excellent potential for intelligent video surveillance, such as visual tracking [1,2], trajectory prediction [3,4], pedestrian activity analysis [5,6]. Like single-modal visible person re-identification [[7], [8], [9], [10], [11], [12]], VIPR also faces many challenges, such as viewpoint variations, low resolutions, poor illuminations, and pose variations. Even worse, compared with single-modal visible person re-identification, VIPR has to deal with a great modal gap between visible images and infrared images. As a result, VIPR is more challenging than single-modal visible person re-identification. For visible person re-identification, the common way of learning a discriminative similarity metric is to use a triplet loss function cooperating with a hard mining or re-weighting strategy to shrink distances of hard positive pairs of the same identity as well as enlarge distances of hard negative pairs of different identities. Although this common way can be applied directly to VIPR as done in [[13], [14], [15], [16], [17]], VIPR has its dilemma. As illustrated in Fig. 2(a), the distances of intra-modal negative pairs are usually closer than those of cross-modal negative pairs. As a result, the hard mining or re-weighting triplet loss function [13,18] is prone to optimize intra-modal negative pairs, which deviates the cross-modal similarity metric for VIPR. We call this dilemma the modal-mismatch problem. To quantitatively explain the modal-mismatch problem, we calculate modal-mismatch ratios. A modal-mismatch ratio is equal to the number of mini-batches optimizing intra-modal pairs over the total number of mini-batches of an epoch. Furthermore, there are two types of modal-mismatch ratios, namely, positive modal-mismatch ratio and negative modal-mismatch ratio, which are corresponding to positive intra-modal pairs of the same identity and negative intra-modal pairs of different identities, respectively. As shown in Fig. 2(b) and Fig. 2(c), on the both SYSU-MM01 [19] and RegDB [20] datasets, as the training progress of the hard mining triplet loss, the negative modal-mismatch ratio will decrease and stabilize around 50%. That is, there is at least half of effort on dealing with negative intra-modal pairs, which seriously biases the similarity metric for cross-modal VIPR. Another surprise is that even though cross-modal positive pairs are prone to be farther than intra-modal positive pairs, the positive modal-mismatch ratio is stable at around 20% on the SYSU-MM01 [19] dataset and 10% on the RegDB [20] dataset. Consequently, the modal-mismatch problem greatly distracts the similarity metric learning for cross-modal VIPR. To alleviate the problem of modal mismatch, existing work has two strategies: (1) only cross-modal optimization [[21], [22], [23]] and (2) both cross-modal and intra-modal optimization [[24], [25], [26], [27], [28], [29], [30]]. The first strategy is intuitive for VIPR, which is essentially a cross-modal retrieval task. For example, Dai et al. [21] used visible (or infrared) samples as anchors and pulled distances between anchors and infrared (or visible) positive samples as well as pushed distances between anchors and infrared (or visible) negative samples. Liu et al. [22] proposed a hetero-center triplet (HC-Triplet) loss function, which calculates the centers of every identity on one modality and optimized cross-modal centers rather than cross-modal samples. However, the first strategy has a low utilization of training data, because only cross-modal pairs are emphasized but intra-modal pairs are ignored, which is not conducive to VIPR performance in terms of using more training data. Compared to the first strategy, the second strategy additionally considers intra-modal pairs. For example, Liu et al. [24] designed a bidirectional triplet constrained top-push ranking loss (BTTR) function. The BTTR loss function not only deals with cross-modal/intra-modal triplets but also forces positive cross-modal pairs to be closer than negative intra-modal pairs. Zhao et al. [25] proposed a global triplet loss function that optimizes pairs irrespective of whether they are cross-modal or intra-modal. But, they also noticed that the cross-modal pairs are more important for VIPR, thus they had to apply a triplet loss again to emphasize cross-modal pairs, bringing extra computations. Li et al. [27] also globally optimized pairs, the main difference is that they designed an un-nesting method to reduce global optimization computations. Cai et al. [26] calculated the centers of identities in each modality, and then pulled each identity's samples to be close to its visible center and infrared center, as well as pushed each identity's samples far from visible centers and infrared centers who have different identities from the samples. Although the second strategy has a higher utilization of training data than the first strategy, the optimization on intra-modal pairs could bring a danger of distracting a VIPR model's concentration on handling cross-modal pairs. Therefore, how to reasonably use intra-modal pairs for VIPR is still an open problem. In this paper, our motivation is that since VIPR is a cross-modal task, regardless of modalities, it has an expectation: features of different instances of the same individual to share the same clustering center so that features of different instances of the same individual have similar significance. If we simultaneously optimize the cross-modal similarity metric and the intra-modal similarity metric, and learn features that have a small difference between the cross-modal similarity metric and the intra-modal similarity metric, it will benefit to meet the VIPR expectation, thus reducing the danger brought by intra-modal pairs. To this end, we propose a high utilization mismatch amending (HUMA) triplet loss function for VIPR. Our HUMA also is a cross-modal and intra-modal simultaneous optimization method. However, unlike existing methods [[24], [25], [26], [27]], we design a multi-modal matching regularization (MMMR) approach to maintain consistency in similarity metrics for both cross-modal and intra-modal data. The MMMR restricts variations of distance matrices calculated from cross- and intra-modal pairs to the adverse distractions of intra-modal pairs. Furthermore, we consider that the MMMR preferring modal-robust similarity metric may hinder from discriminant features, and design a separated loss function assignment (SLFA) strategy. The main contributions of this paper are summarized as follows. • A high utilization mismatch amending (HUMA) triplet loss function is proposed for VIPR. The key of HUMA is a multi-modal matching regularization (MMMR), which utilizes full training data and reduces the adverse distractions from intra-modal pairs. • A novel separated loss function assignment (SLFA) strategy is designed to avoid the risk of harming feature discrimination caused by MMMR that preferring coherence in similarity metrics. • Extensive experiments on two datasets (i.e., SYSU-MM01 [19] and RegDB [20]) show that the proposed approach is superior to state-of-the-art VIPR methods. The key novelty of this paper lies in the multi-modal matching regularization (MMMR) of the high utilization mismatch amending (HUMA) triplet loss function. The MMMR plays a significant role in fully utilizing the training data and effectively addressing the adverse distractions caused by intra-modal pairs. By incorporating the MMMR, the HUMA loss function acquires good VIPR performance. For example, on the RegDB dataset, with the help of MMMR, the R1 improves from 82.4% to 92.8% and 79.2% to 91.2% on visible-to-infrared and infrared-to-visible retrieval modes. Section snippets Loss function designs Many VIPR methods [[14], [15], [16], [17]] apply triplet losses [13,18], but these triplet losses initially serve for single-modal person re-identification. However, the way of directly using single-modal triplet losses suffers from the modal-mismatch problem. Specifically, because intra-modal pairs are usually closer than cross-modal pairs, there is a substantial probability to optimize intra-modal pairs. For example, regarding pushing closet negative pairs, according to our statistics shown Short review of triplet losses The most commonly-used triplet loss for visible person re-identification is the hard mining triplet (HMT) [18] loss function. The HMT loss function pulls farthest positive pairs and pushes closest negative pairs of a mini-batch training data. The HMT loss function is formulated as follows: L HMT = 1 n ∑ i = 1 n max x ∈ P i x i − x 2 − min x ∈ N i x i − x 2 + α + , where n is the number of samples in a mini-batch; ⋅ + equals to max ⋅ 0 ; α > 0 is a manually setting margin; P i and N i respectively denote the positive set and the negative Experiment and analysis In this section, we evaluate our HUMA method on two well-known visible-infrared pedestrian datasets, namely, SYSU-MM01 [19] and RegDB [20] datasets. For performance metrics, commonly-used mean average precision (mAP) [[51], [52], [53]], mean inverse negative penalty (mINP) [13], and cumulative match characteristic (CMC) [9,54,55] curve are applied. R1 denotes the rank-1 identification rate on a CMC curve. Conclusion In this paper, we design a high utilization mismatch amending (HUMA) triplet loss for visible-infrared person re-identification (VIPR). The HUMA triplet loss contains two components, namely, (1) cross- and intra-modal triplet (CIMT) loss, and (2) multi-modal matching regularization (MMMR). The CIMT component acquires high data utilization via simultaneously optimizing cross-modal pairs and intra-modal pairs. The MMMR component effectively addresses the adverse distractions of intra-modal pairs, CRediT authorship contribution statement Jianqing Zhu: Conceptualization, Methodology, Investigation, Software, Validation, Writing – original draft, Funding acquisition. Hanxiao Wu: Methodology, Investigation, Software, Validation, Writing – original draft. Qianqian Zhao: Investigation, Software, Validation, Writing – review & editing. Huanqiang Zeng: Methodology, Writing – review & editing, Funding acquisition, Supervision, Investigation. Xiaobin Zhu: Software, Validation, Writing – review & editing. Jingchang Huang: Validation, Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Acknowledgement This work was supported in part by the National Natural Science Foundation of China under the Grant of 61976098 , in part by the National Key R&D Program of China under the Grant of 2021YFE0205400 , in part by the Natural Science Foundation for Outstanding Young Scholars of Fujian Province under the Grant of 2022J06023 , in part by the High-level Talent Innovation and Entrepreneurship Project of Quanzhou City under the Grant of 2023C013R , in part by the Collaborative Innovation Platform Project of References (73) A. Kerim et al. Using synthetic data for person tracking under adverse weather conditions Image Vis. Comput. (2021) R. Wang et al. Multi-information-based convolutional neural network with attention mechanism for pedestrian trajectory prediction Image Vis. Comput. (2021) Y. Chen et al. Pose-guided counterfactual inference for occluded person re-identification Image Vis. Comput. (2022) W. Shi et al. Iranet: identity-relevance aware representation for cloth-changing person re-identification Image Vis. Comput. (2022) X. Cai et al. Dual-modality hard mining triplet-center loss for visible infrared person re-identification Knowl.-Based Syst. (2021) W. Li et al. Unified batch all triplet loss for visible-infrared person re-identification P. Zhang et al. Beyond modality alignment: learning part-level representation for visible-infrared person re-identification Image Vis. Comput. (2021) Z. Sun et al. A survey of multiple pedestrian tracking based on tracking-by-detection framework IEEE Trans. Circ. Syst. Video Technol. (2020) X. Liu et al. Trajectorycnn: a new spatio-temporal feature learning network for human motion prediction IEEE Trans. Circ. Syst. Video Technol. (2020) K. Chen et al. A semisupervised recurrent convolutional attention model for human activity recognition IEEE Trans. Neural Netw. Learn. Syst. (2020) C. Xiao et al. Deepseg: deep-learning-based activity segmentation framework for activity recognition using wifi IEEE Internet Things J. (2021) D. Tao et al. Deep multi-view feature learning for person re-identification IEEE Trans. Circ. Syst. Video Technol. (2017) J. Zhu et al. Deep hybrid similarity learning for person re-identification IEEE Trans. Circ. Syst. Video Technol. (2018) Z. Zeng et al. Illumination-adaptive person re-identification IEEE Trans. Multimedia (2020) P. Wang et al. Horeid: deep high-order mapping enhances pose alignment for person re-identification IEEE Trans. Image Process. (2021) M. Ye et al. Deep learning for person re-identification: a survey and outlook IEEE Trans. Pattern Anal. Mach. Intell. (2022) M. Ye et al. Dynamic dual-attentive aggregation learning for visible-infrared person re-identification G.-A. Wang et al. Cross-modality paired-images generation for rgb-infrared person re-identification C. Seokeon et al. Hi-cmd:hiererchical cross-modality disentanglement for visible-infrared person re-identification W. Hu et al. Adversarial decoupling and modality-invariant representation learning for visible-infrared person re-identification IEEE Trans. Circ. Syst. Video Technol. (2022) A. Hermans et al. In defense of the triplet loss for person re-identification arXiv preprint (2017) A. Wu et al. Rgb-infrared cross-modality person re-identification D.T. Nguyen et al. Person recognition system based on a combination of body images from visible light and thermal cameras Sensors (2017) P. Dai et al. Cross-modality person re-identification with generative adversarial training H. Liu et al. Parameter sharing exploration and hetero-center triplet loss for visible-thermal person re-identification IEEE Trans. Multimedia (2021) Y. Feng et al. Llm: learning cross-modality person re-identification via low-rank local matching IEEE Sign. Proc. Lett. (2021) H. Liu et al. Sfanet: a spectrum-aware feature augmentation network for visible-infrared person reidentification IEEE Trans. Neural Netw. Learn. Syst. (2023) Y.-B. Zhao et al. Hpiln: a feature learning framework for cross-modality person re-identification IET Image Process. (2019) M. Ye et al. Visible thermal person re-identification via dual-constrained top-ranking M. Ye et al. Bi-directional center-constrained top-ranking for visible thermal person re-identification IEEE Trans. Inform. Forens. Security (2020) D. Zhang et al. Dual mutual learning for cross-modality person re-identification IEEE Trans. Circ. Syst. Video Technol. (2022) M. Ye et al. Hierarchical discriminative learning for visible thermal person re-identification Y. Lu et al. Cross-modality person re-identification with shared-specific feature transfer M. Ye et al. Dynamic tri-level relation mining with attentive graph for visible infrared re-identification IEEE Trans. Inform. Forens. Security (2022) L. Zhang et al. Global-local multiple granularity learning for cross-modality visible-infrared person reidentification IEEE Trans. Neural Netw. Learn. Syst. (2021) Z. Wei et al. Flexible body partition-based adversarial learning for visible infrared person re-identification IEEE Trans. Neural Netw. Learn. Syst. (2022) View more references Cited by (0) Recommended articles (6) Research article An automated hyperparameter tuned deep learning model enabled facial emotion recognition for autonomous vehicle drivers Image and Vision Computing, Volume 133, 2023, Article 104659 Show abstract The progress of autonomous driving cars is a difficult movement that causes problems regarding safety, ethics, social acceptance, and cybersecurity. Currently, the automotive industry is utilizing these technologies to assist drivers with advanced driver assistance systems. This system helps different functions to careful driving and predict drivers' ability of stable driving behavior and road safety. A great number of researches have shown that the driver's emotion is the major factor that handles the emotions, resulting in serious vehicle collisions. As a result, continuous monitoring of drivers' behavior could assist to evaluate their behavior to prevent accidents. The study proposes a new Squirrel Search Optimization with Deep Learning Enabled Facial Emotion Recognition (SSO-DLFER) technique for Autonomous Vehicle Drivers. The proposed SSO-DLFER technique focuses mainly on the identification of driver facial emotions in the AVs. The proposed SSO-DLFER technique follows two major processes namely face detection and emotion recognition. The RetinaNet model is employed at the initial phase of the face detection process. For emotion recognition, the SSO-DLFER technique applied the Neural Architectural Search (NASNet) Large feature extractor with a gated recurrent unit (GRU) model as a classifier. For improving the emotion recognition performance, the SSO-based hyperparameter tuning procedure is performed. The simulation analysis of the SSO-DLFER technique is tested under benchmark datasets and the experimental outcome was investigated under various aspects. The comparative analysis reported the enhanced performance of the SSO-DLFER algorithm on recent approaches. Research article Learning discriminative visual semantic embedding for zero-shot recognition Signal Processing: Image Communication, Volume 115, 2023, Article 116955 Show abstract We present a novel zero-shot learning (ZSL) method that concentrates on strengthening the discriminative visual information of the semantic embedding space for recognizing object classes. To address the ZSL problem, many previous works strive to learn a transformation to bridge the visual features and semantic representations, while ignoring that the discriminative property of the semantic embedding space can benefit zero-shot prediction tasks. Among these existing approaches, human-defined attributes are typically employed to build up the mid-level semantics. However, the discriminative capability and completeness of manually defined attributes are hard to guarantee, which may easily cause semantic ambiguity. To alleviate this issue, we propose a discriminative visual semantic embedding (DVSE) model that formulates the ZSL problem as a supervised dictionary learning framework. The proposed method is capable of exploring a set of discriminative visual attributes and ensures knowledge transfer across categories. Moreover, a unified objective is introduced to generate an augmented semantic embedding space where these learned visual attributes and human-defined attributes are incorporated jointly for consolidating the visual cues of feature representations. Finally, we treat the DVSE model as an optimization problem and further propose an iterative solver. Extensive experiments on several challenging benchmark datasets demonstrate that the proposed method achieves favorable performances compared with state-of-the-art ZSL approaches. Research article Language and vision based person re-identification for surveillance systems using deep learning with LIP layers Image and Vision Computing, Volume 132, 2023, Article 104658 Show abstract Real-time surveillance systems have become a necessity of today's life owing to their relevance in the contemporary era for security reasons to ensure a secure and safe environment. Presently, Person re-identification (Re-ID)-based surveillance systems are becoming increasingly more prevalent and sophisticated since they do not require human intervention and are more reliable to deploy in public spaces leveraging multi-camera networks. However, one of the major problems in Person ReID is the visual appearance i-e the appearance of a person in an image is greatly affected by different camera views. As a result, the discriminative set of features must be learned in a deep learning model in order to re-identify persons from opposing camera viewpoints. To address this challenge, we propose an image/text-retrieval-based Person ReId method in which both visual and text-based features are exploited to carry out person re-identification. More precisely, the textual descriptions of the images are taken into account as text features with Glove Word Embedding followed by 1D-MAPCNN and fused with image-level features extracted using the GoogLeNet model. In addition, the feature discriminability is enhanced using local importance-based pooling (LIP) layers in which adaptive significance weights are learned during downsampling. Moreover, from two different modalities, feature refinement is done during training with the help of attention mechanisms using the Convolutional Block Attention module (CBAM) and the proposed shared attention neural network. It is observed that LIP layers along with both vision and textual features are playing a key role in acquiring discriminative features even if the visual appearance of the same person is greatly affected due to camera pose conditions. The proposed method is validated on the CUHK-PADES dataset and has 15.34% and 24.39% rank-1 improvement in text and image-based retrievals. Research article Temporally consistent reconstruction of 3D clothed human surface with warp field Image and Vision Computing, Volume 137, 2023, Article 104782 Show abstract Implicit functions are widely used in 3D human surface reconstruction due to their advantage to represent details. However, human reconstruction based on implicit functions struggles to maintain the integrity (unbroken body structure) and accuracy (no non-human parts) of human models. To address these issues, we propose a method, called TCR, for temporally consistent reconstruction of 3D clothed human surface with warp field. The fact that the general shape of a person does not change largely over time inspires us to exploit the temporally consistent shape information from previous frames to refine the human model of current frame. Therefore, we construct a canonical space and then store the shape information by updating the canonical model. To align the observed space with the canonical space, a warp field is firstly estimated for the forward and inverse warping of the human model. A probabilistic fusion strategy is then used to update the canonical model. In addition, the reconstructed result is further refined through the orthogonality constraints between the surface and its normal, which fully exploits the detailed information of estimated normal maps. Experiments on the Adobe and MonoPerfCap datasets show that TCR achieves the state-of-the-art performance. Furthermore, TCR is more robust and can maintain the integrity and accuracy of the reconstructed human body even with extreme poses and partial occlusions. Research article 3D human body modeling with orthogonal human mask image based on multi-channel Swin transformer architecture Image and Vision Computing, Volume 137, 2023, Article 104795 Show abstract The reconstruction based on RGB images of dressed human body lacks the shape information of the human body under clothing, while the naked 3D human body scanning will violate the user's privacy. To overcome these limitations, a new method, based on Swin transformer (Swin-T), for reconstructing 3D human body shape from human orthogonal mask image is proposed. Its core is to express the reconstruction problem as solving regression mapping function. A fast body shape type classification method based on the human front mask is proposed. The regression function is innovatively represented as a piecewise function, with the body shape of the human body as the segmentation criterion. A multi-channel Swin-T architecture is designed, which can not only extract features from front and side mask images, but also their mixed features to construct the regression mapping function. Different body types for different genders are predicted with separate regression function to help estimate an accurate human model. Extensive experimental results show that the proposed method effectively achieves visually realistic and accurate body reconstruction, and significantly outperforms the current state-of-the-art methods. In addition, the classification of body types can compensate for the errors caused by partial clothing laxity in practical applications, which is beneficial for users to obtain a more accurate 3D human model. Research article Deep robust multi-channel learning subspace clustering networks Image and Vision Computing, Volume 137, 2023, Article 104769 Show abstract Subspace clustering methods are now widely used for unsupervised high-dimensional data processing in computer vision and other domains. Deep subspace clustering methods based on auto-encoder networks have made a significant improvement in nonlinear subspace clustering problems in comparison to previous works. However, these methods ignore the valid information lost during feature extraction, resulting in incomplete information and imprecise feature representations for subspace clustering. In addition, the clustering performance of the existing clustering methods is excessively dependent on hyper-parameters, making training difficult and unstable. In this paper, we propose Deep Robust Multi-Channel Learning Subspace Clustering Networks (DRMCLSC), a novel deep subspace clustering network for learning more comprehensive feature representations with good robustness for subspace clustering. The multi-channel learning strategy allows the model to extract, retain and fuse features simultaneously, enabling all valid information from the sample data to be obtained. Moreover, the multi-channel learning structure of the proposed method produces a more stable integration network that is less dependent on hyper-parameters and more resistant to training errors than previous works. Extensive experimental results on four benchmark datasets demonstrate the proposed method is superior and more effective than the state-of-the-art subspace clustering methods. View full text © 2023 Elsevier B.V. All rights reserved. About ScienceDirect Remote access Shopping cart Advertise Contact and support Terms and conditions Privacy policy We use cookies to help provide and enhance our service and tailor content and ads. By continuing you agree to the use of cookies . Copyright © 2023 Elsevier B.V. or its licensors or contributors. ScienceDirect® is a registered trademark of Elsevier B.V. ScienceDirect® is a registered trademark of Elsevier B.V.



关键字:

暂无


所属期刊
Image and Vision Computing
ISSN: 0262-8856
来自:Elsevier BV