Video-based Person Re-identification with Spatial and Temporal Memory Networks

** ICCV 2021 **

Abstract

Video-based person re-identification (reID) aims to retrieve person videos with the same identity as a query person across multiple cameras. Spatial and temporal distractors in person videos, such as background clutter and partial occlusions over frames, respectively, make this task much more challenging than image-based person reID. We observe that (a) spatial distractors appear consistently in a particular location, and (b) temporal distractors show several patterns, e.g., partial occlusions occur in the first few frames, where such patterns provide informative cues for predicting which frames to focus on (i.e., temporal attentions). Based on this, we introduce a novel Spatial and Temporal Memory Networks (STMN). The spatial memory stores features for spatial distractors that frequently emerge across video frames, while the temporal memory saves attentions which are optimized for typical temporal patterns in person videos. We leverage the spatial and temporal memories to refine frame-level person representations and to aggregate the refined frame-level features into a sequence-level person representation, respectively, effectively handling spatial and temporal distractors in person videos. We also introduce a memory spread loss preventing our model from addressing particular items only in the memories. Experimental results on standard benchmarks, including MARS, DukeMTMC-VideoReID, and LS-VID, demonstrate the effectiveness of our method.

(a)
(b)

Approach

STMN mainly consists of three components: an encoder, a spatial memory, and a temporal memory. For each frame, the encoder extracts a person representation and two query maps, where each query is used to access either spatial or temporal memories.
The spatial memory stores features for scene details, frequently appearing across video frames, such as street lights, trees, and concrete pavers. We extract such features from the spatial memory using the corresponding query map, and use them to refine the person representation, removing information that interferes with identifying persons.
The temporal memory saves attentions optimized for typical temporal patterns that repeatably occur in person videos. We access the temporal memory with the corresponding query map, and use the output to aggregate the refined frame-level features into a sequence-level person representation.
We also propose a memory spread loss that prevents our model from accessing few memory items repeatedly, encouraging all items to be used. We train our model end-to-end using memory spread, triplet, and cross-entropy terms.

Experiment

We compare STMN with state-of-the-art methods on standard video-based reID benchmarks, including MARS, DukeMTMC-VideoReID, and LS-VID. For fair comparisons, we classify the methods into two groups, depending on whether they follow RRS or all- frames strategies for evaluation (please refer to the paper for the details). We set a new state of the art on the benchmarks. The results of STMN using the RRS even surpass those of previous methods, e.g., COSAM, M3D, and GLTR on the all-frames setting.

Citation

[link] [code]
@inproceedings{eom2021video,
  title={Video-based Person Re-identification with Spatial and Temporal Memory Networks},
  author={Eom, Chanho and Lee, Geon and Lee, Junghyup and Ham, Bumsub},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={12036--12045},
  year={2021}
}

Acknowledgements

This research was supported in part by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (NRF-2019R1A2C2084816) and the Yonsei University Research Fund of 2021 (2021-22-0001).