Fast Self-Supervised Depth and Mask Aware Association for Multi-Object Tracking

Fast Self-Supervised Depth and Mask Aware Association for Multi-Object Tracking

The paper “Fast Self-Supervised Depth and Mask Aware Association for Multi-Object Tracking” by Milad Khanchi, Maria Amer, and Charalambos Poullis has been accepted for publication at British Machine Vision Conference (BMVC) 2025.

TL;DR: SelfTrEncMOT is a novel multi-object tracking framework that integrates zero-shot monocular depth estimation and promptable segmentation masks into a compact self-supervised encoder. Instead of relying on computationally expensive segmentation IoU, the encoder refines fused depth–segmentation embeddings into stable object representations, which are then combined with appearance and motion cues. This enables more reliable association in challenging scenarios with occlusion, crowd density, or visually similar objects.

On SportsMOT and DanceTrack, SelfTrEncMOT achieves state-of-the-art results for tracking-by-detection methods, significantly outperforming prior approaches in identity preservation metrics such as HOTA, IDF1, and AssA. On simpler benchmarks like MOT17, it remains competitive while requiring only a single GPU for training, in contrast to joint detection-ReID approaches that demand much higher computational resources.

Research paper: Coming soon
Source code: https://github.com/Milad-Khanchi/SelfTrEncMOT