Real-time tracking is one of the most challenging problems in computer vision. Most Transformer-based trackers usually require expensive computational and storage power, which leads to the inability of these robust trackers to achieve real-time performance in resource-constrained devices. In this work, we propose a lightweight tracker, AnteaTrack. To localize the target more precisely, we develop a scaling-invariant max-filtering operator employing a sliding window combined with local max-pooling, which filters out the suspected target from the feature and performs an augmented representation while suppressing the background. For a more tight bounding box, we employ Pixel-Shuffle to increase the resolution of the feature map and get a more fine-grained representation of the target. In addition, AnteaTrack can run in real-time at 47 frames per second(FPS) on the CPU. We tested AnteaTrack on 5 datasets, and a large number of experiments have shown that AnteaTrack provides the most efficient solution compared to the same type of CPU real-time trackers. The code will be available at https://github.com/cnchange/AnteaTrack.
Keywords
real-time tracking; lightweight transformers; attention mechanism; deep learning
Subject
Computer Science and Mathematics, Computer Vision and Graphics
Copyright:
This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.