Abstract: Visual object tracking (VOT), aiming to track a target object in a continuous video, is a fundamental and critical task in computer vision. However, the reliance on third-party resources (e.g., dataset) for training poses concealed threats to the security of VOT models. In this paper, we reveal that VOT models are vulnerable to a poison-only and targeted backdoor attack, where the adversary can achieve arbitrary tracking predictions by manipulating only part of the training data. Specifically, we first define and formulate three different invariants of the targeted attacks: size-manipulation, trajectory-manipulation, and hybrid attacks. To implement these, we introduce Random Video Poisoning (RVP), a novel poison-only strategy that exploits temporal correlations within video data by poisoning entire video sequences. Extensive experiments demonstrate that RVP effectively injects controllable backdoors, enabling precise manipulation of tracking behavior upon trigger activation, while maintaining high performance on benign data, thus ensuring stealth. Our findings not only expose significant vulnerabilities but also highlight that the underlying principles could be adapted for beneficial uses, such as dataset watermarking for copyright protection.
Keywords: visual object tracking; backdoor attack; computer vision; data security; AI safety