TransVisDrone : Spatio temporal Transformers for drone to drone detection in aerial videos

Tushar Sangam

Ishan R. Dave

Waqas Sultani

Mubarak Shah

[Paper]

[GitHub]

Abstract

Drone-to-drone detection using visual feed has crucial applications, such as detecting drone collisions, detecting drone attacks, or coordinating flight with other drones. However, existing methods are computationally costly, follow non-end-to-end optimization, and have complex multi-stage pipelines, making them less suitable for real-time deployment on edge devices. In this work, we propose a simple yet effective framework, TransVisDrone, that provides an end-to-end solution with higher computational efficiency. We utilize CSPDarkNet-53 network to learn object-related spatial features and VideoSwin model to improve drone detection in challenging scenarios by learning spatio-temporal dependencies of drone motion. Our method achieves state-of-the-art performance on three challenging real-world datasets (Average Precision@0.5IOU): NPS 0.95, FLDrones 0.75, and AOT 0.80, and a higher throughput than previous methods. We also demonstrate its deployment capability on edge devices and its usefulness in detecting drone-collision (encounter).

Qualitative Visualizations

[Slides]

Method Overview

Method works in online fashion
Sample clip from continuous drone footage
We then apply temporally consistent augmentations on clip
Extract individual features of each frame in clip
Apply efficient spatio-temporal attention using 3D SwinTransformer layers
Post processing using NMS to reduce false positives & low confident detections

[GitHub]

Paper and Supplementary Material

Tushar Sangam, Ishan Dave, Waqas Sultani, Mubarak Shah.
TransVisDrone: Spatio-Temporal Transformer for Vision-based Drone-to-Drone Detection in Aerial Videos
In Conference ICRA, 2023, London.
(hosted on ArXiv)

[Bibtex]

Acknowledgements

This template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project; the code can be found here.