Overview
Why UAV dual cognition is necessary
Multimodal large language models have made strong progress on general image understanding, video understanding, and visual question answering, yet systematic evaluation remains limited for UAV embodied intelligence. Unlike ground-view settings, a UAV moves continuously in open 3D space and must reason not only about external targets, obstacles, and reachable directions, but also about its own position, current flight behavior, and future viewpoint changes. We therefore treat self-aware reasoning and environment-aware reasoning as two complementary cognitive prerequisites for aerial embodied intelligence, rather than as isolated downstream tasks.
UAV-DualCog is designed around that dual-cognition view. At the capability level, it organizes the benchmark into self-aware and environment-aware task lines; at the observation level, it spans both image and video settings. The benchmark is supported by a full automated toolchain that starts from semantic point clouds and landmark assets, then builds multi-view image QA, hierarchical flight-behavior video QA, and unified experiment outputs. With the current stable release, UAV-DualCog covers 12 AirSim scenes, 512 valid landmarks, 4,096 image QA samples, and 2,048 video QA samples, and exposes clear gaps between semantic answering, spatial grounding, behavior recognition, and temporal localization.
Reviewer note: for the clearest reading path, please start from this homepage overview, then proceed in order through Benchmark, Construction, Evaluation, Leaderboard, and Analysis. The Usage page serves as the primary entry point for reproduction-oriented inspection.