Spatio-temporal Reasoning Benchmark

UAV-DualCog

A benchmark for aerial multiview spatiotemporal reasoning that explicitly evaluates self-state cognition and environment-state cognition under both image and video settings.

Paper & Release
Paper & Release Under peer review
Code
Dataset
ModelScope Available
Hugging Face Preparing

Overview

Why UAV dual cognition is necessary

Multimodal large language models have made strong progress on general image understanding, video understanding, and visual question answering, yet systematic evaluation remains limited for UAV embodied intelligence. Unlike ground-view settings, a UAV moves continuously in open 3D space and must reason not only about external targets, obstacles, and reachable directions, but also about its own position, current flight behavior, and future viewpoint changes. We therefore treat self-aware reasoning and environment-aware reasoning as two complementary cognitive prerequisites for aerial embodied intelligence, rather than as isolated downstream tasks.

UAV-DualCog is designed around that dual-cognition view. At the capability level, it organizes the benchmark into self-aware and environment-aware task lines; at the observation level, it spans both image and video settings. The benchmark is supported by a full automated toolchain that starts from semantic point clouds and landmark assets, then builds multi-view image QA, hierarchical flight-behavior video QA, and unified experiment outputs. With the current stable release, UAV-DualCog covers 12 AirSim scenes, 512 valid landmarks, 4,096 image QA samples, and 2,048 video QA samples, and exposes clear gaps between semantic answering, spatial grounding, behavior recognition, and temporal localization.

Reviewer note: for the clearest reading path, please start from this homepage overview, then proceed in order through Benchmark, Construction, Evaluation, Leaderboard, and Analysis. The Usage page serves as the primary entry point for reproduction-oriented inspection.

Benchmark Snapshot

Benchmark Statistics

The current release contains 12 scenes, 512 released landmarks, 4,096 image samples, and 2,048 video samples, while the broader asset pool already spans 18 scenarios, 746 valid landmarks, and 166 fine-grained subcategories for future benchmark expansion.

12
Released Scenes
512
Released Landmarks
4096
Image Samples
2048
Video Samples
4840
Source Images
5h 27m 13s
Video Duration
Benchmark Statistics. Released task balance, landmark distribution, and video-side trajectory statistics in the benchmark.

Task Definitions

The released task matrix summarizes the six benchmark tasks, their modality, cognition axis, and structured output interface.

Task Modality Cognition Output
Landmark-Relative Position Reasoning
Infer the UAV position relative to the target landmark from a landmark-centric reference view and a current egocentric observation.
image Self-Aware option + bbox
Future Observation Prediction
Predict which candidate image is the correct future observation after a described orbit action and localize the landmark in that selected image.
image Self-Aware option
Self-Relative Position Reasoning
Judge where the target landmark lies relative to the UAV's current forward direction and ground it in the same observation.
image Environment-Aware option + bbox
Landmark-Driven Action Decision
Choose which direction the UAV should move to approach the target landmark and ground the landmark in the current observation.
image Environment-Aware option + bbox
Flight Behavior Recognition and Temporal Localization
Recognize the UAV's own flight behaviors from first-person video and localize the corresponding temporal intervals.
video Self-Aware behavior option(s) + intervals
Landmark Visibility Counting and Interval Reasoning
Count landmark appearances in flight video and localize every visible interval of the target landmark.
video Environment-Aware count + intervals

Task Examples

This homepage example browser mirrors the benchmark-page explorer so readers can inspect representative image and video tasks immediately after the task matrix.

Landmark-Relative Position Reasoning

env_7_27_237_self_shared_4way_000149_where

Image 1 shows the Front facade of light gray mid-rise building with red tiled roof and dormer windows in the landmark-centric coordinate frame. Based on that reference, what is your position relative to the landmark in image 2? Select one option and return the normalized bounding box of the landmark in image 2. Options: A. Right B. Back C. Left D. Front

2 images: 1 reference view and 1 query observation.

Reference Image

Query Observation

Leaderboard

Current leading models

The current benchmark leaders are shown below as Top-3 views for combined, modality-specific, and dual-cognition-dimension performance. The `Acc` column here reports Overall Acc.

Combined

# Model Acc
1 Gemini 3 Flash 48.4%
2 Qwen 3.5-Flash 38.9%
3 Qwen 3.5-35B-A3B 37.9%

Image

# Model Acc
1 Gemini 3 Flash 50.2%
2 GPT 5.3 Chat 47.8%
3 Qwen 3.5-27B 44.3%

Video

# Model Acc
1 Gemini 3 Flash 46.5%
2 Mimo v2 Omni 38.8%
3 InternVL 3.5-38B 37.8%

Self-Aware

# Model Acc
1 Gemini 3 Flash 44.4%
2 GLM 4.6V 30.9%
3 Qwen 3.5-Flash 28.7%

Environment-Aware

# Model Acc
1 Qwen 3.5-35B-A3B 54.9%
2 Gemini 3 Flash 54.2%
3 Qwen 3.5-27B 53.3%

Analysis

Dual-cognition analysis highlights both media sensitivity and cognition imbalance

The accompanying analysis shows that current MLLMs already exhibit partial dual-cognition competence, but they still fall short of a stable and unified capability. Dual cognition is the benchmark's main target, while image and video act as the two media used to probe it. Under both media, semantic success remains easier than the spatial or temporal evidence required to support it, which means that many seemingly correct decisions are not yet grounded with equally convincing evidence.

More importantly, the released results show that dual cognition does not yet develop in a balanced way. Performance shifts visibly between the two media settings, and environment-aware reasoning remains stronger overall than self-aware reasoning after aggregation across tasks. This is where the benchmark becomes especially valuable: it does not only score the two cognition axes separately, but also reveals how closely they co-develop, where they remain disconnected, and what these gaps imply for future MLLM development.

About

This is the official website for UAV-DualCog. The corresponding paper is currently under peer review, and this public release follows a single-blind policy.