Home | UAV-DualCog

Spatio-temporal Reasoning Benchmark

A benchmark for aerial multiview spatiotemporal reasoning that explicitly evaluates self-state cognition and environment-state cognition under both image and video settings.

Code

Overview

Why UAV dual cognition is necessary

Multimodal large language models have made strong progress on general image understanding, video understanding, and visual question answering, yet systematic evaluation remains limited for UAV embodied intelligence. Unlike ground-view settings, a UAV moves continuously in open 3D space and must reason not only about external targets, obstacles, and reachable directions, but also about its own position, current flight behavior, and future viewpoint changes. We therefore treat self-aware reasoning and environment-aware reasoning as two complementary cognitive prerequisites for aerial embodied intelligence, rather than as isolated downstream tasks.

UAV-DualCog is designed around that dual-cognition view. At the capability level, it organizes the benchmark into self-aware and environment-aware task lines; at the observation level, it spans both image and video settings. The benchmark is supported by a full automated toolchain that starts from semantic point clouds and landmark assets, then builds multi-view image QA, hierarchical flight-behavior video QA, and unified experiment outputs. With the current stable release, UAV-DualCog covers 12 AirSim scenes, 512 valid landmarks, 4,096 image QA samples, and 2,048 video QA samples, and exposes clear gaps between semantic answering, spatial grounding, behavior recognition, and temporal localization.

Reviewer note: for the clearest reading path, please start from this homepage overview, then proceed in order through Benchmark, Construction, Evaluation, Leaderboard, and Analysis. The Usage page serves as the primary entry point for reproduction-oriented inspection.

Benchmark Task definition, prompt design, examples, statistics, landmark assets, and flight behavior library. Construction Stage-by-stage pipeline, internal substeps, intermediate artifacts, and toolchain logic. Evaluation Metric design, evaluation protocol, prompt templates, and experiment setup. Leaderboard Image-task, video-task, and capability-wise result tables with ranking cues. Analysis Capability gaps, cross-modality trends, representative models, and benchmark insights. Usage Download channels, codebase entry points, benchmark commands, and reproduction guidance.

Benchmark Snapshot

Benchmark Statistics

The current release contains 12 scenes, 512 released landmarks, 4,096 image samples, and 2,048 video samples, while the broader asset pool already spans 18 scenarios, 746 valid landmarks, and 166 fine-grained subcategories for future benchmark expansion.

Released Scenes

512

Released Landmarks

4096

Image Samples

2048

Video Samples

4840

Source Images

5h 27m 13s

Video Duration

Open Full Statistics

Benchmark statistics figure — **Benchmark Statistics.** Released task balance, landmark distribution, and video-side trajectory statistics in the benchmark.

Task Definitions

The released task matrix summarizes the six benchmark tasks, their modality, cognition axis, and structured output interface.

Task	Modality	Cognition	Output
Landmark-Relative Position Reasoning Infer the UAV position relative to the target landmark from a landmark-centric reference view and a current egocentric observation.	image	Self-Aware	option + bbox
Future Observation Prediction Predict which candidate image is the correct future observation after a described orbit action and localize the landmark in that selected image.	image	Self-Aware	option
Self-Relative Position Reasoning Judge where the target landmark lies relative to the UAV's current forward direction and ground it in the same observation.	image	Environment-Aware	option + bbox
Landmark-Driven Action Decision Choose which direction the UAV should move to approach the target landmark and ground the landmark in the current observation.	image	Environment-Aware	option + bbox
Flight Behavior Recognition and Temporal Localization Recognize the UAV's own flight behaviors from first-person video and localize the corresponding temporal intervals.	video	Self-Aware	behavior option(s) + intervals
Landmark Visibility Counting and Interval Reasoning Count landmark appearances in flight video and localize every visible interval of the target landmark.	video	Environment-Aware	count + intervals

Open Full Task Matrix

Task Examples

This homepage example browser mirrors the benchmark-page explorer so readers can inspect representative image and video tasks immediately after the task matrix.

Landmark-Relative Position Reasoning

env_7_27_237_self_shared_4way_000149_where

Image 1 shows the Front facade of light gray mid-rise building with red tiled roof and dormer windows in the landmark-centric coordinate frame. Based on that reference, what is your position relative to the landmark in image 2? Select one option and return the normalized bounding box of the landmark in image 2. Options: A. Right B. Back C. Left D. Front

2 images: 1 reference view and 1 query observation.

Reference Image

Query Observation

Ground Truth

Option: Option A (Right)

BBox: [0.398, 0.384, 0.562, 0.674]

Claude Sonnet 4.6

Option: C ✕

BBox: [0.450, 0.350, 0.720, 0.650] ✕ IoU=0.302 latency=5604 ms

GPT 5.3 Chat

Option: B ✕

BBox: [0.250, 0.450, 0.500, 0.800] ✕ IoU=0.203 latency=3345 ms

Gemini 3 Flash

Option: D ✕

BBox: [1.000, 1.000, 1.000, 1.000] ✕ IoU=0.000 latency=3696 ms

Qwen 3.6-Plus

Option: B ✕

BBox: [0.580, 0.290, 0.840, 0.460] ✕ IoU=0.000 latency=1254 ms

Kimi K2.5

Option: B ✕

BBox: [0.172, 0.556, 0.562, 0.994] ✕ IoU=0.097 latency=2608 ms

GLM 4.6V

Option: A ✓

BBox: [1.000, 1.000, 1.000, 1.000] ✕ IoU=0.000 latency=16820 ms

Mimo v2 Omni

Option: B ✕

BBox: [0.530, 0.000, 0.950, 0.450] ✕ IoU=0.009 latency=3990 ms

InternVL 3.5-30B-A3B

Option: C ✕

BBox: [0.420, 0.450, 0.580, 0.550] ✕ IoU=0.288 latency=48156 ms

SenseNova-SI-1.2

Option: - ✕

BBox: - ✕ IoU=0.000

VST-7B-RL

Option: C ✕

BBox: [0.110, 0.200, 0.300, 0.400] ✕ IoU=0.000 latency=593 ms

SpaceOm

Option: C ✕

BBox: [0.600, 0.300, 0.800, 0.500] ✕ IoU=0.000 latency=653 ms

ViLaSR

Option: D ✕

BBox: [1.000, 1.000, 1.000, 1.000] ✕ IoU=0.000 latency=1223 ms

Prediction Summary. In this qualitative read, option judgment is correct for 1/12, spatial grounding is correct for 0/12, and dual-cognition pass is 0/12. This indicates a coupling bottleneck between recognition and evidence validation. Relative strengths appear in GLM 4.6V, while weaker runs concentrate in Claude Sonnet 4.6, GPT 5.3 Chat.

Leaderboard

Current leading models

The current benchmark leaders are shown below as Top-3 views for combined, modality-specific, and dual-cognition-dimension performance. The `Acc` column here reports Overall Acc.

Combined

#	Model	Acc
1	Gemini 3 Flash	48.4%
2	Qwen 3.5-Flash	38.9%
3	Qwen 3.5-35B-A3B	37.9%

Image

#	Model	Acc
1	Gemini 3 Flash	50.2%
2	GPT 5.3 Chat	47.8%
3	Qwen 3.5-27B	44.3%

Video

#	Model	Acc
1	Gemini 3 Flash	46.5%
2	Mimo v2 Omni	38.8%
3	InternVL 3.5-38B	37.8%

Self-Aware

#	Model	Acc
1	Gemini 3 Flash	44.4%
2	GLM 4.6V	30.9%
3	Qwen 3.5-Flash	28.7%

Environment-Aware

#	Model	Acc
1	Qwen 3.5-35B-A3B	54.9%
2	Gemini 3 Flash	54.2%
3	Qwen 3.5-27B	53.3%

Open Full Leaderboard

Analysis

Dual-cognition analysis highlights both media sensitivity and cognition imbalance

The accompanying analysis shows that current MLLMs already exhibit partial dual-cognition competence, but they still fall short of a stable and unified capability. Dual cognition is the benchmark's main target, while image and video act as the two media used to probe it. Under both media, semantic success remains easier than the spatial or temporal evidence required to support it, which means that many seemingly correct decisions are not yet grounded with equally convincing evidence.

More importantly, the released results show that dual cognition does not yet develop in a balanced way. Performance shifts visibly between the two media settings, and environment-aware reasoning remains stronger overall than self-aware reasoning after aggregation across tasks. This is where the benchmark becomes especially valuable: it does not only score the two cognition axes separately, but also reveals how closely they co-develop, where they remain disconnected, and what these gaps imply for future MLLM development.

Open Full Analysis

About

This is the official website for UAV-DualCog. The corresponding paper is currently under peer review, and this public release follows a single-blind policy.