Benchmark

UAV-DualCog

A dual-cognition benchmark for aerial image and video reasoning with explicit grounding.

Dual-Cognition Formulation

Aerial embodied reasoning is formulated as joint self-understanding and world-understanding

UAV-DualCog is built around a dual-cognition formulation: self-aware reasoning and environment-aware reasoning define the benchmark's primary capability axis, while image and video media act as the observation settings used to test that capability under different evidence conditions. Across both cognition branches, the benchmark evaluates not only semantic correctness, but also whether the answer is aligned with explicit spatial or temporal evidence.

The formulation starts from the observation that a UAV navigating open 3D space must reason about both itself and the world around it. Self-aware reasoning asks the agent to infer its own landmark-relative position, predict the viewpoint change induced by motion, and recognize the flight behavior it is executing over time. Environment-aware reasoning asks the agent to infer where the target landmark lies relative to the UAV, decide what action is appropriate under the current spatial situation, and reason about when or how often the landmark becomes visible during flight.

The top-level formulation therefore separates reasoning about the agent itself from reasoning about the external scene, instead of collapsing both into one generic embodied score. Image and video are the two media through which the same dual-cognition requirement is probed.

Dual-cognition formulation figure — Dual-Cognition Formulation. The benchmark centers on self-aware and environment-aware reasoning, with image and video providing the two media through which this dual-cognition capability is evaluated.

Task Definition

Six tasks operationalize the dual-cognition formulation under image and video media

This dual-cognition split is paired with an evidence-aware evaluation design. Image and video serve as complementary evaluation media: on the image side, models must pair answer selection with landmark grounding, while on the video side they must pair semantic recognition or counting with temporal localization.

The image branch contains two self-aware tasks and two environment-aware tasks. Together, they test landmark-relative self-positioning, future observation prediction, self-relative target positioning, and landmark-driven action decision. Each image sample is released as a structured multiple-choice problem, and the grounding-oriented tasks additionally require a normalized landmark bounding box.

The video branch contains one self-aware task for flight behavior recognition and one environment-aware task for landmark visibility reasoning. These are not free-form video description problems: models must return structured behavior options or visibility counts together with interval predictions, so the benchmark can measure semantic success and temporal evidence quality separately.

Task	Modality	Cognition	Interface	Output Schema
Landmark-Relative Position Reasoning Infer the UAV position relative to the target landmark from a landmark-centric reference view and a current egocentric observation.	image	Self-Aware	Reference image + query observation	option + bbox
Future Observation Prediction Predict which candidate image is the correct future observation after a described orbit action and localize the landmark in that selected image.	image	Self-Aware	Reference image + four future-view candidates	option
Self-Relative Position Reasoning Judge where the target landmark lies relative to the UAV's current forward direction and ground it in the same observation.	image	Environment-Aware	Reference image + query observation	option + bbox
Landmark-Driven Action Decision Choose which direction the UAV should move to approach the target landmark and ground the landmark in the current observation.	image	Environment-Aware	Reference image + query observation	option + bbox
Flight Behavior Recognition and Temporal Localization Recognize the UAV's own flight behaviors from first-person video and localize the corresponding temporal intervals.	video	Self-Aware	Flight video	behavior option(s) + intervals
Landmark Visibility Counting and Interval Reasoning Count landmark appearances in flight video and localize every visible interval of the target landmark.	video	Environment-Aware	Flight video + reference image + landmark description	count + intervals

Task Definition. The released benchmark instantiates six tasks around the dual-cognition split, with image and video acting as the two evaluation media through which those capabilities are measured. Each task exposes a fixed input contract and a task-specific structured output.

The benchmark defines complete prompts, constrained answer formats, and task-specific parsing rules for all released tasks. The detailed prompt templates, JSON-style response contracts, and aggregation protocol are collected on the Evaluation page.

Open Evaluation Protocol

Task Examples

Released Benchmark Examples

Examples are pulled from the current release split and keep the same media, prompt, answer, and prediction structure used in the benchmark.

Landmark-Relative Position Reasoning

env_7_27_237_self_shared_4way_000149_where

Image 1 shows the Front facade of light gray mid-rise building with red tiled roof and dormer windows in the landmark-centric coordinate frame. Based on that reference, what is your position relative to the landmark in image 2? Select one option and return the normalized bounding box of the landmark in image 2. Options: A. Right B. Back C. Left D. Front

2 images: 1 reference view and 1 query observation.

Reference Image

Query Observation

Ground Truth

Option: Option A (Right)

BBox: [0.398, 0.384, 0.562, 0.674]

Claude Sonnet 4.6

Option: C ✕

BBox: [0.450, 0.350, 0.720, 0.650] ✕ IoU=0.302 latency=5604 ms

GPT 5.3 Chat

Option: B ✕

BBox: [0.250, 0.450, 0.500, 0.800] ✕ IoU=0.203 latency=3345 ms

Gemini 3 Flash

Option: D ✕

BBox: [1.000, 1.000, 1.000, 1.000] ✕ IoU=0.000 latency=3696 ms

Qwen 3.6-Plus

Option: B ✕

BBox: [0.580, 0.290, 0.840, 0.460] ✕ IoU=0.000 latency=1254 ms

Kimi K2.5

Option: B ✕

BBox: [0.172, 0.556, 0.562, 0.994] ✕ IoU=0.097 latency=2608 ms

GLM 4.6V

Option: A ✓

BBox: [1.000, 1.000, 1.000, 1.000] ✕ IoU=0.000 latency=16820 ms

Mimo v2 Omni

Option: B ✕

BBox: [0.530, 0.000, 0.950, 0.450] ✕ IoU=0.009 latency=3990 ms

InternVL 3.5-30B-A3B

Option: C ✕

BBox: [0.420, 0.450, 0.580, 0.550] ✕ IoU=0.288 latency=48156 ms

SenseNova-SI-1.2

Option: - ✕

BBox: - ✕ IoU=0.000

VST-7B-RL

Option: C ✕

BBox: [0.110, 0.200, 0.300, 0.400] ✕ IoU=0.000 latency=593 ms

SpaceOm

Option: C ✕

BBox: [0.600, 0.300, 0.800, 0.500] ✕ IoU=0.000 latency=653 ms

ViLaSR

Option: D ✕

BBox: [1.000, 1.000, 1.000, 1.000] ✕ IoU=0.000 latency=1223 ms

Prediction Summary. In this qualitative read, option judgment is correct for 1/12, spatial grounding is correct for 0/12, and dual-cognition pass is 0/12. This indicates a coupling bottleneck between recognition and evidence validation. Relative strengths appear in GLM 4.6V, while weaker runs concentrate in Claude Sonnet 4.6, GPT 5.3 Chat.

Benchmark Statistics

Overall Statistics

The released benchmark split is carved out from a larger reviewed scene pool. The released statistics below describe the benchmark-facing subset, while the figure retains the paper-level overview of released scenes, landmarks, and task counts.

The released benchmark is intentionally balanced around the two dual-cognition branches, and it exposes those branches under both image and video media. The statistics in this section therefore move from global scale to concrete dataset structure: task-count balance first, then scene coverage, landmark taxonomy, and finally the image and video branches with their own acquisition and generation characteristics.

Read together, these summaries show that UAV-DualCog is not only a collection of samples, but a structured benchmark release built from reviewed scenes, multiview landmarks, behavior-driven missions, and evidence-aware task interfaces. The cards below summarize the released scale, while the figure and the following subsections unpack how that scale is distributed across scenes, categories, task families, and media-specific design choices.

Released Scenes

512

Released Landmarks

4096

Image Samples

2048

Video Samples

4840

Source Images

5h 27m 13s

Video Duration

Benchmark statistics figure — Released task balance, landmark distribution, and video-side trajectory statistics in the benchmark.

Task Quantity Distribution

Task balance is enforced explicitly at release time rather than inferred after evaluation. The released benchmark contains six task families with matched counts, so both cognition axes and both media branches remain legible at the table level.

In the current release, each task contributes 1,024 samples, yielding 3072 self-aware rows and 3072 environment-aware rows overall. This makes the benchmark easy to read quantitatively before any leaderboard aggregation is introduced. The table makes that symmetry explicit: the four image tasks evenly cover the two cognition axes, and the two video tasks mirror that same split at the temporal level rather than introducing a modality-specific imbalance.

Cognition	Task Type	Released Samples
Self-Aware	Landmark-Relative Position Reasoning	1024
Self-Aware	Future Observation Prediction	1024
Environment-Aware	Self-Relative Position Reasoning	1024
Environment-Aware	Landmark-Driven Action Decision	1024
Self-Aware	Flight Behavior Recognition and Temporal Localization	1024
Environment-Aware	Landmark Visibility Counting and Interval Reasoning	1024

Task Quantity Distribution. Each released task contributes 1,024 samples, yielding balanced task counts across the two cognition axes and the six task families.

Scene Coverage

Released scenes are first constructed as fused Stage 1 environments with aligned geometry, poses, and scene-level observations, and only then filtered into the benchmark-facing split. The current release keeps 12 test scenes active in the benchmark, rather than collapsing task construction around a single environment.

The scene library below shows two benchmark-facing views per released test scene whenever public thumbnails are available, while the chart summarizes how landmarks, image samples, and video samples remain distributed across scenes. Even the largest released scene, ENV 16, contributes only 126 landmarks, so coverage remains broad rather than scene-dominated.

The scene-level bars show a deliberately spread release rather than a single dominant hub. ENV 16 is the largest contributor, followed by scenes such as ENV 7 and ENV 20, but smaller scenes like ENV 8, ENV 13, and ENV 17 are still preserved in the benchmark. As a result, the split spans dense urban layouts, industrial zones, waterfront areas, and smaller environments at the same time.

Open Stage 1 Construction

1 / 12

ENV 7 · Waterfront City

Released benchmark scene

Landmarks 86

Image Samples 688

Video Samples 344

Area 220580 m²

Scene Library. Two representative benchmark-facing snapshots are shown for each released test scene with available public thumbnails.

The scene-level distribution remains broad rather than collapsing around a single city block. Landmarks, image tasks, and video tasks all remain visible at the scene level in the released split.

Scene Coverage. Every selected scene contributes landmarks together with released image and video tasks, so the released split remains spatially distributed rather than scene-collapsed.

Landmark Category

Released landmarks come from the Stage 2 review pipeline, where raw scene candidates are filtered, assigned coarse and fine-grained semantic labels, and linked to multiview RGB evidence before split selection. The released benchmark keeps 512 reviewed landmarks across eight coarse categories and 166 fine-grained subcategories.

The tables below summarize the dominant categories in the released split, and the landmark library grounds those labels back into the same multiview evidence used by the image-task branch. Building is currently the largest coarse category (34.4%), while Mid Rise Building is the single most common fine-grained subtype (15.6%).

The category distribution is broad but not flat. Buildings remain the largest coarse class, and vegetation is also strongly represented through frequently occurring tree landmarks, while public facilities and industrial infrastructure together account for a substantial fraction of the release. At the fine-grained level, the long tail is preserved: mid-rise buildings and deciduous trees dominate, but street furniture, signs, cranes, benches, and other urban objects remain visible as benchmark targets.

Open Stage 2 Construction

Coarse Category	Released Landmarks	Share
Building	176	34.4%
Industrial Infrastructure	90	17.6%
Public Facility	81	15.8%
Vegetation	80	15.6%
Urban Landscape	36	7.0%
Other	32	6.3%
Transport Infrastructure	11	2.1%
Vehicle	6	1.2%

Fine-Grained Subcategory	Category	Count	Share
Mid Rise Building	Building	80	15.6%
Deciduous Tree	Vegetation	72	14.1%
Street Lamp Post	Public Facility	35	6.8%
Low Rise Building	Building	24	4.7%
High Rise Building	Building	19	3.7%
Signboard	Urban Landscape	18	3.5%
Billboard	Urban Landscape	14	2.7%
Public Bench	Public Facility	14	2.7%

Landmark Category Distribution. The released split preserves all eight coarse categories while remaining long-tailed at the fine-grained subcategory level.

The landmark library below keeps the full eight side-view orbit for representative released landmarks, so the category taxonomy is grounded in the same multiview evidence used by the image-task pipeline.

Mid Rise Building Front Right — Front Right

Occluded / Invalid

Back Right

Occluded / Invalid

Back Left

Mid Rise Building Front Left — Front Left

Landmark Library. Each showcased landmark preserves the full eight landmark-centric side views used by the released image-task branch.

Image Tasks

Image-task construction begins with reviewed Stage 2 landmarks, preserves their valid landmark-centric orbit views, and then adds task-specific egocentric captures such as the query observations used by environment-aware tasks. The image branch is therefore built from released scene assets and released landmark views rather than from detached benchmark-only screenshots.

In the current release, the image branch contains 4096 samples backed by 4840 source images, including 3816 valid landmark views and 1024 extra task-specific observation captures. Landmark descriptions are kept concise while remaining discriminative, averaging 10.6 words across the released split. Detailed rendering and prompt-constrained generation are described in the construction page.

The two histograms show that image-task supervision remains visually and linguistically well-conditioned. Most released landmarks retain rich multiview support, with the largest mass concentrated at eight valid side views, and the remaining landmarks still clustering around five to seven views. Description lengths are similarly concentrated: the released split peaks around 9 to 11 words, so descriptions stay compact while still carrying enough semantic detail to distinguish nearby landmarks.

Open Stage 4 Construction

4096

Image Tasks

Released image QA samples across four task families

3816

Valid Landmark Views

All valid RGB views attached to released landmarks

1024

Extra Observation Images

Task-specific egocentric captures added during Stage 4 generation

10.6 words

Mean Description Length

Average landmark-description length in the released split

Image-Branch Summary. The released image branch comprises 4096 samples, 3816 valid landmark views, and 1024 task-specific egocentric observations.

The histograms below make two properties explicit: most released landmarks preserve a high number of valid viewpoints, and landmark descriptions are intentionally concise rather than paragraph-length annotations.

Valid Side-View Count Distribution. Distribution of valid landmark side-view counts in the released split; top views are excluded so the maximum remains eight orbit views.

Description Length Distribution. Distribution of released landmark-description lengths measured in words.

Video Tasks

Video-task construction starts from reviewed landmarks and a predefined Stage 3 behavior library. In this library, atomic behaviors are the reusable low-level flight primitives, such as approach, orbit, rise, or mapping motions, each instantiated with controlled default parameter ranges. Composite behaviors are higher-level inspection patterns built by composing multiple atomic maneuvers into longer and more operationally meaningful UAV routines.

The released video branch therefore does not treat flight videos as unconstrained motion clips. Instead, it anchors both Flight Behavior Recognition and Temporal Localization and Landmark Visibility Counting and Interval Reasoning in the same hierarchical behavior definition, so that benchmark predictions can be interpreted against explicit flight-mode semantics rather than only against raw video appearance.

Composite relations

Gradual Approach

Gradual Depart

Circular Orbit

Figure-Eight Orbit

Spiral Orbit

Square Orbit

Triangular Orbit

Surface Mapping

Comet Trajectory

Sky Rise

Behavior Hierarchy. The released video branch is grounded in a two-layer behavior library where reusable atomic maneuvers are composed into higher-level inspection classes.

The released video branch contains 2048 samples with 5h 27m 13s of total released footage, built from 5 composite templates and 10 atomic behavior classes. The distributions below summarize how those missions are instantiated and populated in the benchmark, while the detailed behavior library and mission-generation chain are documented on the Construction page.

The statistics below expose three complementary properties of the video branch. First, the behavior-mode counts are broadly even across the five composite inspection families, while the atomic library remains distributed across ten lower-level maneuvers rather than being dominated by only one or two motion primitives. Second, composite missions are physically longer than atomic ones, with most composite trajectories concentrated between 200 and 600 meters, whereas atomic trajectories are concentrated below 400 meters and especially in the 0 to 200 meter range. Third, the video-duration distribution mirrors that pattern: most atomic missions fall below 20 seconds, while most composite missions cluster between 10 and 30 seconds, with only a small long tail extending beyond 40 seconds.

Open Stage 3 Construction Open Full Behavior Library

The following distributions summarize how the released video branch is populated: the first chart counts released missions per behavior mode, and the latter two compare atomic and composite trajectories in terms of physical path length and video duration. In both latter views, atomic missions dominate the shortest bins, while composite missions shift toward the mid-range bins that correspond to longer inspection routes and longer observation windows. The underlying behavior hierarchy and default mission templates now live in the Stage 3 construction section so that the benchmark page can stay focused on released statistics and task-facing interpretation.

Behavior Mode Distribution. Released mission counts across the Stage 3 behavior library.

Trajectory Length Distribution. Each bin stacks released atomic and composite trajectories after recomputing waypoint-path lengths from mission artifacts.

Video Duration Distribution. Duration is measured from the released mission videos and shown separately for atomic and composite missions.