Benchmark

UAV-DualCog

A dual-cognition benchmark for aerial image and video reasoning with explicit grounding.

Dual-Cognition Formulation

Aerial embodied reasoning is formulated as joint self-understanding and world-understanding

UAV-DualCog is built around a dual-cognition formulation: self-aware reasoning and environment-aware reasoning define the benchmark's primary capability axis, while image and video media act as the observation settings used to test that capability under different evidence conditions. Across both cognition branches, the benchmark evaluates not only semantic correctness, but also whether the answer is aligned with explicit spatial or temporal evidence.

The formulation starts from the observation that a UAV navigating open 3D space must reason about both itself and the world around it. Self-aware reasoning asks the agent to infer its own landmark-relative position, predict the viewpoint change induced by motion, and recognize the flight behavior it is executing over time. Environment-aware reasoning asks the agent to infer where the target landmark lies relative to the UAV, decide what action is appropriate under the current spatial situation, and reason about when or how often the landmark becomes visible during flight.

The top-level formulation therefore separates reasoning about the agent itself from reasoning about the external scene, instead of collapsing both into one generic embodied score. Image and video are the two media through which the same dual-cognition requirement is probed.

Dual-Cognition Formulation. The benchmark centers on self-aware and environment-aware reasoning, with image and video providing the two media through which this dual-cognition capability is evaluated.

Task Definition

Six tasks operationalize the dual-cognition formulation under image and video media

This dual-cognition split is paired with an evidence-aware evaluation design. Image and video serve as complementary evaluation media: on the image side, models must pair answer selection with landmark grounding, while on the video side they must pair semantic recognition or counting with temporal localization.

The image branch contains two self-aware tasks and two environment-aware tasks. Together, they test landmark-relative self-positioning, future observation prediction, self-relative target positioning, and landmark-driven action decision. Each image sample is released as a structured multiple-choice problem, and the grounding-oriented tasks additionally require a normalized landmark bounding box.

The video branch contains one self-aware task for flight behavior recognition and one environment-aware task for landmark visibility reasoning. These are not free-form video description problems: models must return structured behavior options or visibility counts together with interval predictions, so the benchmark can measure semantic success and temporal evidence quality separately.

Task Modality Cognition Interface Output Schema
Landmark-Relative Position Reasoning
Infer the UAV position relative to the target landmark from a landmark-centric reference view and a current egocentric observation.
image Self-Aware Reference image + query observation option + bbox
Future Observation Prediction
Predict which candidate image is the correct future observation after a described orbit action and localize the landmark in that selected image.
image Self-Aware Reference image + four future-view candidates option
Self-Relative Position Reasoning
Judge where the target landmark lies relative to the UAV's current forward direction and ground it in the same observation.
image Environment-Aware Reference image + query observation option + bbox
Landmark-Driven Action Decision
Choose which direction the UAV should move to approach the target landmark and ground the landmark in the current observation.
image Environment-Aware Reference image + query observation option + bbox
Flight Behavior Recognition and Temporal Localization
Recognize the UAV's own flight behaviors from first-person video and localize the corresponding temporal intervals.
video Self-Aware Flight video behavior option(s) + intervals
Landmark Visibility Counting and Interval Reasoning
Count landmark appearances in flight video and localize every visible interval of the target landmark.
video Environment-Aware Flight video + reference image + landmark description count + intervals

Task Definition. The released benchmark instantiates six tasks around the dual-cognition split, with image and video acting as the two evaluation media through which those capabilities are measured. Each task exposes a fixed input contract and a task-specific structured output.

The benchmark defines complete prompts, constrained answer formats, and task-specific parsing rules for all released tasks. The detailed prompt templates, JSON-style response contracts, and aggregation protocol are collected on the Evaluation page.

Task Examples

Released Benchmark Examples

Examples are pulled from the current release split and keep the same media, prompt, answer, and prediction structure used in the benchmark.

Landmark-Relative Position Reasoning

env_7_27_237_self_shared_4way_000149_where

Image 1 shows the Front facade of light gray mid-rise building with red tiled roof and dormer windows in the landmark-centric coordinate frame. Based on that reference, what is your position relative to the landmark in image 2? Select one option and return the normalized bounding box of the landmark in image 2. Options: A. Right B. Back C. Left D. Front

2 images: 1 reference view and 1 query observation.

Reference Image

Query Observation

Benchmark Statistics

Overall Statistics

The released benchmark split is carved out from a larger reviewed scene pool. The released statistics below describe the benchmark-facing subset, while the figure retains the paper-level overview of released scenes, landmarks, and task counts.

The released benchmark is intentionally balanced around the two dual-cognition branches, and it exposes those branches under both image and video media. The statistics in this section therefore move from global scale to concrete dataset structure: task-count balance first, then scene coverage, landmark taxonomy, and finally the image and video branches with their own acquisition and generation characteristics.

Read together, these summaries show that UAV-DualCog is not only a collection of samples, but a structured benchmark release built from reviewed scenes, multiview landmarks, behavior-driven missions, and evidence-aware task interfaces. The cards below summarize the released scale, while the figure and the following subsections unpack how that scale is distributed across scenes, categories, task families, and media-specific design choices.

12
Released Scenes
512
Released Landmarks
4096
Image Samples
2048
Video Samples
4840
Source Images
5h 27m 13s
Video Duration
Released task balance, landmark distribution, and video-side trajectory statistics in the benchmark.

Task Quantity Distribution

Task balance is enforced explicitly at release time rather than inferred after evaluation. The released benchmark contains six task families with matched counts, so both cognition axes and both media branches remain legible at the table level.

In the current release, each task contributes 1,024 samples, yielding 3072 self-aware rows and 3072 environment-aware rows overall. This makes the benchmark easy to read quantitatively before any leaderboard aggregation is introduced. The table makes that symmetry explicit: the four image tasks evenly cover the two cognition axes, and the two video tasks mirror that same split at the temporal level rather than introducing a modality-specific imbalance.

Cognition Task Type Released Samples
Self-Aware Landmark-Relative Position Reasoning 1024
Self-Aware Future Observation Prediction 1024
Environment-Aware Self-Relative Position Reasoning 1024
Environment-Aware Landmark-Driven Action Decision 1024
Self-Aware Flight Behavior Recognition and Temporal Localization 1024
Environment-Aware Landmark Visibility Counting and Interval Reasoning 1024

Task Quantity Distribution. Each released task contributes 1,024 samples, yielding balanced task counts across the two cognition axes and the six task families.

Scene Coverage

Released scenes are first constructed as fused Stage 1 environments with aligned geometry, poses, and scene-level observations, and only then filtered into the benchmark-facing split. The current release keeps 12 test scenes active in the benchmark, rather than collapsing task construction around a single environment.

The scene library below shows two benchmark-facing views per released test scene whenever public thumbnails are available, while the chart summarizes how landmarks, image samples, and video samples remain distributed across scenes. Even the largest released scene, ENV 16, contributes only 126 landmarks, so coverage remains broad rather than scene-dominated.

The scene-level bars show a deliberately spread release rather than a single dominant hub. ENV 16 is the largest contributor, followed by scenes such as ENV 7 and ENV 20, but smaller scenes like ENV 8, ENV 13, and ENV 17 are still preserved in the benchmark. As a result, the split spans dense urban layouts, industrial zones, waterfront areas, and smaller environments at the same time.

Scene Library. Two representative benchmark-facing snapshots are shown for each released test scene with available public thumbnails.

The scene-level distribution remains broad rather than collapsing around a single city block. Landmarks, image tasks, and video tasks all remain visible at the scene level in the released split.

0 275 550 825 1100 Landmarks | env_10: 48.00 Landmarks | env_11: 56.00 Landmarks | env_13: 8.00 Landmarks | env_16: 126.00 Landmarks | env_17: 8.00 Landmarks | env_20: 61.00 Landmarks | env_21: 16.00 Landmarks | env_23: 41.00 Landmarks | env_24: 10.00 Landmarks | env_7: 86.00 Landmarks | env_8: 4.00 Landmarks | env_9: 48.00 Image Samples | env_10: 384.00 Image Samples | env_11: 448.00 Image Samples | env_13: 64.00 Image Samples | env_16: 1008.00 Image Samples | env_17: 64.00 Image Samples | env_20: 488.00 Image Samples | env_21: 128.00 Image Samples | env_23: 328.00 Image Samples | env_24: 80.00 Image Samples | env_7: 688.00 Image Samples | env_8: 32.00 Image Samples | env_9: 384.00 Video Samples | env_10: 192.00 Video Samples | env_11: 224.00 Video Samples | env_13: 32.00 Video Samples | env_16: 504.00 Video Samples | env_17: 32.00 Video Samples | env_20: 244.00 Video Samples | env_21: 64.00 Video Samples | env_23: 164.00 Video Samples | env_24: 40.00 Video Samples | env_7: 344.00 Video Samples | env_8: 16.00 Video Samples | env_9: 192.00 env_10 env_11 env_13 env_16 env_17 env_20 env_21 env_23 env_24 env_7 env_8 env_9

Scene Coverage. Every selected scene contributes landmarks together with released image and video tasks, so the released split remains spatially distributed rather than scene-collapsed.

Landmark Category

Released landmarks come from the Stage 2 review pipeline, where raw scene candidates are filtered, assigned coarse and fine-grained semantic labels, and linked to multiview RGB evidence before split selection. The released benchmark keeps 512 reviewed landmarks across eight coarse categories and 166 fine-grained subcategories.

The tables below summarize the dominant categories in the released split, and the landmark library grounds those labels back into the same multiview evidence used by the image-task branch. Building is currently the largest coarse category (34.4%), while Mid Rise Building is the single most common fine-grained subtype (15.6%).

The category distribution is broad but not flat. Buildings remain the largest coarse class, and vegetation is also strongly represented through frequently occurring tree landmarks, while public facilities and industrial infrastructure together account for a substantial fraction of the release. At the fine-grained level, the long tail is preserved: mid-rise buildings and deciduous trees dominate, but street furniture, signs, cranes, benches, and other urban objects remain visible as benchmark targets.

Coarse Category Released Landmarks Share
Building 176 34.4%
Industrial Infrastructure 90 17.6%
Public Facility 81 15.8%
Vegetation 80 15.6%
Urban Landscape 36 7.0%
Other 32 6.3%
Transport Infrastructure 11 2.1%
Vehicle 6 1.2%
Fine-Grained Subcategory Category Count Share
Mid Rise Building Building 80 15.6%
Deciduous Tree Vegetation 72 14.1%
Street Lamp Post Public Facility 35 6.8%
Low Rise Building Building 24 4.7%
High Rise Building Building 19 3.7%
Signboard Urban Landscape 18 3.5%
Billboard Urban Landscape 14 2.7%
Public Bench Public Facility 14 2.7%

Landmark Category Distribution. The released split preserves all eight coarse categories while remaining long-tailed at the fine-grained subcategory level.

The landmark library below keeps the full eight side-view orbit for representative released landmarks, so the category taxonomy is grounded in the same multiview evidence used by the image-task pipeline.

Landmark Library. Each showcased landmark preserves the full eight landmark-centric side views used by the released image-task branch.

Image Tasks

Image-task construction begins with reviewed Stage 2 landmarks, preserves their valid landmark-centric orbit views, and then adds task-specific egocentric captures such as the query observations used by environment-aware tasks. The image branch is therefore built from released scene assets and released landmark views rather than from detached benchmark-only screenshots.

In the current release, the image branch contains 4096 samples backed by 4840 source images, including 3816 valid landmark views and 1024 extra task-specific observation captures. Landmark descriptions are kept concise while remaining discriminative, averaging 10.6 words across the released split. Detailed rendering and prompt-constrained generation are described in the construction page.

The two histograms show that image-task supervision remains visually and linguistically well-conditioned. Most released landmarks retain rich multiview support, with the largest mass concentrated at eight valid side views, and the remaining landmarks still clustering around five to seven views. Description lengths are similarly concentrated: the released split peaks around 9 to 11 words, so descriptions stay compact while still carrying enough semantic detail to distinguish nearby landmarks.

4096
Image Tasks

Released image QA samples across four task families

3816
Valid Landmark Views

All valid RGB views attached to released landmarks

1024
Extra Observation Images

Task-specific egocentric captures added during Stage 4 generation

10.6 words
Mean Description Length

Average landmark-description length in the released split

Image-Branch Summary. The released image branch comprises 4096 samples, 3816 valid landmark views, and 1024 task-specific egocentric observations.

The histograms below make two properties explicit: most released landmarks preserve a high number of valid viewpoints, and landmark descriptions are intentionally concise rather than paragraph-length annotations.

0 50 99 149 198 Released Landmarks | 4 side views: 73.00 Released Landmarks | 5 side views: 93.00 Released Landmarks | 6 side views: 73.00 Released Landmarks | 7 side views: 75.00 Released Landmarks | 8 side views: 198.00 4 side views 5 side views 6 side views 7 side views 8 side views

Valid Side-View Count Distribution. Distribution of valid landmark side-view counts in the released split; top views are excluded so the maximum remains eight orbit views.

0 26 52 77 103 Landmarks | 5 words: 1.00 Landmarks | 6 words: 9.00 Landmarks | 7 words: 24.00 Landmarks | 8 words: 45.00 Landmarks | 9 words: 103.00 Landmarks | 10 words: 101.00 Landmarks | 11 words: 81.00 Landmarks | 12 words: 48.00 Landmarks | 13 words: 38.00 Landmarks | 14 words: 31.00 Landmarks | 15 words: 18.00 Landmarks | 16 words: 5.00 Landmarks | 17 words: 6.00 Landmarks | 18 words: 2.00 5 words 6 words 7 words 8 words 9 words 10 words 11 words 12 words 13 words 14 words 15 words 16 words 17 words 18 words

Description Length Distribution. Distribution of released landmark-description lengths measured in words.

Video Tasks

Video-task construction starts from reviewed landmarks and a predefined Stage 3 behavior library. In this library, atomic behaviors are the reusable low-level flight primitives, such as approach, orbit, rise, or mapping motions, each instantiated with controlled default parameter ranges. Composite behaviors are higher-level inspection patterns built by composing multiple atomic maneuvers into longer and more operationally meaningful UAV routines.

The released video branch therefore does not treat flight videos as unconstrained motion clips. Instead, it anchors both Flight Behavior Recognition and Temporal Localization and Landmark Visibility Counting and Interval Reasoning in the same hierarchical behavior definition, so that benchmark predictions can be interpreted against explicit flight-mode semantics rather than only against raw video appearance.

Gradual Approach
Gradual Depart
Circular Orbit
Figure-Eight Orbit
Spiral Orbit
Square Orbit
Triangular Orbit
Surface Mapping
Comet Trajectory
Sky Rise

Behavior Hierarchy. The released video branch is grounded in a two-layer behavior library where reusable atomic maneuvers are composed into higher-level inspection classes.

The released video branch contains 2048 samples with 5h 27m 13s of total released footage, built from 5 composite templates and 10 atomic behavior classes. The distributions below summarize how those missions are instantiated and populated in the benchmark, while the detailed behavior library and mission-generation chain are documented on the Construction page.

The statistics below expose three complementary properties of the video branch. First, the behavior-mode counts are broadly even across the five composite inspection families, while the atomic library remains distributed across ten lower-level maneuvers rather than being dominated by only one or two motion primitives. Second, composite missions are physically longer than atomic ones, with most composite trajectories concentrated between 200 and 600 meters, whereas atomic trajectories are concentrated below 400 meters and especially in the 0 to 200 meter range. Third, the video-duration distribution mirrors that pattern: most atomic missions fall below 20 seconds, while most composite missions cluster between 10 and 30 seconds, with only a small long tail extending beyond 40 seconds.

The following distributions summarize how the released video branch is populated: the first chart counts released missions per behavior mode, and the latter two compare atomic and composite trajectories in terms of physical path length and video duration. In both latter views, atomic missions dominate the shortest bins, while composite missions shift toward the mid-range bins that correspond to longer inspection routes and longer observation windows. The underlying behavior hierarchy and default mission templates now live in the Stage 3 construction section so that the benchmark page can stay focused on released statistics and task-facing interpretation.

0 27 55 82 109 Released Missions | Circular Inspection: 109.00 Released Missions | Spiral Inspection: 103.00 Released Missions | Square Inspection: 103.00 Released Missions | Triangular Inspection: 99.00 Released Missions | Surface-Mapping Inspection: 98.00 Released Missions | Atomic Circular Orbit: 56.00 Released Missions | Atomic Comet Trajectory: 54.00 Released Missions | Atomic Figure-Eight Orbit: 54.00 Released Missions | Atomic Gradual Approach: 54.00 Released Missions | Atomic Gradual Depart: 53.00 Released Missions | Atomic Sky Rise: 53.00 Released Missions | Atomic Spiral Orbit: 49.00 Released Missions | Atomic Square Orbit: 49.00 Released Missions | Atomic Surface Mapping: 45.00 Released Missions | Atomic Triangular Orbit: 45.00 Circular Inspection Spiral Inspection Square Inspection Triangular Inspection Surface-Mapping Inspection Atomic Circular Orbit Atomic Comet Trajectory Atomic Figure-Eight Orbit Atomic Gradual Approach Atomic Gradual Depart Atomic Sky Rise Atomic Spiral Orbit Atomic Square Orbit Atomic Surface Mapping Atomic Triangular Orbit

Behavior Mode Distribution. Released mission counts across the Stage 3 behavior library.

0 75 150 225 300 Atomic | 0-200 m: 263 Composite | 0-200 m: 37 Atomic | 200-400 m: 105 Composite | 200-400 m: 182 Atomic | 400-600 m: 77 Composite | 400-600 m: 169 Atomic | 600-800 m: 38 Composite | 600-800 m: 87 Atomic | 800-1000 m: 9 Composite | 800-1000 m: 28 Atomic | 1000-1200 m: 7 Composite | 1000-1200 m: 8 Atomic | 1200-1400 m: 5 Composite | 1200-1400 m: 1 Atomic | 1400-1600 m: 2 Atomic | 1600-1800 m: 4 Atomic | 1800-2000 m: 2 0-200 m 200-400 m 400-600 m 600-800 m 800-1000 m 1000-1200 m 1200-1400 m 1400-1600 m 1600-1800 m 1800-2000 m

Trajectory Length Distribution. Each bin stacks released atomic and composite trajectories after recomputing waypoint-path lengths from mission artifacts.

0 77 153 230 306 Atomic | 0-10s: 276 Composite | 0-10s: 4 Atomic | 10-20s: 108 Composite | 10-20s: 198 Atomic | 20-30s: 71 Composite | 20-30s: 182 Atomic | 30-40s: 28 Composite | 30-40s: 84 Atomic | 40-50s: 14 Composite | 40-50s: 35 Atomic | 50-60s: 3 Composite | 50-60s: 9 Atomic | 60-70s: 5 Atomic | 70-80s: 2 Atomic | 80-90s: 2 Atomic | 90-100s: 1 Atomic | 100-110s: 2 0-10s 10-20s 20-30s 30-40s 40-50s 50-60s 60-70s 70-80s 80-90s 90-100s 100-110s

Video Duration Distribution. Duration is measured from the released mission videos and shown separately for atomic and composite missions.