A dual-cognition benchmark for aerial image and video reasoning with explicit grounding.
Dual-Cognition Formulation
Aerial embodied reasoning is formulated as joint self-understanding and world-understanding
UAV-DualCog is built around a dual-cognition formulation:
self-aware reasoning and environment-aware reasoning
define the benchmark's primary capability axis, while image and video media
act as the observation settings used to test that capability under different evidence
conditions. Across both cognition branches, the benchmark evaluates not only semantic
correctness, but also whether the answer is aligned with
explicit spatial or temporal evidence.
The formulation starts from the observation that a UAV navigating open 3D space must reason
about both itself and the world around it. Self-aware reasoning asks the agent to infer its
own landmark-relative position, predict the viewpoint change induced by motion, and
recognize the flight behavior it is executing over time. Environment-aware reasoning asks
the agent to infer where the target landmark lies relative to the UAV, decide what action
is appropriate under the current spatial situation, and reason about when or how often the
landmark becomes visible during flight.
The top-level formulation therefore separates reasoning about the agent itself
from reasoning about the external scene, instead of collapsing both into
one generic embodied score. Image and video are the two media through which the same dual-cognition
requirement is probed.
Dual-Cognition Formulation. The benchmark centers on self-aware and environment-aware reasoning, with image and video providing the two media through which this dual-cognition capability is evaluated.
Task Definition
Six tasks operationalize the dual-cognition formulation under image and video media
This dual-cognition split is paired with an evidence-aware evaluation design.
Image and video serve as complementary evaluation media: on the image side,
models must pair answer selection with landmark grounding, while on the video side they must
pair semantic recognition or counting with temporal localization.
The image branch contains two self-aware tasks and two
environment-aware tasks. Together, they test landmark-relative self-positioning, future
observation prediction, self-relative target positioning, and landmark-driven action
decision. Each image sample is released as a structured multiple-choice problem, and the
grounding-oriented tasks additionally require a normalized landmark bounding box.
The video branch contains one self-aware task for flight behavior
recognition and one environment-aware task for landmark visibility reasoning. These are
not free-form video description problems: models must return structured behavior
options or visibility counts together with interval predictions, so the benchmark
can measure semantic success and temporal evidence quality separately.
Task
Modality
Cognition
Interface
Output Schema
Landmark-Relative Position Reasoning
Infer the UAV position relative to the target landmark from a landmark-centric reference view and a current egocentric observation.
image
Self-Aware
Reference image + query observation
option + bbox
Future Observation Prediction
Predict which candidate image is the correct future observation after a described orbit action and localize the landmark in that selected image.
image
Self-Aware
Reference image + four future-view candidates
option
Self-Relative Position Reasoning
Judge where the target landmark lies relative to the UAV's current forward direction and ground it in the same observation.
image
Environment-Aware
Reference image + query observation
option + bbox
Landmark-Driven Action Decision
Choose which direction the UAV should move to approach the target landmark and ground the landmark in the current observation.
image
Environment-Aware
Reference image + query observation
option + bbox
Flight Behavior Recognition and Temporal Localization
Recognize the UAV's own flight behaviors from first-person video and localize the corresponding temporal intervals.
video
Self-Aware
Flight video
behavior option(s) + intervals
Landmark Visibility Counting and Interval Reasoning
Count landmark appearances in flight video and localize every visible interval of the target landmark.
video
Environment-Aware
Flight video + reference image + landmark description
count + intervals
Task Definition. The released benchmark instantiates six tasks around the
dual-cognition split, with image and video acting as the two evaluation media through which
those capabilities are measured. Each task exposes a fixed input contract and a task-specific
structured output.
The benchmark defines complete prompts, constrained answer formats, and task-specific
parsing rules for all released tasks. The detailed prompt templates, JSON-style response
contracts, and aggregation protocol are collected on the Evaluation page.
Examples are pulled from the current release split and keep the same media, prompt, answer, and prediction structure used in the benchmark.
Landmark-Relative Position Reasoning
env_7_27_237_self_shared_4way_000149_where
Image 1 shows the Front facade of light gray mid-rise building with red tiled roof and dormer windows in the landmark-centric coordinate frame. Based on that reference, what is your position relative to the landmark in image 2? Select one option and return the normalized bounding box of the landmark in image 2.
Options:
A. Right
B. Back
C. Left
D. Front
2 images: 1 reference view and 1 query observation.
Reference Image
Query Observation
Ground Truth
Option: Option A (Right)
BBox: [0.398, 0.384, 0.562, 0.674]
Claude Sonnet 4.6
Option:C✕
BBox:[0.450, 0.350, 0.720, 0.650]✕IoU=0.302latency=5604 ms
GPT 5.3 Chat
Option:B✕
BBox:[0.250, 0.450, 0.500, 0.800]✕IoU=0.203latency=3345 ms
Gemini 3 Flash
Option:D✕
BBox:[1.000, 1.000, 1.000, 1.000]✕IoU=0.000latency=3696 ms
Qwen 3.6-Plus
Option:B✕
BBox:[0.580, 0.290, 0.840, 0.460]✕IoU=0.000latency=1254 ms
Kimi K2.5
Option:B✕
BBox:[0.172, 0.556, 0.562, 0.994]✕IoU=0.097latency=2608 ms
GLM 4.6V
Option:A✓
BBox:[1.000, 1.000, 1.000, 1.000]✕IoU=0.000latency=16820 ms
Mimo v2 Omni
Option:B✕
BBox:[0.530, 0.000, 0.950, 0.450]✕IoU=0.009latency=3990 ms
InternVL 3.5-30B-A3B
Option:C✕
BBox:[0.420, 0.450, 0.580, 0.550]✕IoU=0.288latency=48156 ms
SenseNova-SI-1.2
Option:-✕
BBox:-✕IoU=0.000
VST-7B-RL
Option:C✕
BBox:[0.110, 0.200, 0.300, 0.400]✕IoU=0.000latency=593 ms
SpaceOm
Option:C✕
BBox:[0.600, 0.300, 0.800, 0.500]✕IoU=0.000latency=653 ms
ViLaSR
Option:D✕
BBox:[1.000, 1.000, 1.000, 1.000]✕IoU=0.000latency=1223 ms
Prediction Summary. In this qualitative read, option judgment is correct for 1/12, spatial grounding is correct for 0/12, and dual-cognition pass is 0/12. This indicates a coupling bottleneck between recognition and evidence validation. Relative strengths appear in GLM 4.6V, while weaker runs concentrate in Claude Sonnet 4.6, GPT 5.3 Chat.
Benchmark Statistics
Overall Statistics
The released benchmark split is carved out from a larger reviewed scene pool. The released statistics below describe the benchmark-facing subset, while the figure retains the paper-level overview of released scenes, landmarks, and task counts.
The released benchmark is intentionally balanced around the
two dual-cognition branches, and it exposes those branches under both
image and video media. The statistics in this section therefore move from
global scale to concrete dataset structure: task-count balance first, then scene coverage,
landmark taxonomy, and finally the image and video branches with their own acquisition and
generation characteristics.
Read together, these summaries show that UAV-DualCog is not only a collection of samples,
but a structured benchmark release built from reviewed scenes, multiview landmarks,
behavior-driven missions, and evidence-aware task interfaces. The cards below summarize
the released scale, while the figure and the following subsections unpack how that scale
is distributed across scenes, categories, task families, and media-specific design
choices.
12
Released Scenes
512
Released Landmarks
4096
Image Samples
2048
Video Samples
4840
Source Images
5h 27m 13s
Video Duration
Released task balance, landmark distribution, and video-side trajectory statistics in the benchmark.
Task Quantity Distribution
Task balance is enforced explicitly at release time rather than inferred after evaluation.
The released benchmark contains six task families with matched counts, so both cognition
axes and both media branches remain legible at the table level.
In the current release, each task contributes 1,024 samples, yielding 3072
self-aware rows and 3072 environment-aware rows overall. This makes
the benchmark easy to read quantitatively before any leaderboard aggregation is introduced.
The table makes that symmetry explicit: the four image tasks evenly cover the two cognition
axes, and the two video tasks mirror that same split at the temporal level rather than
introducing a modality-specific imbalance.
Cognition
Task Type
Released Samples
Self-Aware
Landmark-Relative Position Reasoning
1024
Self-Aware
Future Observation Prediction
1024
Environment-Aware
Self-Relative Position Reasoning
1024
Environment-Aware
Landmark-Driven Action Decision
1024
Self-Aware
Flight Behavior Recognition and Temporal Localization
1024
Environment-Aware
Landmark Visibility Counting and Interval Reasoning
1024
Task Quantity Distribution. Each released task contributes 1,024 samples,
yielding balanced task counts across the two cognition axes and the six task families.
Scene Coverage
Released scenes are first constructed as fused Stage 1 environments with aligned geometry,
poses, and scene-level observations, and only then filtered into the benchmark-facing split.
The current release keeps 12 test scenes active in the benchmark, rather
than collapsing task construction around a single environment.
The scene library below shows two benchmark-facing views per released test scene whenever
public thumbnails are available, while the chart summarizes how landmarks, image samples,
and video samples remain distributed across scenes. Even the largest released scene, ENV 16, contributes only 126 landmarks, so coverage remains broad rather than scene-dominated.
The scene-level bars show a deliberately spread release rather than a single dominant hub.
ENV 16 is the largest contributor, followed by scenes such as ENV 7 and ENV 20, but
smaller scenes like ENV 8, ENV 13, and ENV 17 are still preserved in the benchmark.
As a result, the split spans dense urban layouts, industrial zones, waterfront areas, and
smaller environments at the same time.
Scene Library. Two representative benchmark-facing snapshots are shown for
each released test scene with available public thumbnails.
The scene-level distribution remains broad rather than collapsing around a single city
block. Landmarks, image tasks, and video tasks all remain visible at the scene level in
the released split.
Scene Coverage. Every selected scene contributes landmarks together with released image and video tasks, so the released split remains spatially distributed rather than scene-collapsed.
Landmark Category
Released landmarks come from the Stage 2 review pipeline, where raw scene candidates are
filtered, assigned coarse and fine-grained semantic labels, and linked to multiview RGB
evidence before split selection. The released benchmark keeps 512
reviewed landmarks across eight coarse categories and 166 fine-grained subcategories.
The tables below summarize the dominant categories in the released split, and the landmark
library grounds those labels back into the same multiview evidence used by the image-task
branch. Building is currently the largest coarse category (34.4%), while Mid Rise Building is the single most common fine-grained subtype (15.6%).
The category distribution is broad but not flat. Buildings remain the largest coarse class,
and vegetation is also strongly represented through frequently occurring tree landmarks,
while public facilities and industrial infrastructure together account for a substantial
fraction of the release. At the fine-grained level, the long tail is preserved: mid-rise
buildings and deciduous trees dominate, but street furniture, signs, cranes, benches, and
other urban objects remain visible as benchmark targets.
Landmark Category Distribution. The released split preserves all eight
coarse categories while remaining long-tailed at the fine-grained subcategory level.
The landmark library below keeps the full eight side-view orbit for representative released
landmarks, so the category taxonomy is grounded in the same multiview evidence used by the
image-task pipeline.
Mid Rise Building
Building · Mid Rise Building · ENV 7
light gray mid-rise building with red tiled roof and dormer windows
FrontFront RightRight
Occluded / Invalid
Back RightBack
Occluded / Invalid
Back LeftLeftFront Left
Landmark Library. Each showcased landmark preserves the full eight
landmark-centric side views used by the released image-task branch.
Image Tasks
Image-task construction begins with reviewed Stage 2 landmarks, preserves
their valid landmark-centric orbit views, and then adds
task-specific egocentric captures such as the query observations used by
environment-aware tasks. The image branch is therefore built from released
scene assets and released landmark views rather than from detached benchmark-only
screenshots.
In the current release, the image branch contains 4096 samples backed by
4840 source images, including 3816 valid landmark views and
1024 extra task-specific observation captures. Landmark
descriptions are kept concise while remaining discriminative, averaging 10.6 words across the released split.
Detailed rendering and prompt-constrained generation are described in the construction
page.
The two histograms show that image-task supervision remains visually and
linguistically well-conditioned. Most released landmarks retain
rich multiview support, with the largest mass concentrated at eight valid
side views, and the remaining landmarks still clustering around five to seven views.
Description lengths are similarly concentrated: the released split peaks around 9 to 11
words, so descriptions stay compact while still carrying enough semantic detail to
distinguish nearby landmarks.
Released image QA samples across four task families
3816
Valid Landmark Views
All valid RGB views attached to released landmarks
1024
Extra Observation Images
Task-specific egocentric captures added during Stage 4 generation
10.6 words
Mean Description Length
Average landmark-description length in the released split
Image-Branch Summary. The released image branch comprises 4096
samples, 3816 valid landmark views, and 1024
task-specific egocentric observations.
The histograms below make two properties explicit: most released landmarks preserve a high
number of valid viewpoints, and landmark descriptions are intentionally
concise rather than paragraph-length annotations.
Valid Side-View Count Distribution. Distribution of valid landmark side-view counts in the released split; top views are excluded so the maximum remains eight orbit views.
Description Length Distribution. Distribution of released landmark-description lengths measured in words.
Video Tasks
Video-task construction starts from reviewed landmarks and a predefined
Stage 3 behavior library. In this library,
atomic behaviors are the reusable low-level flight primitives, such as
approach, orbit, rise, or mapping motions, each instantiated with controlled default
parameter ranges. Composite behaviors are higher-level inspection patterns
built by composing multiple atomic maneuvers into longer and more operationally meaningful
UAV routines.
The released video branch therefore does not treat flight videos as unconstrained motion
clips. Instead, it anchors both
Flight Behavior Recognition and Temporal Localization and
Landmark Visibility Counting and Interval Reasoning in the same
hierarchical behavior definition, so that benchmark predictions can be
interpreted against explicit flight-mode semantics rather than only against raw video
appearance.
Composite relations
Gradual Approach
Gradual Depart
Circular Orbit
Figure-Eight Orbit
Spiral Orbit
Square Orbit
Triangular Orbit
Surface Mapping
Comet Trajectory
Sky Rise
Behavior Hierarchy. The released video branch is grounded in a two-layer
behavior library where reusable atomic maneuvers are composed into higher-level inspection
classes.
The released video branch contains 2048 samples with 5h 27m 13s
of total released footage, built from 5 composite
templates and 10 atomic behavior classes. The
distributions below summarize how those missions are instantiated and populated in the
benchmark, while the detailed behavior library and mission-generation chain are documented
on the Construction page.
The statistics below expose three complementary properties of the video branch. First, the
behavior-mode counts are broadly even across the five composite inspection families, while
the atomic library remains distributed across ten lower-level maneuvers rather than being
dominated by only one or two motion primitives. Second, composite missions are physically
longer than atomic ones, with most composite trajectories concentrated between 200 and 600
meters, whereas atomic trajectories are concentrated below 400 meters and especially in the
0 to 200 meter range. Third, the video-duration distribution mirrors that pattern: most
atomic missions fall below 20 seconds, while most composite missions cluster between 10 and
30 seconds, with only a small long tail extending beyond 40 seconds.
The following distributions summarize how the released video branch is populated: the first
chart counts released missions per behavior mode, and the latter two compare atomic and
composite trajectories in terms of physical path length and video duration. In both latter
views, atomic missions dominate the shortest bins, while composite missions shift toward
the mid-range bins that correspond to longer inspection routes and longer observation
windows. The underlying behavior hierarchy and default mission templates now live in the
Stage 3 construction section so that the benchmark page can stay focused on released
statistics and task-facing interpretation.
Behavior Mode Distribution. Released mission counts across the Stage 3 behavior library.
Trajectory Length Distribution. Each bin stacks released atomic and composite trajectories after recomputing waypoint-path lengths from mission artifacts.
Video Duration Distribution. Duration is measured from the released mission videos and shown separately for atomic and composite missions.