Construction

Dataset Construction Pipeline

Scene assets are transformed into reviewed landmarks, then into image and video benchmark tasks through a stage-by-stage pipeline with explicit handoff files and quality-control points.

Release Funnel

The released split is obtained through explicit filtering stages

These counts come from the current release statistics and summarize how raw scene assets are narrowed into the released split.

The funnel below should be read as a quality-controlled release path rather than a simple downsampling process. UAV-DualCog first starts from a broader reviewed scene pool, then progressively restricts that pool through landmark validation, task construction constraints, and benchmark split selection until the public release retains only the rows that satisfy the intended evaluation contract.

In practice, the largest reductions do not come from rendering image or video tasks themselves, but from the earlier stages that determine whether a scene is suitable for release, whether a landmark is sufficiently stable and recognizable, and whether a mission or observation should be exposed as benchmark-facing media. This is why the left-hand funnel is best interpreted as evidence of curation rigor: each retained count already reflects review, filtering, and release-boundary decisions rather than raw generation volume alone.

The Stage Snapshot on the right complements these counts by summarizing the media contract that the released benchmark actually exposes. Together, the funnel and the snapshot explain both how much data survives each stage and what form that surviving data takes once it reaches public use, including image resolution, frame sampling, aspect-ratio consistency, and compression policy. Unlike many earlier embodied QA benchmarks that only release relatively low-resolution observations, UAV-DualCog preserves 4K-grade source imagery and adopts the internationally standardized DCI 4K 4:3 setting, which is also compatible with mainstream UAV capture configurations such as DJI Mavic 4 Pro. This allows the same release to support evaluation under multiple resolution settings: researchers may retain the original high-resolution assets for fine spatial reasoning tests or derive lower-resolution variants for controlled efficiency comparisons without changing benchmark semantics.

Construction Funnel

Counts exported from the current release statistics.

Raw Landmark Candidates 2626.00

Stage 1 + Stage 2 candidate pool

Reviewed Valid Landmarks 746.00

After review keep/discard and semantic annotation

Released Split Landmarks 512.00

Landmarks selected into UAV-DualCog

Stage 4 Samples 4096.00

Image QA rows exported from the released split

Stage 3 Samples 2048.00

Video benchmark rows exported from the released split

Construction Funnel. The largest reductions happen before the benchmark rows are finalized: scene review narrows the raw environment pool first, then landmark validation and task-side filtering further trim the candidate space before public export. By the time the release reaches 512 landmarks, 4096 image rows, and 2048 video rows, each retained item has already passed multiple rounds of geometric, semantic, and release-boundary screening rather than simple random subsampling.

Stage Snapshot

The construction chain stays compact because each stage has a narrow role and a concrete handoff artifact.

DCI 4K

Image & Frame Capture

4096×3072 for both video-task frame capture and image-task asset rendering

4:3

Aspect Ratio

used consistently across released media and aligned with common UAV capture settings

JPEG quality 80

Compression Policy

balances image quality and file size in the public release

1080P

Stage 3 Video

1440x1080 · 10 Mbps · H.264 / MP4 · YUV420p, with on-demand resolution support up to 4096×3072

10 FPS

Video Frame Rate

captures rapid action changes while remaining practical for MLLM temporal reasoning

Stage Pipeline

The construction story is best read as four linked stages rather than scattered implementation notes

UAV-DualCog is constructed through a four-stage process that starts from scene-scale point cloud collection, moves through landmark review and structured annotation, then branches into behavior-driven video-task construction and landmark-centered image-task construction. The figure below gives the end-to-end pipeline view, while the summary cards that follow condense each stage into its role, inputs, outputs, and benchmark-facing handoff.

UAV-DualCog construction pipeline figure — **Construction Pipeline.** The benchmark construction process links scene scanning, landmark review, behavior-driven mission generation, and image-task assembly into one explicit four-stage pipeline.

STAGE1

Scene Point-Cloud Collection and Semantic Fusion

Build a scene-level geometric backbone aligned with RGB, segmentation, LiDAR, and pose data.

Inputs. Scene config, sampled poses, simulator-side RGB/segmentation/LiDAR streams

Outputs. scene_data/<scene>/pcd_map/*.pcd; semantic_lidar_compact.npy; instance-aware fusion metadata

STAGE2

Landmark Review and Semantic Annotation

Turn the fused scene cloud into a reviewed, semantically named landmark pool.

Inputs. Fused semantic / instance clouds; multiview RGB evidence; review state

Outputs. landmarks_raw/; landmarks_review/; valid_instances.json

STAGE3

Behavior-Driven Video Task Generation

Compose missions, repair trajectories, render benchmark videos, and export video manifests and metrics.

Inputs. valid_instances.json; atomic/composite behavior templates; collision-aware render config

Outputs. stage3_tasks/missions/*/final_task; stage3 manifests; parsed predictions and metrics CSV

STAGE4

Image QA Generation, Render-Only Asset Refresh, and Evaluation

Sample structured image QA rows and keep assets refreshable without changing task semantics.

Inputs. valid_instances.json; view-definition rules; difficulty filters; render_requests sidecars

Outputs. qa/manifests; qa/render_requests; qa/assets; stage4 metrics CSV

Stage 1

Scene Point-Cloud Collection and Semantic Fusion

Build a scene-level geometric backbone aligned with RGB, segmentation, LiDAR, and pose data.

Stage 1 starts from the simulator scene itself together with the runtime configuration that defines sensor settings, sampling bounds, altitude, and pose coverage. The goal is not to generate benchmark questions directly, but to reconstruct a stable scene-level geometric backbone that later stages can mine repeatedly without returning to image space.

In the current implementation, this stage follows the same logic described in the paper: multi-pose sampling, raw chunk capture, and global semantic fusion. RGB, segmentation, LiDAR, and pose metadata are captured at every sampled pose, then transformed into a unified scene coordinate system so that semantic and instance identities remain attached to the fused point cloud. The result is a scene-scale representation that Stage 2 can split directly into landmark candidates rather than re-detecting objects from scratch.

The key output of Stage 1 is therefore not a benchmark row but a geometric substrate. By the time this stage finishes, every released scene already has a stable map boundary, fused point cloud, global coordinate system, and scan preview record. That shared spatial backbone is what lets later stages reason about landmark visibility, relative position, orbit feasibility, and scene coverage without repeatedly solving low-level mapping problems.

Inputs

Scene configs define MapBound, altitude, yaw sweep, and runtime capture parameters; each sampled pose records synchronized RGB, segmentation, LiDAR, and pose metadata.

Core Transform

Raw chunks are fused into a scene-scale semantic point cloud while preserving semantic and instance identities, so Stage 2 can aggregate landmark candidates directly from geometry-aware assets.

Outputs

The released handoff is a set of scene-scale assets such as fused point-cloud files, scan preview media, and scene metadata that keep Stage 1 coverage auditable before landmark mining begins.

Open Benchmark Scene Coverage →

Reproduce

python scripts/flightmvstg/stage1_collect_pcd.py --config configs/flightmvstg/task_airsim_env_7.yaml --mode all --scene-id env_7 --engine airsim

The scene scanning library below exposes benchmark-facing panoramas for the released test scenes while keeping Stage 1 metadata visible. Each entry now combines two released RGB scene snapshots with two segmented point-cloud panoramas rendered in top-down and oblique views, together with the configured scene boundary, mapped area, and the complete unscreened landmark-candidate count that Stage 2 begins from.

ENV 7 · Waterfront City

Two RGB scene snapshots + two segmented point-cloud panoramas + Stage 2 raw candidate pool

Scene Boundary x[-219, 191] · y[-270, 268] · z[-50, 52]

Mapped Area 220580 m²

Raw Landmark Candidates 167

Reviewed Valid Landmarks 86

ENV 7 · Waterfront City RGB Top-Down Panorama — RGB Top-Down Panorama

ENV 7 · Waterfront City RGB Oblique Panorama — RGB Oblique Panorama

ENV 7 · Waterfront City Segmented Point Cloud Top-Down — Segmented Point Cloud Top-Down

ENV 7 · Waterfront City Segmented Point Cloud Oblique — Segmented Point Cloud Oblique

1 / 12

Scene Scanning Library. Two benchmark-facing RGB scene snapshots and two segmented point-cloud panoramas are shown for each released test environment together with Stage 1 boundary metadata and the unscreened Stage 2 candidate pool.

Stage 2

Landmark Review and Semantic Annotation

Turn the fused scene cloud into a reviewed, semantically named landmark pool.

Stage 2 consumes the fused semantic and instance-aware cloud from Stage 1 and converts it into reviewed landmark assets. The pipeline first groups points by semantic class and instance identity, then renders multiview RGB and point-cloud evidence around each candidate so that geometric structure, appearance, and view direction are all available before any benchmark task is built.

For each landmark candidate, the pipeline computes eight orbit-view capture poses from the 3D bounding box and instance center, covering the canonical directions front, front-right, right, back-right, back, back-left, left, and front-left. The associated segmented point-cloud views are then used to perform occlusion-aware validity checks, so heavily blocked or non-informative directions can be excluded before benchmark-facing landmark assets are finalized.

This is the stage where scene geometry becomes benchmark object assets. Review removes unstable or uninformative candidates, confirms the main view, and finalizes the reviewed landmark manifest used by both Stage 3 and Stage 4. After review, the pipeline runs a constrained auto-label prompt to add coarse category, fine-grained subcategory, and a short discriminative description that later prompts can reuse.

The important shift is that Stage 2 turns a fused but still machine-oriented scene map into benchmark-facing semantic objects. Candidate aggregation, multiview review, and prompt-constrained annotation are linked in one chain: only after geometry, visibility, and main-view consistency are confirmed does the pipeline allow a landmark to receive the semantic fields that later prompt templates depend on.

Step 1 · Candidate Aggregation. The pipeline first partitions the fused semantic cloud by semantic class and instance identity. Each resulting candidate is then packaged with multiview RGB evidence, point-cloud views, 3D center metadata, and 3D box support so that later review decisions are made against geometry-aware evidence rather than isolated screenshots. At this stage the pool is intentionally permissive: the goal is to surface all plausible landmark candidates before benchmark-facing filtering begins.

Step 2 · Review and Main-View Confirmation. Human review then removes unstable or uninformative instances, confirms the most representative main view, verifies one landmark-centric direction anchor, and checks bbox validity. The main view is not required to be the front image; it is simply the clearest and most benchmark-facing reference view. Once that anchor direction is confirmed, the remaining seven directions are derived automatically in the fixed orbit order rather than being edited independently. At this step, reviewers also manually remove unusable view images that remain visually ambiguous, severely occluded, or otherwise unsuitable even after the earlier geometric filtering. Candidates can be dropped immediately, but a landmark is only kept after the main-view and direction checks are complete. The accepted pool is frozen in the reviewed landmark manifest (valid_instances.json), which becomes the only stable Stage 2 handoff consumed by both later stages. In practice this step turns a broad candidate inventory into a controlled landmark repository whose direction labels, representative views, and spatial references can be trusted downstream.

Step 3 · Prompt-Constrained Semantic Annotation. After geometric review, the pipeline runs a constrained auto-label prompt to enrich each landmark with benchmark- facing semantics. This is not free-form captioning: the prompt explicitly constrains category choices, subtype naming, and JSON formatting so that landmark semantics remain short, discriminative, and machine-parseable. The result is a semantic layer that later prompt templates can reuse directly without re-describing each landmark from scratch.

In practice, Stage 2 Step 2-4 are completed in the internal review web. That interface is where reviewers perform candidate screening, main-view confirmation, single-direction anchoring, invalid-view cleanup, auto-label launch, and final semantic audit. The command line remains useful for collection and standalone auto-label reruns, but the web is the recommended working surface whenever view quality and semantic correctness need to be judged together from visual evidence.

Layer 1 · Coarse Category. A closed eight-class ontology anchors every landmark before later task generation.

The prompt must choose exactly one coarse category from the fixed benchmark-facing list: building, vehicle, public_facility, urban_landscape, transport_infrastructure, industrial_infrastructure, vegetation, or other.

Layer 2 · Fine-Grained Subcategory. The second layer stays flexible but must remain category-consistent and generic.

Subcategories cannot be proper nouns or brands; they act as reusable landmark types such as mid-rise building, trash bin, shipping container stack, or deciduous tree.

Layer 3 · Discriminative Description. A short noun phrase captures the identifying cues that later prompts can reuse.

Descriptions must stay under twenty words while preserving subtype, color, shape or texture, and any local visual cue that helps distinguish nearby landmarks.

The reviewed landmark scanning library below makes that Stage 2 handoff concrete: it shows each retained landmark view together with the paired segmented point-cloud rendering captured from the same scanning pose. This is the evidence basis from which later prompt templates, mission planners, and image-task constructors all reuse Stage 2 landmark assets after review and semantic labeling are finalized.

Mid Rise Building Front segmented point cloud — Front

Mid Rise Building Front Right RGB — Front Right

Mid Rise Building Front Right segmented point cloud — Front Right

Mid Rise Building Right segmented point cloud — Right

Occluded / Invalid

Back Right

Mid Rise Building Back segmented point cloud — Back

Occluded / Invalid

Back Left

Mid Rise Building Left segmented point cloud — Left

Mid Rise Building Front Left RGB — Front Left

Mid Rise Building Front Left segmented point cloud — Front Left

Landmark Scanning Library. Reviewed landmark-centric RGB side views are shown together with their corresponding segmented point-cloud views, preserving the scan-level evidence that underlies later image and video task construction.

Auto-Label Prompt Package

System Prompt

You are an aerial landmark recognition expert. Only use image evidence.

Output Schema

{"category":"building|vehicle|public_facility|urban_landscape|transport_infrastructure|industrial_infrastructure|vegetation|other","subcategory":"...","description":"...","confidence":0.0}

User Prompt Template

- class_name: {class_name or '(empty)'} (user-filled weak hint from Step 2, optional) - images: up to 4 views (front/back/left/right preferred, some directions may be missing) - Each uploaded image has a red bounding box marking the landmark, and the side label (e.g., 'Visible Side: Front (Landmark-centric)') at the top-left corner shows the visible side of the landmark in the object-centric (landmark-centric) frame. category candidates (must choose from list): [building, vehicle, public_facility, urban_landscape,transport_infrastructure, industrial_infrastructure, vegetation, other] subcategory requirements: - flexible generic subtype, not a proper noun or brand name - must stay category-consistent - examples below are illustrative only; use other common descriptive terms or phrases if they better match the landmark • building: low-rise building, mid-rise building, high-rise building, warehouse, pagoda, chapel, factory shed, rural farmhouse, glass skyscraper, brick schoolhouse • vehicle: sedan car, delivery van, city bus, cargo truck, motorcycle, construction excavator • public_facility: street lamp post, bus shelter, public bench, trash bin, fire hydrant, antenna • urban_landscape: sculpture, plaza, fountain, city square, urban garden, landscape installation, signboard, billboard, advertising board, wayfinding sign, landmark signage • transport_infrastructure: arch bridge, overpass, railway track, tunnel entrance, roundabout, pedestrian crosswalk, subway station entrance, traffic island, railway platform • industrial_infrastructure: shipping container stack, oil storage tank, grain silo, crane tower, pipeline • vegetation: deciduous tree, coniferous tree, palm tree, shrub cluster, hedge row, grassy lawn • other: temporary tent, rubble pile, playground slide, inflatable archway, construction barrier description constraints (`description`): - one noun phrase, <= 20 words - must include: subcategory, color, shape/texture - may include surrounding relation, visible text/pattern cues to distinguish the landmark - positive examples: - dark red middle-rise building with white neon light featuring the word "HOTEL" on the top - gray pagoda-like tower with layered roof edges beside roadside trees - white arch bridge with curved span above a narrow river channel - dark stone obelisk with sharp top in open paved square Output JSON (no extra explanation text): { "category": "building|vehicle|public_facility|urban_landscape|transport_infrastructure|industrial_infrastructure|vegetation|other", "subcategory": "...", "description": "...", "confidence": 0.0 }

Outputs and Bridge

The released Stage 2 outputs live in the raw and reviewed landmark bundles; reviewed RGB views, main-view tags, semantic layers, and short descriptions then become the common substrate reused by Stage 3 and Stage 4.

Open Benchmark Landmark Statistics →

Reproduce

python scripts/flightmvstg/stage2_landmark_label.py --config configs/flightmvstg/task_airsim_env_7.yaml --scene-id env_7 --mode review_instances_web --port 20261

Open Stage 2 Web Guide

Stage 3

Behavior-Driven Video Task Generation

Compose missions, repair trajectories, render benchmark videos, and export video manifests and metrics.

Stage 3 begins from reviewed Stage 2 landmarks and an explicit two-layer behavior system. In the current single-landmark release, each valid landmark is bound to one atomic mission and one composite mission. Those missions are not just labels: they expand into executable trajectories, camera-control programs, temporal supervision tracks, and final benchmark manifest rows.

The flight-mode library is designed as a benchmark primitive rather than a loose collection of cinematic motions. The intent is to keep released trajectories close to common UAV operating scenarios, so that the resulting video tasks reflect recognizable drone behavior patterns instead of synthetic camera sweeps with little operational meaning. At the low level, the atomic motions borrow from the design logic behind DJI's automated “MasterShots”-style capture patterns, where approach, orbit, rise, and mapping maneuvers are defined as reusable camera-flight units. At the high level, the hierarchical flight-mode design organizes composite classes around common inspection and mapping needs, so that the benchmark's temporal reasoning tasks remain tied to realistic patrol, inspection, and surveying workflows.

Composite relations

Gradual Approach

Gradual Depart

Circular Orbit

Figure-Eight Orbit

Spiral Orbit

Square Orbit

Triangular Orbit

Surface Mapping

Comet Trajectory

Sky Rise

Behavior Hierarchy. Composite inspection classes are instantiated by chaining atomic maneuvers, and the highlighted links show which primitives each composite template reuses in the released Stage 3 library.

The implemented chain is mission generation, trajectory search, trajectory repair, video recording, and temporal task organization. Collision checks and safety buffers can trigger radius, height, or scan-width adjustments before a trajectory is accepted, so the final videos are filtered task assets rather than naïve recordings of an idealized path. Parallel rendering is also scheduled to keep missions spatially separated and reduce multi-UAV contamination in the frame.

Once rendering succeeds, Stage 3 exports both mission-level supervision bundles and the benchmark-facing datasets used for Flight Behavior Recognition and Temporal Localization and Landmark Visibility Counting and Interval Reasoning. Archived 4K frame captures and released 1080P H.264 MP4s are separated on purpose, which lets the benchmark keep high-resolution supervision upstream while distributing efficient evaluation videos downstream.

In other words, Stage 3 is where motion programs become auditable temporal tasks. The stage has to satisfy two goals at once: trajectories must remain executable under collision and safety constraints, and the final media must preserve enough temporal structure that interval localization remains meaningful for large multimodal models. That is why mission repair, behavior templating, and supervision export are treated as one connected pipeline rather than three independent utilities.

The media contract used at this stage is therefore worth exposing directly. Internally, missions still retain higher-resolution archived frame captures for supervision and refresh, while the released benchmark distributes a lighter 1080P H.264 video stream at a fixed frame rate. This separation keeps temporal evaluation efficient for public use while preserving enough upstream visual detail to support rerendering, inspection, and higher-resolution downstream analysis when needed.

The Stage 3 internal web is used here as an interactive mission and task workbench. It exposes pages for the behavior library, mission generation, candidate review, manifest browsing, experiments, result inspection, and metrics. This makes it well suited for qualitative verification and targeted reruns, whereas task_pipeline.py remains the recommended interface for batch selection, data export, render, and released-scale experiment phases.

DCI 4K

Capture Frames

4096×3072 archived frames are retained for upstream supervision and rerender flexibility

1080P

Released Video

1440x1080 · 10 Mbps · H.264 / MP4 · YUV420p

10 FPS

Frame Rate

keeps rapid UAV motion changes visible while remaining practical for MLLM temporal reasoning

Stage 3 Media Specs. Mission rendering retains 4K-grade archived frame captures upstream, while the released benchmark exposes 1080P H.264 videos at a fixed 10 FPS evaluation contract.

Mission Expansion and Repair

Composite and atomic behavior templates expand into executable trajectories, then pass through repair routines that can adjust radius, altitude, or scan width before render so that accepted missions remain physically valid and visually clean.

Format Constraints and Handoffs

Mission folders preserve a mission supervision bundle (task_data.json, frames_manifest.json, and frame_index_map.json) alongside the final video, while released datasets keep semantic answers and interval targets as separate fields so temporal quality can be scored independently.

Outputs

The public handoff includes per-scene dataset manifests and mission folders; the scene-level dataset manifest is the benchmark-facing interface, while each mission-level task_data.json keeps the denser temporal supervision contract.

Open Benchmark Video Examples →

Reproduce

python scripts/flightmvstg/task_pipeline.py --spec configs/flightmvstg/task_pipeline/task_pipeline_uav_dualcog_v1.yaml --stage stage3 --phase render

Composite Classes 5

Atomic Classes 10

Behavior sets

Set Name	scope	elements
Circular Inspection	Single Landmark	Gradual Approach, Circular Orbit, Circular Orbit, Gradual Depart
Spiral Inspection	Single Landmark	Gradual Approach, Spiral Orbit, Spiral Orbit, Gradual Depart
Square Inspection	Single Landmark	Gradual Approach, Square Orbit, Square Orbit, Gradual Depart
Triangular Inspection	Single Landmark	Gradual Approach, Triangular Orbit, Triangular Orbit, Gradual Depart
Surface-Mapping Inspection	Single Landmark	Gradual Approach, Surface Mapping, Gradual Depart
Atomic Circular Orbit	Single Landmark	Circular Orbit
Atomic Comet Trajectory	Single Landmark	Comet Trajectory
Atomic Figure-Eight Orbit	Single Landmark	Figure-Eight Orbit
Atomic Gradual Approach	Single Landmark	Gradual Approach
Atomic Gradual Depart	Single Landmark	Gradual Depart
Atomic Sky Rise	Single Landmark	Sky Rise
Atomic Spiral Orbit	Single Landmark	Spiral Orbit
Atomic Square Orbit	Single Landmark	Square Orbit
Atomic Surface Mapping	Single Landmark	Surface Mapping
Atomic Triangular Orbit	Single Landmark	Triangular Orbit

Circular Inspection

Composite · Single Landmark · 109 released missions

4 steps

Approach, perform two circular orbit segments, and depart.

Generation notes This set uses the released Stage 3 defaults and only applies explicit step-level overrides when the template specifies them.

Allow interleave repeat No

Max total atomic Library default

Step sequence Gradual Approach → Circular Orbit → Circular Orbit → Gradual Depart

Step 1

Gradual Approach

gradual_approach · Inspection

Track the landmark while moving forward, descending gradually, and approaching from a forward oblique direction.

Parameter	Default	Range	Step	Choices	Source
Travel Distance (m)	40	30 to 120	10	-	default
Descent (m)	15	5 to 40	5	-	default
Yaw Offset (deg)	0	-35 to 35	5	-	default
Speed (m/s)	20	15 to 25	1	-	default
Gaze Pitch (deg)	-12	-45 to 0	3	-	default
Camera Mode	landmark_track	-	-	look_forward	override

Step 2

Circular Orbit

circular_orbit · Orbit

Orbit the landmark with a radius extension around the target center.

Parameter	Default	Range	Step	Choices	Source
Extension (m)	12	4 to 36	2	-	default
Arc (deg)	180	45 to 720	90	-	default
Direction	cw	-	-	cw, ccw	default
Altitude Offset (m)	8	-20 to 40	2	-	default
Speed (m/s)	20	15 to 25	1	-	default
Camera Mode	look_forward	-	-	look_forward	override
Gaze Pitch (deg)	0	0 to 0	1	-	default

Step 3

Circular Orbit

circular_orbit · Orbit

Orbit the landmark with a radius extension around the target center.

Parameter	Default	Range	Step	Choices	Source
Extension (m)	12	4 to 36	2	-	default
Arc (deg)	180	45 to 720	90	-	default
Direction	cw	-	-	cw, ccw	default
Altitude Offset (m)	8	-20 to 40	2	-	default
Speed (m/s)	20	15 to 25	1	-	default
Camera Mode	landmark_track	-	-	look_forward	override
Gaze Pitch (deg)	0	0 to 0	1	-	default

Step 4

Gradual Depart

gradual_depart · Inspection

Track the landmark while moving backward, rising gradually, and departing toward a rear oblique direction.

Parameter	Default	Range	Step	Choices	Source
Travel Distance (m)	40	30 to 120	10	-	default
Rise (m)	15	5 to 40	5	-	default
Yaw Offset (deg)	0	-35 to 35	5	-	default
Speed (m/s)	20	15 to 25	1	-	default
Gaze Pitch (deg)	-10	-45 to 0	3	-	default
Camera Mode	look_forward	-	-	landmark_track	override

Behavior Library. The released Stage 3 defaults expose the composite set templates, atomic building blocks, and parameter defaults that drive mission generation.

Open Stage 3 Web Guide

Stage 4

Image QA Generation, Render-Only Asset Refresh, and Evaluation

Sample structured image QA rows and keep assets refreshable without changing task semantics.

Stage 4 reuses the reviewed Stage 2 landmark pool, but its target product is structured image QA rather than temporal supervision. In the current single-landmark core release, every valid landmark contributes four task families under two difficulty settings, which yields a fixed 512 × 2 × 4 = 4096-sample lattice before experiments begin.

The image-task library is designed to make dual cognition visible at the level of question formulation rather than only at the level of evaluation metrics. The self-aware branch asks where the UAV is and what it will see after a described motion, while the environment-aware branch asks where the target lies and what action is appropriate under the current landmark-relative situation. This split ensures that image tasks are not just generic multiple-choice perception items: they are built to separate self-state reasoning from target-oriented situational reasoning under the same landmark-centered world model.

The image branch is not assembled by randomly pairing screenshots. Landmark-Relative Position Reasoning and Future Observation Prediction share landmark-centric reference families but ask different questions about current position and motion-induced future observation. Self-Relative Position Reasoning and Landmark-Driven Action Decision share the same egocentric observation so that target direction reasoning and action decision are grounded in identical evidence. For the environment-aware branch, the pipeline repeatedly resamples observation poses and camera yaw until the landmark stays fully visible and its final bbox is valid.

Stage 4 therefore exports both semantic QA rows and rendering sidecars. The manifests keep the benchmark interface stable, while rerender request files preserve the capture recipe that allows assets to be refreshed later without changing sample ids, options, or answers.

This stage is also where the image benchmark becomes deliberately paired and evidence aware. Reference images, query observations, answer options, and normalized bbox targets are emitted together so that semantic selection and spatial grounding can be evaluated on the same sample instead of in disconnected protocols.

The release-side media specification stays explicit here as well. Stage 4 keeps image assets at the same DCI 4K capture standard used upstream, preserves a uniform 4:3 aspect ratio across the branch, and applies a bounded JPEG compression policy so that public QA assets remain compact without discarding the visual detail needed for fine landmark grounding.

The Stage 4 internal web plays a similar role on the image side: it is the place to inspect generated manifests, preview task rows, launch comparison runs, browse per-sample outputs, and read metric summaries. For small interactive checks, this web surface is the fastest way to verify that reference images, query observations, option layouts, and bbox targets are aligned; for released-scale generation and experiment batches, the recommended path is still task_pipeline.py under Stage 4 selection/data/render/experiment phases.

DCI 4K

Image Assets

4096×3072 landmark-centric renders preserve high-frequency spatial details for grounding

4:3

Aspect Ratio

kept consistent with mainstream UAV capture settings and the benchmark-wide media contract

JPEG 80

Compression

balances public-release file size and landmark-level visual fidelity in the exported QA assets

Stage 4 Media Specs. Released image tasks preserve DCI 4K landmark renders under a uniform 4:3 aspect ratio and a JPEG 80 export policy so that semantic QA and spatial grounding share the same high-resolution evidence contract.

Sample Construction Logic

The four image task families are generated from shared landmark-centered evidence pools: paired tasks deliberately reuse reference or query views so that self-aware and environment-aware reasoning remain directly comparable.

Format Constraints and Outputs

Released manifests package prompt text, answer options, target images, and normalized bbox targets in one stable contract, while render_requests keep assets refreshable without changing task semantics.

Outputs

The benchmark-facing image interface is the per-scene QA manifest, while render_requests preserve the rerender recipe that allows released media to be refreshed later without changing ids, options, or answers.

Open Benchmark Image Examples →

Reproduce

python scripts/flightmvstg/task_pipeline.py --spec configs/flightmvstg/task_pipeline/task_pipeline_uav_dualcog_v1.yaml --stage stage4 --phase render

Open Stage 4 Web Guide

Reproducibility

The released benchmark remains reconstructable because handoff artifacts are explicit

The pipeline exposes concrete handoff files between stages: scene point clouds, reviewed landmarks, mission folders, manifests, rerender requests, and metric exports. This keeps the released benchmark traceable as an inspectable file hierarchy rather than hidden runtime state, so benchmark media can be refreshed and evaluation can be rerun without guessing intermediate decisions.

The release is also practical to reuse: both the dataset assets and the public codebase are already published, so external users do not need to reconstruct the benchmark from scratch before they can inspect files, load manifests, or run evaluation. The Usage page collects the currently released download channels together with the official code repository and access notes.

Open Usage & Release Access

Reviewer workflow. Read Benchmark for released scope and sample balance. Read Construction for stage-by-stage provenance and handoff files. Read Evaluation for exact prompt templates, JSON contracts, and scoring rules. Download leaderboard JSON / CSV files to verify tables or regenerate figures.

Dataset user workflow. Start from scene_data reviewed landmarks and task_pipeline manifests. Use Stage 4 manifests plus render_requests to reproduce image tasks without changing semantics. Use Stage 3 mission folders plus task_data.json to inspect video supervision and media specs. Use exported metric matrices when comparing model runs outside the website.

Maintainer workflow. Regenerate Stage 4 and Stage 3 media through task_pipeline render phases. Rerun experiment phases for the desired stage. Export latest metrics CSV files. Refresh site data and rebuild the static website.

Dataset Construction Pipeline

The released split is obtained through explicit filtering stages

Construction Funnel

Stage Snapshot

The construction story is best read as four linked stages rather than scattered implementation notes

Scene Point-Cloud Collection and Semantic Fusion

Landmark Review and Semantic Annotation

Behavior-Driven Video Task Generation

Image QA Generation, Render-Only Asset Refresh, and Evaluation

Scene Point-Cloud Collection and Semantic Fusion

ENV 7 · Waterfront City

Landmark Review and Semantic Annotation

Mid Rise Building

Behavior-Driven Video Task Generation

Circular Inspection

Gradual Approach

Circular Orbit

Circular Orbit

Gradual Depart

Image QA Generation, Render-Only Asset Refresh, and Evaluation

The released benchmark remains reconstructable because handoff artifacts are explicit