Usage
Reproduction And Execution Guide
This page provides a complete, execution-oriented path for reviewers and users: release channels, full project structure, two reproduction modes, model invocation methods, commented config templates, and stage-by-stage commands consistent with the released code package.
Release Channels
Code and data are released through the official repository and dataset entries
The simulator environment package used for scene-side construction is distributed separately. Download AerialVLN simulator assets from Kaggle before running Mode A.
Project Structure
The full workspace contains code, simulator env files, configs, and benchmark artifacts
Reproduction Modes
UAV-DualCog supports both full construction and experiment-only workflows
Mode A reproduces Stage 1-4 construction and requires simulator environment files. Mode B
runs experiments directly on released artifacts and requires downloaded
scene_data/airsim_env_* plus
task_pipeline_data/UAV-DualCog-V1, but does not require simulator files.
Mode A: Data Construction
Build scene and task artifacts from stage scripts and pipeline phases
Mode B: Experiment Only
Evaluate models on released benchmark artifacts without simulator runtime
Config Templates
Four core configs control scene runtime, stage defaults, model routing, and pipeline scope
The release provides runnable sanitized versions and fully-commented templates. Runnable
examples are listed below (env_7), and the template files in
configs/uav_dualcog/templates/ for custom runs.
Scene Config
Common Stage Config
API Runtime Config
Task Pipeline Spec
Model Routing
Experiment model suffixes control reasoning mode while routes stay on base model aliases
In --experiment-models, suffixes such as -Instant and
-Thinking are request-mode switches. Routing still resolves from
common_api_runtime.yaml using the suffix-stripped base model alias, while
family-specific controls are applied automatically.
vLLM Local Deployment
Use official vLLM installation guidance and serve OpenAI-compatible local endpoints
Follow the official vLLM quickstart for environment setup:
vLLM installation guide.
After serving, point common_api_runtime.yaml to the local endpoint and ensure
the served model name matches your experiment aliasing convention.
Mode A Commands
Stage-by-stage construction workflow with internal web checkpoints
Stage 1 is typically preceded by scene-boundary probing
(probe_airsim_mapbound.py) so MapBound and surface anchors are stable.
Stage 2 Step 2-4 are completed in the internal review web. Stage 3 and Stage 4 also expose
internal web workbenches, but batch generation and released-scale reruns are recommended
through task_pipeline.py.
Environment setup records and Stage 1-4 empirical run logs are included in the official code package under
logs/.
The command block follows the same sequence as README: Step 0 environment setup, Step 1 scene fusion, Step 2 landmark construction/review/auto-label, Step 3 video tasks, and Step 4 image tasks.
Detailed page-by-page usage of the Stage 2/3/4 internal web workbenches is documented in the Internal Web section below.
If your server has no display device, install the following packages before running AirSim:
Internal Web
Stage 2 uses web for review and semantic finalization, while Stage 3 and Stage 4 use web as interactive workbenches
Some functions exist in both command-line and web forms, but they are not equally suitable
for everyday use. In practice, Stage 2 Step 2-4 should be completed in the internal
review web, because representative main-view confirmation, single-direction
anchoring, invalid-view
cleanup, auto-label auditing, and manual semantic repair all benefit from direct visual
inspection. For Stage 3 and Stage 4, the internal web is best used for
qualitative inspection, manifest browsing, prompt/debug checks, and experiment-result
browsing, while released-scale generation is still recommended through
task_pipeline.py.
Stage 2 Web
Landmark screening, representative main-view confirmation, single-direction anchoring, auto-labeling, and semantic review are completed in one review interface
Launch the Stage 2 web with
stage2_landmark_label.py --mode review_instances_web. The interface combines
a left-side landmark list, point-cloud evidence, RGB evidence, auto-label controls, and the
final reviewed semantic fields. The left list groups items by class_id and
class_name, so reviewers can process one semantic cluster at a time instead of
reviewing isolated candidates in random order.
The recommended workflow is: first perform landmark screening with
Drop can be applied immediately to unstable or obviously unusable candidates.
Keep should only be applied after the reviewer has confirmed the
main RGB view and one correct landmark-centric direction anchor. The main
view does not need to be front; it is simply the most
representative benchmark-facing view. Once one direction anchor is confirmed, the
remaining directions in the fixed ring order front,
front_right, right, back_right, back,
back_left, left, front_left are derived
automatically. After that, mark strongly occluded or visually unusable views as invalid,
then run auto-label and review the generated semantics.
Evidence Panels
The point-cloud panel is used to confirm that the candidate is geometrically coherent, while the RGB panel is used to verify representative appearance. The web preserves the eight-view layout even when some directions are invalid, which makes direction auditing stable across landmarks.
Direction And Main View
The RGB panel is used to choose the most representative main view and to confirm one
direction anchor for that landmark. The main view is not necessarily the
front image. After the reviewer selects the main view and confirms one
correct direction, the other seven visible-side directions are computed automatically in
the fixed ring order. If a kept landmark has no valid main view, it should not be
approved.
Auto-Label Controls
Auto-label can be launched for the current landmark, the current class, or the full
retained pool. The generated fields are shown as
auto_label_category, auto_label_subcategory,
auto_label_description, and confidence. This step is best done in the web
because reviewers can immediately compare the proposed semantics against the multiview
evidence.
Manual Review Fields
The final published fields are landmark_category,
landmark_subcategory, and landmark_description. Use
Approve Auto Label when the proposal is already correct; otherwise revise
the final fields manually and save the human correction. In practice, this page
completes Stage 2 Step 2-4 in one continuous review loop.
Stage 3 Web
The Stage 3 workbench supports behavior inspection, mission generation, candidate review, dataset browsing, experiments, and metrics
Launch the Stage 3 web with stage3_generate_traj.py --mode web. The workbench
exposes the pages Behavior Library, Missions,
Review, Generate, Dataset,
Experiments, Results, and Metrics. It is
intended as an interactive inspection and debugging surface rather than the primary route
for large released-scale generation.
Before judging whether a page has content, first switch the top-right scene, task, mission, or manifest selector. Several Stage 3 pages only populate after an active selection is made.
Behavior Library
Use this page to inspect the composite classes, atomic classes, parameter ranges, defaults, and composition rules before generating trajectories. It is the best place to confirm that a released behavior family matches the intended inspection or mapping pattern.
Missions And Review
The Missions page is used to select landmarks, configure composite or atomic mission generation, and produce panorama, preview, or final task videos. The Review page is then used to approve or reject candidates before they are converted into benchmark-facing task rows.
Generate And Dataset
The Generate page converts approved candidates into Stage 3 manifests
and lets you choose self-state and environmental task forms, sample count, seed, and
temporal-localization inclusion. It is useful for spot checks and small controlled
reruns, but released-scale Stage 3 generation is still recommended through
task_pipeline.py --stage stage3 --phase data/render. The
Dataset page is used to preview manifest rows, reference images,
overview images, keyframe boards, and interval targets.
Experiments, Results, And Metrics
The experiment pages support model selection, upload-size control, concurrency and
rate-limit settings, run tracking, per-sample report browsing, grouped bars, full metric
tables, and CSV export. These pages are ideal for targeted reruns and qualitative
diagnosis, but for released-scale experiment sweeps we still recommend
task_pipeline.py --stage stage3 --phase experiment.
Stage 4 Web
The Stage 4 workbench supports image-task generation, manifest preview, model runs, and metric inspection
Launch the Stage 4 web with stage4_qa_generate_and_eval.py --mode web. The
workbench exposes five main pages: Generate, Dataset,
Experiments, Results, and Metrics. It is
especially useful when we want to verify that reference images, query observations, answer
options, and normalized bounding boxes are aligned before launching a larger batch run.
Before judging whether a page is empty, first switch the top-right scene, task type, manifest, or report selector. Several Stage 4 pages only render detailed content after an active selection is chosen.
Generate
The Generate page controls task strategy, view definitions, task types, category
filters, difficulty, sample counts, and per-landmark sampling density. The estimator is
useful for checking the expected scale of the current sampling plan before writing a new
manifest. It is best used for interactive validation and spot checks; released-scale
Stage 4 generation is still recommended through
task_pipeline.py --stage stage4 --phase data/render.
Dataset
The Dataset page loads a manifest summary and sample preview. Use it to verify that the paired image layout, option ordering, answer target, and bbox overlays are all visually consistent. For Stage 4, this page is usually the fastest way to catch sampling or render issues before experiments begin.
Experiments And Results
The Experiments page is used to choose a manifest, select one or more models, set upload
resolution/quality, concurrency, and limits, then launch jobs. The Results page shows
run-level summaries and per-sample outputs, which makes it suitable for prompt debugging
and qualitative failure analysis. For released-scale benchmark runs and comparative
sweeps, we still recommend task_pipeline.py --stage stage4 --phase experiment.
Metrics
The Metrics page summarizes option accuracy, BBox Acc@50IoU, mean IoU, grouped
comparisons, the full experiment matrix, and progress tables. Use this page to inspect
small-to-medium comparison runs, while released-scale Stage 4 generation and experiments
should still be driven by task_pipeline.py --stage stage4.
Mode B Commands
Experiment workflow on downloaded benchmark artifacts
For model invocation, this release supports both API routing and local deployment. Configure
common_api_runtime.yaml first, then run stage-specific experiment phases.
If you only want to validate wiring before real API/model calls, run the smoke-test commands below first.
Smoke Tests