Usage | UAV-DualCog

Release Channels

Code, data, and simulator assets are released through official channels

The released code and benchmark data are available from the repository and dataset platforms below. For scene-side data construction in Mode A, the AerialVLN simulator assets are not bundled with the benchmark dataset and need to be downloaded separately.

Code

Dataset

AerialVLN Simulator

Project Structure

The full workspace contains code, simulator env files, configs, and benchmark artifacts

UAV-DualCog/ ├── scripts/uav_dualcog/ # Stage 1-4 + task_pipeline entrypoints ├── trajectory/ # atomic/composite behavior library ├── sim_bridge/ # AirSim bridge abstraction ├── configs/ │ ├── uav_dualcog/ │ │ ├── task_airsim_env_<id>.yaml # runnable scene configs (18 scenes) │ │ ├── common_stage_configs.yaml # stage-shared behavior config │ │ ├── common_api_runtime.yaml # model routing config │ │ ├── task_pipeline/ │ │ │ └── task_pipeline_uav_dualcog_v1.yaml │ │ └── templates/ # fully-commented config templates │ └── prompts/uav_dualcog_prompts.yaml ├── envs/ │ └── airsim/env_*/ # simulator environment files ├── scene_data/ │ └── airsim_env_*/ # Stage 1-2 scene artifacts │ ├── pcd_map/ │ ├── landmarks_raw/ │ └── landmarks_review/ ├── task_pipeline_data/ │ └── UAV-DualCog-V1/ # released benchmark artifacts │ ├── airsim_env_*/ │ │ ├── video_tasks/ │ │ └── image_tasks/ │ └── task_pipeline/ │ ├── dataset_stats/ │ ├── exports/ │ └── landmark_lists/ ├── environment.yml └── requirements.txt

Reproduction Modes

UAV-DualCog supports both full construction and experiment-only workflows

Mode A reproduces Stage 1-4 construction and requires simulator environment files. Mode B runs experiments directly on released artifacts and requires downloaded scene_data/airsim_env_* plus task_pipeline_data/UAV-DualCog-V1, but does not require simulator files.

Mode A: Data Construction

Build scene and task artifacts from stage scripts and pipeline phases

# Mode A: Data Construction UAV-DualCog/ ├── envs/airsim/env_7/ # required ├── scene_data/ # writeable └── task_pipeline_data/ # writeable

Mode B: Experiment Only

Evaluate models on released benchmark artifacts without simulator runtime

# Mode B: Experiment Only UAV-DualCog/ ├── scene_data/ # required for scene metadata and landmark context └── task_pipeline_data/UAV-DualCog-V1/ # downloaded benchmark release data # simulator files are NOT required for this mode

Config Templates

Four core configs control scene runtime, stage defaults, model routing, and pipeline scope

The release provides runnable sanitized versions and fully-commented templates. Runnable examples are listed below (env_7), and the template files in configs/uav_dualcog/templates/ for custom runs.

# Runnable examples (env_7 shown) # scene_id values are recommended to use the canonical env_<id> format throughout configs and commands # if --scene-id is provided on the command line, keep it identical to task.scene_id in the config; do not mix 7 and env_7 within one workspace configs/uav_dualcog/task_airsim_env_7.yaml configs/uav_dualcog/common_stage_configs.yaml configs/uav_dualcog/common_api_runtime.yaml configs/uav_dualcog/task_pipeline/task_pipeline_uav_dualcog_v1.yaml

Scene Config

# configs/uav_dualcog/task_airsim_env_7.yaml task: name: UAV-DualCog-env_7 # scene-level log name engine: airsim # simulator backend base_dir: scene_data # scene root parent scene_id: env_7 scene_dir_name: airsim_env_7 output_layout: scene_dir_include_engine: true # write under scene_data/airsim_env_7 stage1_dir: pcd_map # Stage 1 outputs stage2_raw_dir: landmarks_raw stage2_review_dir: landmarks_review stage3_task_root_dir: video_tasks # Stage 3 root under task_pipeline_data/UAV-DualCog-V1 stage4_qa_dir: image_tasks # Stage 4 root under task_pipeline_data/UAV-DualCog-V1 camera: width: 4096 # source capture resolution (DCI 4K, 4:3) height: 3072 fov: 72.0 # FoV in degrees fps: 10 # source-frame sampling rate collect: yaw_list_deg: [0] pose_settle_sec: 0.05 # settle time before capture traj_map: VoxelWidth: 1.5 # grid size (m) for map representation LidarDelta: [30, 30, 50] # local lidar sampling span (m), [x, y, z] MapBound: [-219, 191, -270, 268, -50, 52] parallel: mode: single_instance_multi_thread # one AirSim process + multi-thread workers workers: 6 stage2: mode: collect_instances collect_enable_rgb_views: true collect_rgb_views_count: 8 # side views per landmark collect_add_birdseye_view: true collect_parallel_workers: 6 collect_rgb_parallel_workers: 6 collect_view_image_width: 4096 collect_view_image_height: 3072 stage3: auto_mission_enabled: true final_video_fps: 2 # released benchmark video FPS final_video_width: 1440 # released video = 1080P (4:3) final_video_height: 1080 engine_params: airsim: sim_ip: 127.0.0.1 sim_port: 41070 launch_sim: true headless: true vehicle_name: drone_1 camera_name: front_0 lidar_range: 500.0 # lidar max range (m)

Common Stage Config

# configs/uav_dualcog/common_stage_configs.yaml stage3_behavior_library: shared: safety_distance_m: 2.0 # global minimum safe distance elements: gradual_approach: display_name: Gradual Approach family: inspection camera_mode_default: landmark_track params: travel_distance_m: {min: 30, max: 120, default: 40, step: 10} descent_m: {min: 5, max: 40, default: 15, step: 5} circular_orbit: display_name: Circular Orbit family: orbit camera_mode_default: landmark_track params: extension_m: {min: 4, max: 36, default: 12, step: 2} arc_deg: {min: 45, max: 720, default: 180, step: 90}

API Runtime Config

# configs/uav_dualcog/common_api_runtime.yaml api: default_models: stage2: Qwen/Qwen3.5-9B # default Stage 2 model stage3: openai/gpt-5.3-chat stage4: Qwen/Qwen3.5-4B models: # API routing mode openai/gpt-5.3-chat: api_source: cloud api_base: ${UAV_DUALCOG_API_BASE} api_key: ${UAV_DUALCOG_API_KEY} request_model: gpt-5.3-chat rpm_limit: 60 tpm_limit: 200000 rate_limit_reserve_ratio: 0.1 # local deployment mode (no additional quantization handling in this repo) # you can invoke as Qwen/Qwen3.5-9B-Instant or Qwen/Qwen3.5-9B-Thinking in experiments Qwen/Qwen3.5-9B: api_source: local api_base: http://127.0.0.1:28000/v1 api_key: ${UAV_DUALCOG_LOCAL_API_KEY} request_model: Qwen/Qwen3.5-9B rpm_limit: 1000 tpm_limit: 1000000 rate_limit_reserve_ratio: 0.2 stage3_temporal: api_upload_resize_enabled: true api_upload_max_width: 640 api_upload_max_height: 480 api_upload_jpeg_quality: 80 timeout_s: 600 experiment_concurrency: 1

Task Pipeline Spec

# configs/uav_dualcog/task_pipeline/task_pipeline_uav_dualcog_v1.yaml task_name: UAV-DualCog-V1 task_pipeline_root_dir: task_pipeline_data stage: both # both|stage3|stage4 phase: both # selection|data|render|experiment|analyze|both seed: 29 scene_ids: [env_7, env_8, env_9, env_10, env_11, env_13, env_16, env_17, env_20, env_21, env_23, env_24] landmark_list: artifact_name: AirMultiviewST_Test_Single_landmarks scene_ids: [env_7, env_8, env_9, env_10, env_11, env_13, env_16, env_17, env_20, env_21, env_23, env_24] scene_landmark_counts: env_7: all stage3: single_landmark: landmark_count: all atomic_classes_per_landmark: 1 composite_classes_per_landmark: 1 instances_per_class: 1 self_state_forms: [self_instance_recognition_joint] environmental_forms: [env_visibility_reasoning] qa_samples_per_task: 1 manifest_mode: all final_video_width: 1440 final_video_height: 1080 final_capture_parallel_workers: 16 record_parallel_workers: 16 stage4: env_capture_parallel_workers: 8 overlay_parallel_workers: 32 landmark_count: all per_landmark: task_types: [self_where, self_what, env_where, env_how] qa_samples_per_difficulty: 1 difficulties: [4way, 8way]

Parameter impact notes (efficiency vs quality): - Resolution (camera width/height, collect_view_image_*): higher improves detail and bbox tightness but increases capture and I/O cost. - FoV (camera.fov): smaller FoV magnifies target but reduces context; larger FoV increases context but also distortion. - LiDAR radius/span (lidar_range, LidarDelta): larger coverage gives richer geometry but heavier point-cloud fusion. - Grid size (traj_map.VoxelWidth): smaller grid preserves detail but costs more memory/time. - Parallel capture workers: improves throughput under one simulator process but too high can hurt stability. - Frame rate (camera.fps / final_video_fps): higher preserves temporal detail, but increases storage and inference overhead. Machine baseline used for current runnable defaults: - CPU: 2 x AMD EPYC 7452 (128 vCPUs total) - RAM: 503 GiB - GPU: 4 x NVIDIA GeForce RTX 4090 (24GB each) Operational constraint: - AirSim is tested to be stable as a single process in this workflow. - Therefore collection uses single-process, multi-thread parallel capture (single_instance_multi_thread).

Model Routing

Experiment model suffixes control reasoning mode while routes stay on base model aliases

In --experiment-models, suffixes such as -Instant and -Thinking are request-mode switches. Routing still resolves from common_api_runtime.yaml using the suffix-stripped base model alias, while family-specific controls are applied automatically.

# Runtime suffix behavior for --experiment-models # Base model is resolved from common_api_runtime.yaml routes. # Suffix only changes request controls (family-specific reasoning toggles). Qwen/Qwen3.5-9B-Instant Qwen/Qwen3.5-9B-Thinking OpenGVLab/InternVL3_5-4B-Instant # Notes # -Instant => disable thinking/reasoning controls where supported # -Thinking / -Reasoning => enable thinking/reasoning controls where supported # If a provider/model family does not expose that toggle, runtime keeps a safe no-op

vLLM Local Deployment

Use official vLLM installation guidance and serve OpenAI-compatible local endpoints

Follow the official vLLM quickstart for environment setup: vLLM installation guide. After serving, point common_api_runtime.yaml to the local endpoint and ensure the served model name matches your experiment aliasing convention.

# vLLM install guide (official) # https://docs.vllm.com.cn/en/latest/getting_started/quickstart/#installation # Download from ModelScope modelscope download --model Qwen/Qwen3.5-4B --local_dir ./models/qwen3_5-4b export CUDA_VISIBLE_DEVICES=3 export VLLM_USE_MODELSCOPE=true vllm serve ./models/internvl3_5-4b --served-model-name OpenGVLab/InternVL3_5-4B-Instant --tensor-parallel-size 1 --reasoning-parser qwen3 --max-model-len 32K --kv-cache-dtype fp8 --gpu-memory-utilization 0.90 --max-num-seqs 8 --max-num-batched-tokens 16K --enable-prefix-caching --host 0.0.0.0 --port 40900

Mode A Commands

Stage-by-stage construction workflow with internal web checkpoints

Stage 1 is typically preceded by scene-boundary probing (probe_airsim_mapbound.py) so MapBound and surface anchors are stable. Stage 2 Step 2-4 are completed in the internal review web. Stage 3 and Stage 4 also expose internal web workbenches, but batch generation and released-scale reruns are recommended through task_pipeline.py.

Environment setup records and Stage 1-4 empirical run logs are included in the official code package under logs/.

The command block follows the same sequence as README: Step 0 environment setup, Step 1 scene fusion, Step 2 landmark construction/review/auto-label, Step 3 video tasks, and Step 4 image tasks.

Detailed page-by-page usage of the Stage 2/3/4 internal web workbenches is documented in the Internal Web section below.

If your server has no display device, install the following packages before running AirSim:

sudo apt install xdg-user-dirs xdg-utils sudo apt install libegl1 sudo apt install vulkan-tools libvulkan1 mesa-vulkan-drivers

# Step 0: environment setup conda env create -f environment.yml conda activate uav-dualcog # Stage 1.0: probe scene bounds and write back MapBound (recommended) python scripts/uav_dualcog/probe_airsim_mapbound.py --config configs/uav_dualcog/task_airsim_env_7.yaml --scene-id env_7 --workers 6 --probe-source hybrid --write-back --output scene_data/airsim_env_7/pcd_map/mapbound_probe_env7.json # Stage 1: point-cloud collection + fusion python scripts/uav_dualcog/stage1_collect_pcd.py --config configs/uav_dualcog/task_airsim_env_7.yaml --scene-id env_7 --mode all --engine airsim # Stage 2 Step 1: collect landmark candidates and multiview evidence python scripts/uav_dualcog/stage2_landmark_label.py --config configs/uav_dualcog/task_airsim_env_7.yaml --scene-id env_7 --mode collect_instances # Stage 2 Step 2-4: run internal review web, then perform manual review + auto labeling python scripts/uav_dualcog/stage2_landmark_label.py --config configs/uav_dualcog/task_airsim_env_7.yaml --scene-id env_7 --mode review_instances_web --host 0.0.0.0 --port 20261 python scripts/uav_dualcog/stage2_landmark_label.py --config configs/uav_dualcog/task_airsim_env_7.yaml --scene-id env_7 --mode auto_label # Stage 3 recommended batch pipeline python scripts/uav_dualcog/task_pipeline.py --spec configs/uav_dualcog/task_pipeline/task_pipeline_uav_dualcog_v1.yaml --stage stage3 --phase selection python scripts/uav_dualcog/task_pipeline.py --spec configs/uav_dualcog/task_pipeline/task_pipeline_uav_dualcog_v1.yaml --stage stage3 --phase data python scripts/uav_dualcog/task_pipeline.py --spec configs/uav_dualcog/task_pipeline/task_pipeline_uav_dualcog_v1.yaml --stage stage3 --phase render # Stage 4 recommended batch pipeline python scripts/uav_dualcog/task_pipeline.py --spec configs/uav_dualcog/task_pipeline/task_pipeline_uav_dualcog_v1.yaml --stage stage4 --phase selection python scripts/uav_dualcog/task_pipeline.py --spec configs/uav_dualcog/task_pipeline/task_pipeline_uav_dualcog_v1.yaml --stage stage4 --phase data python scripts/uav_dualcog/task_pipeline.py --spec configs/uav_dualcog/task_pipeline/task_pipeline_uav_dualcog_v1.yaml --stage stage4 --phase render # Optional internal web workbenches for inspection python scripts/uav_dualcog/stage3_generate_traj.py --config configs/uav_dualcog/task_airsim_env_7.yaml --scene-id env_7 --mode web python scripts/uav_dualcog/stage4_qa_generate_and_eval.py --config configs/uav_dualcog/task_airsim_env_7.yaml --scene-id env_7 --mode web --port 20264

Internal Web

Stage 2 uses web for review and semantic finalization, while Stage 3 and Stage 4 use web as interactive workbenches

Some functions exist in both command-line and web forms, but they are not equally suitable for everyday use. In practice, Stage 2 Step 2-4 should be completed in the internal review web, because representative main-view confirmation, single-direction anchoring, invalid-view cleanup, auto-label auditing, and manual semantic repair all benefit from direct visual inspection. For Stage 3 and Stage 4, the internal web is best used for qualitative inspection, manifest browsing, prompt/debug checks, and experiment-result browsing, while released-scale generation is still recommended through task_pipeline.py.

Stage 2 Web

Landmark screening, representative main-view confirmation, single-direction anchoring, auto-labeling, and semantic review are completed in one review interface

Launch the Stage 2 web with stage2_landmark_label.py --mode review_instances_web. The interface combines a left-side landmark list, point-cloud evidence, RGB evidence, auto-label controls, and the final reviewed semantic fields. The left list groups items by class_id and class_name, so reviewers can process one semantic cluster at a time instead of reviewing isolated candidates in random order.

The recommended workflow is: first perform landmark screening with Drop can be applied immediately to unstable or obviously unusable candidates. Keep should only be applied after the reviewer has confirmed the main RGB view and one correct landmark-centric direction anchor. The main view does not need to be front; it is simply the most representative benchmark-facing view. Once one direction anchor is confirmed, the remaining directions in the fixed ring order front, front_right, right, back_right, back, back_left, left, front_left are derived automatically. After that, mark strongly occluded or visually unusable views as invalid, then run auto-label and review the generated semantics.

Evidence Panels

The point-cloud panel is used to confirm that the candidate is geometrically coherent, while the RGB panel is used to verify representative appearance. The web preserves the eight-view layout even when some directions are invalid, which makes direction auditing stable across landmarks.

Direction And Main View

The RGB panel is used to choose the most representative main view and to confirm one direction anchor for that landmark. The main view is not necessarily the front image. After the reviewer selects the main view and confirms one correct direction, the other seven visible-side directions are computed automatically in the fixed ring order. If a kept landmark has no valid main view, it should not be approved.

Auto-Label Controls

Auto-label can be launched for the current landmark, the current class, or the full retained pool. The generated fields are shown as auto_label_category, auto_label_subcategory, auto_label_description, and confidence. This step is best done in the web because reviewers can immediately compare the proposed semantics against the multiview evidence.

Manual Review Fields

The final published fields are landmark_category, landmark_subcategory, and landmark_description. Use Approve Auto Label when the proposal is already correct; otherwise revise the final fields manually and save the human correction. In practice, this page completes Stage 2 Step 2-4 in one continuous review loop.

Stage 2 landmark review workspace — **Stage 2 Review Workspace.** The review page combines point-cloud evidence, multiview RGB evidence, review-state controls, and auto-label approval so that Stage 2 Step 2-4 can be completed in one continuous workflow.

Stage 3 Web

The Stage 3 workbench supports behavior inspection, mission generation, candidate review, dataset browsing, experiments, and metrics

Launch the Stage 3 web with stage3_generate_traj.py --mode web. The workbench exposes the pages Behavior Library, Missions, Review, Generate, Dataset, Experiments, Results, and Metrics. It is intended as an interactive inspection and debugging surface rather than the primary route for large released-scale generation.

Before judging whether a page has content, first switch the top-right scene, task, mission, or manifest selector. Several Stage 3 pages only populate after an active selection is made.

Behavior Library

Use this page to inspect the composite classes, atomic classes, parameter ranges, defaults, and composition rules before generating trajectories. It is the best place to confirm that a released behavior family matches the intended inspection or mapping pattern.

Missions And Review

The Missions page is used to select landmarks, configure composite or atomic mission generation, and produce panorama, preview, or final task videos. The Review page is then used to approve or reject candidates before they are converted into benchmark-facing task rows.

Stage 3 mission generation page — **Stage 3 Mission Generation.** Landmark selection, mission-family configuration, and panorama or video rendering are coordinated here for interactive candidate prototyping.

Generate And Dataset

The Generate page converts approved candidates into Stage 3 manifests and lets you choose self-state and environmental task forms, sample count, seed, and temporal-localization inclusion. It is useful for spot checks and small controlled reruns, but released-scale Stage 3 generation is still recommended through task_pipeline.py --stage stage3 --phase data/render. The Dataset page is used to preview manifest rows, reference images, overview images, keyframe boards, and interval targets.

Stage 3 manifest generation page — **Stage 3 Manifest Generation.** Approved mission candidates are converted into benchmark-facing task rows here, with task-form toggles and sampling controls exposed before export.

Stage 3 dataset browser — **Stage 3 Dataset Browser.** Manifest rows, auxiliary media, sample videos, and temporal targets are spot-checked here before experiments are launched.

Experiments, Results, And Metrics

The experiment pages support model selection, upload-size control, concurrency and rate-limit settings, run tracking, per-sample report browsing, grouped bars, full metric tables, and CSV export. These pages are ideal for targeted reruns and qualitative diagnosis, but for released-scale experiment sweeps we still recommend task_pipeline.py --stage stage3 --phase experiment.

Stage 3 results page — **Stage 3 Results.** Run summaries and sample-level predictions are reviewed here to separate semantic mistakes, interval mistakes, and parsing failures.

Stage 3 metrics page — **Stage 3 Metrics.** Summary cards, grouped comparisons, full tables, and progress views help diagnose Stage 3 model behavior before exporting CSV for offline aggregation.

Stage 4 Web

The Stage 4 workbench supports image-task generation, manifest preview, model runs, and metric inspection

Launch the Stage 4 web with stage4_qa_generate_and_eval.py --mode web. The workbench exposes five main pages: Generate, Dataset, Experiments, Results, and Metrics. It is especially useful when we want to verify that reference images, query observations, answer options, and normalized bounding boxes are aligned before launching a larger batch run.

Before judging whether a page is empty, first switch the top-right scene, task type, manifest, or report selector. Several Stage 4 pages only render detailed content after an active selection is chosen.

Generate

The Generate page controls task strategy, view definitions, task types, category filters, difficulty, sample counts, and per-landmark sampling density. The estimator is useful for checking the expected scale of the current sampling plan before writing a new manifest. It is best used for interactive validation and spot checks; released-scale Stage 4 generation is still recommended through task_pipeline.py --stage stage4 --phase data/render.

Stage 4 task generation page — **Stage 4 Task Generation.** Task strategy, difficulty, category filters, and per-landmark sampling density are configured here before a new image-QA manifest is written.

Dataset

The Dataset page loads a manifest summary and sample preview. Use it to verify that the paired image layout, option ordering, answer target, and bbox overlays are all visually consistent. For Stage 4, this page is usually the fastest way to catch sampling or render issues before experiments begin.

Experiments And Results

The Experiments page is used to choose a manifest, select one or more models, set upload resolution/quality, concurrency, and limits, then launch jobs. The Results page shows run-level summaries and per-sample outputs, which makes it suitable for prompt debugging and qualitative failure analysis. For released-scale benchmark runs and comparative sweeps, we still recommend task_pipeline.py --stage stage4 --phase experiment.

Stage 4 experiments page — **Stage 4 Experiments.** Model aliases, upload settings, concurrency, and rate limits are managed here for qualitative reruns and small-to-medium comparison jobs.

Stage 4 results page — **Stage 4 Results.** Run summaries and sample-level outputs are reviewed here to identify option-selection errors, bbox errors, and parser failures.

Metrics

The Metrics page summarizes option accuracy, BBox Acc@50IoU, mean IoU, grouped comparisons, the full experiment matrix, and progress tables. Use this page to inspect small-to-medium comparison runs, while released-scale Stage 4 generation and experiments should still be driven by task_pipeline.py --stage stage4.

Mode B Commands

Experiment workflow on downloaded benchmark artifacts

For model invocation, this release supports both API routing and local deployment. Configure common_api_runtime.yaml first, then run stage-specific experiment phases.

If you only want to validate wiring before real API/model calls, run the smoke-test commands below first.

# Experiment-only mode # Required: downloaded scene_data/airsim_env_* + task_pipeline_data/UAV-DualCog-V1 # Not required: simulator environment files # Stage 3 experiments python scripts/uav_dualcog/task_pipeline.py --spec configs/uav_dualcog/task_pipeline/task_pipeline_uav_dualcog_v1.yaml --stage stage3 --phase experiment --experiment-models openai/gpt-5.3-chat Qwen/Qwen3.5-9B-Instant # Stage 4 experiments python scripts/uav_dualcog/task_pipeline.py --spec configs/uav_dualcog/task_pipeline/task_pipeline_uav_dualcog_v1.yaml --stage stage4 --phase experiment --experiment-models openai/gpt-5.3-chat Qwen/Qwen3.5-4B-Thinking

Smoke Tests

Quick CLI checks before long construction or experiment runs

# Interface smoke tests (no long-running construction required) python scripts/uav_dualcog/stage1_collect_pcd.py --help python scripts/uav_dualcog/stage2_landmark_label.py --help python scripts/uav_dualcog/probe_airsim_mapbound.py --help python scripts/uav_dualcog/stage3_generate_traj.py --help python scripts/uav_dualcog/stage4_qa_generate_and_eval.py --help python scripts/uav_dualcog/task_pipeline.py --help python scripts/uav_dualcog/mock_api_runtime_check.py --config configs/uav_dualcog/common_api_runtime.yaml

Reproduction And Execution Guide

Code, data, and simulator assets are released through official channels

Code

Dataset

AerialVLN Simulator

The full workspace contains code, simulator env files, configs, and benchmark artifacts

UAV-DualCog supports both full construction and experiment-only workflows

Build scene and task artifacts from stage scripts and pipeline phases

Evaluate models on released benchmark artifacts without simulator runtime

Four core configs control scene runtime, stage defaults, model routing, and pipeline scope

Experiment model suffixes control reasoning mode while routes stay on base model aliases

Use official vLLM installation guidance and serve OpenAI-compatible local endpoints

Stage-by-stage construction workflow with internal web checkpoints

Stage 2 uses web for review and semantic finalization, while Stage 3 and Stage 4 use web as interactive workbenches

Landmark screening, representative main-view confirmation, single-direction anchoring, auto-labeling, and semantic review are completed in one review interface

The Stage 3 workbench supports behavior inspection, mission generation, candidate review, dataset browsing, experiments, and metrics

The Stage 4 workbench supports image-task generation, manifest preview, model runs, and metric inspection

Experiment workflow on downloaded benchmark artifacts

Quick CLI checks before long construction or experiment runs