Evaluation

Evaluation Protocol

The benchmark uses strict JSON outputs, explicit spatial grounding for image tasks, and explicit interval supervision for video tasks. The sections below show the exact contracts and prompt templates used by the current implementation.

Evaluation Scope

The protocol is shared across all exported leaderboard rows, but image-side and video-side outputs are intentionally different.

31
Image Models

models with Stage 4 leaderboard coverage

28
Video Models

models with Stage 3 leaderboard coverage

4
Image Tasks

answer + bbox evaluation

2
Video Tasks

semantic + interval evaluation

Protocol

Semantic correctness and evidence alignment are evaluated separately

UAV-DualCog does not treat answer choice as the full task. Image tasks must also ground the target landmark, and video tasks must also place events or visibility into the right temporal intervals. This protocol design is what allows the benchmark to reveal answer-versus-grounding and semantic-versus-temporal gaps.

Parsing & Validity Rules

All tasks are parsed under strict JSON rules before scoring

Rules

  • All tasks require strict JSON outputs. Free-form text, Markdown code fences, or extra explanations are treated as invalid formatting.
  • Image tasks are only considered fully correct when answer selection and landmark grounding agree.
  • Video tasks separate semantic success from temporal success so recognition without interval localization is not treated as complete success.
  • The website leaderboard uses convenience aggregate scores for browsing; the raw task-specific metrics remain the primary result source.

Field-to-Metric Mapping

Output Field Used By Drives Metrics
answer_option_id Image tasks Option Accuracy
bbox_xyxy_norm Image tasks BBox Acc@50IoU, BBox Mean IoU
answers[].intervals_sec Self video tasks Temporal F1@0.5, mean tIoU
visible_count Visibility video task Count Accuracy
visible_intervals_sec Visibility video task Segment F1@0.5, mean tIoU

Image Task Protocols

Image tasks score answer selection together with landmark grounding

Even when the answer is semantically correct, localization quality remains part of the final judgment.

Self-Aware

Landmark-Relative Position Reasoning

Output Schema

{"answer_option_id":"A","bbox_xyxy_norm":[0.1,0.2,0.3,0.4]}

Valid predictions must provide exactly one option id and one normalized bbox for the target landmark.

Metrics

  • Option Accuracy. Correct relative-position choice.
  • BBox Acc@50IoU. Predicted box overlaps the GT landmark with IoU >= 0.5.
  • BBox Mean IoU. Continuous localization quality on the query image.

Self-Aware

Future Observation Prediction

Output Schema

{"answer_option_id":"A","bbox_xyxy_norm":[0.1,0.2,0.3,0.4]}

Valid predictions must provide exactly one option id and one normalized bbox for the target landmark.

Metrics

  • Option Accuracy. Correct future-view selection under the orbit action.
  • BBox Acc@50IoU. Localization quality in the selected candidate image.
  • BBox Mean IoU. Continuous grounding quality after view prediction.

Environment-Aware

Self-Relative Position Reasoning

Output Schema

{"answer_option_id":"A","bbox_xyxy_norm":[0.1,0.2,0.3,0.4]}

Valid predictions must provide exactly one option id and one normalized bbox for the target landmark.

Metrics

  • Option Accuracy. Correct egocentric direction judgment.
  • BBox Acc@50IoU. Localization on the current observation image.
  • BBox Mean IoU. Continuous target grounding fidelity.

Environment-Aware

Landmark-Driven Action Decision

Output Schema

{"answer_option_id":"A","bbox_xyxy_norm":[0.1,0.2,0.3,0.4]}

Valid predictions must provide exactly one option id and one normalized bbox for the target landmark.

Metrics

  • Option Accuracy. Correct action direction toward the landmark.
  • BBox Acc@50IoU. Whether the decision is grounded on the correct landmark.
  • BBox Mean IoU. Continuous localization quality while making the action decision.

Video Task Protocols

Video tasks separate semantic recognition from interval quality

This is why the benchmark can expose models that recognize the right behavior or count but still misplace the corresponding temporal intervals.

Composite Behavior

Self-Aware Video Evaluation

Output Schema

{"answers":[{"option_id":"A","intervals_sec":[[0.0,1.2]]},{"option_id":"C","intervals_sec":[[3.4,5.0]]}]}

Valid predictions must return strict JSON; interval-based scoring is computed after semantic parsing.

Metrics

  • Semantic Accuracy / F1. Semantic recognition of composite or atomic behaviors depending on the view.
  • Temporal F1@0.5. Interval-level success at the 0.5 threshold.
  • Mean tIoU. Continuous temporal localization quality.

Atomic Behavior

Self-Aware Video Evaluation

Output Schema

{"answers":[{"option_id":"A","intervals_sec":[[0.0,1.2]]},{"option_id":"C","intervals_sec":[[3.4,5.0]]}]}

Valid predictions must return strict JSON; interval-based scoring is computed after semantic parsing.

Metrics

  • Semantic Accuracy / F1. Semantic recognition of composite or atomic behaviors depending on the view.
  • Temporal F1@0.5. Interval-level success at the 0.5 threshold.
  • Mean tIoU. Continuous temporal localization quality.

Landmark Visibility

Environment-Aware Video Evaluation

Output Schema

{"visible_count":2,"visible_intervals_sec":[[0.0,9.1],[17.7,26.8]]}

Valid predictions must return strict JSON; interval-based scoring is computed after semantic parsing.

Metrics

  • Count Accuracy. Exact visible-count correctness.
  • Segment F1@0.5. Interval-level visibility success at the 0.5 threshold.
  • Mean tIoU. Continuous temporal overlap quality for visible intervals.

Output Contracts

These task-level output fields summarize the structured response contract defined above

This summary brings the image-side and video-side definitions into one compact view so users can see, task by task, which fields are expected and which scores they contribute to.

Task Modality Required Fields Primary Scores
Landmark-Relative Position Reasoning Image answer_option_id, bbox_xyxy_norm Option Accuracy, BBox Acc@50IoU, BBox Mean IoU
Future Observation Prediction Image answer_option_id, bbox_xyxy_norm Option Accuracy, BBox Acc@50IoU, BBox Mean IoU
Self-Relative Position Reasoning Image answer_option_id, bbox_xyxy_norm Option Accuracy, BBox Acc@50IoU, BBox Mean IoU
Landmark-Driven Action Decision Image answer_option_id, bbox_xyxy_norm Option Accuracy, BBox Acc@50IoU, BBox Mean IoU
Composite Behavior Video answers[].option_ids, answers[].intervals_sec Semantic Accuracy / F1, Temporal F1@0.5, Mean tIoU
Atomic Behavior Video answers[].option_ids, answers[].intervals_sec Semantic Accuracy / F1, Temporal F1@0.5, Mean tIoU
Landmark Visibility Video visible_count, visible_intervals_sec Count Accuracy, Segment F1@0.5, Mean tIoU

LLM Invocation Sources

Model calls are grouped into local deployment and API channels

The runtime model map follows configs/flightmvstg/common_api_runtime.yaml. The API routes include Siliconflow, OpenRouter, Bailian, Zhipu AI, and Xiaomi Mimo.

Local

13 model routes

Model Source
InternVL 3.5-14B Local
InternVL 3.5-38B Local
InternVL 3.5-4B Local
InternVL 3.5-8B Local
Qwen 3.5-4B Local
Qwen 3.5-9B Local
SenseNova-SI-1.2 Local
SpaceOm Local
SpaceR Local
SpaceThinker Local
ViLaSR Local
VST-7B-RL Local
VST-7B-SFT Local

API

17 model routes

Model Source
Gemini 3 Flash OpenRouter
Gemini 3.1 Flash Lite OpenRouter
GLM 4.6V Zhipu AI
GPT 5.3 Chat OpenRouter
Grok 4.1 Fast OpenRouter
Intern S1-Pro SH-AILab
InternVL 3.5-241B-A28B SH-AILab
InternVL 3.5-30B-A3B Siliconflow
Kimi K2.5 Bailian / Siliconflow
Mimo v2 Omni Xiaomi Mimo
Qwen 3.5-122B-A10B Siliconflow
Qwen 3.5-27B Siliconflow
Qwen 3.5-35B-A3B Siliconflow
Qwen 3.5-397B-A17B Siliconflow
Qwen 3.5-Flash Bailian
Qwen 3.5-Plus Bailian
Qwen 3.6-Plus Bailian

Suffix Handling

Instant/Thinking suffixes are interpreted per model family

The evaluation runtime strips -Instant, -Thinking, or -Reasoning to resolve the base route in common_api_runtime.yaml. It then applies family-compatible request controls (for example enable_thinking, reasoning, or chat_template_kwargs) according to provider capabilities. This keeps the routing contract stable across local deployment and API invocation while preserving a unified experiment naming convention.

Full Evaluation Prompt Templates

These templates are exported from the active prompt configuration and reproduced verbatim.

Image Task Prompts

Landmark-Relative Position Reasoning

Infer the UAV position relative to the target landmark from a landmark-centric reference view and a current egocentric observation.

Output Schema

{"answer_option_id":"A","bbox_xyxy_norm":[0.1,0.2,0.3,0.4]}

Protocol Note

Self-Aware image task evaluated with option prediction plus grounding.

System Prompt

[TASK]: You are a UAV agent. Compare a reference view including a landmark and your current first-person view. Decide your position relative to the landmark. [INPUT]: - You are given two first-person view images. - Image 1 is the reference view, showing the landmark and its bounding box. - Image 2 is your current view. - Image 1 tells you which facade of the landmark is visible in the landmark-centric coordinate frame. - The user prompt gives the question and answer options. [OUTPUT]: 1. Return exactly one valid JSON object. 2. Do not output markdown, explanation, or extra text. 3. Use exactly one option_id from the user prompt. 4. The bounding box must use normalized [x1,y1,x2,y2] coordinates. 5. Strictly follow this JSON schema: {"answer_option_id":"A","bbox_xyxy_norm":[0.1,0.2,0.3,0.4]} [NOTE]: 1. Decide your position relative to the landmark in image 2. 2. Also localize the landmark in image 2. 3. The visible facade in image 1 defines the landmark's forward-facing reference.

User Prompt

Image 1 shows the {{reference_object_view}} facade of {{landmark_description}} in the landmark-centric coordinate frame. Based on that reference, what is your position relative to the landmark in image 2? Select one option and return the normalized bounding box of the landmark in image 2. Options: {{options_text}}

Invocation Structure

system: full task instruction user: natural-language question with image blocks parser: strict JSON schema