Evaluation

Evaluation Protocol

The benchmark uses strict JSON outputs, explicit spatial grounding for image tasks, and explicit interval supervision for video tasks. The sections below show the exact contracts and prompt templates used by the current implementation.

Evaluation Scope

The protocol is shared across all exported leaderboard rows, but image-side and video-side outputs are intentionally different.

Image Models

models with Stage 4 leaderboard coverage

Video Models

models with Stage 3 leaderboard coverage

Image Tasks

answer + bbox evaluation

Video Tasks

semantic + interval evaluation

Protocol

Semantic correctness and evidence alignment are evaluated separately

UAV-DualCog does not treat answer choice as the full task. Image tasks must also ground the target landmark, and video tasks must also place events or visibility into the right temporal intervals. This protocol design is what allows the benchmark to reveal answer-versus-grounding and semantic-versus-temporal gaps.

Parsing & Validity Rules

All tasks are parsed under strict JSON rules before scoring

Rules

All tasks require strict JSON outputs. Free-form text, Markdown code fences, or extra explanations are treated as invalid formatting.
Image tasks are only considered fully correct when answer selection and landmark grounding agree.
Video tasks separate semantic success from temporal success so recognition without interval localization is not treated as complete success.
The website leaderboard uses convenience aggregate scores for browsing; the raw task-specific metrics remain the primary result source.

Field-to-Metric Mapping

Output Field	Used By	Drives Metrics
`answer_option_id`	Image tasks	Option Accuracy
`bbox_xyxy_norm`	Image tasks	BBox Acc@50IoU, BBox Mean IoU
`answers[].intervals_sec`	Self video tasks	Temporal F1@0.5, mean tIoU
`visible_count`	Visibility video task	Count Accuracy
`visible_intervals_sec`	Visibility video task	Segment F1@0.5, mean tIoU

Image Task Protocols

Image tasks score answer selection together with landmark grounding

Even when the answer is semantically correct, localization quality remains part of the final judgment.

Self-Aware

Landmark-Relative Position Reasoning

Output Schema

{"answer_option_id":"A","bbox_xyxy_norm":[0.1,0.2,0.3,0.4]}

Valid predictions must provide exactly one option id and one normalized bbox for the target landmark.

Metrics

Option Accuracy. Correct relative-position choice.
BBox Acc@50IoU. Predicted box overlaps the GT landmark with IoU >= 0.5.
BBox Mean IoU. Continuous localization quality on the query image.

Self-Aware

Future Observation Prediction

Output Schema

{"answer_option_id":"A","bbox_xyxy_norm":[0.1,0.2,0.3,0.4]}

Valid predictions must provide exactly one option id and one normalized bbox for the target landmark.

Metrics

Option Accuracy. Correct future-view selection under the orbit action.
BBox Acc@50IoU. Localization quality in the selected candidate image.
BBox Mean IoU. Continuous grounding quality after view prediction.

Environment-Aware

Self-Relative Position Reasoning

Output Schema

{"answer_option_id":"A","bbox_xyxy_norm":[0.1,0.2,0.3,0.4]}

Valid predictions must provide exactly one option id and one normalized bbox for the target landmark.

Metrics

Option Accuracy. Correct egocentric direction judgment.
BBox Acc@50IoU. Localization on the current observation image.
BBox Mean IoU. Continuous target grounding fidelity.

Environment-Aware

Landmark-Driven Action Decision

Output Schema

{"answer_option_id":"A","bbox_xyxy_norm":[0.1,0.2,0.3,0.4]}

Valid predictions must provide exactly one option id and one normalized bbox for the target landmark.

Metrics

Option Accuracy. Correct action direction toward the landmark.
BBox Acc@50IoU. Whether the decision is grounded on the correct landmark.
BBox Mean IoU. Continuous localization quality while making the action decision.

Video Task Protocols

Video tasks separate semantic recognition from interval quality

This is why the benchmark can expose models that recognize the right behavior or count but still misplace the corresponding temporal intervals.

Composite Behavior

Self-Aware Video Evaluation

Output Schema

{"answers":[{"option_id":"A","intervals_sec":[[0.0,1.2]]},{"option_id":"C","intervals_sec":[[3.4,5.0]]}]}

Valid predictions must return strict JSON; interval-based scoring is computed after semantic parsing.

Metrics

Semantic Accuracy / F1. Semantic recognition of composite or atomic behaviors depending on the view.
Temporal F1@0.5. Interval-level success at the 0.5 threshold.
Mean tIoU. Continuous temporal localization quality.

Atomic Behavior

Self-Aware Video Evaluation

Output Schema

{"answers":[{"option_id":"A","intervals_sec":[[0.0,1.2]]},{"option_id":"C","intervals_sec":[[3.4,5.0]]}]}

Valid predictions must return strict JSON; interval-based scoring is computed after semantic parsing.

Metrics

Semantic Accuracy / F1. Semantic recognition of composite or atomic behaviors depending on the view.
Temporal F1@0.5. Interval-level success at the 0.5 threshold.
Mean tIoU. Continuous temporal localization quality.

Landmark Visibility

Environment-Aware Video Evaluation

Output Schema

{"visible_count":2,"visible_intervals_sec":[[0.0,9.1],[17.7,26.8]]}

Valid predictions must return strict JSON; interval-based scoring is computed after semantic parsing.

Metrics

Count Accuracy. Exact visible-count correctness.
Segment F1@0.5. Interval-level visibility success at the 0.5 threshold.
Mean tIoU. Continuous temporal overlap quality for visible intervals.

Output Contracts

These task-level output fields summarize the structured response contract defined above

This summary brings the image-side and video-side definitions into one compact view so users can see, task by task, which fields are expected and which scores they contribute to.

Task	Modality	Required Fields	Primary Scores
Landmark-Relative Position Reasoning	Image	`answer_option_id, bbox_xyxy_norm`	Option Accuracy, BBox Acc@50IoU, BBox Mean IoU
Future Observation Prediction	Image	`answer_option_id, bbox_xyxy_norm`	Option Accuracy, BBox Acc@50IoU, BBox Mean IoU
Self-Relative Position Reasoning	Image	`answer_option_id, bbox_xyxy_norm`	Option Accuracy, BBox Acc@50IoU, BBox Mean IoU
Landmark-Driven Action Decision	Image	`answer_option_id, bbox_xyxy_norm`	Option Accuracy, BBox Acc@50IoU, BBox Mean IoU
Composite Behavior	Video	`answers[].option_ids, answers[].intervals_sec`	Semantic Accuracy / F1, Temporal F1@0.5, Mean tIoU
Atomic Behavior	Video	`answers[].option_ids, answers[].intervals_sec`	Semantic Accuracy / F1, Temporal F1@0.5, Mean tIoU
Landmark Visibility	Video	`visible_count, visible_intervals_sec`	Count Accuracy, Segment F1@0.5, Mean tIoU

LLM Invocation Sources

Model calls are grouped into local deployment and API channels

The runtime model map follows configs/flightmvstg/common_api_runtime.yaml. The API routes include Siliconflow, OpenRouter, Bailian, Zhipu AI, and Xiaomi Mimo.

Local

13 model routes

Model	Source
InternVL 3.5-14B	Local
InternVL 3.5-38B	Local
InternVL 3.5-4B	Local
InternVL 3.5-8B	Local
Qwen 3.5-4B	Local
Qwen 3.5-9B	Local
SenseNova-SI-1.2	Local
SpaceOm	Local
SpaceR	Local
SpaceThinker	Local
ViLaSR	Local
VST-7B-RL	Local
VST-7B-SFT	Local

API

17 model routes

Model	Source
Gemini 3 Flash	OpenRouter
Gemini 3.1 Flash Lite	OpenRouter
GLM 4.6V	Zhipu AI
GPT 5.3 Chat	OpenRouter
Grok 4.1 Fast	OpenRouter
Intern S1-Pro	SH-AILab
InternVL 3.5-241B-A28B	SH-AILab
InternVL 3.5-30B-A3B	Siliconflow
Kimi K2.5	Bailian / Siliconflow
Mimo v2 Omni	Xiaomi Mimo
Qwen 3.5-122B-A10B	Siliconflow
Qwen 3.5-27B	Siliconflow
Qwen 3.5-35B-A3B	Siliconflow
Qwen 3.5-397B-A17B	Siliconflow
Qwen 3.5-Flash	Bailian
Qwen 3.5-Plus	Bailian
Qwen 3.6-Plus	Bailian

Suffix Handling

Instant/Thinking suffixes are interpreted per model family

The evaluation runtime strips -Instant, -Thinking, or -Reasoning to resolve the base route in common_api_runtime.yaml. It then applies family-compatible request controls (for example enable_thinking, reasoning, or chat_template_kwargs) according to provider capabilities. This keeps the routing contract stable across local deployment and API invocation while preserving a unified experiment naming convention.

Full Evaluation Prompt Templates

These templates are exported from the active prompt configuration and reproduced verbatim.

Image Task Prompts

Landmark-Relative Position Reasoning

Infer the UAV position relative to the target landmark from a landmark-centric reference view and a current egocentric observation.

Output Schema

{"answer_option_id":"A","bbox_xyxy_norm":[0.1,0.2,0.3,0.4]}

Protocol Note

Self-Aware image task evaluated with option prediction plus grounding.

System Prompt

[TASK]: You are a UAV agent. Compare a reference view including a landmark and your current first-person view. Decide your position relative to the landmark. [INPUT]: - You are given two first-person view images. - Image 1 is the reference view, showing the landmark and its bounding box. - Image 2 is your current view. - Image 1 tells you which facade of the landmark is visible in the landmark-centric coordinate frame. - The user prompt gives the question and answer options. [OUTPUT]: 1. Return exactly one valid JSON object. 2. Do not output markdown, explanation, or extra text. 3. Use exactly one option_id from the user prompt. 4. The bounding box must use normalized [x1,y1,x2,y2] coordinates. 5. Strictly follow this JSON schema: {"answer_option_id":"A","bbox_xyxy_norm":[0.1,0.2,0.3,0.4]} [NOTE]: 1. Decide your position relative to the landmark in image 2. 2. Also localize the landmark in image 2. 3. The visible facade in image 1 defines the landmark's forward-facing reference.

User Prompt

Image 1 shows the {{reference_object_view}} facade of {{landmark_description}} in the landmark-centric coordinate frame. Based on that reference, what is your position relative to the landmark in image 2? Select one option and return the normalized bounding box of the landmark in image 2. Options: {{options_text}}

Invocation Structure

system: full task instruction user: natural-language question with image blocks parser: strict JSON schema