models with Stage 4 leaderboard coverage
Evaluation
Evaluation Protocol
The benchmark uses strict JSON outputs, explicit spatial grounding for image tasks, and explicit interval supervision for video tasks. The sections below show the exact contracts and prompt templates used by the current implementation.
Evaluation Scope
The protocol is shared across all exported leaderboard rows, but image-side and video-side outputs are intentionally different.
models with Stage 3 leaderboard coverage
answer + bbox evaluation
semantic + interval evaluation
Protocol
Semantic correctness and evidence alignment are evaluated separately
UAV-DualCog does not treat answer choice as the full task. Image tasks must also ground the target landmark, and video tasks must also place events or visibility into the right temporal intervals. This protocol design is what allows the benchmark to reveal answer-versus-grounding and semantic-versus-temporal gaps.
Parsing & Validity Rules
All tasks are parsed under strict JSON rules before scoring
Rules
- All tasks require strict JSON outputs. Free-form text, Markdown code fences, or extra explanations are treated as invalid formatting.
- Image tasks are only considered fully correct when answer selection and landmark grounding agree.
- Video tasks separate semantic success from temporal success so recognition without interval localization is not treated as complete success.
- The website leaderboard uses convenience aggregate scores for browsing; the raw task-specific metrics remain the primary result source.
Field-to-Metric Mapping
| Output Field | Used By | Drives Metrics |
|---|---|---|
answer_option_id | Image tasks | Option Accuracy |
bbox_xyxy_norm | Image tasks | BBox Acc@50IoU, BBox Mean IoU |
answers[].intervals_sec | Self video tasks | Temporal F1@0.5, mean tIoU |
visible_count | Visibility video task | Count Accuracy |
visible_intervals_sec | Visibility video task | Segment F1@0.5, mean tIoU |
Image Task Protocols
Image tasks score answer selection together with landmark grounding
Even when the answer is semantically correct, localization quality remains part of the final judgment.
Self-Aware
Landmark-Relative Position Reasoning
Output Schema
Valid predictions must provide exactly one option id and one normalized bbox for the target landmark.
Metrics
- Option Accuracy. Correct relative-position choice.
- BBox Acc@50IoU. Predicted box overlaps the GT landmark with IoU >= 0.5.
- BBox Mean IoU. Continuous localization quality on the query image.
Self-Aware
Future Observation Prediction
Output Schema
Valid predictions must provide exactly one option id and one normalized bbox for the target landmark.
Metrics
- Option Accuracy. Correct future-view selection under the orbit action.
- BBox Acc@50IoU. Localization quality in the selected candidate image.
- BBox Mean IoU. Continuous grounding quality after view prediction.
Environment-Aware
Self-Relative Position Reasoning
Output Schema
Valid predictions must provide exactly one option id and one normalized bbox for the target landmark.
Metrics
- Option Accuracy. Correct egocentric direction judgment.
- BBox Acc@50IoU. Localization on the current observation image.
- BBox Mean IoU. Continuous target grounding fidelity.
Environment-Aware
Landmark-Driven Action Decision
Output Schema
Valid predictions must provide exactly one option id and one normalized bbox for the target landmark.
Metrics
- Option Accuracy. Correct action direction toward the landmark.
- BBox Acc@50IoU. Whether the decision is grounded on the correct landmark.
- BBox Mean IoU. Continuous localization quality while making the action decision.
Video Task Protocols
Video tasks separate semantic recognition from interval quality
This is why the benchmark can expose models that recognize the right behavior or count but still misplace the corresponding temporal intervals.
Composite Behavior
Self-Aware Video Evaluation
Output Schema
Valid predictions must return strict JSON; interval-based scoring is computed after semantic parsing.
Metrics
- Semantic Accuracy / F1. Semantic recognition of composite or atomic behaviors depending on the view.
- Temporal F1@0.5. Interval-level success at the 0.5 threshold.
- Mean tIoU. Continuous temporal localization quality.
Atomic Behavior
Self-Aware Video Evaluation
Output Schema
Valid predictions must return strict JSON; interval-based scoring is computed after semantic parsing.
Metrics
- Semantic Accuracy / F1. Semantic recognition of composite or atomic behaviors depending on the view.
- Temporal F1@0.5. Interval-level success at the 0.5 threshold.
- Mean tIoU. Continuous temporal localization quality.
Landmark Visibility
Environment-Aware Video Evaluation
Output Schema
Valid predictions must return strict JSON; interval-based scoring is computed after semantic parsing.
Metrics
- Count Accuracy. Exact visible-count correctness.
- Segment F1@0.5. Interval-level visibility success at the 0.5 threshold.
- Mean tIoU. Continuous temporal overlap quality for visible intervals.
Output Contracts
These task-level output fields summarize the structured response contract defined above
This summary brings the image-side and video-side definitions into one compact view so users can see, task by task, which fields are expected and which scores they contribute to.
| Task | Modality | Required Fields | Primary Scores |
|---|---|---|---|
| Landmark-Relative Position Reasoning | Image | answer_option_id, bbox_xyxy_norm | Option Accuracy, BBox Acc@50IoU, BBox Mean IoU |
| Future Observation Prediction | Image | answer_option_id, bbox_xyxy_norm | Option Accuracy, BBox Acc@50IoU, BBox Mean IoU |
| Self-Relative Position Reasoning | Image | answer_option_id, bbox_xyxy_norm | Option Accuracy, BBox Acc@50IoU, BBox Mean IoU |
| Landmark-Driven Action Decision | Image | answer_option_id, bbox_xyxy_norm | Option Accuracy, BBox Acc@50IoU, BBox Mean IoU |
| Composite Behavior | Video | answers[].option_ids, answers[].intervals_sec | Semantic Accuracy / F1, Temporal F1@0.5, Mean tIoU |
| Atomic Behavior | Video | answers[].option_ids, answers[].intervals_sec | Semantic Accuracy / F1, Temporal F1@0.5, Mean tIoU |
| Landmark Visibility | Video | visible_count, visible_intervals_sec | Count Accuracy, Segment F1@0.5, Mean tIoU |
LLM Invocation Sources
Model calls are grouped into local deployment and API channels
The runtime model map follows configs/flightmvstg/common_api_runtime.yaml.
The API routes include Siliconflow, OpenRouter, Bailian, Zhipu AI, and Xiaomi Mimo.
Local
13 model routes
| Model | Source |
|---|---|
| InternVL 3.5-14B | Local |
| InternVL 3.5-38B | Local |
| InternVL 3.5-4B | Local |
| InternVL 3.5-8B | Local |
| Qwen 3.5-4B | Local |
| Qwen 3.5-9B | Local |
| SenseNova-SI-1.2 | Local |
| SpaceOm | Local |
| SpaceR | Local |
| SpaceThinker | Local |
| ViLaSR | Local |
| VST-7B-RL | Local |
| VST-7B-SFT | Local |
API
17 model routes
| Model | Source |
|---|---|
| Gemini 3 Flash | OpenRouter |
| Gemini 3.1 Flash Lite | OpenRouter |
| GLM 4.6V | Zhipu AI |
| GPT 5.3 Chat | OpenRouter |
| Grok 4.1 Fast | OpenRouter |
| Intern S1-Pro | SH-AILab |
| InternVL 3.5-241B-A28B | SH-AILab |
| InternVL 3.5-30B-A3B | Siliconflow |
| Kimi K2.5 | Bailian / Siliconflow |
| Mimo v2 Omni | Xiaomi Mimo |
| Qwen 3.5-122B-A10B | Siliconflow |
| Qwen 3.5-27B | Siliconflow |
| Qwen 3.5-35B-A3B | Siliconflow |
| Qwen 3.5-397B-A17B | Siliconflow |
| Qwen 3.5-Flash | Bailian |
| Qwen 3.5-Plus | Bailian |
| Qwen 3.6-Plus | Bailian |
Suffix Handling
Instant/Thinking suffixes are interpreted per model family
The evaluation runtime strips -Instant, -Thinking, or
-Reasoning to resolve the base route in
common_api_runtime.yaml. It then applies family-compatible request controls
(for example enable_thinking, reasoning, or
chat_template_kwargs) according to provider capabilities. This keeps the
routing contract stable across local deployment and API invocation while preserving a
unified experiment naming convention.
Full Evaluation Prompt Templates
These templates are exported from the active prompt configuration and reproduced verbatim.
Image Task Prompts
Landmark-Relative Position Reasoning
Infer the UAV position relative to the target landmark from a landmark-centric reference view and a current egocentric observation.
Output Schema
Protocol Note
Self-Aware image task evaluated with option prediction plus grounding.
System Prompt
User Prompt
Invocation Structure