Analysis

Result Analysis

UAV-DualCog evaluates whether multimodal large language models can reason about the UAV itself and the surrounding environment while grounding each decision in spatial or temporal evidence. The analysis first examines different observation media, then compares different self-aware and environment-aware abilities, then performs cross-media, cross-cognition, and model-family analysis before summarizing the main findings and future capability directions.

Analysis Framework

UAV-DualCog separates answer correctness from evidence reliability and separates self-aware cognition from environment-aware cognition. The analysis is organized around media-specific results, cognition-specific results, and cross-analysis that links the observed gaps to future UAV embodied-reasoning capabilities.

1
Media Analysis

Image and video results reveal how dual cognition changes across spatial and temporal evidence.

2
Cognition Analysis

Self-aware and environment-aware results expose whether UAV self-modeling and environment modeling are balanced.

3
Cross Analysis

Cross-media, cross-cognition, and model-family comparisons summarize the main findings and future capability directions.

UAV-DualCog does not treat UAV understanding as ordinary visual question answering. It evaluates whether a multimodal large language model can reason about the UAV itself, reason about the external environment, and support those judgments with spatial or temporal evidence. Image results are interpreted through spatial grounding, video results through temporal localization, and aggregate comparisons through cross-media transfer and cognition balance.

Two questions are central: whether dual-cognition ability transfers between image and video media, and whether self-aware and environment-aware cognition have developed in a balanced way. A single score can obscure these distinctions, because a model may answer correctly without evidence, perform well in videos but not images, or rank highly only because environment-aware tasks compensate for weak self-aware reasoning.

Cross-Media Transfer Analysis

Cross-media transfer is real but insufficient

UAV-DualCog measures the same dual-cognition formulation through image and video evidence. Image tasks require answer choices to be supported by spatial evidence, while video tasks require semantic decisions to be supported by temporal intervals.

Image Task Results and Analysis

Image results reveal a broad gap between semantic choice and spatial evidence

Table 1 reports the complete image-task results. The four image tasks include Landmark-Relative Position Reasoning, Future Observation Prediction, Self-Relative Position Reasoning, and Landmark-Driven Action Decision. Together, they cover the self-aware and environment-aware cognition axes. Because the table reports option accuracy, BBox Acc@0.5, and mIoU, it directly shows whether a model only selects the correct option or can also provide credible spatial evidence.

Table 1. Complete results for image tasks. The table is copied from the leaderboard and reports overall image accuracy together with per-task answer accuracy, BBox Acc@0.5, and mIoU.

Model Overall
Acc
Self-Aware Environment-Aware
Landmark-Relative Position Reasoning Future Observation Prediction Self-Relative Position Reasoning Landmark-Driven Action Decision
Acc IoU@50 mIoU Acc IoU@50 mIoU Acc IoU@50 mIoU Acc IoU@50 mIoU
Closed-source Models
Claude Sonnet 4.6 #3 42.7% 3 37.7% 3 4.9% 18.2% 23.6% 3.4% 14.2% 48.5% 9.6% 3 20.2% 2 61.0% 2 8.6% 19.9%
GPT 5.3 Chat #2 47.8% 2 35.2% 11.9% 23.5% 37.3% 2 12.4% 25.2% 3 56.5% 1 14.9% 1 22.5% 1 62.3% 1 15.4% 1 22.5% 1
Gemini 3 Flash #1 50.2% 1 47.6% 1 0.7% 1.2% 45.9% 1 0.7% 0.9% 56.2% 2 0.1% 0.9% 51.3% 0.5% 1.2%
Gemini 3.1 Flash Lite 34.5% 39.1% 2 0.0% 0.0% 28.6% 2.5% 3.5% 39.9% 0.4% 0.8% 30.4% 0.4% 0.6%
Grok 4.1 Fast 27.4% 21.1% 3.6% 16.4% 22.9% 2.2% 17.5% 33.0% 1.9% 8.2% 32.6% 1.6% 7.5%
Qwen 3.6-Plus 39.3% 32.8% 16.6% 27.1% 3 33.5% 3 14.0% 3 26.7% 2 48.4% 7.4% 19.0% 42.5% 6.0% 19.9%
Qwen 3.5-Plus 38.5% 28.6% 21.4% 3 27.1% 29.0% 13.4% 23.5% 47.6% 9.7% 2 19.3% 48.9% 8.3% 20.0% 3
Qwen 3.5-Flash 40.9% 27.8% 21.6% 2 29.8% 2 28.9% 15.9% 2 23.3% 52.4% 3 8.8% 19.4% 3 54.5% 3 8.9% 3 20.1% 2
Mimo v2 Omni 31.2% 28.7% 38.0% 1 34.3% 1 21.3% 29.8% 1 29.7% 1 36.5% 8.8% 17.4% 38.2% 9.2% 2 17.3%
Open-source Models
GLM 4.6V 32.8% 30.8% 9.4% 8.7% 27.1% 8.6% 8.7% 29.2% 3.3% 6.7% 44.2% 2.5% 5.0%
Kimi K2.5 34.1% 34.3% 1 27.1% 30.4% 32.7% 15.0% 21.7% 39.4% 5.0% 15.6% 30.1% 4.7% 14.4%
Qwen 3.5-397B-A17B 39.4% 27.4% 3.3% 12.6% 35.8% 3 10.0% 21.4% 47.9% 2.8% 13.0% 46.2% 4.4% 13.9%
Qwen 3.5-122B-A10B #2 42.3% 2 32.6% 3 29.1% 2 31.6% 3 37.8% 1 20.6% 1 28.4% 3 49.7% 6.1% 19.4% 3 49.0% 6.8% 18.8% 3
Qwen 3.5-35B-A3B #3 41.9% 3 29.5% 27.6% 30.5% 27.4% 0.5% 0.8% 53.2% 2 6.1% 14.5% 57.5% 1 7.2% 3 18.5%
Qwen 3.5-27B #1 44.3% 1 31.6% 42.8% 1 39.9% 1 37.1% 2 17.1% 25.8% 57.8% 1 7.3% 2 20.4% 2 50.6% 3 7.5% 2 21.0% 2
Qwen 3.5-9B 40.9% 29.0% 15.9% 25.8% 29.0% 17.8% 3 29.8% 1 53.2% 2 10.3% 1 20.9% 1 52.2% 2 11.5% 1 22.1% 1
Qwen 3.5-4B 39.0% 30.9% 12.8% 23.1% 30.2% 11.2% 21.7% 47.5% 6.7% 3 17.7% 47.5% 6.3% 18.0%
Intern S1-Pro 28.4% 29.4% 17.4% 26.8% 26.2% 9.6% 21.3% 27.8% 2.7% 13.7% 30.3% 3.2% 14.0%
InternVL 3.5-241B-A28B 37.7% 32.9% 2 28.4% 3 31.2% 26.1% 19.5% 2 25.3% 50.1% 3 5.4% 15.8% 41.8% 5.1% 15.0%
InternVL 3.5-30B-A3B 32.0% 27.3% 4.9% 19.9% 24.0% 7.0% 15.5% 35.4% 0.6% 7.6% 41.3% 1.3% 7.3%
InternVL 3.5-14B 31.1% 22.8% 7.6% 22.2% 31.9% 6.6% 22.0% 28.5% 4.7% 12.9% 41.3% 5.0% 11.9%
InternVL 3.5-8B 27.1% 28.5% 19.7% 34.2% 2 24.5% 8.1% 28.7% 2 24.7% 4.2% 12.3% 30.8% 2.5% 11.4%
InternVL 3.5-4B 27.8% 25.6% 3.1% 19.8% 26.4% 10.8% 23.2% 22.4% 1.0% 8.0% 36.7% 1.0% 7.5%
Fine-tuned Models
SpaceR #3 34.0% 3 20.6% 2.4% 3 6.0% 3 24.0% 1.2% 3 1.9% 3 42.4% 1.8% 3 5.4% 3 48.8% 2 2.3% 2 6.8% 2
SpaceThinker 28.8% 22.6% 1.2% 3.0% 25.4% 1 1.9% 2 5.7% 2 32.8% 0.6% 5.1% 34.3% 0.7% 4.7%
SpaceOm 29.3% 24.7% 3 2.8% 2 6.7% 2 24.5% 2 3.6% 1 10.9% 1 35.2% 2.4% 2 8.6% 1 32.8% 1.8% 3 6.5% 3
SenseNova-SI-1.2 19.6% 17.5% 10.7% 1 10.6% 1 0.1% 0.0% 0.1% 15.3% 3.2% 1 5.8% 2 45.4% 6.0% 1 8.5% 1
ViLaSR 33.4% 18.8% 0.0% 0.0% 22.1% 0.0% 0.0% 43.1% 3 0.0% 0.0% 49.7% 1 0.0% 0.0%
VST-7B-RL #1 36.1% 1 29.2% 1 0.0% 0.0% 24.1% 3 0.0% 0.0% 47.2% 1 0.0% 0.1% 44.0% 0.0% 0.0%
VST-7B-SFT #2 36.1% 2 28.1% 2 0.0% 0.0% 23.0% 0.0% 0.0% 46.5% 2 0.0% 0.2% 46.6% 3 0.0% 0.2%

As shown in Table 1, Gemini 3 Flash, GPT 5.3 Chat, and several Qwen 3.5 models are strong on the image leaderboard, but their strengths are structurally different. Gemini 3 Flash reaches the highest image overall accuracy at 50.22%, indicating strong semantic option selection. However, its four BBox Acc@0.5 scores are only 0.68%, 0.68%, 0.10%, and 0.49%, which means that strong semantic judgment does not naturally become precise spatial evidence. GPT 5.3 Chat obtains 47.83% image overall accuracy and exceeds 11% on all four BBox Acc@0.5 metrics, giving it more balanced spatial evidence quality. Qwen 3.5-27B reaches 44.29% image overall accuracy, and its BBox Acc@0.5 on Landmark-Relative Position Reasoning reaches 42.77%, showing promising local spatial grounding potential. In contrast, spatially specialized or reinforcement-trained models such as VST-7B-RL, VST-7B-SFT, and ViLaSR are not completely ineffective on option accuracy, but many of their BBox metrics remain close to 0%. General spatial-reasoning tuning therefore does not directly solve the joint output requirement of option judgment plus explicit evidence in UAV-DualCog.

The central image-branch question is whether a model that selects the correct answer for self-aware or environment-aware cognition can also recover the spatial evidence supporting that answer. Figure 1 uses representative models from major model families, including Claude Sonnet 4.6, GPT 5.3 Chat, Gemini 3 Flash, Qwen 3.6-Plus, Kimi K2.5, GLM 4.6V, Mimo v2 Omni, InternVL 3.5-30B-A3B, SenseNova-SI-1.2, VST-7B-RL, SpaceOm, and ViLaSR. Across this selected image subset, the mean semantic score reaches 35.98%, but the mean BBox Acc@0.5 is only 7.02%. Current models have therefore acquired a limited level of dual-cognition semantic judgment under images, but this judgment is far from reliably grounded in verifiable object locations.

More specifically, semantic correctness without sufficient evidence is highly common in the image branch. GPT 5.3 Chat, Claude Sonnet 4.6, and Gemini 3 Flash remain competitive in option accuracy, yet still lose substantial quality on bounding-box alignment. Mimo v2 Omni is not the top model in image overall accuracy, but it shows relatively stronger spatial evidence recovery, suggesting that different model families emphasize correct judgment and evidence alignment differently. The Qwen 3.5 family is comparatively stable, combining semantic judgment with some degree of spatial localization across multiple image tasks.

0 14 28 41 55 Semantic (Option) | Claude Sonnet 4.6: 42.72 Semantic (Option) | GPT 5.3 Chat: 47.83 Semantic (Option) | Gemini 3 Flash: 50.22 Semantic (Option) | Qwen 3.6-Plus: 39.31 Semantic (Option) | Kimi K2.5: 34.11 Semantic (Option) | Mimo v2 Omni: 31.18 Semantic (Option) | InternVL 3.5-30B-A3B: 32.01 Semantic (Option) | SenseNova-SI-1.2: 19.58 Semantic (Option) | VST-7B-RL: 36.13 Semantic (Option) | SpaceOm: 29.30 Semantic (Option) | ViLaSR: 33.42 Spatial Acc@0.5 | Claude Sonnet 4.6: 6.62 Spatial Acc@0.5 | GPT 5.3 Chat: 13.67 Spatial Acc@0.5 | Gemini 3 Flash: 0.49 Spatial Acc@0.5 | Qwen 3.6-Plus: 10.99 Spatial Acc@0.5 | Kimi K2.5: 12.94 Spatial Acc@0.5 | Mimo v2 Omni: 21.44 Spatial Acc@0.5 | InternVL 3.5-30B-A3B: 3.44 Spatial Acc@0.5 | SenseNova-SI-1.2: 4.98 Spatial Acc@0.5 | VST-7B-RL: 0.00 Spatial Acc@0.5 | SpaceOm: 2.66 Spatial Acc@0.5 | ViLaSR: 0.00 Claude Sonnet 4.6 GPT 5.3 Chat Gemini 3 Flash Qwen 3.6-Plus Kimi K2.5 Mimo v2 Omni InternVL 3.5-30B-A3B SenseNova-SI-1.2 VST-7B-RL SpaceOm ViLaSR

Figure 1. Selected image models compared by semantic option accuracy and spatial localization Acc@0.5. The overall trend indicates that image-based dual cognition still mainly reaches the level of answering correctly, rather than answering correctly with credible spatial evidence.

At the task level, the evidence bottleneck is not uniform across image tasks. The semantic-spatial correlation is highest for Landmark-Driven Action Decision at Pearson r = 0.44, indicating that models that choose the correct action are more likely to recover reasonable spatial evidence. By contrast, Landmark-Relative Position Reasoning has a correlation of only 0.02, almost no stable coupling. This means that a model can often answer a relative-position question correctly while failing to provide a consistent target box in the same image. Table 2 shows that all four image tasks have incomplete synchronization between semantic decisions and spatial evidence, with slightly higher correlations for environment-aware tasks and the weakest coupling when the model must explicitly reason about the UAV relative to a landmark. In UAV scenes, a correct option may therefore come from language priors, local appearance cues, or answer-choice bias. It cannot be treated as reliable evidence of completed spatial reasoning.

Task Pearson r Semantic Avg. Spatial Avg.
Landmark-Relative Position Reasoning +0.02 30.35% 10.69%
Future Observation Prediction +0.15 26.29% 7.81%
Self-Relative Position Reasoning +0.28 41.97% 4.73%
Landmark-Driven Action Decision +0.44 45.33% 4.85%

Table 2. Task-level correlation analysis for the image branch. Pearson correlations are computed from the selected image-model subset and compare semantic accuracy with spatial evidence quality.

Video Task Results and Analysis

Video results show tighter evidence coupling, but long-horizon behavior semantics remain difficult

Table 3 reports the complete video-task results. The video tasks include Composite Behavior Recognition with temporal localization, Atomic Behavior Recognition with temporal localization, and Landmark Visibility Counting and Interval Reasoning. They examine the model's understanding of the UAV's own flight behavior and the visibility process of external targets. Unlike image tasks, where evidence is a target box, video tasks use temporal intervals. This tests whether a semantic judgment can be grounded in concrete temporal evidence.

Table 3. Complete results for video tasks. The table is copied from the leaderboard and reports overall video accuracy together with semantic and temporal evidence metrics for each released video task.

Model Overall
Acc
Composite Behavior Recognition Atomic Behavior Recognition Landmark Visibility Reasoning
Acc tIoU@50 mtIoU Acc F1@50 mtIoU Acc F1@50 mtIoU
Closed-source Models
Gemini 3 Flash #1 46.5% 1 49.3% 1 51.9% 1 54.3% 1 35.0% 1 52.6% 2 46.6% 55.2% 2 44.9% 44.3%
Gemini 3.1 Flash Lite 31.9% 22.3% 29.5% 32.2% 19.7% 40.0% 40.6% 53.7% 3 43.1% 41.6%
Mimo v2 Omni #2 38.8% 2 27.8% 3 33.1% 3 35.8% 3 29.1% 3 53.6% 1 46.9% 3 59.4% 1 52.5% 3 53.7% 1
Qwen 3.5-Flash #3 36.8% 3 32.6% 2 35.8% 2 40.1% 2 25.6% 51.7% 3 42.4% 52.3% 57.8% 1 51.3% 2
Qwen 3.5-Plus 28.6% 2.4% 9.0% 9.2% 30.0% 2 46.6% 47.7% 1 53.2% 53.9% 2 46.8% 3
Qwen 3.6-Plus 29.6% 9.8% 16.5% 16.9% 28.3% 45.9% 47.3% 2 50.8% 51.5% 46.1%
Open-source Models
GLM 4.6V 32.4% 32.4% 2 1.3% 6.7% 33.4% 2 21.6% 22.6% 31.3% 16.7% 33.8%
Kimi K2.5 30.9% 11.9% 19.1% 19.9% 30.2% 3 54.6% 2 49.2% 50.5% 49.7% 3 45.9%
Qwen 3.5-397B-A17B 23.4% 5.2% 1.7% 2.2% 24.6% 29.9% 30.0% 40.4% 20.4% 31.8%
Qwen 3.5-122B-A10B 30.6% 21.2% 28.4% 1 31.9% 1 22.1% 39.0% 34.6% 48.4% 46.4% 49.7% 2
Qwen 3.5-35B-A3B 34.0% 21.6% 21.5% 3 23.1% 3 26.2% 54.4% 50.3% 3 54.1% 3 49.7% 2 48.6%
Qwen 3.5-27B 31.3% 14.6% 21.9% 2 23.3% 2 28.0% 54.5% 3 50.7% 2 51.4% 51.7% 1 48.7% 3
Qwen 3.5-9B 29.2% 9.9% 10.1% 11.0% 27.2% 55.9% 1 51.3% 1 50.7% 33.5% 36.8%
Qwen 3.5-4B 16.5% 6.4% 1.3% 1.4% 9.4% 3.0% 2.8% 33.9% 44.3% 52.0% 1
InternVL 3.5-38B #1 37.8% 1 15.2% 11.1% 12.8% 34.2% 1 33.0% 33.1% 64.0% 1 20.0% 27.5%
InternVL 3.5-30B-A3B 28.0% 15.3% 5.8% 8.6% 24.8% 24.2% 24.6% 43.9% 7.9% 12.6%
InternVL 3.5-14B #2 36.6% 2 33.1% 1 3.9% 11.3% 22.4% 15.8% 17.4% 54.5% 2 13.9% 22.7%
InternVL 3.5-8B #3 34.1% 3 27.4% 8.5% 14.1% 23.0% 25.0% 26.1% 51.9% 16.4% 21.3%
InternVL 3.5-4B 28.9% 31.0% 3 3.5% 7.4% 21.1% 11.7% 12.1% 34.6% 16.5% 32.3%
Fine-tuned Models
SpaceR #2 28.0% 2 24.0% 2 0.0% 1 2.6% 2 26.3% 2 15.3% 3 13.9% 3 33.7% 2 15.7% 2 30.8% 2
SpaceThinker #3 25.0% 3 19.0% 3 0.0% 1 1.6% 3 22.4% 3 18.1% 1 14.1% 1 33.7% 2 13.6% 3 27.1% 3
SpaceOm #1 30.9% 1 33.2% 1 0.0% 1 3.2% 1 27.0% 1 16.7% 2 14.0% 2 32.3% 3 17.0% 1 34.2% 1
ViLaSR 17.0% 5.3% 0.0% 1 0.8% 9.7% 3.4% 5.6% 35.9% 1 10.1% 20.6%
VST-7B-RL 7.6% 0.0% 0.0% 1 0.0% 0.0% 0.0% 0.0% 22.8% 0.0% 0.0%
VST-7B-SFT 7.8% 0.0% 0.0% 1 0.0% 0.0% 0.0% 0.0% 23.4% 4.1% 8.1%

Table 3 shows that Gemini 3 Flash, Mimo v2 Omni, InternVL 3.5-38B, and Qwen 3.5-Flash are strong video models, but their advantages come from different sources. Gemini 3 Flash reaches 46.49% video overall accuracy and performs strongly on composite-behavior semantics, composite-behavior F1@0.5, atomic-behavior F1@0.5, and visibility-count accuracy, indicating a relatively balanced relationship between video semantics and temporal evidence. Mimo v2 Omni reaches 38.76% video overall accuracy and is especially strong on visibility counting at 59.38%, visibility-interval F1@0.5 at 52.52%, and visibility-interval mTIoU at 53.71%, suggesting that it is particularly good at using continuous temporal cues for target-visibility reasoning. InternVL 3.5-38B reaches 63.95% visibility-count accuracy but only 11.11% composite-behavior F1@0.5, revealing a separation between recognizing salient target occurrences and understanding complex flight behavior. Qwen 3.5-4B has only 16.54% video overall accuracy, yet its visibility-interval F1@0.5 reaches 44.33% and mTIoU reaches 51.99%, which shows that a low aggregate score can still contain exploitable local evidence ability.

The video branch asks the same kind of evidence question as the image branch, but replaces spatial localization with temporal localization. Figure 2 selects representative models from major families, including Gemini 3 Flash, Kimi K2.5, GLM 4.6V, Mimo v2 Omni, Qwen 3.6-Plus, InternVL 3.5-30B-A3B, SpaceOm, and ViLaSR. In this subset, mean video semantic performance reaches 33.20%, and the mean Temporal F1@0.5 score reaches 29.02%. The gap between semantics and evidence is smaller than in the image branch, but this does not mean the problem has been solved. Instead, the video branch exposes a more complex instability in mapping evidence to semantics and semantics back to evidence.

Some models score higher on temporal localization than on semantic recognition, meaning that they can roughly detect where key evidence occurs without reliably summarizing those segments as stable behavior or visibility semantics. Other models show the opposite pattern, with semantic scores above temporal localization scores, meaning that they can form coarse high-level judgments but cannot precisely delimit the supporting interval. The first case reflects insufficient integration from evidence to semantics, while the second case reflects insufficient backtracking from semantics to evidence.

0 13 25 38 50 Semantic | Gemini 3 Flash: 46.02 Semantic | Kimi K2.5: 34.12 Semantic | GLM 4.6V: 32.35 Semantic | Mimo v2 Omni: 40.65 Semantic | Qwen 3.6-Plus: 33.26 Semantic | InternVL 3.5-30B-A3B: 30.21 Semantic | SpaceOm: 30.47 Semantic | ViLaSR: 18.48 Temporal F1@0.5 | Gemini 3 Flash: 49.45 Temporal F1@0.5 | Kimi K2.5: 44.92 Temporal F1@0.5 | GLM 4.6V: 15.26 Temporal F1@0.5 | Mimo v2 Omni: 48.70 Temporal F1@0.5 | Qwen 3.6-Plus: 41.89 Temporal F1@0.5 | InternVL 3.5-30B-A3B: 13.82 Temporal F1@0.5 | SpaceOm: 13.05 Temporal F1@0.5 | ViLaSR: 5.08 Gemini 3 Flash Kimi K2.5 GLM 4.6V Mimo v2 Omni Qwen 3.6-Plus InternVL 3.5-30B-A3B SpaceOm ViLaSR

Figure 2. Selected video models compared by semantic score and temporal localization F1@0.5. Compared with the image branch, the two metric types are closer, but many models still fail to improve them consistently.

Task-level correlations in Table 4 show clear internal differences within the video branch. Landmark Visibility Counting and Interval Reasoning has the strongest semantic-temporal correlation at Pearson r = 0.84, meaning that a model that correctly counts visibility events is often more likely to produce reasonable intervals. Atomic Behavior Recognition follows at r = 0.68. Composite Behavior Recognition has the weakest correlation at r = 0.53, showing that high-level flight-pattern recognition depends more heavily on cross-segment semantic integration and remains one of the hardest parts of the video branch. The complete video leaderboard further shows that InternVL 3.5-38B, Mimo v2 Omni, and Gemini 3 Flash are often strong on different task types, but no model yet forms a unified capability that works reliably across all video tasks.

Overall, temporal evidence recovery in videos is stronger than spatial grounding in images, but composite behavior recognition still shows a clear semantic-interval disconnect. Models can more easily identify how many times a target appears or where certain atomic actions occur than integrate multiple local segments into a complete high-level flight behavior. Long-horizon behavior semantics therefore remains a major bottleneck for current multimodal large language models.

Task Pearson r Semantic Avg. Temporal Avg.
Atomic Flight Behavior Recognition +0.68 27.19% 34.09%
Composite Flight Behavior Recognition +0.53 23.13% 15.95%
Landmark Visibility Counting and Interval Reasoning +0.84 44.91% 31.29%

Table 4. Task-level correlation analysis for the video branch. Pearson correlations compare semantic scores with temporal evidence quality.

Cross-Media Capability Transfer

Aggregate results show transfer, but not a stable cross-media representation

To compare stability across media, Table 5 summarizes image performance, video performance, self-aware cognition, and environment-aware cognition. It is both a combined leaderboard and a shared reference for cross-media transfer and dual-cognition imbalance.

Table 5. Combined experimental results. The table reports paired models with valid image and video results and aggregates performance by media and by cognition axis.

Model Overall
Acc
By Media By Cognition
Image Video Self-Aware Environment-Aware
Closed-source Models
Gemini 3 Flash #1 48.4% 1 50.2% 1 46.5% 1 44.4% 1 54.2% 1
Gemini 3.1 Flash Lite 33.2% 34.5% 31.9% 27.5% 3 41.3%
Qwen 3.6-Plus 34.5% 39.3% 3 29.6% 26.1% 47.2%
Qwen 3.5-Plus 33.5% 38.5% 28.6% 22.5% 49.9% 3
Qwen 3.5-Flash #2 38.9% 2 40.9% 2 36.8% 3 28.7% 2 53.1% 2
Mimo v2 Omni #3 35.0% 3 31.2% 38.8% 2 26.7% 44.7%
Open-source Models
GLM 4.6V 32.6% 32.8% 32.4% 30.9% 1 34.9%
Kimi K2.5 32.5% 34.1% 30.9% 27.3% 40.0%
Qwen 3.5-397B-A17B 31.4% 39.4% 23.4% 23.3% 44.9%
Qwen 3.5-122B-A10B #3 36.4% 3 42.3% 2 30.6% 28.4% 2 49.1%
Qwen 3.5-35B-A3B #1 37.9% 1 41.9% 3 34.0% 3 26.2% 54.9% 1
Qwen 3.5-27B #2 37.8% 2 44.3% 1 31.3% 27.8% 3 53.3% 2
Qwen 3.5-9B 35.1% 40.9% 29.2% 23.8% 52.1% 3
Qwen 3.5-4B 27.8% 39.0% 16.5% 19.2% 42.9%
InternVL 3.5-30B-A3B 30.0% 32.0% 28.0% 22.9% 40.2%
InternVL 3.5-14B 33.9% 31.1% 36.6% 1 27.5% 41.4%
InternVL 3.5-8B 30.6% 27.1% 34.1% 2 25.9% 35.8%
InternVL 3.5-4B 28.3% 27.8% 28.9% 26.0% 31.2%
Fine-tuned Models
SpaceR #1 31.0% 1 34.0% 3 28.0% 2 23.7% 2 41.6% 2
SpaceThinker #3 26.9% 3 28.8% 25.0% 3 22.3% 3 33.6%
SpaceOm #2 30.1% 2 29.3% 30.9% 1 27.4% 1 33.4%
ViLaSR 25.2% 33.4% 17.0% 14.0% 42.9% 1
VST-7B-RL 21.9% 36.1% 1 7.6% 13.3% 38.0%
VST-7B-SFT 21.9% 36.1% 2 7.8% 12.8% 38.8% 3

Table 5 shows that Gemini 3 Flash, Qwen 3.5-Flash, Qwen 3.5-35B-A3B, and Qwen 3.5-27B occupy leading positions in the combined leaderboard. This indicates that stronger current models can achieve relatively high dual-cognition accuracy under both image and video media. At the same time, similar combined scores can hide very different media structures. Mimo v2 Omni reaches 31.18% image overall accuracy and 38.76% video overall accuracy, making it more video-oriented. Qwen 3.5-4B reaches 38.99% image accuracy but only 16.54% video accuracy, showing a large media gap. InternVL 3.5-14B and InternVL 3.5-8B both score higher on video than on image, suggesting that the InternVL family has a video-side advantage. Aggregate scores therefore cannot replace media-wise analysis.

Cross-media correlation supports the same conclusion. The Pearson correlation between image overall accuracy and video overall accuracy is 0.59, a moderate positive relationship. Among 23 paired models, 18 score higher on video than on image, while only 5 score higher on image. This result shows that cross-media transfer is real. Models that are strong on the image side are usually not weak on the video side, so image and video are not completely separate capability spaces. However, the transfer remains far from stable.

Comparison Paired Models Pearson r Image Avg. Video Avg. Gap Video Higher Image Higher
Overall Image vs Video 23 +0.59 19.48% 23.09% +3.62 pts 18 5

Table 6. Cross-media correlation analysis computed from the paired model pool used by the combined leaderboard.

Figure 3 and Table 6 show that the transfer is not yet stable or consistent. Points do not cluster tightly along the diagonal. Instead, they are visibly scattered and shifted. The video mean is 23.09%, higher than the image mean of 19.48%, with a gap of 3.62 percentage points. This suggests that current models can more easily form some coarse judgment when continuous temporal cues are available. However, the video-side advantage does not mean that dual cognition has been solved. It more likely means that dynamic cues provide shortcuts. A reliable UAV agent must remain consistent across single-frame spatial localization, cross-view matching, long-horizon behavior understanding, and temporal interval evidence.

Image Score vs Video Score

0 7 15 22 30 0 9 17 26 35 InternVL 3.5-14B | x 18.13% | y 24.13% | InternVL InternVL 3.5-30B-A3B | x 16.01% | y 18.54% | InternVL InternVL 3.5-4B | x 15.45% | y 18.85% | InternVL InternVL 3.5-8B | x 19.14% | y 22.17% | InternVL Qwen 3.5-122B-A10B | x 27.49% | y 28.52% | Qwen Qwen 3.5-27B | x 29.9% | y 26.39% | Qwen Qwen 3.5-35B-A3B | x 22.79% | y 30.19% | Qwen Qwen 3.5-397B-A17B | x 19.89% | y 18% | Qwen Qwen 3.5-4B | x 22.8% | y 15.87% | Qwen Qwen 3.5-9B | x 26.47% | y 23.56% | Qwen SpaceOm | x 13.38% | y 19.32% | SpaceOm SpaceR | x 13.64% | y 21.64% | SpaceR SpaceThinker | x 11.48% | y 16.77% | SpaceThinker VST-7B-SFT | x 12.06% | y 4.36% | VST ViLaSR | x 11.14% | y 13.96% | ViLASR Gemini 3 Flash | x 17.25% | y 33.96% | Gemini Gemini 3.1 Flash Lite | x 12.18% | y 25.91% | Gemini Kimi K2.5 | x 22.52% | y 26.24% | Kimi Qwen 3.5-Flash | x 25.95% | y 34.78% | Qwen Qwen 3.5-Plus | x 24.73% | y 25.69% | Qwen Qwen 3.6-Plus | x 24.48% | y 25.89% | Qwen Mimo v2 Omni | x 25.77% | y 32.2% | Mimo GLM 4.6V | x 15.35% | y 24.21% | GLM Image Score Video Score
InternVL Qwen SpaceOm SpaceR SpaceThinker VST ViLASR Gemini Kimi Mimo GLM

Figure 3. Image overall accuracy versus video overall accuracy. The scatter shows a positive trend but does not collapse tightly onto the diagonal, indicating that dual cognition can transfer between image and video media but has not yet become a stable cross-media representation.

Qualitative evidence in Figure 4 supports the same conclusion. The left case is a self-aware image task, where the model must judge the UAV's position relative to a landmark-centered coordinate system and return the landmark box in the current view. GPT 5.3 Chat and Claude Sonnet 4.6 roughly locate the target landmark, but they do not correctly transform the reference frame from the landmark-centered coordinate system to the current UAV view, so they choose the wrong direction. Gemini 3.1 Flash Lite also fails to complete this coordinate-frame conversion and does not output a valid localization box. Qwen 3.5 Plus selects the correct direction, but its predicted box is still too loose and shifted, showing that semantic correctness does not guarantee precise spatial grounding.

The right case is an environment-aware video task, where the model must count how many times the landmark appears and recover the full visible interval. Qwen 3.5 Plus predicts the correct count, but its interval covers only part of the visible process. GPT 5.3 Plus and Gemini 3.1 Flash Lite fail to continuously localize the landmark across the flight video, which leads to incorrect visibility counts and fragmented intervals. Claude Sonnet 4.6 gives the best response in this example because it tracks the landmark continuously and provides a more complete interval. Together, the two examples show that UAV-DualCog requires both the semantic decision and the evidence path behind it: self-aware cognition depends on reference-frame transformation, while environment-aware cognition depends on continuous target tracking and temporal evidence completeness.

Figure 4. Qualitative analysis examples. The left case is a self-aware image task that requires landmark-centered reference-frame transformation and spatial evidence. The right case is an environment-aware video task that requires landmark visibility counting and complete temporal interval evidence.

Dual-Cognition Difference Analysis

Dual cognition is biased toward environment-aware reasoning

Cross-media analysis asks whether the same capability remains stable between image and video evidence. Dual-cognition difference analysis asks whether models understand the UAV itself and the external environment equally well.

Self-Aware Cognition Results and Analysis

Self-aware cognition remains the weaker branch

Table 7 reports the self-aware cognition leaderboard. The self-aware tasks include Landmark-Relative Position Reasoning and Future Observation Prediction on the image side, and Composite Behavior Recognition and Atomic Behavior Recognition on the video side. Together, they require a model to understand where the UAV itself is, what it will observe, and how it is moving. This is more demanding than simply recognizing external objects because it requires subject-state modeling, reference-frame transformation, and view-change reasoning.

Table 7. Complete leaderboard for self-aware cognition. The table groups self-aware image and video tasks and reports both semantic and evidence metrics.

Model Overall
Acc
Image Tasks Video Tasks
Landmark-Relative Position Reasoning Future Observation Prediction Composite Behavior Recognition Atomic Behavior Recognition
Acc IoU@50 mIoU Acc IoU@50 mIoU Acc tIoU@50 mtIoU Acc F1@50 mtIoU
Closed-source Models
Gemini 3 Flash #1 44.4% 1 47.6% 1 0.7% 1.2% 45.9% 1 0.7% 0.9% 49.3% 1 51.9% 1 54.3% 1 35.0% 1 52.6% 2 46.6%
Gemini 3.1 Flash Lite #3 27.5% 3 39.1% 2 0.0% 0.0% 28.6% 2.5% 3.5% 22.3% 29.5% 32.2% 19.7% 40.0% 40.6%
Qwen 3.6-Plus 26.1% 32.8% 3 16.6% 27.1% 3 33.5% 2 14.0% 3 26.7% 2 9.8% 16.5% 16.9% 28.3% 45.9% 47.3% 2
Qwen 3.5-Plus 22.5% 28.6% 21.4% 3 27.1% 29.0% 3 13.4% 23.5% 3 2.4% 9.0% 9.2% 30.0% 2 46.6% 47.7% 1
Qwen 3.5-Flash #2 28.7% 2 27.8% 21.6% 2 29.8% 2 28.9% 15.9% 2 23.3% 32.6% 2 35.8% 2 40.1% 2 25.6% 51.7% 3 42.4%
Mimo v2 Omni 26.7% 28.7% 38.0% 1 34.3% 1 21.3% 29.8% 1 29.7% 1 27.8% 3 33.1% 3 35.8% 3 29.1% 3 53.6% 1 46.9% 3
Open-source Models
GLM 4.6V #1 30.9% 1 30.8% 9.4% 8.7% 27.1% 8.6% 8.7% 32.4% 2 1.3% 6.7% 33.4% 1 21.6% 22.6%
Kimi K2.5 27.3% 34.3% 1 27.1% 30.4% 32.7% 15.0% 21.7% 11.9% 19.1% 19.9% 30.2% 2 54.6% 2 49.2%
Qwen 3.5-397B-A17B 23.3% 27.4% 3.3% 12.6% 35.8% 3 10.0% 21.4% 5.2% 1.7% 2.2% 24.6% 29.9% 30.0%
Qwen 3.5-122B-A10B #2 28.4% 2 32.6% 2 29.1% 2 31.6% 3 37.8% 1 20.6% 1 28.4% 3 21.2% 28.4% 1 31.9% 1 22.1% 39.0% 34.6%
Qwen 3.5-35B-A3B 26.2% 29.5% 27.6% 3 30.5% 27.4% 0.5% 0.8% 21.6% 21.5% 3 23.1% 3 26.2% 54.4% 50.3% 3
Qwen 3.5-27B #3 27.8% 3 31.6% 3 42.8% 1 39.9% 1 37.1% 2 17.1% 3 25.8% 14.6% 21.9% 2 23.3% 2 28.0% 3 54.5% 3 50.7% 2
Qwen 3.5-9B 23.8% 29.0% 15.9% 25.8% 29.0% 17.8% 2 29.8% 1 9.9% 10.1% 11.0% 27.2% 55.9% 1 51.3% 1
Qwen 3.5-4B 19.2% 30.9% 12.8% 23.1% 30.2% 11.2% 21.7% 6.4% 1.3% 1.4% 9.4% 3.0% 2.8%
InternVL 3.5-30B-A3B 22.9% 27.3% 4.9% 19.9% 24.0% 7.0% 15.5% 15.3% 5.8% 8.6% 24.8% 24.2% 24.6%
InternVL 3.5-14B 27.5% 22.8% 7.6% 22.2% 31.9% 6.6% 22.0% 33.1% 1 3.9% 11.3% 22.4% 15.8% 17.4%
InternVL 3.5-8B 25.9% 28.5% 19.7% 34.2% 2 24.5% 8.1% 28.7% 2 27.4% 8.5% 14.1% 23.0% 25.0% 26.1%
InternVL 3.5-4B 26.0% 25.6% 3.1% 19.8% 26.4% 10.8% 23.2% 31.0% 3 3.5% 7.4% 21.1% 11.7% 12.1%
Fine-tuned Models
SpaceR #2 23.7% 2 20.6% 2.4% 2 6.0% 2 24.0% 1.2% 3 1.9% 3 24.0% 2 0.0% 1 2.6% 2 26.3% 2 15.3% 3 13.9% 3
SpaceThinker #3 22.3% 3 22.6% 1.2% 3 3.0% 3 25.4% 1 1.9% 2 5.7% 2 19.0% 3 0.0% 1 1.6% 3 22.4% 3 18.1% 1 14.1% 1
SpaceOm #1 27.4% 1 24.7% 3 2.8% 1 6.7% 1 24.5% 2 3.6% 1 10.9% 1 33.2% 1 0.0% 1 3.2% 1 27.0% 1 16.7% 2 14.0% 2
ViLaSR 14.0% 18.8% 0.0% 0.0% 22.1% 0.0% 0.0% 5.3% 0.0% 1 0.8% 9.7% 3.4% 5.6%
VST-7B-RL 13.3% 29.2% 1 0.0% 0.0% 24.1% 3 0.0% 0.0% 0.0% 0.0% 1 0.0% 0.0% 0.0% 0.0%
VST-7B-SFT 12.8% 28.1% 2 0.0% 0.0% 23.0% 0.0% 0.0% 0.0% 0.0% 1 0.0% 0.0% 0.0% 0.0%

Table 7 shows that self-aware scores are low overall. Gemini 3 Flash ranks first with 44.44%, but this is already much lower than its environment-aware global accuracy of 54.20%. The second-ranked GLM 4.6V reaches only 30.91%, and many other models cluster in the 20% range. Current models therefore have clearly insufficient modeling of the UAV's own state. The difficulty is that self-aware tasks do not merely ask a model to identify objects. They require it to understand the UAV's position relative to landmarks, what it will see after a view change, what flight behavior is occurring, and which time interval supports that behavior. These problems are closer to subject modeling in embodied intelligence and require simultaneous reasoning over reference views, current views, motion changes, and temporal processes.

Different models also show internal splits within self-aware cognition. Qwen 3.5-27B reaches 42.77% BBox Acc@0.5 on Landmark-Relative Position Reasoning, showing strong evidence in part of spatial self-localization. However, its composite-behavior semantic score is only 14.62%, indicating that spatial self-localization does not naturally transfer to complex behavior understanding. Qwen 3.5-9B reaches 55.92% F1@0.5 on Atomic Behavior Recognition but only 9.87% on composite-behavior semantics, suggesting that it can detect local motion segments without combining them into high-level flight behavior. Self-aware cognition is therefore not a single skill. It consists of cross-view spatial mapping, post-motion observation prediction, behavior recognition, and temporal localization.

Environment-Aware Cognition Results and Analysis

Environment-aware cognition is higher, but still not fully evidence-grounded

Table 8 reports the environment-aware cognition leaderboard. The environment-aware tasks include Self-Relative Position Reasoning and Landmark-Driven Action Decision on the image side, and Landmark Visibility Counting and Interval Reasoning on the video side. Compared with self-aware cognition, environment-aware cognition focuses on where external targets are, how to act according to target position, and when targets become visible in a video.

Table 8. Complete leaderboard for environment-aware cognition. The table groups environment-aware image and video tasks and reports both semantic and evidence metrics.

Model Overall
Acc
Image Tasks Video Tasks
Self-Relative Position Reasoning Landmark-Driven Action Decision Landmark Visibility Reasoning
Acc IoU@50 mIoU Acc IoU@50 mIoU Acc F1@50 mtIoU
Closed-source Models
Gemini 3 Flash #1 54.2% 1 56.2% 1 0.1% 0.9% 51.3% 2 0.5% 1.2% 55.2% 2 44.9% 44.3%
Gemini 3.1 Flash Lite 41.3% 39.9% 0.4% 0.8% 30.4% 0.4% 0.6% 53.7% 3 43.1% 41.6%
Qwen 3.6-Plus 47.2% 48.4% 3 7.4% 3 19.0% 3 42.5% 6.0% 19.9% 3 50.8% 51.5% 46.1%
Qwen 3.5-Plus #3 49.9% 3 47.6% 9.7% 1 19.3% 2 48.9% 3 8.3% 3 20.0% 2 53.2% 53.9% 2 46.8% 3
Qwen 3.5-Flash #2 53.1% 2 52.4% 2 8.8% 2 19.4% 1 54.5% 1 8.9% 2 20.1% 1 52.3% 57.8% 1 51.3% 2
Mimo v2 Omni 44.7% 36.5% 8.8% 2 17.4% 38.2% 9.2% 1 17.3% 59.4% 1 52.5% 3 53.7% 1
Open-source Models
GLM 4.6V 34.9% 29.2% 3.3% 6.7% 44.2% 2.5% 5.0% 31.3% 16.7% 33.8%
Kimi K2.5 40.0% 39.4% 5.0% 15.6% 30.1% 4.7% 14.4% 50.5% 49.7% 3 45.9%
Qwen 3.5-397B-A17B 44.9% 47.9% 2.8% 13.0% 46.2% 4.4% 13.9% 40.4% 20.4% 31.8%
Qwen 3.5-122B-A10B 49.1% 49.7% 3 6.1% 19.4% 3 49.0% 6.8% 18.8% 3 48.4% 46.4% 49.7% 2
Qwen 3.5-35B-A3B #1 54.9% 1 53.2% 2 6.1% 14.5% 57.5% 1 7.2% 3 18.5% 54.1% 2 49.7% 2 48.6%
Qwen 3.5-27B #2 53.3% 2 57.8% 1 7.3% 2 20.4% 2 50.6% 3 7.5% 2 21.0% 2 51.4% 51.7% 1 48.7% 3
Qwen 3.5-9B #3 52.1% 3 53.2% 2 10.3% 1 20.9% 1 52.2% 2 11.5% 1 22.1% 1 50.7% 33.5% 36.8%
Qwen 3.5-4B 42.9% 47.5% 6.7% 3 17.7% 47.5% 6.3% 18.0% 33.9% 44.3% 52.0% 1
InternVL 3.5-30B-A3B 40.2% 35.4% 0.6% 7.6% 41.3% 1.3% 7.3% 43.9% 7.9% 12.6%
InternVL 3.5-14B 41.4% 28.5% 4.7% 12.9% 41.3% 5.0% 11.9% 54.5% 1 13.9% 22.7%
InternVL 3.5-8B 35.8% 24.7% 4.2% 12.3% 30.8% 2.5% 11.4% 51.9% 3 16.4% 21.3%
InternVL 3.5-4B 31.2% 22.4% 1.0% 8.0% 36.7% 1.0% 7.5% 34.6% 16.5% 32.3%
Fine-tuned Models
SpaceR #2 41.6% 2 42.4% 1.8% 2 5.4% 2 48.8% 2 2.3% 1 6.8% 1 33.7% 2 15.7% 2 30.8% 2
SpaceThinker 33.6% 32.8% 0.6% 3 5.1% 3 34.3% 0.7% 3 4.7% 3 33.7% 2 13.6% 3 27.1% 3
SpaceOm 33.4% 35.2% 2.4% 1 8.6% 1 32.8% 1.8% 2 6.5% 2 32.3% 3 17.0% 1 34.2% 1
ViLaSR #1 42.9% 1 43.1% 3 0.0% 0.0% 49.7% 1 0.0% 0.0% 35.9% 1 10.1% 20.6%
VST-7B-RL 38.0% 47.2% 1 0.0% 0.1% 44.0% 0.0% 0.0% 22.8% 0.0% 0.0%
VST-7B-SFT #3 38.8% 3 46.5% 2 0.0% 0.2% 46.6% 3 0.0% 0.2% 23.4% 4.1% 8.1%

Table 8 shows that environment-aware cognition is substantially higher than self-aware cognition. The top five models all exceed 52% in environment-aware global accuracy. Qwen 3.5-35B-A3B reaches 54.95%, Gemini 3 Flash reaches 54.20%, and Qwen 3.5-27B and Qwen 3.5-Flash reach 53.26% and 53.09%, respectively. This indicates that current models more easily handle external target position, action decisions, and visibility processes.

However, high environment-aware scores still require fine-grained interpretation. Gemini 3 Flash has high environment-aware global accuracy, but its image-side BBox Acc@0.5 is only 0.10% on env_where and 0.49% on env_how. Its environment-aware advantage mainly comes from option judgment and video visibility, not precise spatial evidence. Qwen 3.5-Flash is more balanced, with stronger env_where, env_how, and visibility-interval metrics, and a visibility-interval F1@0.5 of 57.84%. Mimo v2 Omni is not strong on image-side environment tasks, but it reaches 59.38% visibility-count accuracy, 52.52% visibility interval F1@0.5, and 53.71% visibility-interval mTIoU. Its advantage therefore mainly comes from continuous video cues. Environment-aware cognition being higher than self-aware cognition does not mean that models fully understand the external world. It means that external object recognition, visibility counting, and coarse directional judgment are relatively easier for current models.

Dual-Cognition Difference Analysis

Self-aware and environment-aware cognition are only weakly balanced

After presenting the two cognition branches separately, we can compare whether they are balanced. Table 9 reports paired statistics between self-aware overall accuracy and environment-aware overall accuracy, and Figure 5 visualizes the same paired relationship.

Comparison Paired Models Pearson r Self Avg. Environment Avg. Gap Environment Higher Self Higher
Overall Self vs Environment 24 +0.39 25.5% 43.92% +18.42 pts 24 0

Table 9. Correlation and balance analysis between self-aware and environment-aware cognition.

Self-Aware Score vs Environment-Aware Score

0 11 22 33 44 0 14 27 41 55 InternVL 3.5-14B | x 27.53% | y 44.7% | InternVL InternVL 3.5-30B-A3B | x 22.86% | y 41.14% | InternVL InternVL 3.5-4B | x 26% | y 32.06% | InternVL InternVL 3.5-8B | x 25.86% | y 39.8% | InternVL Qwen 3.5-122B-A10B | x 28.44% | y 48.9% | Qwen Qwen 3.5-27B | x 27.84% | y 52.78% | Qwen Qwen 3.5-35B-A3B | x 26.19% | y 54.74% | Qwen Qwen 3.5-397B-A17B | x 23.27% | y 43.75% | Qwen Qwen 3.5-4B | x 19.19% | y 40.67% | Qwen Qwen 3.5-9B | x 23.76% | y 51.71% | Qwen SpaceOm | x 27.36% | y 33.15% | SpaceOm SpaceR | x 23.73% | y 39.65% | SpaceR SpaceThinker | x 22.32% | y 33.62% | SpaceThinker VST-7B-RL | x 26.66% | y 45.61% | VST VST-7B-SFT | x 12.79% | y 34.98% | VST ViLaSR | x 13.99% | y 41.16% | ViLASR Gemini 3 Flash | x 44.44% | y 54.44% | Gemini Gemini 3.1 Flash Lite | x 27.45% | y 44.43% | Gemini Kimi K2.5 | x 27.26% | y 42.62% | Kimi Qwen 3.5-Flash | x 28.73% | y 52.9% | Qwen Qwen 3.5-Plus | x 22.52% | y 50.73% | Qwen Qwen 3.6-Plus | x 26.11% | y 48.12% | Qwen Mimo v2 Omni | x 26.72% | y 48.36% | Mimo GLM 4.6V | x 30.91% | y 33.99% | GLM Self-Aware Score Environment-Aware Score
InternVL Qwen SpaceOm SpaceR SpaceThinker VST ViLASR Gemini Kimi Mimo GLM

Figure 5. Self-aware overall accuracy versus environment-aware overall accuracy. The scatter shows a positive trend, but most points lie in the region where environment-aware cognition is higher, indicating a persistent imbalance between the two cognition axes.

Table 9 shows that the Pearson correlation between self-aware overall accuracy and environment-aware overall accuracy is 0.34, indicating only a limited positive relationship. At the same time, the mean self-aware score is 26.26%, while the mean environment-aware score is 43.24%, a gap of 16.97 percentage points. Among the paired models in this analysis, every model performs better on environment-aware cognition than on self-aware cognition, and none is stronger on self-aware cognition in aggregate.

This result reveals the central experimental finding of UAV-DualCog. Current multimodal large language models more easily learn to understand external targets and environments than to understand the UAV itself. Environment-aware tasks rely more on object appearance, directional relations, and visibility judgment, while self-aware tasks require reference frames, UAV viewpoint changes, flight-action patterns, and behavior phases. This is closer to subject modeling in embodied intelligence. Self-aware cognition is therefore the most visible weakness in current models.

Table 5 further shows that leading combined models have the same structural imbalance. Gemini 3 Flash scores 44.44% on self-aware cognition and 54.20% on environment-aware cognition. Qwen 3.5-Flash scores 28.73% and 53.09%. Qwen 3.5-35B-A3B scores 26.19% and 54.95%. A high combined score therefore does not imply balanced dual cognition. Many models rank highly because environment-aware scores support the aggregate. Without a dual-cognition framework, the weakness in understanding the UAV itself would be easy to hide behind the overall score.

Deeper Cross Analysis

Model-family differences and future capability directions

The two analyses above also yield broader conclusions about model families, architecture, fine-tuning differences, and the capability gaps future UAV multimodal models need to address.

First, model families show different strengths and weaknesses, so aggregate scores cannot replace capability-structure analysis. Among closed-source models, Gemini 3 Flash ranks first with 48.35% combined accuracy and remains strong on both image at 50.22% and video at 46.49%. However, several of its image BBox Acc@0.5 values are close to 0%, showing that a strong semantic model can still lack precise spatial evidence. GPT 5.3 Chat and Claude Sonnet 4.6 are strong in image option accuracy, but closed-source models as a group do not automatically solve evidence alignment. Among open-source models, the Qwen 3.5 family is the most stable. Qwen 3.5-Flash, Qwen 3.5-35B-A3B, and Qwen 3.5-27B all exceed 53% in environment-aware overall accuracy, and Qwen 3.5-27B reaches 42.77% BBox Acc@0.5 on Landmark-Relative Position Reasoning, showing local spatial grounding potential. InternVL models are more video-oriented. InternVL 3.5-14B and InternVL 3.5-8B score higher on video than on image, and InternVL 3.5-38B is competitive on the video leaderboard. Mimo v2 Omni is strong in video overall accuracy and visibility-interval metrics, indicating that some models are especially good at using continuous temporal cues. Model-family differences are therefore not only about which model is stronger. They also indicate whether a model is strong in semantics, spatial evidence, temporal evidence, or media adaptation.

Second, model scale, MoE architecture, and existing spatial-reasoning fine-tuning do not automatically become reliable dual cognition on UAV-DualCog. In the Qwen family, MoE models do not improve monotonically with scale. Qwen 3.5-35B-A3B obtains 37.95% combined accuracy, almost the same as the Dense Qwen 3.5-27B at 37.81%, while larger Qwen 3.5-122B-A10B and Qwen 3.5-397B-A17B obtain 36.44% and 31.38%. Routing capacity and general knowledge scale therefore do not directly bridge reference frames, evidence localization, and temporal integration in UAV dual cognition. Spatial-reasoning fine-tuned models show a similar pattern. ViLaSR and VST-7B-RL are not completely ineffective on some option tasks, but many image BBox metrics remain 0. General spatial reasoning or vision-language post-training is therefore not fully aligned with the option judgment plus explicit evidence structure required by UAV-DualCog. This is a capability mismatch rather than a simple scale effect.

The comparison between zero-shot and fine-tuned models sharpens this conclusion. Strong general models can already obtain relatively high option accuracy in a zero-shot setting, but their spatial or temporal evidence is unstable. Existing spatial-reasoning or vision-language post-trained models have been strengthened for specific abilities, but they do not automatically gain an advantage on UAV-DualCog, especially on explicit evidence metrics such as BBox Acc@0.5 where many results remain close to 0. UAV-DualCog therefore does not only require general spatial reasoning or generic visual question answering. It measures a more specific capability: aligned UAV-scene reasoning, dual-cognition interpretation, and structured evidence production.

The experimental analysis points to clear directions for future capability improvement. The main issue is not a single missing component, but the combination of three gaps: semantic answers are often stronger than spatial or temporal evidence, self-aware cognition lags behind environment-aware cognition, and image-video transfer remains incomplete. Future work should therefore pay attention to evidence-verifiable reasoning, matched development between self-aware and environment-aware cognition, and cross-media stability instead of treating higher aggregate accuracy as sufficient.

Conclusion

UAV-DualCog shows why dual cognition is essential for embodied UAV reasoning

The analysis supports one central conclusion: current multimodal large language models can capture partial UAV targets, directions, and dynamic cues, but they have not yet formed stable, balanced, and evidence-verifiable UAV dual cognition. For UAV embodied reasoning, this dual cognition is fundamental. A model must understand the UAV's own state, viewpoint, and motion process while also understanding external landmarks, visibility changes, and action-relevant environmental relations.

The media analysis shows that answer correctness alone is insufficient. Image tasks reveal a large semantic-spatial gap, video tasks reveal unstable mapping between temporal evidence and high-level behavior semantics, and the paired media comparison shows real but incomplete transfer. The cognition-axis comparison is even more diagnostic: environment-aware reasoning is consistently stronger than self-aware reasoning, which means that high combined scores can hide weak modeling of the UAV's own state. The model-family analysis further shows that scaling, MoE routing, or generic spatial-reasoning tuning does not automatically close these gaps.

The experimental results of UAV-DualCog show that future UAV multimodal systems should move toward matched and coordinated development of self-aware and environment-aware capabilities. A reliable UAV agent must align both forms of cognition with spatial or temporal evidence and keep them stable when the observation medium changes.