Analysis

Result Analysis

This page analyzes the released results through the benchmark's dual-cognition formulation: how current MLLMs perform on self-aware and environment-aware reasoning, how those capabilities behave when tested under image and video media, and how tightly the two cognition branches actually develop together.

Analysis Framework

This page first examines dual-cognition performance when the benchmark is delivered through image and video media, where correct answers must still be supported by explicit spatial or temporal evidence. It then asks whether the same dual-cognition capability remains stable across media and whether self-aware and environment-aware reasoning develop in a balanced way after all released task families are aggregated.

Image Media

Overall dual-cognition performance under image input, together with the spatial evidence needed to support it.

Video Media

Overall dual-cognition performance under video input, together with the temporal evidence needed to support it.

Cross-Media

Whether dual-cognition performance remains consistent when the benchmark shifts from image to video.

Cognition Balance

Whether self-aware and environment-aware reasoning remain associated and balanced after aggregating across all media.

The analysis uses the released results to ask a more central question: how far current MLLMs already satisfy the benchmark's dual-cognition requirement of reasoning both about the UAV itself and about the external world, when that requirement is tested under image and video observation settings. The emerging picture is clear: current models exhibit partial competence on both branches, yet neither capability is sufficiently strong in its own right, nor do the two appear to develop in a coordinated and consistent way. They therefore do not yet behave as if self-aware and environment-aware reasoning were one unified capability.

The sections below therefore move from media-specific evidence diagnosis to aggregate dual-cognition comparison. The first half asks how much dual-cognition performance survives once image reasoning must be grounded spatially and video reasoning must be localized temporally. The second half then asks whether overall dual-cognition ability remains stable across media and whether the two cognition branches stay associated and balanced after all released task families are aggregated.

Capability Gaps

Dual-Cognition Evidence Gaps

The plots below examine how current MLLMs acquire explicit evidence for dual-cognition reasoning under the benchmark's two observation media. The comparison pool is restricted to the model lists currently used by the task-example browser, so the resulting gaps match the same systems that readers inspect on the benchmark pages.

Image Evidence

Dual-Cognition Reasoning Under Image Evidence

The first subsection studies dual-cognition reasoning under image input. In this setting, both self-aware and environment-aware image tasks require the model not only to answer correctly but also to recover the spatial evidence that grounds that answer in the current observation.

Under image evidence, the central question is whether a model that can reason correctly about the UAV or the target landmark can also retrieve the spatial evidence that makes that reasoning verifiable. The current selected-model subset shows that semantic option accuracy remains consistently higher than spatial Acc@0.5 across nearly the entire comparison pool, indicating that dual-cognition reasoning under images is still easier at the answer level than at the grounding level.

This distinction matters because the image branch asks the model to support both self-aware and environment-aware decisions with explicit spatial evidence. Some systems remain competitive so long as only option selection is measured, yet lose a substantial fraction of their apparent advantage once dual-cognition reasoning must be grounded by a verifiable bounding box in the current view.

The bar chart below is constructed from the same image-model subset used by the benchmark examples, with one bar showing semantic option accuracy and the other showing spatial Acc@0.5 for each model. Read in that way, the figure is not only a ranking comparison: it summarizes how much of each model's apparent dual-cognition competence survives once image reasoning must be backed by explicit spatial grounding.

Even inside the selected model subset, semantic choice remains easier than precise landmark grounding at IoU 0.5.

Across the selected image-model list, average semantic performance reaches 35.98%, while spatial Acc@0.5 remains lower at 7.02%. In other words, the present model pool can often infer the right dual-cognition answer under image input without yet matching that answer with equally reliable spatial evidence. Judgments that cannot recover sufficient supporting evidence remain less accurate and less persuasive, even when the semantic choice itself appears correct.

The task-level correlation pattern is also uneven. Landmark-Driven Action Decision shows the strongest semantic-grounding coupling in the current model subset (0.44), whereas Landmark-Relative Position Reasoning remains the weakest (0.02), indicating that the image branch does not expose one uniform grounding bottleneck: the ability to obtain spatial evidence for dual-cognition reasoning still varies from task to task.

Task	Pearson r	Semantic Avg.	Spatial Avg.
Landmark-Relative Position Reasoning	+0.02	30.35%	10.69%
Future Observation Prediction	+0.15	26.29%	7.81%
Self-Relative Position Reasoning	+0.28	41.97%	4.73%
Landmark-Driven Action Decision	+0.44	45.33%	4.85%

Image Correlation Analysis. Pearson correlations computed from the selected image-model subset, together with mean semantic and spatial scores for each released image task.

Video Evidence

Dual-Cognition Reasoning Under Video Evidence

The second subsection asks the parallel question under video input. Here, dual-cognition reasoning is evaluated through temporal evidence: the model must not only recognize the correct behavior or visibility state, but also recover the interval that supports that judgment.

Under video evidence, the relevant question is whether a model that reaches the correct self-aware or environment-aware interpretation can also recover the temporal support that makes that interpretation inspectable. The current selected-model subset exhibits a related but not identical structure: semantic success and temporal localization remain more closely coupled than image semantics and spatial grounding, yet the interval metric still trails the semantic metric for most models.

This means that recognizing the correct behavior or visibility state is still not equivalent to delimiting the corresponding temporal support with sufficient precision. The resulting ranking changes are analytically useful because they reveal which models preserve temporal evidence quality after semantic recognition has already been achieved, and which models mainly gain from coarse recognition without matching boundary quality.

The bar chart below is constructed from the same video-model subset used by the benchmark examples, with one bar showing weighted semantic performance and the other showing temporal F1@0.5 for each model. The figure should therefore be read as a summary of how much dual-cognition performance remains once video reasoning is required to produce explicit interval evidence rather than only the correct semantic judgment.

The video comparison uses the same model list as the benchmark examples and compares semantic success against interval quality at the 0.5 overlap threshold.

Across the selected video-model list, mean semantic performance reaches 33.2%, whereas temporal F1@0.5 remains lower at 29.02%. The gap is smaller than a semantic-only reading might suggest, but it remains large enough to reorder models once dual-cognition reasoning must be supported by explicit temporal evidence. A small subset of models even scores higher on temporal localization than on semantic analysis, which suggests that they can often detect where the relevant evidence occurs in time without yet converting that evidence into the correct semantic interpretation; this points to a weakness in mapping temporal support to stable behavior or visibility semantics. By contrast, models whose semantic scores remain clearly above their temporal scores appear able to form a coarse high-level judgment while still failing to delimit the supporting interval precisely, which indicates that semantic recognition and evidence grounding remain only partially integrated.

The task-level coupling is strongest on Landmark Visibility Counting and Interval Reasoning (0.84) and weakest on Composite Flight Behavior Recognition (0.53). This makes the video branch look more coherent than the image branch overall, but it still shows that the temporal evidence needed for dual cognition is not acquired equally well by behavior recognition and landmark visibility reasoning.

Task	Pearson r	Semantic Avg.	Temporal Avg.
Atomic Flight Behavior Recognition	+0.68	27.19%	34.09%
Composite Flight Behavior Recognition	+0.53	23.13%	15.95%
Landmark Visibility Counting and Interval Reasoning	+0.84	44.91%	31.29%

Video Correlation Analysis. Pearson correlations computed from the selected video-model subset, together with mean semantic and temporal scores for each released video task.

Cross Analysis

Dual-Cognition Transfer and Balance

This section asks two complementary dual-cognition questions. First, does the same cognition axis remain stable when the evidence medium changes from image to video? Second, do self-aware and environment-aware capability remain balanced once the corresponding task families are aggregated? Both scatter plots and correlation summaries use the same paired model pool as the combined leaderboard, namely models with valid results on both the image and video benchmarks.

Cross-Media Transfer

Dual Cognition Across Media

This subsection asks whether the same dual-cognition capability remains stable when the evidence medium changes from image to video. The scatter gives the overall transfer pattern, and the table below summarizes the same relationship with paired-model means, average gap, and overall correlation under the shared paired-model filter.

The overall image-versus-video scatter shows a clear positive relation, but not a tight diagonal collapse. Across the paired models retained here, the aggregate cross-media relation remains only moderate (Pearson r = 0.59), so stronger image performance often carries into video reasoning, but far from uniformly.

The imbalance around the diagonal is equally informative. Among the displayed paired models, 18 lie above the diagonal and only 5 lie below it, so the present model pool more often improves or at least preserves its aggregate standing under video evidence than under image evidence. Cross-media transfer is therefore real, but it remains directional rather than symmetric.

InternVL Qwen SpaceOm SpaceR SpaceThinker VST ViLASR Gemini Kimi Mimo GLM

Image Score vs Video Score. Only models retained by the combined leaderboard filter are shown here, i.e. systems with valid non-zero image and video aggregate results. Point colors denote model families so that cross-media clustering can be compared at the family level rather than only model by model.

The correlation summary confirms the same pattern numerically. Across the 23 paired models retained here, the overall image-video relation remains moderately positive (Pearson r = 0.59), which is strong enough to rule out a random pairing but still far from a fully stable cross-media transfer regime.

The aggregate means also sharpen the interpretation: image performance averages 19.48%, whereas video performance rises to 23.09%, leaving a positive gap of +3.62 points. Read together with the scatter, this indicates that the current model pool does not simply preserve one fixed level of dual-cognition quality across media; instead, many systems shift to a visibly different operating point once the evidence becomes temporally extended.

Comparison	Paired Models	Pearson r	Image Avg.	Video Avg.	Gap
Overall Image vs Video	23	+0.59	19.48%	23.09%	+3.62 pts

Correlation Analysis. Overall image-versus-video summary computed from the same paired model pool used by the combined leaderboard.

Within-Medium Cognition Balance

Self-Aware vs Environment-Aware Balance

This subsection asks whether self-aware and environment-aware reasoning already behave like balanced aspects of one capability after averaging across their image and video task families, or whether the benchmark still exposes a persistent dual-cognition asymmetry within that same paired model pool.

The overall self-versus-environment scatter again shows a positive but incomplete alignment (Pearson r = 0.39). Current models therefore do not split into two wholly unrelated branches, but neither do they collapse into one balanced capability.

The more striking feature is the direction of that spread. Among the displayed paired models, 24 lie above the diagonal, whereas only 0 lie below it. The dominant pattern is therefore not random variance but a systematic tilt toward stronger environment-aware scores, which suggests that current MLLMs still find self-aware reasoning harder to consolidate at the same level once both media are taken into account.

InternVL Qwen SpaceOm SpaceR SpaceThinker VST ViLASR Gemini Kimi Mimo GLM

Self-Aware Score vs Environment-Aware Score. Only models retained by the combined leaderboard filter are shown here, and the self-aware and environment-aware scores are aggregated within that same paired model pool. Point colors denote model families so that cognition balance can be read together with family-level grouping.

The correlation summary makes the same imbalance precise. Across the 24 paired models, self-aware and environment-aware scores remain positively related (Pearson r = 0.39), so the two branches do rise together to some extent rather than behaving as unrelated skills.

Yet the mean values show that this coupling is not balanced. Self-aware performance averages only 25.5%, whereas environment-aware performance reaches 43.92%, yielding a large positive gap of +18.42 points. The result is therefore not a unified dual-cognition competence, but a partially coupled system in which environment-aware reasoning is generally more mature than self-aware reasoning.

Comparison	Paired Models	Pearson r	Self Avg.	Environment Avg.	Gap
Overall Self vs Environment	24	+0.39	25.5%	43.92%	+18.42 pts

Correlation Analysis. Overall self-versus-environment summary computed from the same paired model pool used by the combined leaderboard.

Summary

Dual-cognition analysis shows that current MLLMs remain uneven across cognition axes and media

The preceding sections can be read as a progression from branch-level evidence diagnosis to dual-cognition comparison. The summary below therefore first consolidates what the gap analyses and the cross analyses reveal about current MLLM behavior and then turns to what those results imply about the benchmark itself.

The capability-gap results show that current models still solve dual cognition more readily at the semantic level than at the evidence level. In the selected image-model subset, option accuracy averages 35.98%, whereas spatial Acc@0.5 averages only 7.02%. In the selected video-model subset, semantic success averages 33.2%, whereas temporal F1@0.5 remains lower at 29.02%. These are not incidental failures confined to a few weak systems; they persist across the same comparison pool used by the example browser and therefore indicate that dual cognition is still only partially realized when explicit grounding or localization is required.

The cross analyses sharpen that diagnosis in two directions. First, transfer across image and video settings is real but incomplete: across the 23 paired models, overall image and video scores retain only a moderate positive relation (+0.59), and the aggregate mean rises from 19.48% under image input to 23.09% under video input. This means that a model's dual-cognition level does not simply carry across media unchanged; instead, many systems move to a different operating point once the benchmark shifts from static observations to temporally extended evidence. The positive relationship shows that some shared competence exists, but the persistent spread and the +3.62-point video advantage show that this competence is still medium sensitive rather than medium invariant.

The cross-cognition view reveals an even stronger structural asymmetry. After aggregating over all media, self-aware and environment-aware scores remain only moderately coupled (+0.39), while the mean level of environment-aware reasoning reaches 43.92% compared with only 25.5% for self-aware reasoning. The resulting +18.42-point gap is large enough to show that the two cognition branches are not merely noisy variants of one common skill. Instead, they develop unevenly: current MLLMs more often consolidate reasoning about the external world than reasoning about the UAV itself, so dual cognition remains only partially integrated even before one asks whether the supporting spatial or temporal evidence is correct.

Taken together, these findings clarify the value of UAV-DualCog as a dual-cognition benchmark rather than as a simple leaderboard generator. Its value lies in exposing whether a model can sustain self-aware and environment-aware reasoning simultaneously, whether that behavior survives when the evidence medium changes, and whether semantic success is supported by the grounding or localization evidence that operational UAV tasks actually require. A benchmark organized around only one of these axes could still rank models, but it would obscure the central failure patterns revealed here: partial self-awareness without robust environment understanding, correct semantics without evidence, or strength in one cognition branch that degrades sharply in the other. The present analysis therefore shows that UAV-DualCog is most valuable precisely because it turns hidden dual-cognition asymmetries into measurable, comparable, and interpretable model behavior.

Key Findings

Current MLLMs still do not collapse dual cognition into one shared competence: self-aware and environment-aware reasoning remain meaningfully separable even after averaging across image and video tasks.

Cross-media transfer is present but incomplete: overall image and video scores remain only moderately coupled (0.59), so dual-cognition quality still changes materially when the observation medium shifts.

Evidence gaps persist in both branches: selected image models average 35.98% on semantic choice but only 7.02% on spatial grounding, while selected video models average 33.2% on semantic recognition but 29.02% on temporal localization.

Dual-cognition balance is likewise only partial: self-aware and environment-aware scores rise together only moderately (0.39), while the aggregate model pool remains visibly tilted toward stronger environment-aware performance.

Future Directions

Future multimodal models should treat self-aware and environment-aware reasoning as related but non-identical objectives, rather than assuming that one shared visual-language policy will close both branches automatically.

Improving dual-cognition performance will require stronger media robustness: models need to preserve the same cognition axis more consistently when the benchmark shifts between image and video observation settings.

Progress will also require tighter evidence alignment, because stronger dual-cognition semantics will not be convincing unless they are supported by more reliable spatial grounding and temporal localization.