The tables below preserve the full benchmark reporting structure while making it easier to
compare models across image tasks, video tasks, self-aware reasoning, and
environment-aware reasoning.
Leaderboard
Current leading models
The current benchmark leaders are shown below as Top-3 views for combined,
modality-specific, and dual-cognition-dimension performance. The `Acc` column here reports
Overall Acc.
Image
#
Model
Acc
1
Gemini 3 Flash
50.2%
2
GPT 5.3 Chat
47.8%
3
Qwen 3.5-27B
44.3%
Video
#
Model
Acc
1
Gemini 3 Flash
46.5%
2
Mimo v2 Omni
38.8%
3
InternVL 3.5-38B
37.8%
Self-Aware
#
Model
Acc
1
Gemini 3 Flash
44.4%
2
GLM 4.6V
30.9%
3
Qwen 3.5-Flash
28.7%
Environment-Aware
#
Model
Acc
1
Qwen 3.5-35B-A3B
54.9%
2
Gemini 3 Flash
54.2%
3
Qwen 3.5-27B
53.3%
Combined
#
Model
Acc
1
Gemini 3 Flash
48.4%
2
Qwen 3.5-Flash
38.9%
3
Qwen 3.5-35B-A3B
37.9%
Image Tasks
Image-task leaderboard
This table evaluates image-based dual cognition across the four released image tasks. It
reports overall image performance together with per-task answer accuracy and spatial
grounding quality, so readers can inspect not only whether a model selects the right
option, but also whether that decision is supported by reliable landmark localization.
Ranking colors are computed independently inside each model group. Tied values share the same
displayed rank.
#1 within group#2 within group#3 within group
Model
Overall Acc
Self-Aware
Environment-Aware
Landmark-Relative Position Reasoning
Future Observation Prediction
Self-Relative Position Reasoning
Landmark-Driven Action Decision
Acc
IoU@50
mIoU
Acc
IoU@50
mIoU
Acc
IoU@50
mIoU
Acc
IoU@50
mIoU
Closed-source Models
Claude Sonnet 4.6 #3
42.7%3
37.7%3
4.9%
18.2%
23.6%
3.4%
14.2%
48.5%
9.6%3
20.2%2
61.0%2
8.6%
19.9%
GPT 5.3 Chat #2
47.8%2
35.2%
11.9%
23.5%
37.3%2
12.4%
25.2%3
56.5%1
14.9%1
22.5%1
62.3%1
15.4%1
22.5%1
Gemini 3 Flash #1
50.2%1
47.6%1
0.7%
1.2%
45.9%1
0.7%
0.9%
56.2%2
0.1%
0.9%
51.3%
0.5%
1.2%
Gemini 3.1 Flash Lite
34.5%
39.1%2
0.0%
0.0%
28.6%
2.5%
3.5%
39.9%
0.4%
0.8%
30.4%
0.4%
0.6%
Grok 4.1 Fast
27.4%
21.1%
3.6%
16.4%
22.9%
2.2%
17.5%
33.0%
1.9%
8.2%
32.6%
1.6%
7.5%
Qwen 3.6-Plus
39.3%
32.8%
16.6%
27.1%3
33.5%3
14.0%3
26.7%2
48.4%
7.4%
19.0%
42.5%
6.0%
19.9%
Qwen 3.5-Plus
38.5%
28.6%
21.4%3
27.1%
29.0%
13.4%
23.5%
47.6%
9.7%2
19.3%
48.9%
8.3%
20.0%3
Qwen 3.5-Flash
40.9%
27.8%
21.6%2
29.8%2
28.9%
15.9%2
23.3%
52.4%3
8.8%
19.4%3
54.5%3
8.9%3
20.1%2
Mimo v2 Omni
31.2%
28.7%
38.0%1
34.3%1
21.3%
29.8%1
29.7%1
36.5%
8.8%
17.4%
38.2%
9.2%2
17.3%
Open-source Models
GLM 4.6V
32.8%
30.8%
9.4%
8.7%
27.1%
8.6%
8.7%
29.2%
3.3%
6.7%
44.2%
2.5%
5.0%
Kimi K2.5
34.1%
34.3%1
27.1%
30.4%
32.7%
15.0%
21.7%
39.4%
5.0%
15.6%
30.1%
4.7%
14.4%
Qwen 3.5-397B-A17B
39.4%
27.4%
3.3%
12.6%
35.8%3
10.0%
21.4%
47.9%
2.8%
13.0%
46.2%
4.4%
13.9%
Qwen 3.5-122B-A10B #2
42.3%2
32.6%3
29.1%2
31.6%3
37.8%1
20.6%1
28.4%3
49.7%
6.1%
19.4%3
49.0%
6.8%
18.8%3
Qwen 3.5-35B-A3B #3
41.9%3
29.5%
27.6%
30.5%
27.4%
0.5%
0.8%
53.2%2
6.1%
14.5%
57.5%1
7.2%3
18.5%
Qwen 3.5-27B #1
44.3%1
31.6%
42.8%1
39.9%1
37.1%2
17.1%
25.8%
57.8%1
7.3%2
20.4%2
50.6%3
7.5%2
21.0%2
Qwen 3.5-9B
40.9%
29.0%
15.9%
25.8%
29.0%
17.8%3
29.8%1
53.2%2
10.3%1
20.9%1
52.2%2
11.5%1
22.1%1
Qwen 3.5-4B
39.0%
30.9%
12.8%
23.1%
30.2%
11.2%
21.7%
47.5%
6.7%3
17.7%
47.5%
6.3%
18.0%
Intern S1-Pro
28.4%
29.4%
17.4%
26.8%
26.2%
9.6%
21.3%
27.8%
2.7%
13.7%
30.3%
3.2%
14.0%
InternVL 3.5-241B-A28B
37.7%
32.9%2
28.4%3
31.2%
26.1%
19.5%2
25.3%
50.1%3
5.4%
15.8%
41.8%
5.1%
15.0%
InternVL 3.5-30B-A3B
32.0%
27.3%
4.9%
19.9%
24.0%
7.0%
15.5%
35.4%
0.6%
7.6%
41.3%
1.3%
7.3%
InternVL 3.5-14B
31.1%
22.8%
7.6%
22.2%
31.9%
6.6%
22.0%
28.5%
4.7%
12.9%
41.3%
5.0%
11.9%
InternVL 3.5-8B
27.1%
28.5%
19.7%
34.2%2
24.5%
8.1%
28.7%2
24.7%
4.2%
12.3%
30.8%
2.5%
11.4%
InternVL 3.5-4B
27.8%
25.6%
3.1%
19.8%
26.4%
10.8%
23.2%
22.4%
1.0%
8.0%
36.7%
1.0%
7.5%
Fine-tuned Models
SpaceR #3
34.0%3
20.6%
2.4%3
6.0%3
24.0%
1.2%3
1.9%3
42.4%
1.8%3
5.4%3
48.8%2
2.3%2
6.8%2
SpaceThinker
28.8%
22.6%
1.2%
3.0%
25.4%1
1.9%2
5.7%2
32.8%
0.6%
5.1%
34.3%
0.7%
4.7%
SpaceOm
29.3%
24.7%3
2.8%2
6.7%2
24.5%2
3.6%1
10.9%1
35.2%
2.4%2
8.6%1
32.8%
1.8%3
6.5%3
SenseNova-SI-1.2
19.6%
17.5%
10.7%1
10.6%1
0.1%
0.0%
0.1%
15.3%
3.2%1
5.8%2
45.4%
6.0%1
8.5%1
ViLaSR
33.4%
18.8%
0.0%
0.0%
22.1%
0.0%
0.0%
43.1%3
0.0%
0.0%
49.7%1
0.0%
0.0%
VST-7B-RL #1
36.1%1
29.2%1
0.0%
0.0%
24.1%3
0.0%
0.0%
47.2%1
0.0%
0.1%
44.0%
0.0%
0.0%
VST-7B-SFT #2
36.1%2
28.1%2
0.0%
0.0%
23.0%
0.0%
0.0%
46.5%2
0.0%
0.2%
46.6%3
0.0%
0.2%
Video Tasks
Video-task leaderboard
This table evaluates video-based dual cognition across composite behavior recognition,
atomic behavior recognition, and landmark visibility reasoning. It reports overall video
performance together with semantic and temporal columns for each task, so readers can judge
whether a model recognizes the right flight event or landmark state and whether it also
localizes the corresponding interval correctly.
Ranking colors are computed independently inside each model group. Tied values share the same
displayed rank.
#1 within group#2 within group#3 within group
Model
Overall Acc
Composite Behavior Recognition
Atomic Behavior Recognition
Landmark Visibility Reasoning
Acc
tIoU@50
mtIoU
Acc
F1@50
mtIoU
Acc
F1@50
mtIoU
Closed-source Models
Gemini 3 Flash #1
46.5%1
49.3%1
51.9%1
54.3%1
35.0%1
52.6%2
46.6%
55.2%2
44.9%
44.3%
Gemini 3.1 Flash Lite
31.9%
22.3%
29.5%
32.2%
19.7%
40.0%
40.6%
53.7%3
43.1%
41.6%
Mimo v2 Omni #2
38.8%2
27.8%3
33.1%3
35.8%3
29.1%3
53.6%1
46.9%3
59.4%1
52.5%3
53.7%1
Qwen 3.5-Flash #3
36.8%3
32.6%2
35.8%2
40.1%2
25.6%
51.7%3
42.4%
52.3%
57.8%1
51.3%2
Qwen 3.5-Plus
28.6%
2.4%
9.0%
9.2%
30.0%2
46.6%
47.7%1
53.2%
53.9%2
46.8%3
Qwen 3.6-Plus
29.6%
9.8%
16.5%
16.9%
28.3%
45.9%
47.3%2
50.8%
51.5%
46.1%
Open-source Models
GLM 4.6V
32.4%
32.4%2
1.3%
6.7%
33.4%2
21.6%
22.6%
31.3%
16.7%
33.8%
Kimi K2.5
30.9%
11.9%
19.1%
19.9%
30.2%3
54.6%2
49.2%
50.5%
49.7%3
45.9%
Qwen 3.5-397B-A17B
23.4%
5.2%
1.7%
2.2%
24.6%
29.9%
30.0%
40.4%
20.4%
31.8%
Qwen 3.5-122B-A10B
30.6%
21.2%
28.4%1
31.9%1
22.1%
39.0%
34.6%
48.4%
46.4%
49.7%2
Qwen 3.5-35B-A3B
34.0%
21.6%
21.5%3
23.1%3
26.2%
54.4%
50.3%3
54.1%3
49.7%2
48.6%
Qwen 3.5-27B
31.3%
14.6%
21.9%2
23.3%2
28.0%
54.5%3
50.7%2
51.4%
51.7%1
48.7%3
Qwen 3.5-9B
29.2%
9.9%
10.1%
11.0%
27.2%
55.9%1
51.3%1
50.7%
33.5%
36.8%
Qwen 3.5-4B
16.5%
6.4%
1.3%
1.4%
9.4%
3.0%
2.8%
33.9%
44.3%
52.0%1
InternVL 3.5-38B #1
37.8%1
15.2%
11.1%
12.8%
34.2%1
33.0%
33.1%
64.0%1
20.0%
27.5%
InternVL 3.5-30B-A3B
28.0%
15.3%
5.8%
8.6%
24.8%
24.2%
24.6%
43.9%
7.9%
12.6%
InternVL 3.5-14B #2
36.6%2
33.1%1
3.9%
11.3%
22.4%
15.8%
17.4%
54.5%2
13.9%
22.7%
InternVL 3.5-8B #3
34.1%3
27.4%
8.5%
14.1%
23.0%
25.0%
26.1%
51.9%
16.4%
21.3%
InternVL 3.5-4B
28.9%
31.0%3
3.5%
7.4%
21.1%
11.7%
12.1%
34.6%
16.5%
32.3%
Fine-tuned Models
SpaceR #2
28.0%2
24.0%2
0.0%1
2.6%2
26.3%2
15.3%3
13.9%3
33.7%2
15.7%2
30.8%2
SpaceThinker #3
25.0%3
19.0%3
0.0%1
1.6%3
22.4%3
18.1%1
14.1%1
33.7%2
13.6%3
27.1%3
SpaceOm #1
30.9%1
33.2%1
0.0%1
3.2%1
27.0%1
16.7%2
14.0%2
32.3%3
17.0%1
34.2%1
ViLaSR
17.0%
5.3%
0.0%1
0.8%
9.7%
3.4%
5.6%
35.9%1
10.1%
20.6%
VST-7B-RL
7.6%
0.0%
0.0%1
0.0%
0.0%
0.0%
0.0%
22.8%
0.0%
0.0%
VST-7B-SFT
7.8%
0.0%
0.0%1
0.0%
0.0%
0.0%
0.0%
23.4%
4.1%
8.1%
Self-Aware
Self-aware capability leaderboard
This table reorganizes the benchmark by capability rather than by medium and focuses on
self-aware reasoning as one coherent axis. It places the two self-aware image tasks
together with composite and atomic flight-behavior recognition from video, so readers can
inspect how well each model reasons about UAV self-state across both spatial and temporal
evidence channels.
Ranking colors are computed independently inside each model group. Tied values share the same
displayed rank.
#1 within group#2 within group#3 within group
Model
Overall Acc
Image Tasks
Video Tasks
Landmark-Relative Position Reasoning
Future Observation Prediction
Composite Behavior Recognition
Atomic Behavior Recognition
Acc
IoU@50
mIoU
Acc
IoU@50
mIoU
Acc
tIoU@50
mtIoU
Acc
F1@50
mtIoU
Closed-source Models
Gemini 3 Flash #1
44.4%1
47.6%1
0.7%
1.2%
45.9%1
0.7%
0.9%
49.3%1
51.9%1
54.3%1
35.0%1
52.6%2
46.6%
Gemini 3.1 Flash Lite #3
27.5%3
39.1%2
0.0%
0.0%
28.6%
2.5%
3.5%
22.3%
29.5%
32.2%
19.7%
40.0%
40.6%
Qwen 3.6-Plus
26.1%
32.8%3
16.6%
27.1%3
33.5%2
14.0%3
26.7%2
9.8%
16.5%
16.9%
28.3%
45.9%
47.3%2
Qwen 3.5-Plus
22.5%
28.6%
21.4%3
27.1%
29.0%3
13.4%
23.5%3
2.4%
9.0%
9.2%
30.0%2
46.6%
47.7%1
Qwen 3.5-Flash #2
28.7%2
27.8%
21.6%2
29.8%2
28.9%
15.9%2
23.3%
32.6%2
35.8%2
40.1%2
25.6%
51.7%3
42.4%
Mimo v2 Omni
26.7%
28.7%
38.0%1
34.3%1
21.3%
29.8%1
29.7%1
27.8%3
33.1%3
35.8%3
29.1%3
53.6%1
46.9%3
Open-source Models
GLM 4.6V #1
30.9%1
30.8%
9.4%
8.7%
27.1%
8.6%
8.7%
32.4%2
1.3%
6.7%
33.4%1
21.6%
22.6%
Kimi K2.5
27.3%
34.3%1
27.1%
30.4%
32.7%
15.0%
21.7%
11.9%
19.1%
19.9%
30.2%2
54.6%2
49.2%
Qwen 3.5-397B-A17B
23.3%
27.4%
3.3%
12.6%
35.8%3
10.0%
21.4%
5.2%
1.7%
2.2%
24.6%
29.9%
30.0%
Qwen 3.5-122B-A10B #2
28.4%2
32.6%2
29.1%2
31.6%3
37.8%1
20.6%1
28.4%3
21.2%
28.4%1
31.9%1
22.1%
39.0%
34.6%
Qwen 3.5-35B-A3B
26.2%
29.5%
27.6%3
30.5%
27.4%
0.5%
0.8%
21.6%
21.5%3
23.1%3
26.2%
54.4%
50.3%3
Qwen 3.5-27B #3
27.8%3
31.6%3
42.8%1
39.9%1
37.1%2
17.1%3
25.8%
14.6%
21.9%2
23.3%2
28.0%3
54.5%3
50.7%2
Qwen 3.5-9B
23.8%
29.0%
15.9%
25.8%
29.0%
17.8%2
29.8%1
9.9%
10.1%
11.0%
27.2%
55.9%1
51.3%1
Qwen 3.5-4B
19.2%
30.9%
12.8%
23.1%
30.2%
11.2%
21.7%
6.4%
1.3%
1.4%
9.4%
3.0%
2.8%
InternVL 3.5-30B-A3B
22.9%
27.3%
4.9%
19.9%
24.0%
7.0%
15.5%
15.3%
5.8%
8.6%
24.8%
24.2%
24.6%
InternVL 3.5-14B
27.5%
22.8%
7.6%
22.2%
31.9%
6.6%
22.0%
33.1%1
3.9%
11.3%
22.4%
15.8%
17.4%
InternVL 3.5-8B
25.9%
28.5%
19.7%
34.2%2
24.5%
8.1%
28.7%2
27.4%
8.5%
14.1%
23.0%
25.0%
26.1%
InternVL 3.5-4B
26.0%
25.6%
3.1%
19.8%
26.4%
10.8%
23.2%
31.0%3
3.5%
7.4%
21.1%
11.7%
12.1%
Fine-tuned Models
SpaceR #2
23.7%2
20.6%
2.4%2
6.0%2
24.0%
1.2%3
1.9%3
24.0%2
0.0%1
2.6%2
26.3%2
15.3%3
13.9%3
SpaceThinker #3
22.3%3
22.6%
1.2%3
3.0%3
25.4%1
1.9%2
5.7%2
19.0%3
0.0%1
1.6%3
22.4%3
18.1%1
14.1%1
SpaceOm #1
27.4%1
24.7%3
2.8%1
6.7%1
24.5%2
3.6%1
10.9%1
33.2%1
0.0%1
3.2%1
27.0%1
16.7%2
14.0%2
ViLaSR
14.0%
18.8%
0.0%
0.0%
22.1%
0.0%
0.0%
5.3%
0.0%1
0.8%
9.7%
3.4%
5.6%
VST-7B-RL
13.3%
29.2%1
0.0%
0.0%
24.1%3
0.0%
0.0%
0.0%
0.0%1
0.0%
0.0%
0.0%
0.0%
VST-7B-SFT
12.8%
28.1%2
0.0%
0.0%
23.0%
0.0%
0.0%
0.0%
0.0%1
0.0%
0.0%
0.0%
0.0%
Environment-Aware
Environment-aware capability leaderboard
This table groups together the environment-aware tasks across image and video and therefore
reads the benchmark from the perspective of external-world understanding. It makes it
easier to compare landmark-relative direction judgment, landmark-driven action decision,
and dynamic visibility reasoning under one shared environment-state perspective, together
with the grounding and localization evidence attached to those decisions.
Ranking colors are computed independently inside each model group. Tied values share the same
displayed rank.
#1 within group#2 within group#3 within group
Model
Overall Acc
Image Tasks
Video Tasks
Self-Relative Position Reasoning
Landmark-Driven Action Decision
Landmark Visibility Reasoning
Acc
IoU@50
mIoU
Acc
IoU@50
mIoU
Acc
F1@50
mtIoU
Closed-source Models
Gemini 3 Flash #1
54.2%1
56.2%1
0.1%
0.9%
51.3%2
0.5%
1.2%
55.2%2
44.9%
44.3%
Gemini 3.1 Flash Lite
41.3%
39.9%
0.4%
0.8%
30.4%
0.4%
0.6%
53.7%3
43.1%
41.6%
Qwen 3.6-Plus
47.2%
48.4%3
7.4%3
19.0%3
42.5%
6.0%
19.9%3
50.8%
51.5%
46.1%
Qwen 3.5-Plus #3
49.9%3
47.6%
9.7%1
19.3%2
48.9%3
8.3%3
20.0%2
53.2%
53.9%2
46.8%3
Qwen 3.5-Flash #2
53.1%2
52.4%2
8.8%2
19.4%1
54.5%1
8.9%2
20.1%1
52.3%
57.8%1
51.3%2
Mimo v2 Omni
44.7%
36.5%
8.8%2
17.4%
38.2%
9.2%1
17.3%
59.4%1
52.5%3
53.7%1
Open-source Models
GLM 4.6V
34.9%
29.2%
3.3%
6.7%
44.2%
2.5%
5.0%
31.3%
16.7%
33.8%
Kimi K2.5
40.0%
39.4%
5.0%
15.6%
30.1%
4.7%
14.4%
50.5%
49.7%3
45.9%
Qwen 3.5-397B-A17B
44.9%
47.9%
2.8%
13.0%
46.2%
4.4%
13.9%
40.4%
20.4%
31.8%
Qwen 3.5-122B-A10B
49.1%
49.7%3
6.1%
19.4%3
49.0%
6.8%
18.8%3
48.4%
46.4%
49.7%2
Qwen 3.5-35B-A3B #1
54.9%1
53.2%2
6.1%
14.5%
57.5%1
7.2%3
18.5%
54.1%2
49.7%2
48.6%
Qwen 3.5-27B #2
53.3%2
57.8%1
7.3%2
20.4%2
50.6%3
7.5%2
21.0%2
51.4%
51.7%1
48.7%3
Qwen 3.5-9B #3
52.1%3
53.2%2
10.3%1
20.9%1
52.2%2
11.5%1
22.1%1
50.7%
33.5%
36.8%
Qwen 3.5-4B
42.9%
47.5%
6.7%3
17.7%
47.5%
6.3%
18.0%
33.9%
44.3%
52.0%1
InternVL 3.5-30B-A3B
40.2%
35.4%
0.6%
7.6%
41.3%
1.3%
7.3%
43.9%
7.9%
12.6%
InternVL 3.5-14B
41.4%
28.5%
4.7%
12.9%
41.3%
5.0%
11.9%
54.5%1
13.9%
22.7%
InternVL 3.5-8B
35.8%
24.7%
4.2%
12.3%
30.8%
2.5%
11.4%
51.9%3
16.4%
21.3%
InternVL 3.5-4B
31.2%
22.4%
1.0%
8.0%
36.7%
1.0%
7.5%
34.6%
16.5%
32.3%
Fine-tuned Models
SpaceR #2
41.6%2
42.4%
1.8%2
5.4%2
48.8%2
2.3%1
6.8%1
33.7%2
15.7%2
30.8%2
SpaceThinker
33.6%
32.8%
0.6%3
5.1%3
34.3%
0.7%3
4.7%3
33.7%2
13.6%3
27.1%3
SpaceOm
33.4%
35.2%
2.4%1
8.6%1
32.8%
1.8%2
6.5%2
32.3%3
17.0%1
34.2%1
ViLaSR #1
42.9%1
43.1%3
0.0%
0.0%
49.7%1
0.0%
0.0%
35.9%1
10.1%
20.6%
VST-7B-RL
38.0%
47.2%1
0.0%
0.1%
44.0%
0.0%
0.0%
22.8%
0.0%
0.0%
VST-7B-SFT #3
38.8%3
46.5%2
0.0%
0.2%
46.6%3
0.0%
0.2%
23.4%
4.1%
8.1%
Combined
Combined leaderboard
This table reports only models with valid results on both the image and video benchmarks.
It summarizes overall cross-media performance together with media-wise and capability-wise
aggregate scores, so readers can see whether a model's benchmark ranking comes from broad
consistency across both media and both cognition branches or from strength concentrated in
only one part of the benchmark.
Ranking colors are computed independently inside each model group. Tied values share the same
displayed rank.