Leaderboard

Leaderboard Tables

The tables below preserve the full benchmark reporting structure while making it easier to compare models across image tasks, video tasks, self-aware reasoning, and environment-aware reasoning.

Leaderboard

Current leading models

The current benchmark leaders are shown below as Top-3 views for combined, modality-specific, and dual-cognition-dimension performance. The `Acc` column here reports Overall Acc.

Image

# Model Acc
1 Gemini 3 Flash 50.2%
2 GPT 5.3 Chat 47.8%
3 Qwen 3.5-27B 44.3%

Video

# Model Acc
1 Gemini 3 Flash 46.5%
2 Mimo v2 Omni 38.8%
3 InternVL 3.5-38B 37.8%

Self-Aware

# Model Acc
1 Gemini 3 Flash 44.4%
2 GLM 4.6V 30.9%
3 Qwen 3.5-Flash 28.7%

Environment-Aware

# Model Acc
1 Qwen 3.5-35B-A3B 54.9%
2 Gemini 3 Flash 54.2%
3 Qwen 3.5-27B 53.3%

Combined

# Model Acc
1 Gemini 3 Flash 48.4%
2 Qwen 3.5-Flash 38.9%
3 Qwen 3.5-35B-A3B 37.9%

Image Tasks

Image-task leaderboard

This table evaluates image-based dual cognition across the four released image tasks. It reports overall image performance together with per-task answer accuracy and spatial grounding quality, so readers can inspect not only whether a model selects the right option, but also whether that decision is supported by reliable landmark localization.

Ranking colors are computed independently inside each model group. Tied values share the same displayed rank.

#1 within group #2 within group #3 within group
Model Overall
Acc
Self-Aware Environment-Aware
Landmark-Relative Position Reasoning Future Observation Prediction Self-Relative Position Reasoning Landmark-Driven Action Decision
Acc IoU@50 mIoU Acc IoU@50 mIoU Acc IoU@50 mIoU Acc IoU@50 mIoU
Closed-source Models
Claude Sonnet 4.6 #3 42.7% 3 37.7% 3 4.9% 18.2% 23.6% 3.4% 14.2% 48.5% 9.6% 3 20.2% 2 61.0% 2 8.6% 19.9%
GPT 5.3 Chat #2 47.8% 2 35.2% 11.9% 23.5% 37.3% 2 12.4% 25.2% 3 56.5% 1 14.9% 1 22.5% 1 62.3% 1 15.4% 1 22.5% 1
Gemini 3 Flash #1 50.2% 1 47.6% 1 0.7% 1.2% 45.9% 1 0.7% 0.9% 56.2% 2 0.1% 0.9% 51.3% 0.5% 1.2%
Gemini 3.1 Flash Lite 34.5% 39.1% 2 0.0% 0.0% 28.6% 2.5% 3.5% 39.9% 0.4% 0.8% 30.4% 0.4% 0.6%
Grok 4.1 Fast 27.4% 21.1% 3.6% 16.4% 22.9% 2.2% 17.5% 33.0% 1.9% 8.2% 32.6% 1.6% 7.5%
Qwen 3.6-Plus 39.3% 32.8% 16.6% 27.1% 3 33.5% 3 14.0% 3 26.7% 2 48.4% 7.4% 19.0% 42.5% 6.0% 19.9%
Qwen 3.5-Plus 38.5% 28.6% 21.4% 3 27.1% 29.0% 13.4% 23.5% 47.6% 9.7% 2 19.3% 48.9% 8.3% 20.0% 3
Qwen 3.5-Flash 40.9% 27.8% 21.6% 2 29.8% 2 28.9% 15.9% 2 23.3% 52.4% 3 8.8% 19.4% 3 54.5% 3 8.9% 3 20.1% 2
Mimo v2 Omni 31.2% 28.7% 38.0% 1 34.3% 1 21.3% 29.8% 1 29.7% 1 36.5% 8.8% 17.4% 38.2% 9.2% 2 17.3%
Open-source Models
GLM 4.6V 32.8% 30.8% 9.4% 8.7% 27.1% 8.6% 8.7% 29.2% 3.3% 6.7% 44.2% 2.5% 5.0%
Kimi K2.5 34.1% 34.3% 1 27.1% 30.4% 32.7% 15.0% 21.7% 39.4% 5.0% 15.6% 30.1% 4.7% 14.4%
Qwen 3.5-397B-A17B 39.4% 27.4% 3.3% 12.6% 35.8% 3 10.0% 21.4% 47.9% 2.8% 13.0% 46.2% 4.4% 13.9%
Qwen 3.5-122B-A10B #2 42.3% 2 32.6% 3 29.1% 2 31.6% 3 37.8% 1 20.6% 1 28.4% 3 49.7% 6.1% 19.4% 3 49.0% 6.8% 18.8% 3
Qwen 3.5-35B-A3B #3 41.9% 3 29.5% 27.6% 30.5% 27.4% 0.5% 0.8% 53.2% 2 6.1% 14.5% 57.5% 1 7.2% 3 18.5%
Qwen 3.5-27B #1 44.3% 1 31.6% 42.8% 1 39.9% 1 37.1% 2 17.1% 25.8% 57.8% 1 7.3% 2 20.4% 2 50.6% 3 7.5% 2 21.0% 2
Qwen 3.5-9B 40.9% 29.0% 15.9% 25.8% 29.0% 17.8% 3 29.8% 1 53.2% 2 10.3% 1 20.9% 1 52.2% 2 11.5% 1 22.1% 1
Qwen 3.5-4B 39.0% 30.9% 12.8% 23.1% 30.2% 11.2% 21.7% 47.5% 6.7% 3 17.7% 47.5% 6.3% 18.0%
Intern S1-Pro 28.4% 29.4% 17.4% 26.8% 26.2% 9.6% 21.3% 27.8% 2.7% 13.7% 30.3% 3.2% 14.0%
InternVL 3.5-241B-A28B 37.7% 32.9% 2 28.4% 3 31.2% 26.1% 19.5% 2 25.3% 50.1% 3 5.4% 15.8% 41.8% 5.1% 15.0%
InternVL 3.5-30B-A3B 32.0% 27.3% 4.9% 19.9% 24.0% 7.0% 15.5% 35.4% 0.6% 7.6% 41.3% 1.3% 7.3%
InternVL 3.5-14B 31.1% 22.8% 7.6% 22.2% 31.9% 6.6% 22.0% 28.5% 4.7% 12.9% 41.3% 5.0% 11.9%
InternVL 3.5-8B 27.1% 28.5% 19.7% 34.2% 2 24.5% 8.1% 28.7% 2 24.7% 4.2% 12.3% 30.8% 2.5% 11.4%
InternVL 3.5-4B 27.8% 25.6% 3.1% 19.8% 26.4% 10.8% 23.2% 22.4% 1.0% 8.0% 36.7% 1.0% 7.5%
Fine-tuned Models
SpaceR #3 34.0% 3 20.6% 2.4% 3 6.0% 3 24.0% 1.2% 3 1.9% 3 42.4% 1.8% 3 5.4% 3 48.8% 2 2.3% 2 6.8% 2
SpaceThinker 28.8% 22.6% 1.2% 3.0% 25.4% 1 1.9% 2 5.7% 2 32.8% 0.6% 5.1% 34.3% 0.7% 4.7%
SpaceOm 29.3% 24.7% 3 2.8% 2 6.7% 2 24.5% 2 3.6% 1 10.9% 1 35.2% 2.4% 2 8.6% 1 32.8% 1.8% 3 6.5% 3
SenseNova-SI-1.2 19.6% 17.5% 10.7% 1 10.6% 1 0.1% 0.0% 0.1% 15.3% 3.2% 1 5.8% 2 45.4% 6.0% 1 8.5% 1
ViLaSR 33.4% 18.8% 0.0% 0.0% 22.1% 0.0% 0.0% 43.1% 3 0.0% 0.0% 49.7% 1 0.0% 0.0%
VST-7B-RL #1 36.1% 1 29.2% 1 0.0% 0.0% 24.1% 3 0.0% 0.0% 47.2% 1 0.0% 0.1% 44.0% 0.0% 0.0%
VST-7B-SFT #2 36.1% 2 28.1% 2 0.0% 0.0% 23.0% 0.0% 0.0% 46.5% 2 0.0% 0.2% 46.6% 3 0.0% 0.2%

Video Tasks

Video-task leaderboard

This table evaluates video-based dual cognition across composite behavior recognition, atomic behavior recognition, and landmark visibility reasoning. It reports overall video performance together with semantic and temporal columns for each task, so readers can judge whether a model recognizes the right flight event or landmark state and whether it also localizes the corresponding interval correctly.

Ranking colors are computed independently inside each model group. Tied values share the same displayed rank.

#1 within group #2 within group #3 within group
Model Overall
Acc
Composite Behavior Recognition Atomic Behavior Recognition Landmark Visibility Reasoning
Acc tIoU@50 mtIoU Acc F1@50 mtIoU Acc F1@50 mtIoU
Closed-source Models
Gemini 3 Flash #1 46.5% 1 49.3% 1 51.9% 1 54.3% 1 35.0% 1 52.6% 2 46.6% 55.2% 2 44.9% 44.3%
Gemini 3.1 Flash Lite 31.9% 22.3% 29.5% 32.2% 19.7% 40.0% 40.6% 53.7% 3 43.1% 41.6%
Mimo v2 Omni #2 38.8% 2 27.8% 3 33.1% 3 35.8% 3 29.1% 3 53.6% 1 46.9% 3 59.4% 1 52.5% 3 53.7% 1
Qwen 3.5-Flash #3 36.8% 3 32.6% 2 35.8% 2 40.1% 2 25.6% 51.7% 3 42.4% 52.3% 57.8% 1 51.3% 2
Qwen 3.5-Plus 28.6% 2.4% 9.0% 9.2% 30.0% 2 46.6% 47.7% 1 53.2% 53.9% 2 46.8% 3
Qwen 3.6-Plus 29.6% 9.8% 16.5% 16.9% 28.3% 45.9% 47.3% 2 50.8% 51.5% 46.1%
Open-source Models
GLM 4.6V 32.4% 32.4% 2 1.3% 6.7% 33.4% 2 21.6% 22.6% 31.3% 16.7% 33.8%
Kimi K2.5 30.9% 11.9% 19.1% 19.9% 30.2% 3 54.6% 2 49.2% 50.5% 49.7% 3 45.9%
Qwen 3.5-397B-A17B 23.4% 5.2% 1.7% 2.2% 24.6% 29.9% 30.0% 40.4% 20.4% 31.8%
Qwen 3.5-122B-A10B 30.6% 21.2% 28.4% 1 31.9% 1 22.1% 39.0% 34.6% 48.4% 46.4% 49.7% 2
Qwen 3.5-35B-A3B 34.0% 21.6% 21.5% 3 23.1% 3 26.2% 54.4% 50.3% 3 54.1% 3 49.7% 2 48.6%
Qwen 3.5-27B 31.3% 14.6% 21.9% 2 23.3% 2 28.0% 54.5% 3 50.7% 2 51.4% 51.7% 1 48.7% 3
Qwen 3.5-9B 29.2% 9.9% 10.1% 11.0% 27.2% 55.9% 1 51.3% 1 50.7% 33.5% 36.8%
Qwen 3.5-4B 16.5% 6.4% 1.3% 1.4% 9.4% 3.0% 2.8% 33.9% 44.3% 52.0% 1
InternVL 3.5-38B #1 37.8% 1 15.2% 11.1% 12.8% 34.2% 1 33.0% 33.1% 64.0% 1 20.0% 27.5%
InternVL 3.5-30B-A3B 28.0% 15.3% 5.8% 8.6% 24.8% 24.2% 24.6% 43.9% 7.9% 12.6%
InternVL 3.5-14B #2 36.6% 2 33.1% 1 3.9% 11.3% 22.4% 15.8% 17.4% 54.5% 2 13.9% 22.7%
InternVL 3.5-8B #3 34.1% 3 27.4% 8.5% 14.1% 23.0% 25.0% 26.1% 51.9% 16.4% 21.3%
InternVL 3.5-4B 28.9% 31.0% 3 3.5% 7.4% 21.1% 11.7% 12.1% 34.6% 16.5% 32.3%
Fine-tuned Models
SpaceR #2 28.0% 2 24.0% 2 0.0% 1 2.6% 2 26.3% 2 15.3% 3 13.9% 3 33.7% 2 15.7% 2 30.8% 2
SpaceThinker #3 25.0% 3 19.0% 3 0.0% 1 1.6% 3 22.4% 3 18.1% 1 14.1% 1 33.7% 2 13.6% 3 27.1% 3
SpaceOm #1 30.9% 1 33.2% 1 0.0% 1 3.2% 1 27.0% 1 16.7% 2 14.0% 2 32.3% 3 17.0% 1 34.2% 1
ViLaSR 17.0% 5.3% 0.0% 1 0.8% 9.7% 3.4% 5.6% 35.9% 1 10.1% 20.6%
VST-7B-RL 7.6% 0.0% 0.0% 1 0.0% 0.0% 0.0% 0.0% 22.8% 0.0% 0.0%
VST-7B-SFT 7.8% 0.0% 0.0% 1 0.0% 0.0% 0.0% 0.0% 23.4% 4.1% 8.1%

Self-Aware

Self-aware capability leaderboard

This table reorganizes the benchmark by capability rather than by medium and focuses on self-aware reasoning as one coherent axis. It places the two self-aware image tasks together with composite and atomic flight-behavior recognition from video, so readers can inspect how well each model reasons about UAV self-state across both spatial and temporal evidence channels.

Ranking colors are computed independently inside each model group. Tied values share the same displayed rank.

#1 within group #2 within group #3 within group
Model Overall
Acc
Image Tasks Video Tasks
Landmark-Relative Position Reasoning Future Observation Prediction Composite Behavior Recognition Atomic Behavior Recognition
Acc IoU@50 mIoU Acc IoU@50 mIoU Acc tIoU@50 mtIoU Acc F1@50 mtIoU
Closed-source Models
Gemini 3 Flash #1 44.4% 1 47.6% 1 0.7% 1.2% 45.9% 1 0.7% 0.9% 49.3% 1 51.9% 1 54.3% 1 35.0% 1 52.6% 2 46.6%
Gemini 3.1 Flash Lite #3 27.5% 3 39.1% 2 0.0% 0.0% 28.6% 2.5% 3.5% 22.3% 29.5% 32.2% 19.7% 40.0% 40.6%
Qwen 3.6-Plus 26.1% 32.8% 3 16.6% 27.1% 3 33.5% 2 14.0% 3 26.7% 2 9.8% 16.5% 16.9% 28.3% 45.9% 47.3% 2
Qwen 3.5-Plus 22.5% 28.6% 21.4% 3 27.1% 29.0% 3 13.4% 23.5% 3 2.4% 9.0% 9.2% 30.0% 2 46.6% 47.7% 1
Qwen 3.5-Flash #2 28.7% 2 27.8% 21.6% 2 29.8% 2 28.9% 15.9% 2 23.3% 32.6% 2 35.8% 2 40.1% 2 25.6% 51.7% 3 42.4%
Mimo v2 Omni 26.7% 28.7% 38.0% 1 34.3% 1 21.3% 29.8% 1 29.7% 1 27.8% 3 33.1% 3 35.8% 3 29.1% 3 53.6% 1 46.9% 3
Open-source Models
GLM 4.6V #1 30.9% 1 30.8% 9.4% 8.7% 27.1% 8.6% 8.7% 32.4% 2 1.3% 6.7% 33.4% 1 21.6% 22.6%
Kimi K2.5 27.3% 34.3% 1 27.1% 30.4% 32.7% 15.0% 21.7% 11.9% 19.1% 19.9% 30.2% 2 54.6% 2 49.2%
Qwen 3.5-397B-A17B 23.3% 27.4% 3.3% 12.6% 35.8% 3 10.0% 21.4% 5.2% 1.7% 2.2% 24.6% 29.9% 30.0%
Qwen 3.5-122B-A10B #2 28.4% 2 32.6% 2 29.1% 2 31.6% 3 37.8% 1 20.6% 1 28.4% 3 21.2% 28.4% 1 31.9% 1 22.1% 39.0% 34.6%
Qwen 3.5-35B-A3B 26.2% 29.5% 27.6% 3 30.5% 27.4% 0.5% 0.8% 21.6% 21.5% 3 23.1% 3 26.2% 54.4% 50.3% 3
Qwen 3.5-27B #3 27.8% 3 31.6% 3 42.8% 1 39.9% 1 37.1% 2 17.1% 3 25.8% 14.6% 21.9% 2 23.3% 2 28.0% 3 54.5% 3 50.7% 2
Qwen 3.5-9B 23.8% 29.0% 15.9% 25.8% 29.0% 17.8% 2 29.8% 1 9.9% 10.1% 11.0% 27.2% 55.9% 1 51.3% 1
Qwen 3.5-4B 19.2% 30.9% 12.8% 23.1% 30.2% 11.2% 21.7% 6.4% 1.3% 1.4% 9.4% 3.0% 2.8%
InternVL 3.5-30B-A3B 22.9% 27.3% 4.9% 19.9% 24.0% 7.0% 15.5% 15.3% 5.8% 8.6% 24.8% 24.2% 24.6%
InternVL 3.5-14B 27.5% 22.8% 7.6% 22.2% 31.9% 6.6% 22.0% 33.1% 1 3.9% 11.3% 22.4% 15.8% 17.4%
InternVL 3.5-8B 25.9% 28.5% 19.7% 34.2% 2 24.5% 8.1% 28.7% 2 27.4% 8.5% 14.1% 23.0% 25.0% 26.1%
InternVL 3.5-4B 26.0% 25.6% 3.1% 19.8% 26.4% 10.8% 23.2% 31.0% 3 3.5% 7.4% 21.1% 11.7% 12.1%
Fine-tuned Models
SpaceR #2 23.7% 2 20.6% 2.4% 2 6.0% 2 24.0% 1.2% 3 1.9% 3 24.0% 2 0.0% 1 2.6% 2 26.3% 2 15.3% 3 13.9% 3
SpaceThinker #3 22.3% 3 22.6% 1.2% 3 3.0% 3 25.4% 1 1.9% 2 5.7% 2 19.0% 3 0.0% 1 1.6% 3 22.4% 3 18.1% 1 14.1% 1
SpaceOm #1 27.4% 1 24.7% 3 2.8% 1 6.7% 1 24.5% 2 3.6% 1 10.9% 1 33.2% 1 0.0% 1 3.2% 1 27.0% 1 16.7% 2 14.0% 2
ViLaSR 14.0% 18.8% 0.0% 0.0% 22.1% 0.0% 0.0% 5.3% 0.0% 1 0.8% 9.7% 3.4% 5.6%
VST-7B-RL 13.3% 29.2% 1 0.0% 0.0% 24.1% 3 0.0% 0.0% 0.0% 0.0% 1 0.0% 0.0% 0.0% 0.0%
VST-7B-SFT 12.8% 28.1% 2 0.0% 0.0% 23.0% 0.0% 0.0% 0.0% 0.0% 1 0.0% 0.0% 0.0% 0.0%

Environment-Aware

Environment-aware capability leaderboard

This table groups together the environment-aware tasks across image and video and therefore reads the benchmark from the perspective of external-world understanding. It makes it easier to compare landmark-relative direction judgment, landmark-driven action decision, and dynamic visibility reasoning under one shared environment-state perspective, together with the grounding and localization evidence attached to those decisions.

Ranking colors are computed independently inside each model group. Tied values share the same displayed rank.

#1 within group #2 within group #3 within group
Model Overall
Acc
Image Tasks Video Tasks
Self-Relative Position Reasoning Landmark-Driven Action Decision Landmark Visibility Reasoning
Acc IoU@50 mIoU Acc IoU@50 mIoU Acc F1@50 mtIoU
Closed-source Models
Gemini 3 Flash #1 54.2% 1 56.2% 1 0.1% 0.9% 51.3% 2 0.5% 1.2% 55.2% 2 44.9% 44.3%
Gemini 3.1 Flash Lite 41.3% 39.9% 0.4% 0.8% 30.4% 0.4% 0.6% 53.7% 3 43.1% 41.6%
Qwen 3.6-Plus 47.2% 48.4% 3 7.4% 3 19.0% 3 42.5% 6.0% 19.9% 3 50.8% 51.5% 46.1%
Qwen 3.5-Plus #3 49.9% 3 47.6% 9.7% 1 19.3% 2 48.9% 3 8.3% 3 20.0% 2 53.2% 53.9% 2 46.8% 3
Qwen 3.5-Flash #2 53.1% 2 52.4% 2 8.8% 2 19.4% 1 54.5% 1 8.9% 2 20.1% 1 52.3% 57.8% 1 51.3% 2
Mimo v2 Omni 44.7% 36.5% 8.8% 2 17.4% 38.2% 9.2% 1 17.3% 59.4% 1 52.5% 3 53.7% 1
Open-source Models
GLM 4.6V 34.9% 29.2% 3.3% 6.7% 44.2% 2.5% 5.0% 31.3% 16.7% 33.8%
Kimi K2.5 40.0% 39.4% 5.0% 15.6% 30.1% 4.7% 14.4% 50.5% 49.7% 3 45.9%
Qwen 3.5-397B-A17B 44.9% 47.9% 2.8% 13.0% 46.2% 4.4% 13.9% 40.4% 20.4% 31.8%
Qwen 3.5-122B-A10B 49.1% 49.7% 3 6.1% 19.4% 3 49.0% 6.8% 18.8% 3 48.4% 46.4% 49.7% 2
Qwen 3.5-35B-A3B #1 54.9% 1 53.2% 2 6.1% 14.5% 57.5% 1 7.2% 3 18.5% 54.1% 2 49.7% 2 48.6%
Qwen 3.5-27B #2 53.3% 2 57.8% 1 7.3% 2 20.4% 2 50.6% 3 7.5% 2 21.0% 2 51.4% 51.7% 1 48.7% 3
Qwen 3.5-9B #3 52.1% 3 53.2% 2 10.3% 1 20.9% 1 52.2% 2 11.5% 1 22.1% 1 50.7% 33.5% 36.8%
Qwen 3.5-4B 42.9% 47.5% 6.7% 3 17.7% 47.5% 6.3% 18.0% 33.9% 44.3% 52.0% 1
InternVL 3.5-30B-A3B 40.2% 35.4% 0.6% 7.6% 41.3% 1.3% 7.3% 43.9% 7.9% 12.6%
InternVL 3.5-14B 41.4% 28.5% 4.7% 12.9% 41.3% 5.0% 11.9% 54.5% 1 13.9% 22.7%
InternVL 3.5-8B 35.8% 24.7% 4.2% 12.3% 30.8% 2.5% 11.4% 51.9% 3 16.4% 21.3%
InternVL 3.5-4B 31.2% 22.4% 1.0% 8.0% 36.7% 1.0% 7.5% 34.6% 16.5% 32.3%
Fine-tuned Models
SpaceR #2 41.6% 2 42.4% 1.8% 2 5.4% 2 48.8% 2 2.3% 1 6.8% 1 33.7% 2 15.7% 2 30.8% 2
SpaceThinker 33.6% 32.8% 0.6% 3 5.1% 3 34.3% 0.7% 3 4.7% 3 33.7% 2 13.6% 3 27.1% 3
SpaceOm 33.4% 35.2% 2.4% 1 8.6% 1 32.8% 1.8% 2 6.5% 2 32.3% 3 17.0% 1 34.2% 1
ViLaSR #1 42.9% 1 43.1% 3 0.0% 0.0% 49.7% 1 0.0% 0.0% 35.9% 1 10.1% 20.6%
VST-7B-RL 38.0% 47.2% 1 0.0% 0.1% 44.0% 0.0% 0.0% 22.8% 0.0% 0.0%
VST-7B-SFT #3 38.8% 3 46.5% 2 0.0% 0.2% 46.6% 3 0.0% 0.2% 23.4% 4.1% 8.1%

Combined

Combined leaderboard

This table reports only models with valid results on both the image and video benchmarks. It summarizes overall cross-media performance together with media-wise and capability-wise aggregate scores, so readers can see whether a model's benchmark ranking comes from broad consistency across both media and both cognition branches or from strength concentrated in only one part of the benchmark.

Ranking colors are computed independently inside each model group. Tied values share the same displayed rank.

#1 within group #2 within group #3 within group
Model Overall
Acc
By Media By Cognition
Image Video Self-Aware Environment-Aware
Closed-source Models
Gemini 3 Flash #1 48.4% 1 50.2% 1 46.5% 1 44.4% 1 54.2% 1
Gemini 3.1 Flash Lite 33.2% 34.5% 31.9% 27.5% 3 41.3%
Qwen 3.6-Plus 34.5% 39.3% 3 29.6% 26.1% 47.2%
Qwen 3.5-Plus 33.5% 38.5% 28.6% 22.5% 49.9% 3
Qwen 3.5-Flash #2 38.9% 2 40.9% 2 36.8% 3 28.7% 2 53.1% 2
Mimo v2 Omni #3 35.0% 3 31.2% 38.8% 2 26.7% 44.7%
Open-source Models
GLM 4.6V 32.6% 32.8% 32.4% 30.9% 1 34.9%
Kimi K2.5 32.5% 34.1% 30.9% 27.3% 40.0%
Qwen 3.5-397B-A17B 31.4% 39.4% 23.4% 23.3% 44.9%
Qwen 3.5-122B-A10B #3 36.4% 3 42.3% 2 30.6% 28.4% 2 49.1%
Qwen 3.5-35B-A3B #1 37.9% 1 41.9% 3 34.0% 3 26.2% 54.9% 1
Qwen 3.5-27B #2 37.8% 2 44.3% 1 31.3% 27.8% 3 53.3% 2
Qwen 3.5-9B 35.1% 40.9% 29.2% 23.8% 52.1% 3
Qwen 3.5-4B 27.8% 39.0% 16.5% 19.2% 42.9%
InternVL 3.5-30B-A3B 30.0% 32.0% 28.0% 22.9% 40.2%
InternVL 3.5-14B 33.9% 31.1% 36.6% 1 27.5% 41.4%
InternVL 3.5-8B 30.6% 27.1% 34.1% 2 25.9% 35.8%
InternVL 3.5-4B 28.3% 27.8% 28.9% 26.0% 31.2%
Fine-tuned Models
SpaceR #1 31.0% 1 34.0% 3 28.0% 2 23.7% 2 41.6% 2
SpaceThinker #3 26.9% 3 28.8% 25.0% 3 22.3% 3 33.6%
SpaceOm #2 30.1% 2 29.3% 30.9% 1 27.4% 1 33.4%
ViLaSR 25.2% 33.4% 17.0% 14.0% 42.9% 1
VST-7B-RL 21.9% 36.1% 1 7.6% 13.3% 38.0%
VST-7B-SFT 21.9% 36.1% 2 7.8% 12.8% 38.8% 3