Leaderboard

Leaderboard Tables

The tables below preserve the full benchmark reporting structure while making it easier to compare models across image tasks, video tasks, self-aware reasoning, and environment-aware reasoning.

Leaderboard

Current leading models

The current benchmark leaders are shown below as Top-3 views for combined, modality-specific, and dual-cognition-dimension performance. The `Acc` column here reports Overall Acc.

Image

#	Model	Acc
1	Gemini 3 Flash	50.2%
2	GPT 5.3 Chat	47.8%
3	Qwen 3.5-27B	44.3%

Video

#	Model	Acc
1	Gemini 3 Flash	46.5%
2	Mimo v2 Omni	38.8%
3	InternVL 3.5-38B	37.8%

Self-Aware

#	Model	Acc
1	Gemini 3 Flash	44.4%
2	GLM 4.6V	30.9%
3	Qwen 3.5-Flash	28.7%

Environment-Aware

#	Model	Acc
1	Qwen 3.5-35B-A3B	54.9%
2	Gemini 3 Flash	54.2%
3	Qwen 3.5-27B	53.3%

Combined

#	Model	Acc
1	Gemini 3 Flash	48.4%
2	Qwen 3.5-Flash	38.9%
3	Qwen 3.5-35B-A3B	37.9%

Image Tasks

Image-task leaderboard

This table evaluates image-based dual cognition across the four released image tasks. It reports overall image performance together with per-task answer accuracy and spatial grounding quality, so readers can inspect not only whether a model selects the right option, but also whether that decision is supported by reliable landmark localization.

Ranking colors are computed independently inside each model group. Tied values share the same displayed rank.

#1 within group #2 within group #3 within group

Model	Overall Acc	Self-Aware						Environment-Aware
		Landmark-Relative Position Reasoning			Future Observation Prediction			Self-Relative Position Reasoning			Landmark-Driven Action Decision
		Acc	IoU@50	mIoU	Acc	IoU@50	mIoU	Acc	IoU@50	mIoU	Acc	IoU@50	mIoU
Closed-source Models
Claude Sonnet 4.6 #3	42.7% 3	37.7% 3	4.9%	18.2%	23.6%	3.4%	14.2%	48.5%	9.6% 3	20.2% 2	61.0% 2	8.6%	19.9%
GPT 5.3 Chat #2	47.8% 2	35.2%	11.9%	23.5%	37.3% 2	12.4%	25.2% 3	56.5% 1	14.9% 1	22.5% 1	62.3% 1	15.4% 1	22.5% 1
Gemini 3 Flash #1	50.2% 1	47.6% 1	0.7%	1.2%	45.9% 1	0.7%	0.9%	56.2% 2	0.1%	0.9%	51.3%	0.5%	1.2%
Gemini 3.1 Flash Lite	34.5%	39.1% 2	0.0%	0.0%	28.6%	2.5%	3.5%	39.9%	0.4%	0.8%	30.4%	0.4%	0.6%
Grok 4.1 Fast	27.4%	21.1%	3.6%	16.4%	22.9%	2.2%	17.5%	33.0%	1.9%	8.2%	32.6%	1.6%	7.5%
Qwen 3.6-Plus	39.3%	32.8%	16.6%	27.1% 3	33.5% 3	14.0% 3	26.7% 2	48.4%	7.4%	19.0%	42.5%	6.0%	19.9%
Qwen 3.5-Plus	38.5%	28.6%	21.4% 3	27.1%	29.0%	13.4%	23.5%	47.6%	9.7% 2	19.3%	48.9%	8.3%	20.0% 3
Qwen 3.5-Flash	40.9%	27.8%	21.6% 2	29.8% 2	28.9%	15.9% 2	23.3%	52.4% 3	8.8%	19.4% 3	54.5% 3	8.9% 3	20.1% 2
Mimo v2 Omni	31.2%	28.7%	38.0% 1	34.3% 1	21.3%	29.8% 1	29.7% 1	36.5%	8.8%	17.4%	38.2%	9.2% 2	17.3%
Open-source Models
GLM 4.6V	32.8%	30.8%	9.4%	8.7%	27.1%	8.6%	8.7%	29.2%	3.3%	6.7%	44.2%	2.5%	5.0%
Kimi K2.5	34.1%	34.3% 1	27.1%	30.4%	32.7%	15.0%	21.7%	39.4%	5.0%	15.6%	30.1%	4.7%	14.4%
Qwen 3.5-397B-A17B	39.4%	27.4%	3.3%	12.6%	35.8% 3	10.0%	21.4%	47.9%	2.8%	13.0%	46.2%	4.4%	13.9%
Qwen 3.5-122B-A10B #2	42.3% 2	32.6% 3	29.1% 2	31.6% 3	37.8% 1	20.6% 1	28.4% 3	49.7%	6.1%	19.4% 3	49.0%	6.8%	18.8% 3
Qwen 3.5-35B-A3B #3	41.9% 3	29.5%	27.6%	30.5%	27.4%	0.5%	0.8%	53.2% 2	6.1%	14.5%	57.5% 1	7.2% 3	18.5%
Qwen 3.5-27B #1	44.3% 1	31.6%	42.8% 1	39.9% 1	37.1% 2	17.1%	25.8%	57.8% 1	7.3% 2	20.4% 2	50.6% 3	7.5% 2	21.0% 2
Qwen 3.5-9B	40.9%	29.0%	15.9%	25.8%	29.0%	17.8% 3	29.8% 1	53.2% 2	10.3% 1	20.9% 1	52.2% 2	11.5% 1	22.1% 1
Qwen 3.5-4B	39.0%	30.9%	12.8%	23.1%	30.2%	11.2%	21.7%	47.5%	6.7% 3	17.7%	47.5%	6.3%	18.0%
Intern S1-Pro	28.4%	29.4%	17.4%	26.8%	26.2%	9.6%	21.3%	27.8%	2.7%	13.7%	30.3%	3.2%	14.0%
InternVL 3.5-241B-A28B	37.7%	32.9% 2	28.4% 3	31.2%	26.1%	19.5% 2	25.3%	50.1% 3	5.4%	15.8%	41.8%	5.1%	15.0%
InternVL 3.5-30B-A3B	32.0%	27.3%	4.9%	19.9%	24.0%	7.0%	15.5%	35.4%	0.6%	7.6%	41.3%	1.3%	7.3%
InternVL 3.5-14B	31.1%	22.8%	7.6%	22.2%	31.9%	6.6%	22.0%	28.5%	4.7%	12.9%	41.3%	5.0%	11.9%
InternVL 3.5-8B	27.1%	28.5%	19.7%	34.2% 2	24.5%	8.1%	28.7% 2	24.7%	4.2%	12.3%	30.8%	2.5%	11.4%
InternVL 3.5-4B	27.8%	25.6%	3.1%	19.8%	26.4%	10.8%	23.2%	22.4%	1.0%	8.0%	36.7%	1.0%	7.5%
Fine-tuned Models
SpaceR #3	34.0% 3	20.6%	2.4% 3	6.0% 3	24.0%	1.2% 3	1.9% 3	42.4%	1.8% 3	5.4% 3	48.8% 2	2.3% 2	6.8% 2
SpaceThinker	28.8%	22.6%	1.2%	3.0%	25.4% 1	1.9% 2	5.7% 2	32.8%	0.6%	5.1%	34.3%	0.7%	4.7%
SpaceOm	29.3%	24.7% 3	2.8% 2	6.7% 2	24.5% 2	3.6% 1	10.9% 1	35.2%	2.4% 2	8.6% 1	32.8%	1.8% 3	6.5% 3
SenseNova-SI-1.2	19.6%	17.5%	10.7% 1	10.6% 1	0.1%	0.0%	0.1%	15.3%	3.2% 1	5.8% 2	45.4%	6.0% 1	8.5% 1
ViLaSR	33.4%	18.8%	0.0%	0.0%	22.1%	0.0%	0.0%	43.1% 3	0.0%	0.0%	49.7% 1	0.0%	0.0%
VST-7B-RL #1	36.1% 1	29.2% 1	0.0%	0.0%	24.1% 3	0.0%	0.0%	47.2% 1	0.0%	0.1%	44.0%	0.0%	0.0%
VST-7B-SFT #2	36.1% 2	28.1% 2	0.0%	0.0%	23.0%	0.0%	0.0%	46.5% 2	0.0%	0.2%	46.6% 3	0.0%	0.2%

Video Tasks

Video-task leaderboard

This table evaluates video-based dual cognition across composite behavior recognition, atomic behavior recognition, and landmark visibility reasoning. It reports overall video performance together with semantic and temporal columns for each task, so readers can judge whether a model recognizes the right flight event or landmark state and whether it also localizes the corresponding interval correctly.

Ranking colors are computed independently inside each model group. Tied values share the same displayed rank.

#1 within group #2 within group #3 within group

Model	Overall Acc	Composite Behavior Recognition			Atomic Behavior Recognition			Landmark Visibility Reasoning
Model	Overall Acc	Acc	tIoU@50	mtIoU	Acc	F1@50	mtIoU	Acc	F1@50	mtIoU
Closed-source Models
Gemini 3 Flash #1	46.5% 1	49.3% 1	51.9% 1	54.3% 1	35.0% 1	52.6% 2	46.6%	55.2% 2	44.9%	44.3%
Gemini 3.1 Flash Lite	31.9%	22.3%	29.5%	32.2%	19.7%	40.0%	40.6%	53.7% 3	43.1%	41.6%
Mimo v2 Omni #2	38.8% 2	27.8% 3	33.1% 3	35.8% 3	29.1% 3	53.6% 1	46.9% 3	59.4% 1	52.5% 3	53.7% 1
Qwen 3.5-Flash #3	36.8% 3	32.6% 2	35.8% 2	40.1% 2	25.6%	51.7% 3	42.4%	52.3%	57.8% 1	51.3% 2
Qwen 3.5-Plus	28.6%	2.4%	9.0%	9.2%	30.0% 2	46.6%	47.7% 1	53.2%	53.9% 2	46.8% 3
Qwen 3.6-Plus	29.6%	9.8%	16.5%	16.9%	28.3%	45.9%	47.3% 2	50.8%	51.5%	46.1%
Open-source Models
GLM 4.6V	32.4%	32.4% 2	1.3%	6.7%	33.4% 2	21.6%	22.6%	31.3%	16.7%	33.8%
Kimi K2.5	30.9%	11.9%	19.1%	19.9%	30.2% 3	54.6% 2	49.2%	50.5%	49.7% 3	45.9%
Qwen 3.5-397B-A17B	23.4%	5.2%	1.7%	2.2%	24.6%	29.9%	30.0%	40.4%	20.4%	31.8%
Qwen 3.5-122B-A10B	30.6%	21.2%	28.4% 1	31.9% 1	22.1%	39.0%	34.6%	48.4%	46.4%	49.7% 2
Qwen 3.5-35B-A3B	34.0%	21.6%	21.5% 3	23.1% 3	26.2%	54.4%	50.3% 3	54.1% 3	49.7% 2	48.6%
Qwen 3.5-27B	31.3%	14.6%	21.9% 2	23.3% 2	28.0%	54.5% 3	50.7% 2	51.4%	51.7% 1	48.7% 3
Qwen 3.5-9B	29.2%	9.9%	10.1%	11.0%	27.2%	55.9% 1	51.3% 1	50.7%	33.5%	36.8%
Qwen 3.5-4B	16.5%	6.4%	1.3%	1.4%	9.4%	3.0%	2.8%	33.9%	44.3%	52.0% 1
InternVL 3.5-38B #1	37.8% 1	15.2%	11.1%	12.8%	34.2% 1	33.0%	33.1%	64.0% 1	20.0%	27.5%
InternVL 3.5-30B-A3B	28.0%	15.3%	5.8%	8.6%	24.8%	24.2%	24.6%	43.9%	7.9%	12.6%
InternVL 3.5-14B #2	36.6% 2	33.1% 1	3.9%	11.3%	22.4%	15.8%	17.4%	54.5% 2	13.9%	22.7%
InternVL 3.5-8B #3	34.1% 3	27.4%	8.5%	14.1%	23.0%	25.0%	26.1%	51.9%	16.4%	21.3%
InternVL 3.5-4B	28.9%	31.0% 3	3.5%	7.4%	21.1%	11.7%	12.1%	34.6%	16.5%	32.3%
Fine-tuned Models
SpaceR #2	28.0% 2	24.0% 2	0.0% 1	2.6% 2	26.3% 2	15.3% 3	13.9% 3	33.7% 2	15.7% 2	30.8% 2
SpaceThinker #3	25.0% 3	19.0% 3	0.0% 1	1.6% 3	22.4% 3	18.1% 1	14.1% 1	33.7% 2	13.6% 3	27.1% 3
SpaceOm #1	30.9% 1	33.2% 1	0.0% 1	3.2% 1	27.0% 1	16.7% 2	14.0% 2	32.3% 3	17.0% 1	34.2% 1
ViLaSR	17.0%	5.3%	0.0% 1	0.8%	9.7%	3.4%	5.6%	35.9% 1	10.1%	20.6%
VST-7B-RL	7.6%	0.0%	0.0% 1	0.0%	0.0%	0.0%	0.0%	22.8%	0.0%	0.0%
VST-7B-SFT	7.8%	0.0%	0.0% 1	0.0%	0.0%	0.0%	0.0%	23.4%	4.1%	8.1%

Self-Aware

Self-aware capability leaderboard

This table reorganizes the benchmark by capability rather than by medium and focuses on self-aware reasoning as one coherent axis. It places the two self-aware image tasks together with composite and atomic flight-behavior recognition from video, so readers can inspect how well each model reasons about UAV self-state across both spatial and temporal evidence channels.

Ranking colors are computed independently inside each model group. Tied values share the same displayed rank.

#1 within group #2 within group #3 within group

Model	Overall Acc	Image Tasks						Video Tasks
		Landmark-Relative Position Reasoning			Future Observation Prediction			Composite Behavior Recognition			Atomic Behavior Recognition
		Acc	IoU@50	mIoU	Acc	IoU@50	mIoU	Acc	tIoU@50	mtIoU	Acc	F1@50	mtIoU
Closed-source Models
Gemini 3 Flash #1	44.4% 1	47.6% 1	0.7%	1.2%	45.9% 1	0.7%	0.9%	49.3% 1	51.9% 1	54.3% 1	35.0% 1	52.6% 2	46.6%
Gemini 3.1 Flash Lite #3	27.5% 3	39.1% 2	0.0%	0.0%	28.6%	2.5%	3.5%	22.3%	29.5%	32.2%	19.7%	40.0%	40.6%
Qwen 3.6-Plus	26.1%	32.8% 3	16.6%	27.1% 3	33.5% 2	14.0% 3	26.7% 2	9.8%	16.5%	16.9%	28.3%	45.9%	47.3% 2
Qwen 3.5-Plus	22.5%	28.6%	21.4% 3	27.1%	29.0% 3	13.4%	23.5% 3	2.4%	9.0%	9.2%	30.0% 2	46.6%	47.7% 1
Qwen 3.5-Flash #2	28.7% 2	27.8%	21.6% 2	29.8% 2	28.9%	15.9% 2	23.3%	32.6% 2	35.8% 2	40.1% 2	25.6%	51.7% 3	42.4%
Mimo v2 Omni	26.7%	28.7%	38.0% 1	34.3% 1	21.3%	29.8% 1	29.7% 1	27.8% 3	33.1% 3	35.8% 3	29.1% 3	53.6% 1	46.9% 3
Open-source Models
GLM 4.6V #1	30.9% 1	30.8%	9.4%	8.7%	27.1%	8.6%	8.7%	32.4% 2	1.3%	6.7%	33.4% 1	21.6%	22.6%
Kimi K2.5	27.3%	34.3% 1	27.1%	30.4%	32.7%	15.0%	21.7%	11.9%	19.1%	19.9%	30.2% 2	54.6% 2	49.2%
Qwen 3.5-397B-A17B	23.3%	27.4%	3.3%	12.6%	35.8% 3	10.0%	21.4%	5.2%	1.7%	2.2%	24.6%	29.9%	30.0%
Qwen 3.5-122B-A10B #2	28.4% 2	32.6% 2	29.1% 2	31.6% 3	37.8% 1	20.6% 1	28.4% 3	21.2%	28.4% 1	31.9% 1	22.1%	39.0%	34.6%
Qwen 3.5-35B-A3B	26.2%	29.5%	27.6% 3	30.5%	27.4%	0.5%	0.8%	21.6%	21.5% 3	23.1% 3	26.2%	54.4%	50.3% 3
Qwen 3.5-27B #3	27.8% 3	31.6% 3	42.8% 1	39.9% 1	37.1% 2	17.1% 3	25.8%	14.6%	21.9% 2	23.3% 2	28.0% 3	54.5% 3	50.7% 2
Qwen 3.5-9B	23.8%	29.0%	15.9%	25.8%	29.0%	17.8% 2	29.8% 1	9.9%	10.1%	11.0%	27.2%	55.9% 1	51.3% 1
Qwen 3.5-4B	19.2%	30.9%	12.8%	23.1%	30.2%	11.2%	21.7%	6.4%	1.3%	1.4%	9.4%	3.0%	2.8%
InternVL 3.5-30B-A3B	22.9%	27.3%	4.9%	19.9%	24.0%	7.0%	15.5%	15.3%	5.8%	8.6%	24.8%	24.2%	24.6%
InternVL 3.5-14B	27.5%	22.8%	7.6%	22.2%	31.9%	6.6%	22.0%	33.1% 1	3.9%	11.3%	22.4%	15.8%	17.4%
InternVL 3.5-8B	25.9%	28.5%	19.7%	34.2% 2	24.5%	8.1%	28.7% 2	27.4%	8.5%	14.1%	23.0%	25.0%	26.1%
InternVL 3.5-4B	26.0%	25.6%	3.1%	19.8%	26.4%	10.8%	23.2%	31.0% 3	3.5%	7.4%	21.1%	11.7%	12.1%
Fine-tuned Models
SpaceR #2	23.7% 2	20.6%	2.4% 2	6.0% 2	24.0%	1.2% 3	1.9% 3	24.0% 2	0.0% 1	2.6% 2	26.3% 2	15.3% 3	13.9% 3
SpaceThinker #3	22.3% 3	22.6%	1.2% 3	3.0% 3	25.4% 1	1.9% 2	5.7% 2	19.0% 3	0.0% 1	1.6% 3	22.4% 3	18.1% 1	14.1% 1
SpaceOm #1	27.4% 1	24.7% 3	2.8% 1	6.7% 1	24.5% 2	3.6% 1	10.9% 1	33.2% 1	0.0% 1	3.2% 1	27.0% 1	16.7% 2	14.0% 2
ViLaSR	14.0%	18.8%	0.0%	0.0%	22.1%	0.0%	0.0%	5.3%	0.0% 1	0.8%	9.7%	3.4%	5.6%
VST-7B-RL	13.3%	29.2% 1	0.0%	0.0%	24.1% 3	0.0%	0.0%	0.0%	0.0% 1	0.0%	0.0%	0.0%	0.0%
VST-7B-SFT	12.8%	28.1% 2	0.0%	0.0%	23.0%	0.0%	0.0%	0.0%	0.0% 1	0.0%	0.0%	0.0%	0.0%

Environment-Aware

Environment-aware capability leaderboard

This table groups together the environment-aware tasks across image and video and therefore reads the benchmark from the perspective of external-world understanding. It makes it easier to compare landmark-relative direction judgment, landmark-driven action decision, and dynamic visibility reasoning under one shared environment-state perspective, together with the grounding and localization evidence attached to those decisions.

Ranking colors are computed independently inside each model group. Tied values share the same displayed rank.

#1 within group #2 within group #3 within group

Model	Overall Acc	Image Tasks						Video Tasks
		Self-Relative Position Reasoning			Landmark-Driven Action Decision			Landmark Visibility Reasoning
		Acc	IoU@50	mIoU	Acc	IoU@50	mIoU	Acc	F1@50	mtIoU
Closed-source Models
Gemini 3 Flash #1	54.2% 1	56.2% 1	0.1%	0.9%	51.3% 2	0.5%	1.2%	55.2% 2	44.9%	44.3%
Gemini 3.1 Flash Lite	41.3%	39.9%	0.4%	0.8%	30.4%	0.4%	0.6%	53.7% 3	43.1%	41.6%
Qwen 3.6-Plus	47.2%	48.4% 3	7.4% 3	19.0% 3	42.5%	6.0%	19.9% 3	50.8%	51.5%	46.1%
Qwen 3.5-Plus #3	49.9% 3	47.6%	9.7% 1	19.3% 2	48.9% 3	8.3% 3	20.0% 2	53.2%	53.9% 2	46.8% 3
Qwen 3.5-Flash #2	53.1% 2	52.4% 2	8.8% 2	19.4% 1	54.5% 1	8.9% 2	20.1% 1	52.3%	57.8% 1	51.3% 2
Mimo v2 Omni	44.7%	36.5%	8.8% 2	17.4%	38.2%	9.2% 1	17.3%	59.4% 1	52.5% 3	53.7% 1
Open-source Models
GLM 4.6V	34.9%	29.2%	3.3%	6.7%	44.2%	2.5%	5.0%	31.3%	16.7%	33.8%
Kimi K2.5	40.0%	39.4%	5.0%	15.6%	30.1%	4.7%	14.4%	50.5%	49.7% 3	45.9%
Qwen 3.5-397B-A17B	44.9%	47.9%	2.8%	13.0%	46.2%	4.4%	13.9%	40.4%	20.4%	31.8%
Qwen 3.5-122B-A10B	49.1%	49.7% 3	6.1%	19.4% 3	49.0%	6.8%	18.8% 3	48.4%	46.4%	49.7% 2
Qwen 3.5-35B-A3B #1	54.9% 1	53.2% 2	6.1%	14.5%	57.5% 1	7.2% 3	18.5%	54.1% 2	49.7% 2	48.6%
Qwen 3.5-27B #2	53.3% 2	57.8% 1	7.3% 2	20.4% 2	50.6% 3	7.5% 2	21.0% 2	51.4%	51.7% 1	48.7% 3
Qwen 3.5-9B #3	52.1% 3	53.2% 2	10.3% 1	20.9% 1	52.2% 2	11.5% 1	22.1% 1	50.7%	33.5%	36.8%
Qwen 3.5-4B	42.9%	47.5%	6.7% 3	17.7%	47.5%	6.3%	18.0%	33.9%	44.3%	52.0% 1
InternVL 3.5-30B-A3B	40.2%	35.4%	0.6%	7.6%	41.3%	1.3%	7.3%	43.9%	7.9%	12.6%
InternVL 3.5-14B	41.4%	28.5%	4.7%	12.9%	41.3%	5.0%	11.9%	54.5% 1	13.9%	22.7%
InternVL 3.5-8B	35.8%	24.7%	4.2%	12.3%	30.8%	2.5%	11.4%	51.9% 3	16.4%	21.3%
InternVL 3.5-4B	31.2%	22.4%	1.0%	8.0%	36.7%	1.0%	7.5%	34.6%	16.5%	32.3%
Fine-tuned Models
SpaceR #2	41.6% 2	42.4%	1.8% 2	5.4% 2	48.8% 2	2.3% 1	6.8% 1	33.7% 2	15.7% 2	30.8% 2
SpaceThinker	33.6%	32.8%	0.6% 3	5.1% 3	34.3%	0.7% 3	4.7% 3	33.7% 2	13.6% 3	27.1% 3
SpaceOm	33.4%	35.2%	2.4% 1	8.6% 1	32.8%	1.8% 2	6.5% 2	32.3% 3	17.0% 1	34.2% 1
ViLaSR #1	42.9% 1	43.1% 3	0.0%	0.0%	49.7% 1	0.0%	0.0%	35.9% 1	10.1%	20.6%
VST-7B-RL	38.0%	47.2% 1	0.0%	0.1%	44.0%	0.0%	0.0%	22.8%	0.0%	0.0%
VST-7B-SFT #3	38.8% 3	46.5% 2	0.0%	0.2%	46.6% 3	0.0%	0.2%	23.4%	4.1%	8.1%

Combined

Combined leaderboard

This table reports only models with valid results on both the image and video benchmarks. It summarizes overall cross-media performance together with media-wise and capability-wise aggregate scores, so readers can see whether a model's benchmark ranking comes from broad consistency across both media and both cognition branches or from strength concentrated in only one part of the benchmark.

Ranking colors are computed independently inside each model group. Tied values share the same displayed rank.

#1 within group #2 within group #3 within group

Model	Overall Acc	By Media		By Cognition
Model	Overall Acc	Image	Video	Self-Aware	Environment-Aware
Closed-source Models
Gemini 3 Flash #1	48.4% 1	50.2% 1	46.5% 1	44.4% 1	54.2% 1
Gemini 3.1 Flash Lite	33.2%	34.5%	31.9%	27.5% 3	41.3%
Qwen 3.6-Plus	34.5%	39.3% 3	29.6%	26.1%	47.2%
Qwen 3.5-Plus	33.5%	38.5%	28.6%	22.5%	49.9% 3
Qwen 3.5-Flash #2	38.9% 2	40.9% 2	36.8% 3	28.7% 2	53.1% 2
Mimo v2 Omni #3	35.0% 3	31.2%	38.8% 2	26.7%	44.7%
Open-source Models
GLM 4.6V	32.6%	32.8%	32.4%	30.9% 1	34.9%
Kimi K2.5	32.5%	34.1%	30.9%	27.3%	40.0%
Qwen 3.5-397B-A17B	31.4%	39.4%	23.4%	23.3%	44.9%
Qwen 3.5-122B-A10B #3	36.4% 3	42.3% 2	30.6%	28.4% 2	49.1%
Qwen 3.5-35B-A3B #1	37.9% 1	41.9% 3	34.0% 3	26.2%	54.9% 1
Qwen 3.5-27B #2	37.8% 2	44.3% 1	31.3%	27.8% 3	53.3% 2
Qwen 3.5-9B	35.1%	40.9%	29.2%	23.8%	52.1% 3
Qwen 3.5-4B	27.8%	39.0%	16.5%	19.2%	42.9%
InternVL 3.5-30B-A3B	30.0%	32.0%	28.0%	22.9%	40.2%
InternVL 3.5-14B	33.9%	31.1%	36.6% 1	27.5%	41.4%
InternVL 3.5-8B	30.6%	27.1%	34.1% 2	25.9%	35.8%
InternVL 3.5-4B	28.3%	27.8%	28.9%	26.0%	31.2%
Fine-tuned Models
SpaceR #1	31.0% 1	34.0% 3	28.0% 2	23.7% 2	41.6% 2
SpaceThinker #3	26.9% 3	28.8%	25.0% 3	22.3% 3	33.6%
SpaceOm #2	30.1% 2	29.3%	30.9% 1	27.4% 1	33.4%
ViLaSR	25.2%	33.4%	17.0%	14.0%	42.9% 1
VST-7B-RL	21.9%	36.1% 1	7.6%	13.3%	38.0%
VST-7B-SFT	21.9%	36.1% 2	7.8%	12.8%	38.8% 3

↑