InfraResolution Bench

Leaderboard

Model rankings on InfraResolution Bench

Composite scores use 60% gold + 40% synthetic weighting. Hosted on Prime Lab with deterministic scoring, no LLM judge.

Top models

claude-sonnet-4.6

86.7%

gemini-3.1-pro-preview

86.4%

claude-opus-4.6

85.0%

gpt-5.3-codex

84.5%

gpt-5.1

84.0%

claude-opus-4.5

84.0%

claude-sonnet-4.5

83.8%

gpt-5.2-codex

83.5%

gpt-5

83.4%

glm-5-turbo

83.4%

Score vs Cost to Run

anthropic

arcee-ai

deepseek

google

minimax

moonshotai

nvidia

openai

prime-intellect

qwen

stepfun

x-ai

xiaomi

z-ai

All models

#	Model	Status	Overall	Cost to Run	Gold	Synthetic	Samples
1	claude-sonnet-4.6	stable	86.7%	$5.32	83.1%	92.2%	255
2	gemini-3.1-pro-preview	stable	86.4%	$1.74	87.7%	84.4%	255
3	claude-opus-4.6	stable	85.0%	$26.71	84.9%	85.2%	255
4	gpt-5.3-codex	stable	84.5%	$1.73	84.9%	83.9%	255
5	gpt-5.1	stable	84.0%	$1.75	85.5%	81.9%	255
6	claude-opus-4.5	stable	84.0%	$25.93	83.5%	84.7%	255
7	claude-sonnet-4.5	stable	83.8%	$4.25	83.8%	83.8%	255
8	gpt-5.2-codex	stable	83.5%	$1.29	83.0%	84.3%	255
9	gpt-5	stable	83.4%	$13.45	83.4%	83.5%	255
10	glm-5-turbo	stable	83.4%	$0.32	83.9%	82.7%	255
11	glm-5	stable	82.7%	$0.79	82.7%	82.8%	255
12	qwen3.5-397b-a17b	stable	82.4%	$1.37	81.2%	84.3%	255
13	grok-4	stable	82.3%	$7.20	80.4%	85.1%	255
14	gpt-5.4-mini	stable	82.0%	$0.26	80.3%	84.6%	255
15	qwen3.5-27b	stable	81.5%	$0.29	80.1%	83.5%	255
16	qwen3.6-plus:free	stable	81.3%	Free	80.0%	83.2%	255
17	gpt-5.4	stable	80.9%	$1.28	78.5%	84.6%	255
18	mimo-v2.5	stable	80.8%	-	80.9%	80.6%	255
19	glm-5.1	stable	80.8%	$3.37	80.3%	81.5%	255
20	kimi-k2.5	stable	80.2%	$0.64	78.5%	82.8%	255
21	gemini-3-flash-preview	stable	79.6%	$0.10	79.7%	79.5%	255
22	minimax-m2.7	stable	78.5%	$0.66	77.5%	80.2%	255
23	gpt-5.2	stable	78.5%	$1.30	79.0%	77.6%	255
24	mimo-v2-pro	stable	78.3%	$0.79	77.3%	79.8%	255
25	INTELLECT-3.1	stable	78.1%	$0.57	76.5%	80.7%	255
26	gpt-5.4-nano	stable	77.2%	$0.13	75.8%	79.4%	255
27	minimax-m2	stable	77.2%	$0.66	77.1%	77.3%	255
28	minimax-m2.5	stable	76.0%	$0.66	76.4%	75.4%	255
29	step-3.5-flash	stable	75.6%	$0.13	76.4%	74.5%	255
30	gemma-4-26b-a4b-it	stable	74.1%	$0.16	70.4%	79.7%	255
31	trinity-large-thinking	stable	74.1%	$0.16	73.4%	75.2%	255
32	qwen3.5-35b-a3b	stable	73.7%	$0.29	77.4%	68.1%	255
33	intellect-3	partial	72.4%	$7.25	-	72.4%	255
34	minimax-m2.1	stable	70.5%	$0.66	66.8%	76.0%	255
35	mimo-v2-omni	stable	70.1%	$0.79	73.3%	65.2%	255
36	mimo-v2-flash	stable	64.9%	$0.13	64.1%	66.0%	255
37	deepseek-v3.2	stable	62.5%	$0.36	58.7%	68.0%	255
38	gemma-4-31b-it	stable	61.4%	$0.16	62.0%	60.6%	255
39	nemotron-3-super-120b-a12b	stable	47.7%	$0.23	46.0%	50.3%	255
40	grok-4.20	unstable	34.1%	$3.60	0.0%	85.3%	150
41	kimi-k2.6	unstable	33.6%	-	34.2%	32.7%	255
42	qwen3-max-thinking	unstable	33.1%	$3.30	0.0%	82.9%	255
43	ling-2.6-flash:free	unstable	0.8%	-	0.0%	2.0%	255
44	hy3-preview:free	stable	0.6%	-	0.3%	0.9%	255
-	claude-opus-4.1	not_run	-	-	-	-	-
-	gemini-3-pro-preview	not_run	-	-	-	-	-
-	gpt-5-codex	not_run	-	-	-	-	-
-	gpt-5-mini	not_run	-	-	-	-	-
-	gpt-5.1-codex	partial	-	-	-	90.1%	105
-	mimo-v2.5-pro	partial	-	-	54.5%	-	255

Sweep Status

Pending and blocked runs

Model	Status	Eval ID	Reason
PrimeIntellect/INTELLECT-3.1	stable	intellect31-adapter-gold-tools-15x10-2026-04-06 / intellect31-adapter-synthetic-tools-21x5-2026-04-06	-
prime-intellect/intellect-3	partial	lmb2bjp3gbrempnw475fa7fu	Gold eval still running, showing synthetic only
x-ai/grok-4.20	unstable	b37mxyhieooooj87bbh4ud9n / zz0oog1a8rgpi6ilbfzeyqfs	Completed pair includes a zero-score slice that appears to be provider or tool-loop failure. Synthetic 21x5 rerun is in progress.
qwen/qwen3-max-thinking	unstable	zasjjusmsuh50eerjhp9fgvw / awym3p6nzb2sptgm57rgklfz	Completed pair includes a zero-score slice that appears to be provider or tool-loop failure.
moonshotai/kimi-k2.6	unstable	uhuzxzryxli6q1jx35lum0ti / toq29oqf3yz4yvmdqb3e1j9l	Published pair is not trustworthy: both slices show widespread APIStatusError failures, and a fresh 3x2 gold smoke failed 6/6 with OpenRouter 402 insufficient credits.
inclusionai/ling-2.6-flash:free	unstable	dd70fxe7zq36e1r3uae3src6 / xh3yn5fd17owymevvkssp4zw	Completed pair includes a zero-score gold slice that appears to be provider or output-loop failure.
google/gemini-3-pro-preview	not_run	-	No clean high-signal coverage in imported set.
openai/gpt-5-codex	not_run	-	No clean high-signal coverage in imported set.
openai/gpt-5.1-codex	partial	v0gaq4ouwft4aqh822b1yv3u	Only one high-signal slice is complete or latest run is failed/stalled.
anthropic/claude-opus-4.1	not_run	-	No clean high-signal coverage in imported set.
openai/gpt-5-mini	not_run	-	No clean high-signal coverage in imported set.
xiaomi/mimo-v2.5-pro	partial	lokgz5mffd4bbaq3ybbrnygc	Synthetic rerun completed with OpenRouter 402 insufficient credits, so there is no clean synthetic 21x5 slice to rank.