InfraResolution Bench

Leaderboard

Model rankings on InfraResolution Bench

Composite scores use 60% gold + 40% synthetic weighting. Hosted on Prime Lab with deterministic scoring, no LLM judge.

Top models

claude-sonnet-4.6
86.7%
gemini-3.1-pro-preview
86.4%
claude-opus-4.6
85.0%
gpt-5.3-codex
84.5%
gpt-5.1
84.0%
claude-opus-4.5
84.0%
claude-sonnet-4.5
83.8%
gpt-5.2-codex
83.5%
gpt-5
83.4%
glm-5-turbo
83.4%

Score vs Cost to Run

anthropic
arcee-ai
deepseek
google
minimax
moonshotai
nvidia
openai
prime-intellect
qwen
stepfun
x-ai
xiaomi
z-ai
$0.1$0.2$0.5$1$2$5$10$2030%35%40%45%50%55%60%65%70%75%80%85%90%Cost to Run (USD, Log Scale)Scoreclaude-sonnet-4.6gemini-3.1-pro-previewclaude-opus-4.6gpt-5.3-codexgpt-5.1claude-opus-4.5claude-sonnet-4.5gpt-5.2-codexgpt-5glm-5-turboglm-5qwen3.5-397b-a17bgrok-4gpt-5.4-miniqwen3.5-27bgpt-5.4kimi-k2.5gemini-3-flash-previewminimax-m2.7gpt-5.2mimo-v2-proINTELLECT-3.1gpt-5.4-nanominimax-m2minimax-m2.5step-3.5-flashgemma-4-26b-a4b-ittrinity-large-thinkingqwen3.5-35b-a3bintellect-3minimax-m2.1mimo-v2-omnimimo-v2-flashdeepseek-v3.2gemma-4-31b-itnemotron-3-super-120b-a12bgrok-4.20qwen3-max-thinking

All models

#ModelStatusOverallCost to RunGoldSyntheticSamples
1claude-sonnet-4.6stable86.7%$5.3283.1%92.2%255
2gemini-3.1-pro-previewstable86.4%$1.7487.7%84.4%255
3claude-opus-4.6stable85.0%$26.7184.9%85.2%255
4gpt-5.3-codexstable84.5%$1.7384.9%83.9%255
5gpt-5.1stable84.0%$1.7585.5%81.9%255
6claude-opus-4.5stable84.0%$25.9383.5%84.7%255
7claude-sonnet-4.5stable83.8%$4.2583.8%83.8%255
8gpt-5.2-codexstable83.5%$1.2983.0%84.3%255
9gpt-5stable83.4%$13.4583.4%83.5%255
10glm-5-turbostable83.4%$0.3283.9%82.7%255
11glm-5stable82.7%$0.7982.7%82.8%255
12qwen3.5-397b-a17bstable82.4%$1.3781.2%84.3%255
13grok-4stable82.3%$7.2080.4%85.1%255
14gpt-5.4-ministable82.0%$0.2680.3%84.6%255
15qwen3.5-27bstable81.5%$0.2980.1%83.5%255
16qwen3.6-plus:freestable81.3%Free80.0%83.2%255
17gpt-5.4stable80.9%$1.2878.5%84.6%255
18kimi-k2.5stable80.2%$0.6478.5%82.8%255
19gemini-3-flash-previewstable79.6%$0.1079.7%79.5%255
20minimax-m2.7stable78.5%$0.6677.5%80.2%255
21gpt-5.2stable78.5%$1.3079.0%77.6%255
22mimo-v2-prostable78.3%$0.7977.3%79.8%255
23INTELLECT-3.1stable78.1%$0.5776.5%80.7%255
24gpt-5.4-nanostable77.2%$0.1375.8%79.4%255
25minimax-m2stable77.2%$0.6677.1%77.3%255
26minimax-m2.5stable76.0%$0.6676.4%75.4%255
27step-3.5-flashstable75.6%$0.1376.4%74.5%255
28gemma-4-26b-a4b-itstable74.1%$0.1670.4%79.7%255
29trinity-large-thinkingstable74.1%$0.1673.4%75.2%255
30qwen3.5-35b-a3bstable73.7%$0.2977.4%68.1%255
31intellect-3partial72.4%$7.25-72.4%255
32minimax-m2.1stable70.5%$0.6666.8%76.0%255
33mimo-v2-omnistable70.1%$0.7973.3%65.2%255
34mimo-v2-flashstable64.9%$0.1364.1%66.0%255
35deepseek-v3.2stable62.5%$0.3658.7%68.0%255
36gemma-4-31b-itstable61.4%$0.1662.0%60.6%255
37nemotron-3-super-120b-a12bstable47.7%$0.2346.0%50.3%255
38grok-4.20unstable34.1%$3.600.0%85.3%150
39qwen3-max-thinkingunstable33.1%$3.300.0%82.9%255
-claude-opus-4.1not_run-----
-glm-5.1blocked-----
-gemini-3-pro-previewnot_run-----
-gpt-5-codexnot_run-----
-gpt-5-mininot_run-----
-gpt-5.1-codexpartial---90.1%105

Sweep Status

Pending and blocked runs

ModelStatusEval IDReason
x-ai/grok-4.20unstableb37mxyhieooooj87bbh4ud9n / zz0oog1a8rgpi6ilbfzeyqfsCompleted pair includes a zero-score slice that appears to be provider or tool-loop failure. Synthetic 21x5 rerun is in progress.
qwen/qwen3-max-thinkingunstablezasjjusmsuh50eerjhp9fgvw / awym3p6nzb2sptgm57rgklfzCompleted pair includes a zero-score slice that appears to be provider or tool-loop failure.
prime-intellect/intellect-3partiallmb2bjp3gbrempnw475fa7fuGold eval still running, showing synthetic only
PrimeIntellect/INTELLECT-3.1stableintellect31-adapter-gold-tools-15x10-2026-04-06 / intellect31-adapter-synthetic-tools-21x5-2026-04-06-
google/gemini-3-pro-previewnot_run-No clean high-signal coverage in imported set.
openai/gpt-5-codexnot_run-No clean high-signal coverage in imported set.
openai/gpt-5.1-codexpartialv0gaq4ouwft4aqh822b1yv3uOnly one high-signal slice is complete or latest run is failed/stalled.
anthropic/claude-opus-4.1not_run-No clean high-signal coverage in imported set.
openai/gpt-5-mininot_run-No clean high-signal coverage in imported set.
glm-5.1blocked-General API denied access to glm-5.1 (403), coding endpoint was unsuitable for benchmark load.