Leaderboard
Model rankings on InfraResolution Bench
Composite scores use 60% gold + 40% synthetic weighting. Hosted on Prime Lab with deterministic scoring, no LLM judge.
Top models
Score vs Cost to Run
anthropic
arcee-ai
deepseek
google
minimax
moonshotai
nvidia
openai
prime-intellect
qwen
stepfun
x-ai
xiaomi
z-ai
All models
| # | Model | Status | Overall | Cost to Run | Gold | Synthetic | Samples |
|---|---|---|---|---|---|---|---|
| 1 | claude-sonnet-4.6 | stable | 86.7% | $5.32 | 83.1% | 92.2% | 255 |
| 2 | gemini-3.1-pro-preview | stable | 86.4% | $1.74 | 87.7% | 84.4% | 255 |
| 3 | claude-opus-4.6 | stable | 85.0% | $26.71 | 84.9% | 85.2% | 255 |
| 4 | gpt-5.3-codex | stable | 84.5% | $1.73 | 84.9% | 83.9% | 255 |
| 5 | gpt-5.1 | stable | 84.0% | $1.75 | 85.5% | 81.9% | 255 |
| 6 | claude-opus-4.5 | stable | 84.0% | $25.93 | 83.5% | 84.7% | 255 |
| 7 | claude-sonnet-4.5 | stable | 83.8% | $4.25 | 83.8% | 83.8% | 255 |
| 8 | gpt-5.2-codex | stable | 83.5% | $1.29 | 83.0% | 84.3% | 255 |
| 9 | gpt-5 | stable | 83.4% | $13.45 | 83.4% | 83.5% | 255 |
| 10 | glm-5-turbo | stable | 83.4% | $0.32 | 83.9% | 82.7% | 255 |
| 11 | glm-5 | stable | 82.7% | $0.79 | 82.7% | 82.8% | 255 |
| 12 | qwen3.5-397b-a17b | stable | 82.4% | $1.37 | 81.2% | 84.3% | 255 |
| 13 | grok-4 | stable | 82.3% | $7.20 | 80.4% | 85.1% | 255 |
| 14 | gpt-5.4-mini | stable | 82.0% | $0.26 | 80.3% | 84.6% | 255 |
| 15 | qwen3.5-27b | stable | 81.5% | $0.29 | 80.1% | 83.5% | 255 |
| 16 | qwen3.6-plus:free | stable | 81.3% | Free | 80.0% | 83.2% | 255 |
| 17 | gpt-5.4 | stable | 80.9% | $1.28 | 78.5% | 84.6% | 255 |
| 18 | kimi-k2.5 | stable | 80.2% | $0.64 | 78.5% | 82.8% | 255 |
| 19 | gemini-3-flash-preview | stable | 79.6% | $0.10 | 79.7% | 79.5% | 255 |
| 20 | minimax-m2.7 | stable | 78.5% | $0.66 | 77.5% | 80.2% | 255 |
| 21 | gpt-5.2 | stable | 78.5% | $1.30 | 79.0% | 77.6% | 255 |
| 22 | mimo-v2-pro | stable | 78.3% | $0.79 | 77.3% | 79.8% | 255 |
| 23 | INTELLECT-3.1 | stable | 78.1% | $0.57 | 76.5% | 80.7% | 255 |
| 24 | gpt-5.4-nano | stable | 77.2% | $0.13 | 75.8% | 79.4% | 255 |
| 25 | minimax-m2 | stable | 77.2% | $0.66 | 77.1% | 77.3% | 255 |
| 26 | minimax-m2.5 | stable | 76.0% | $0.66 | 76.4% | 75.4% | 255 |
| 27 | step-3.5-flash | stable | 75.6% | $0.13 | 76.4% | 74.5% | 255 |
| 28 | gemma-4-26b-a4b-it | stable | 74.1% | $0.16 | 70.4% | 79.7% | 255 |
| 29 | trinity-large-thinking | stable | 74.1% | $0.16 | 73.4% | 75.2% | 255 |
| 30 | qwen3.5-35b-a3b | stable | 73.7% | $0.29 | 77.4% | 68.1% | 255 |
| 31 | intellect-3 | partial | 72.4% | $7.25 | - | 72.4% | 255 |
| 32 | minimax-m2.1 | stable | 70.5% | $0.66 | 66.8% | 76.0% | 255 |
| 33 | mimo-v2-omni | stable | 70.1% | $0.79 | 73.3% | 65.2% | 255 |
| 34 | mimo-v2-flash | stable | 64.9% | $0.13 | 64.1% | 66.0% | 255 |
| 35 | deepseek-v3.2 | stable | 62.5% | $0.36 | 58.7% | 68.0% | 255 |
| 36 | gemma-4-31b-it | stable | 61.4% | $0.16 | 62.0% | 60.6% | 255 |
| 37 | nemotron-3-super-120b-a12b | stable | 47.7% | $0.23 | 46.0% | 50.3% | 255 |
| 38 | grok-4.20 | unstable | 34.1% | $3.60 | 0.0% | 85.3% | 150 |
| 39 | qwen3-max-thinking | unstable | 33.1% | $3.30 | 0.0% | 82.9% | 255 |
| - | claude-opus-4.1 | not_run | - | - | - | - | - |
| - | glm-5.1 | blocked | - | - | - | - | - |
| - | gemini-3-pro-preview | not_run | - | - | - | - | - |
| - | gpt-5-codex | not_run | - | - | - | - | - |
| - | gpt-5-mini | not_run | - | - | - | - | - |
| - | gpt-5.1-codex | partial | - | - | - | 90.1% | 105 |
Sweep Status
Pending and blocked runs
| Model | Status | Eval ID | Reason |
|---|---|---|---|
| x-ai/grok-4.20 | unstable | b37mxyhieooooj87bbh4ud9n / zz0oog1a8rgpi6ilbfzeyqfs | Completed pair includes a zero-score slice that appears to be provider or tool-loop failure. Synthetic 21x5 rerun is in progress. |
| qwen/qwen3-max-thinking | unstable | zasjjusmsuh50eerjhp9fgvw / awym3p6nzb2sptgm57rgklfz | Completed pair includes a zero-score slice that appears to be provider or tool-loop failure. |
| prime-intellect/intellect-3 | partial | lmb2bjp3gbrempnw475fa7fu | Gold eval still running, showing synthetic only |
| PrimeIntellect/INTELLECT-3.1 | stable | intellect31-adapter-gold-tools-15x10-2026-04-06 / intellect31-adapter-synthetic-tools-21x5-2026-04-06 | - |
| google/gemini-3-pro-preview | not_run | - | No clean high-signal coverage in imported set. |
| openai/gpt-5-codex | not_run | - | No clean high-signal coverage in imported set. |
| openai/gpt-5.1-codex | partial | v0gaq4ouwft4aqh822b1yv3u | Only one high-signal slice is complete or latest run is failed/stalled. |
| anthropic/claude-opus-4.1 | not_run | - | No clean high-signal coverage in imported set. |
| openai/gpt-5-mini | not_run | - | No clean high-signal coverage in imported set. |
| glm-5.1 | blocked | - | General API denied access to glm-5.1 (403), coding endpoint was unsuitable for benchmark load. |