Leaderboard
Model rankings on InfraResolution Bench
Composite scores use 60% gold + 40% synthetic weighting. Hosted on Prime Lab with deterministic scoring, no LLM judge.
Top models
Score vs Cost to Run
anthropic
arcee-ai
deepseek
google
minimax
moonshotai
nvidia
openai
prime-intellect
qwen
stepfun
x-ai
xiaomi
z-ai
All models
| # | Model | Status | Overall | Cost to Run | Gold | Synthetic | Samples |
|---|---|---|---|---|---|---|---|
| 1 | claude-sonnet-4.6 | stable | 86.7% | $5.32 | 83.1% | 92.2% | 255 |
| 2 | gemini-3.1-pro-preview | stable | 86.4% | $1.74 | 87.7% | 84.4% | 255 |
| 3 | claude-opus-4.6 | stable | 85.0% | $26.71 | 84.9% | 85.2% | 255 |
| 4 | gpt-5.3-codex | stable | 84.5% | $1.73 | 84.9% | 83.9% | 255 |
| 5 | gpt-5.1 | stable | 84.0% | $1.75 | 85.5% | 81.9% | 255 |
| 6 | claude-opus-4.5 | stable | 84.0% | $25.93 | 83.5% | 84.7% | 255 |
| 7 | claude-sonnet-4.5 | stable | 83.8% | $4.25 | 83.8% | 83.8% | 255 |
| 8 | gpt-5.2-codex | stable | 83.5% | $1.29 | 83.0% | 84.3% | 255 |
| 9 | gpt-5 | stable | 83.4% | $13.45 | 83.4% | 83.5% | 255 |
| 10 | glm-5-turbo | stable | 83.4% | $0.32 | 83.9% | 82.7% | 255 |
| 11 | glm-5 | stable | 82.7% | $0.79 | 82.7% | 82.8% | 255 |
| 12 | qwen3.5-397b-a17b | stable | 82.4% | $1.37 | 81.2% | 84.3% | 255 |
| 13 | grok-4 | stable | 82.3% | $7.20 | 80.4% | 85.1% | 255 |
| 14 | gpt-5.4-mini | stable | 82.0% | $0.26 | 80.3% | 84.6% | 255 |
| 15 | qwen3.5-27b | stable | 81.5% | $0.29 | 80.1% | 83.5% | 255 |
| 16 | qwen3.6-plus:free | stable | 81.3% | Free | 80.0% | 83.2% | 255 |
| 17 | gpt-5.4 | stable | 80.9% | $1.28 | 78.5% | 84.6% | 255 |
| 18 | mimo-v2.5 | stable | 80.8% | - | 80.9% | 80.6% | 255 |
| 19 | glm-5.1 | stable | 80.8% | $3.37 | 80.3% | 81.5% | 255 |
| 20 | kimi-k2.5 | stable | 80.2% | $0.64 | 78.5% | 82.8% | 255 |
| 21 | gemini-3-flash-preview | stable | 79.6% | $0.10 | 79.7% | 79.5% | 255 |
| 22 | minimax-m2.7 | stable | 78.5% | $0.66 | 77.5% | 80.2% | 255 |
| 23 | gpt-5.2 | stable | 78.5% | $1.30 | 79.0% | 77.6% | 255 |
| 24 | mimo-v2-pro | stable | 78.3% | $0.79 | 77.3% | 79.8% | 255 |
| 25 | INTELLECT-3.1 | stable | 78.1% | $0.57 | 76.5% | 80.7% | 255 |
| 26 | gpt-5.4-nano | stable | 77.2% | $0.13 | 75.8% | 79.4% | 255 |
| 27 | minimax-m2 | stable | 77.2% | $0.66 | 77.1% | 77.3% | 255 |
| 28 | minimax-m2.5 | stable | 76.0% | $0.66 | 76.4% | 75.4% | 255 |
| 29 | step-3.5-flash | stable | 75.6% | $0.13 | 76.4% | 74.5% | 255 |
| 30 | gemma-4-26b-a4b-it | stable | 74.1% | $0.16 | 70.4% | 79.7% | 255 |
| 31 | trinity-large-thinking | stable | 74.1% | $0.16 | 73.4% | 75.2% | 255 |
| 32 | qwen3.5-35b-a3b | stable | 73.7% | $0.29 | 77.4% | 68.1% | 255 |
| 33 | intellect-3 | partial | 72.4% | $7.25 | - | 72.4% | 255 |
| 34 | minimax-m2.1 | stable | 70.5% | $0.66 | 66.8% | 76.0% | 255 |
| 35 | mimo-v2-omni | stable | 70.1% | $0.79 | 73.3% | 65.2% | 255 |
| 36 | mimo-v2-flash | stable | 64.9% | $0.13 | 64.1% | 66.0% | 255 |
| 37 | deepseek-v3.2 | stable | 62.5% | $0.36 | 58.7% | 68.0% | 255 |
| 38 | gemma-4-31b-it | stable | 61.4% | $0.16 | 62.0% | 60.6% | 255 |
| 39 | nemotron-3-super-120b-a12b | stable | 47.7% | $0.23 | 46.0% | 50.3% | 255 |
| 40 | grok-4.20 | unstable | 34.1% | $3.60 | 0.0% | 85.3% | 150 |
| 41 | kimi-k2.6 | unstable | 33.6% | - | 34.2% | 32.7% | 255 |
| 42 | qwen3-max-thinking | unstable | 33.1% | $3.30 | 0.0% | 82.9% | 255 |
| 43 | ling-2.6-flash:free | unstable | 0.8% | - | 0.0% | 2.0% | 255 |
| 44 | hy3-preview:free | stable | 0.6% | - | 0.3% | 0.9% | 255 |
| - | claude-opus-4.1 | not_run | - | - | - | - | - |
| - | gemini-3-pro-preview | not_run | - | - | - | - | - |
| - | gpt-5-codex | not_run | - | - | - | - | - |
| - | gpt-5-mini | not_run | - | - | - | - | - |
| - | gpt-5.1-codex | partial | - | - | - | 90.1% | 105 |
| - | mimo-v2.5-pro | partial | - | - | 54.5% | - | 255 |
Sweep Status
Pending and blocked runs
| Model | Status | Eval ID | Reason |
|---|---|---|---|
| PrimeIntellect/INTELLECT-3.1 | stable | intellect31-adapter-gold-tools-15x10-2026-04-06 / intellect31-adapter-synthetic-tools-21x5-2026-04-06 | - |
| prime-intellect/intellect-3 | partial | lmb2bjp3gbrempnw475fa7fu | Gold eval still running, showing synthetic only |
| x-ai/grok-4.20 | unstable | b37mxyhieooooj87bbh4ud9n / zz0oog1a8rgpi6ilbfzeyqfs | Completed pair includes a zero-score slice that appears to be provider or tool-loop failure. Synthetic 21x5 rerun is in progress. |
| qwen/qwen3-max-thinking | unstable | zasjjusmsuh50eerjhp9fgvw / awym3p6nzb2sptgm57rgklfz | Completed pair includes a zero-score slice that appears to be provider or tool-loop failure. |
| moonshotai/kimi-k2.6 | unstable | uhuzxzryxli6q1jx35lum0ti / toq29oqf3yz4yvmdqb3e1j9l | Published pair is not trustworthy: both slices show widespread APIStatusError failures, and a fresh 3x2 gold smoke failed 6/6 with OpenRouter 402 insufficient credits. |
| inclusionai/ling-2.6-flash:free | unstable | dd70fxe7zq36e1r3uae3src6 / xh3yn5fd17owymevvkssp4zw | Completed pair includes a zero-score gold slice that appears to be provider or output-loop failure. |
| google/gemini-3-pro-preview | not_run | - | No clean high-signal coverage in imported set. |
| openai/gpt-5-codex | not_run | - | No clean high-signal coverage in imported set. |
| openai/gpt-5.1-codex | partial | v0gaq4ouwft4aqh822b1yv3u | Only one high-signal slice is complete or latest run is failed/stalled. |
| anthropic/claude-opus-4.1 | not_run | - | No clean high-signal coverage in imported set. |
| openai/gpt-5-mini | not_run | - | No clean high-signal coverage in imported set. |
| xiaomi/mimo-v2.5-pro | partial | lokgz5mffd4bbaq3ybbrnygc | Synthetic rerun completed with OpenRouter 402 insufficient credits, so there is no clean synthetic 21x5 slice to rank. |