InfraResolution Bench
A Prime Lab environment and benchmark for evaluating AI models & agents on Revenue operations tasks in AI infrastructure.
155
Total Cases
86.7%
Top Score (claude-sonnet-4.6)
39
Models Ranked
7
Scenario Families
The Problem
Revenue Operations in AI infrastructure is messy
When a customer experiences an issue that could affect usage, billing, or service expectations, the relevant context is usually spread across multiple systems: CRM/account data, pricing and billing configuration, usage telemetry, incident notes, customer communications, and internal policy docs.
Let's call this collected evidence a case packet: a normalized bundle of all relevant context from these systems, assembled into a single input for an AI agent to reason over.
This benchmark tests whether an agent can take a case packet with messy, partially conflicting evidence and produce a correct, structured commercial resolution.
What It Measures
Given a case packet, the agent must
Classify the issue
Into a bounded RevOps taxonomy: pricing mismatch, metering discrepancy, incident review, customer-caused failure, policy applicability, or ambiguous case.
Interpret messy evidence
Determine root cause, customer impact, and contractual applicability without inventing billing logic.
Detect cross-system discrepancies
E.g. CRM says Committed-100 but billing config says Committed-150. The agent must catch the mismatch.
Route and recommend
Assign a bounded owner and next action across RevOps, Finance, and Engineering. Decide whether human review is needed.
Draft consistent communications
A customer-facing note and an internal ops note that don't contradict the structured output. Scored with consistency checks, not vibes.
Where This Fits
The decision layer in a resolution workflow
Tickets, telemetry, billing alerts, customer emails, and policy documents flow in as raw signals.
Scoring
Three-layer deterministic scoring, no LLM judge
Exact Match
Field-by-field comparison across 9 structured resolution fields. All-or-nothing per field.
Consistency
10+ keyword checks ensuring drafted notes don't contradict structured outputs. Catches plausible but internally contradictory results.
Rubric
Required-content checks for issue mention, owner, action, next step, and account context. Intentionally exact-heavy so note formatting can't flatten real model differences.
Dataset
Gold cases + synthetic generation
Hand-authored adversarial and edge scenarios: CRM/billing mismatches, clean SLA breaches, maintenance exclusions, metering discrepancies, customer-caused failures, and ambiguous mixed evidence.
Across 7 generator families, each built from programmatic truth plus bounded noise injectors for wording variation. Ground truth is always deterministic, never LLM-generated.
Leaderboard
→39+ models ranked on hosted Prime evaluations. Current leader: claude-sonnet-4.6 at 86.7%.
Case Explorer
→Browse the 15 gold cases with structured evidence packets, ground truth labels, and scored model outputs.
Taxonomy
→The bounded classification framework and deterministic scoring methodology behind every evaluation.