InfraResolution Bench

InfraResolution Bench

A Prime Lab environment and benchmark for evaluating AI models & agents on Revenue operations tasks in AI infrastructure.

155

Total Cases

86.7%

Top Score (claude-sonnet-4.6)

39

Models Ranked

7

Scenario Families


The Problem

Revenue Operations in AI infrastructure is messy

When a customer experiences an issue that could affect usage, billing, or service expectations, the relevant context is usually spread across multiple systems: CRM/account data, pricing and billing configuration, usage telemetry, incident notes, customer communications, and internal policy docs.

Let's call this collected evidence a case packet: a normalized bundle of all relevant context from these systems, assembled into a single input for an AI agent to reason over.

This benchmark tests whether an agent can take a case packet with messy, partially conflicting evidence and produce a correct, structured commercial resolution.


What It Measures

Given a case packet, the agent must

Classify the issue

Into a bounded RevOps taxonomy: pricing mismatch, metering discrepancy, incident review, customer-caused failure, policy applicability, or ambiguous case.

Interpret messy evidence

Determine root cause, customer impact, and contractual applicability without inventing billing logic.

Detect cross-system discrepancies

E.g. CRM says Committed-100 but billing config says Committed-150. The agent must catch the mismatch.

Route and recommend

Assign a bounded owner and next action across RevOps, Finance, and Engineering. Decide whether human review is needed.

Draft consistent communications

A customer-facing note and an internal ops note that don't contradict the structured output. Scored with consistency checks, not vibes.


Where This Fits

The decision layer in a resolution workflow

Event Sources

Tickets, telemetry, billing alerts, customer emails, and policy documents flow in as raw signals.

Case Assembly
AI Agent
Execution
Eval Layer

Scoring

Three-layer deterministic scoring, no LLM judge

70%weight

Exact Match

Field-by-field comparison across 9 structured resolution fields. All-or-nothing per field.

20%weight

Consistency

10+ keyword checks ensuring drafted notes don't contradict structured outputs. Catches plausible but internally contradictory results.

10%weight

Rubric

Required-content checks for issue mention, owner, action, next step, and account context. Intentionally exact-heavy so note formatting can't flatten real model differences.


Dataset

Gold cases + synthetic generation

15gold cases

Hand-authored adversarial and edge scenarios: CRM/billing mismatches, clean SLA breaches, maintenance exclusions, metering discrepancies, customer-caused failures, and ambiguous mixed evidence.

140synthetic cases

Across 7 generator families, each built from programmatic truth plus bounded noise injectors for wording variation. Ground truth is always deterministic, never LLM-generated.