InfraResolution Bench

A Prime Lab environment and benchmark for evaluating AI models & agents on Revenue operations tasks in AI infrastructure.

View on GitHub View on Prime Intellect

155

Total Cases

86.7%

Top Score (claude-sonnet-4.6)

Models Ranked

Scenario Families

The Problem

Revenue Operations in AI infrastructure is messy

When a customer experiences an issue that could affect usage, billing, or service expectations, the relevant context is usually spread across multiple systems: CRM/account data, pricing and billing configuration, usage telemetry, incident notes, customer communications, and internal policy docs.

Let's call this collected evidence a case packet: a normalized bundle of all relevant context from these systems, assembled into a single input for an AI agent to reason over.

This benchmark tests whether an agent can take a case packet with messy, partially conflicting evidence and produce a correct, structured commercial resolution.

What It Measures

Given a case packet, the agent must

Classify the issue

Into a bounded RevOps taxonomy: pricing mismatch, metering discrepancy, incident review, customer-caused failure, policy applicability, or ambiguous case.

Interpret messy evidence

Determine root cause, customer impact, and contractual applicability without inventing billing logic.

Detect cross-system discrepancies

E.g. CRM says Committed-100 but billing config says Committed-150. The agent must catch the mismatch.

Route and recommend

Assign a bounded owner and next action across RevOps, Finance, and Engineering. Decide whether human review is needed.

Draft consistent communications

A customer-facing note and an internal ops note that don't contradict the structured output. Scored with consistency checks, not vibes.

Where This Fits

The decision layer in a resolution workflow

Event Sources

→

Case Assembly

→

AI Agent

→

Execution

→

Eval Layer

Event Sources · Tickets, telemetry, billing alerts, customer emails, and policy documents flow in as raw signals.

Case Assembly · Evidence from multiple systems is normalized into a structured case packet with all records and context.

AI Agent · The model receives the case packet and must classify, interpret evidence, route to an owner, and draft communications.

Execution · Resolution is routed to the appropriate owner, responses are drafted, and risk is flagged for review.

Eval Layer · Deterministic scoring: 70% exact match, 20% consistency checks, 10% rubric. No LLM judge.

Event Sources

Tickets, telemetry, billing alerts, customer emails, and policy documents flow in as raw signals.

↓

Case Assembly

↓

AI Agent

↓

Execution

↓

Eval Layer

Scoring

Three-layer deterministic scoring, no LLM judge

70%weight

Exact Match

Field-by-field comparison across 9 structured resolution fields. All-or-nothing per field.

20%weight

Consistency

10+ keyword checks ensuring drafted notes don't contradict structured outputs. Catches plausible but internally contradictory results.

10%weight

Rubric

Required-content checks for issue mention, owner, action, next step, and account context. Intentionally exact-heavy so note formatting can't flatten real model differences.

Dataset

Gold cases + synthetic generation

15gold cases

Hand-authored adversarial and edge scenarios: CRM/billing mismatches, clean SLA breaches, maintenance exclusions, metering discrepancies, customer-caused failures, and ambiguous mixed evidence.

140synthetic cases

Across 7 generator families, each built from programmatic truth plus bounded noise injectors for wording variation. Ground truth is always deterministic, never LLM-generated.

Leaderboard

→

44+ models ranked on hosted Prime evaluations. Current leader: claude-sonnet-4.6 at 86.7%.

Case Explorer

→

Browse the 15 gold cases with structured evidence packets, ground truth labels, and scored model outputs.

Taxonomy

→

The bounded classification framework and deterministic scoring methodology behind every evaluation.