InfraResolution Bench

Taxonomy

The 6-layer classification framework

Every resolution packet is classified across six orthogonal layers. The agent must get each one right. Partial credit comes from consistency and rubric checks, but the primary signal is exact match against ground truth.

Layer 1

Issue Type

The primary classification of what went wrong: pricing mismatch, metering discrepancy, outage, maintenance exclusion, customer-caused failure, degraded performance, or ambiguous case.

pricing_config_mismatchmetering_discrepancyclean_covered_outagemaintenance_exclusioncustomer_caused_failuredegraded_performance_with_commercial_sensitivityambiguous_case

Layer 2

Root Cause

What specifically caused the issue: a misconfigured pricing tier, a telemetry gap, a maintenance window overlap, a customer misconfiguration, etc.

Layer 3

Customer Impact

The financial, operational, or contractual impact on the customer: overbilling, underbilling, service degradation, missed SLA, or no material impact.

overbilledunderbilledservice_degradedsla_breachno_material_impact

Layer 4

Contractual Applicability

Whether a credit, refund, SLA adjustment, or no action is contractually warranted based on the terms and evidence.

credit_warrantedrefund_warrantedsla_adjustmentno_action_warrantedneeds_legal_review

Layer 5

Owner

Which team should own the resolution: finance, RevOps, engineering, shared ownership, or escalation to legal.

financerevopsengineeringshared_revops_financelegal_escalation

Layer 6

Next Action

The concrete next step: issue credit, adjust configuration, escalate, schedule review, or close with no action.

issue_creditadjust_pricing_configescalate_to_engineeringschedule_reviewclose_no_action

Scoring

Deterministic composite scoring

70%weight

Exact Match

Each taxonomy field is compared against ground truth. All-or-nothing per field: the agent gets credit only for exact matches.

20%weight

Consistency

Cross-field logical checks: does the owner match the issue type? Does the action match the contractual finding? Catches plausible but internally contradictory outputs.

10%weight

Rubric

Structured rubric checks on the notes: are they bounded, factual, and non-hallucinated? Do they avoid overstepping into billing logic or legal interpretation?

Example

How a case flows through the eval

Messy Input

A customer reports being overcharged. The CRM shows a recent tier upgrade, billing shows the old rate, usage telemetry shows consumption above the old tier limit, and the contract has a 30-day pricing lock clause.

↓

What the Agent Should Notice

The pricing config was updated in the CRM but not propagated to billing. The 30-day lock clause means the customer should have been billed at the old rate for the remaining lock period. The overcharge is real but partial, only the delta above the locked rate.

↓

Correct Output

issue_type:pricing_config_mismatchimpact:overbilledcontractual:credit_warrantedowner:shared_revops_financeaction:issue_credit

↓

Scoring

5/5 exact matches (70% weight) + all consistency checks pass (20% weight) + notes are bounded and factual (10% weight) = 100% composite.