InfraResolution Bench

Architecture

Full-stack benchmark design

InfraResolution Bench sits at the intersection of commercial operations and AI evaluation. It tests whether models can do the actual job, not just answer questions about it.

Pipeline

1

Event Sources

CRM updates, billing events, usage telemetry, incident reports, customer emails, and policy documents.

2

Case Assembly

Evidence from multiple sources is assembled into a single structured case packet with all records and context.

3

Agent Interface

The model receives the case packet (or tool-calling interface) and must produce a structured JSON resolution.

4

Deterministic Eval

Exact match, consistency, and rubric checks score the output. No LLM judge, fully reproducible.

5

Scoring & Ranking

Composite scores (70%/20%/10% weighting) are computed and models are ranked on the hosted leaderboard.


Agent Interface

What goes in, what comes out

Input: Case Packet

case_id: string

title: string

crm_record: CRMRecord

billing_record: BillingRecord

usage_record: UsageRecord

incident_record: IncidentRecord

customer_note: string | CustomerNote

policy_snippet: string | PolicySnippet

Output: Resolution Packet

issue_type: IssueTypeEnum

root_cause: string

customer_impact: ImpactEnum

contractual_applicability: ContractEnum

owner: OwnerEnum

next_action: ActionEnum

confidence: number

human_review_flag: boolean

customer_facing_note: string

internal_ops_note: string


Prime Lab Integration

Hosted evaluation infrastructure

Verifiers Environment

Published as a Prime Lab environment with deterministic verifiers. The eval runs server-side with full sandboxing. Models interact through the standard environment API.

Hosted Evaluations

Prime orchestrates model sweeps across multiple providers. Each eval run produces per-sample scores that get imported into the leaderboard.

Tools Mode

Beyond full-packet prompting, models can use tool calls to query individual evidence records. This tests whether tool use improves or hurts resolution accuracy.


Prompt Modes

Packet vs Tools

Packet Mode

The entire case packet is provided in the prompt as a single JSON document. The model must parse all evidence and produce a resolution in one shot. Simpler, but requires the model to handle long context well.

Tools Mode

The model receives the case title and can call tools to fetch specific records (CRM, billing, usage, etc.). Tests whether agentic tool use helps the model focus on relevant evidence rather than being overwhelmed by the full packet.