Architecture
Full-stack benchmark design
InfraResolution Bench sits at the intersection of commercial operations and AI evaluation. It tests whether models can do the actual job, not just answer questions about it.
Pipeline
Event Sources
CRM updates, billing events, usage telemetry, incident reports, customer emails, and policy documents.
Case Assembly
Evidence from multiple sources is assembled into a single structured case packet with all records and context.
Agent Interface
The model receives the case packet (or tool-calling interface) and must produce a structured JSON resolution.
Deterministic Eval
Exact match, consistency, and rubric checks score the output. No LLM judge, fully reproducible.
Scoring & Ranking
Composite scores (70%/20%/10% weighting) are computed and models are ranked on the hosted leaderboard.
Agent Interface
What goes in, what comes out
Input: Case Packet
case_id: string
title: string
crm_record: CRMRecord
billing_record: BillingRecord
usage_record: UsageRecord
incident_record: IncidentRecord
customer_note: string | CustomerNote
policy_snippet: string | PolicySnippet
Output: Resolution Packet
issue_type: IssueTypeEnum
root_cause: string
customer_impact: ImpactEnum
contractual_applicability: ContractEnum
owner: OwnerEnum
next_action: ActionEnum
confidence: number
human_review_flag: boolean
customer_facing_note: string
internal_ops_note: string
Prime Lab Integration
Hosted evaluation infrastructure
Verifiers Environment
Published as a Prime Lab environment with deterministic verifiers. The eval runs server-side with full sandboxing. Models interact through the standard environment API.
Hosted Evaluations
Prime orchestrates model sweeps across multiple providers. Each eval run produces per-sample scores that get imported into the leaderboard.
Tools Mode
Beyond full-packet prompting, models can use tool calls to query individual evidence records. This tests whether tool use improves or hurts resolution accuracy.
Prompt Modes
Packet vs Tools
Packet Mode
The entire case packet is provided in the prompt as a single JSON document. The model must parse all evidence and produce a resolution in one shot. Simpler, but requires the model to handle long context well.
Tools Mode
The model receives the case title and can call tools to fetch specific records (CRM, billing, usage, etc.). Tests whether agentic tool use helps the model focus on relevant evidence rather than being overwhelmed by the full packet.