Huzzle Labs

The underwriting AI benchmark

The underwriting track of InsureBench measures how language models do the core job of an underwriter: read the submission, weigh the risk, and decide whether to offer cover and on what terms. Every case is grounded in real application materials and scored pass@1 against the outcome the work resolved to.

The task
Risk & termsAccept, decline, or refer — with limits and exclusions.
Inputs
SubmissionsApplications, schedules, and supporting files.
Scoring
pass@1Checked against the recorded decision.

What the underwriting track tests

Underwriting is a judgement made under uncertainty, but it isn't a guess. An underwriter takes a submission, identifies the exposures that matter, checks them against appetite and guidelines, and produces a decision a file can be built on: accept, decline, or refer — and, if accepted, the limits, deductibles, exclusions, and pricing inputs that go with it. The underwriting track asks a model to do exactly that, from the same materials, and then checks the answer.

Because each case resolves to a recorded outcome, the benchmark rewards models that reach the right call for the right reasons, not models that produce a confident-sounding memo. This is what separates an underwriting AI benchmark from a general reasoning test: the answer is the decision, and the decision is checkable.

Why underwriting is hard for AI

  • The signal is buried. The fact that changes the decision is often one line in a long application or an attachment, not the headline figures.
  • Guidelines are specific. Appetite, referral triggers, and capacity limits are precise; a near-miss is still a miss.
  • Documents disagree. Submissions contain inconsistencies a model has to notice and resolve rather than average over.
  • Terms compound. A defensible accept can still be wrong if the limits, exclusions, or pricing inputs attached to it aren't.

Example case types

Underwriting cases span lines and decisions, for example:

  • Decide whether a commercial property risk falls within appetite given the construction, occupancy, and loss history in the submission.
  • Set the correct exclusions and conditions for a liability risk with a flagged prior claim.
  • Identify the referral trigger that takes a case out of an underwriter's authority.
  • Determine the pricing inputs that follow from the exposure data once the controlling guideline is applied.

How it's scored

Models run pass@1 — one attempt, no retries — and the response is compared to the recorded underwriting outcome. A decision is scored against the decision actually made; numeric terms are scored against the recorded values within a defined tolerance. The full rules are in the methodology. Underwriting is one of three families in the wider InsureBench insurance AI benchmark, alongside claims and actuarial work.

Leaderboard opening 2026. Built by Huzzle Labs.
Get in touch about InsureBench