The underwriting AI benchmark
The underwriting track of InsureBench measures how language models do the core job of an underwriter: read the submission, weigh the risk, and decide whether to offer cover and on what terms. Every case is grounded in real application materials and scored pass@1 against the outcome the work resolved to.
What the underwriting track tests
Underwriting is a judgement made under uncertainty, but it isn't a guess. An underwriter takes a submission, identifies the exposures that matter, checks them against appetite and guidelines, and produces a decision a file can be built on: accept, decline, or refer — and, if accepted, the limits, deductibles, exclusions, and pricing inputs that go with it. The underwriting track asks a model to do exactly that, from the same materials, and then checks the answer.
Because each case resolves to a recorded outcome, the benchmark rewards models that reach the right call for the right reasons, not models that produce a confident-sounding memo. This is what separates an underwriting AI benchmark from a general reasoning test: the answer is the decision, and the decision is checkable.
Why underwriting is hard for AI
- The signal is buried. The fact that changes the decision is often one line in a long application or an attachment, not the headline figures.
- Guidelines are specific. Appetite, referral triggers, and capacity limits are precise; a near-miss is still a miss.
- Documents disagree. Submissions contain inconsistencies a model has to notice and resolve rather than average over.
- Terms compound. A defensible accept can still be wrong if the limits, exclusions, or pricing inputs attached to it aren't.
Example case types
Underwriting cases span lines and decisions, for example:
- Decide whether a commercial property risk falls within appetite given the construction, occupancy, and loss history in the submission.
- Set the correct exclusions and conditions for a liability risk with a flagged prior claim.
- Identify the referral trigger that takes a case out of an underwriter's authority.
- Determine the pricing inputs that follow from the exposure data once the controlling guideline is applied.
How it's scored
Models run pass@1 — one attempt, no retries — and the response is compared to the recorded underwriting outcome. A decision is scored against the decision actually made; numeric terms are scored against the recorded values within a defined tolerance. The full rules are in the methodology. Underwriting is one of three families in the wider InsureBench insurance AI benchmark, alongside claims and actuarial work.