Huzzle Labs

How InsureBench scores AI on insurance work

InsureBench is an insurance AI benchmark built on a simple principle: a model should be judged on the same work a practitioner is judged on. Every case is grounded in real documents and resolves to a single answer that can be checked against a recorded outcome — not a rubric, not a vibe.

Scoring
pass@1One attempt per case, no retries.
Grounding
DocumentsPolicies, applications, and claim files.
Answer
VerifiableA decision, determination, or number.
Families
ThreeUnderwriting, claims, actuarial.

Why insurance needs its own AI benchmark

The stakes are large and the adoption is real. McKinsey estimates AI could add up to ~$1.1 trillion in annual value to global insurance, with generative AI alone potentially unlocking $50–70 billion in additional revenue.1 A 2024 Deloitte survey found 76% of insurance organisations had already deployed generative AI in at least one function,2 yet BCG reports only about 7% of insurers have scaled AI into production, with roughly two-thirds still piloting.3 The gap between deploying and trusting is exactly what a benchmark should illuminate.

~$1.1T
potential annual value of AI to global insurance — McKinsey
76%
of insurers have deployed generative AI in a function — Deloitte, 2024
~7%
have actually scaled AI to production — BCG
69–88%
LLM hallucination rate on specific legal queries — Stanford HAI

Most AI benchmarks measure general reasoning, coding, or exam recall. Insurance work is different in kind. It turns on reading long, inconsistent policy documents, locating the clauses that control a question, reconciling supporting files that don't always agree, and arriving at a decision or number that an auditor could later check. A model can be fluent and articulate and still get the coverage call wrong — and in insurance, the call is what matters. The risk isn't hypothetical: Stanford HAI found leading models hallucinate on 69–88% of specific legal queries,4 and a 2026 Snorkel AI study of agentic underwriting found frontier models hallucinate domain knowledge even when given tool access, with a ~20% performance drop under pass@k.5

InsureBench exists to measure that specific competence. It's an AI benchmark for insurance that scores models on underwriting, claims, and actuarial tasks as they're actually performed, rather than on synthetic questions written to be easy to grade.

How a case is built

Each case starts from a real piece of insurance work and is rebuilt into a self-contained task. A case bundles together:

  • The source documents — the policy wording, application materials, schedules, endorsements, or claim file a practitioner would have in front of them.
  • A precise question — accept or decline, covered or not covered, or a number such as a reserve, premium input, or payable amount.
  • A recorded outcome — the answer the work actually resolved to, verified by a practitioner, which the model is scored against.

Cases are written and reviewed with practising underwriters, claims handlers, and actuaries so the documents are realistic and the recorded outcome is defensible. Sensitive material is removed or synthesised so a case can be released without exposing private data, while keeping the structure and difficulty of the original work intact.

How models are scored

Every case resolves to a single verifiable answer. Models run pass@1: one attempt per case, no retries, no best-of-N. The response is compared to the recorded outcome — an accept/decline decision, a coverage determination, or a number within a defined tolerance. Scoring reflects the outcome, not the style, length, or confidence of the writing. A well-argued wrong answer scores the same as a terse one: zero.

Reporting the result this way keeps the leaderboard honest. Because each answer is checkable, scores aren't a matter of judgement, and a model can't earn credit for sounding authoritative. It either reached the recorded outcome or it didn't.

What the benchmark deliberately rewards

  • Document grounding — finding and applying the controlling clause rather than answering from prior knowledge.
  • Faithfulness to the terms — following the policy as written, including exclusions and conditions, not a reasonable-sounding approximation.
  • Numerical discipline — carrying the right tables, assumptions, and arithmetic through to a defensible figure.
  • Restraint — declining to invent facts the documents don't support.

A GDPval-style benchmark for insurance

InsureBench follows the spirit of GDPval, OpenAI's 2025 benchmark that evaluates models on real, economically valuable occupational tasks across 44 occupations in the top GDP-contributing US sectors, judged by head-to-head comparison with expert deliverables.6 On GDPval's gold subset, the strongest model reached only a ~47.6% win-or-tie rate against human experts — approaching, but not matching, professional work.7 Where GDPval spans many occupations, InsureBench goes deep on one industry: a GDPval for insurance, built around the tasks that underwriters, claims handlers, and actuaries are paid to get right.

Related research

InsureBench builds on a small but growing body of work evaluating language models on insurance tasks. These are prior research efforts we draw on, not competing products:

  • UNDERWRITE (Snorkel AI, 2026) — the most directly related work: an expert-built, multi-turn agentic underwriting benchmark over 13 frontier models. It documents domain-knowledge hallucination despite tool access — the exact failure a verifiable benchmark needs to catch.
  • INS-MMBench (Fudan University, ICCV 2025) — the first hierarchical multimodal insurance benchmark, spanning auto, property, health, and agricultural insurance across 22 fundamental tasks.
  • InsQABench (2025) — a Chinese insurance question-answering benchmark across commonsense knowledge, structured databases, and unstructured documents.
  • INSEva (2025) — a comprehensive Chinese insurance LLM benchmark of 38,704 examples that scores both faithfulness and completeness.

InsureBench's distinction is its GDPval-style, occupational framing for Western insurance practice: document-grounded cases that resolve to a single verifiable outcome and are scored pass@1, across underwriting, claims, and actuarial work.

Sources

  1. McKinsey & Company — The future of AI in the insurance industry. mckinsey.com
  2. Deloitte — Scaling generative AI in insurance (2024). deloitte.com
  3. BCG — Insurance leads AI adoption; now is the time to scale (2025). bcg.com
  4. Stanford HAI — Hallucinating law: legal mistakes with LLMs are pervasive. hai.stanford.edu
  5. Dsouza et al., Snorkel AI — Benchmarking Agents in Insurance Underwriting Environments (arXiv 2602.00456, 2026). arxiv.org
  6. Patwardhan et al., OpenAI — GDPval (arXiv 2510.04374, 2025). arxiv.org
  7. OpenAI — Introducing GDPval. openai.com
  8. Jin et al., Fudan University — INS-MMBench (arXiv 2406.09105, ICCV 2025). arxiv.org
  9. Ding et al. — InsQABench (arXiv 2501.10943, 2025). arxiv.org
  10. INSEva — A comprehensive Chinese insurance LLM benchmark (arXiv 2509.04455, 2025). arxiv.org
Leaderboard opening 2026. Built by Huzzle Labs.
Get in touch about InsureBench