The claims & coverage AI benchmark
The claims track of InsureBench measures how language models do the work a claims handler does: read the policy and the claim file, decide whether the loss is covered, identify the clauses that control the answer, and calculate what's payable. Every case is grounded in real documents and scored pass@1 against the recorded outcome.
What the claims track tests
A coverage decision is a chain of reading. The handler establishes what happened from the claim file, finds the insuring agreement that might respond, then works through the exclusions, conditions, endorsements, and limits that modify it — and only then lands on covered or not covered, and for how much. The claims track asks a model to follow that chain over the actual documents and produce the determination, not a summary of the issues.
This is what makes it a true claims AI benchmark rather than a comprehension quiz: the model has to apply the policy as written, including the clauses that cut against coverage, and the answer it gives is checked against what the file actually resolved to.
Why claims and coverage are hard for AI
- Exclusions decide cases. The insuring agreement often grants cover that a later exclusion or condition takes away; missing one flips the answer.
- Endorsements override the form. A schedule or endorsement can rewrite the base wording, and the model has to apply the version that controls.
- Facts come from a messy file. The relevant facts are spread across reports, correspondence, and forms that don't agree with each other.
- The number has to follow. Even a correct coverage call is incomplete without the right deductible, limit, and payable amount applied on top.
Example case types
- Determine whether a property loss is covered given an exclusion that may or may not apply to the described cause.
- Apply a sub-limit and deductible to reach the amount payable on an otherwise covered claim.
- Decide coverage where an endorsement materially changes the base policy wording.
- Identify the single controlling clause that resolves a disputed liability claim.
How it's scored
Models run pass@1 — one attempt, no retries. A coverage determination is scored against the recorded covered/not-covered outcome; a payable amount is scored against the recorded figure within a defined tolerance. Wording and rationale don't earn points; the determination does. The full grading rules are in the methodology. Claims is one of three families in the InsureBench insurance AI benchmark, alongside underwriting and actuarial work.