The actuarial AI benchmark
The actuarial track of InsureBench measures how language models do quantitative insurance work: reserving, pricing, and exposure calculations that apply the right tables and assumptions and carry the arithmetic through to a defensible figure. Every case resolves to a number and is scored pass@1 against the verified result.
What the actuarial track tests
Actuarial work is where insurance becomes arithmetic under rules. A reserve, a rate, or an exposure figure is the product of a defined method applied to specific data with the right assumptions — and a single mis-selected factor or dropped step changes the answer. The actuarial track gives a model the data and the relevant tables and asks for the figure, then checks it against the verified result.
That makes it an unusually strict actuarial AI benchmark: there's no partial credit for a sensible approach that lands on the wrong number. The model has to choose the right method, apply the right assumptions, and compute accurately, end to end.
Why actuarial work is hard for AI
- Method selection matters. The right technique depends on the data in front of the model; a reasonable-looking but wrong choice fails the case.
- Assumptions are specific. Development factors, discount rates, and rating tables have to be the correct ones, read from the correct place.
- Arithmetic must hold. Long multi-step calculations leave many opportunities to drift, and only the final figure is scored.
- Data is structured but unforgiving. Triangles, schedules, and exposure tables have to be read exactly, with no transposed cells.
Example case types
- Estimate outstanding claims reserves from a loss development triangle using the indicated method.
- Derive a technical premium input from exposure data and a given rating structure.
- Apply development factors and a discount assumption to reach a present-value reserve figure.
- Compute an exposure or frequency-severity figure from a structured data set and stated assumptions.
How it's scored
Models run pass@1 — one attempt, no retries — and the final number is compared to the verified result within a defined tolerance. Working that explains the answer doesn't earn credit on its own; the figure does. The full grading rules are in the methodology. Actuarial is one of three families in the InsureBench insurance AI benchmark, alongside underwriting and claims.