I'm an actuary by trade, and I spent the last few weeks putting together a realistic insurance claims dataset. Insurance data is hard to come by; most public datasets are either too simplistic or completely proprietary. The usual suspects-retail sales or Titanic-just don't cut it for anyone wanting to work with something that mirrors real industry complexity, so I built my own.
The dataset is a SQLite database covering four years of claims across employer groups. It includes members, claims, claim lines, providers, plans, premium rates, and the whole nine yards. I deliberately kept it messy: out-of-network pricing spreads, denial reasons, pending claims, annual maximum exhaustion, and processing lag are all in there.
It comes with 54 exercises across five tiers:
- Foundational SQL (SELECT, WHERE, GROUP BY, JOINs)
- Intermediate analytics (window functions, utilisation metrics, provider analysis)
- Advanced (CTEs, self-joins, cost trends, member behaviour)
- Actuarial analyses (IBNR, experience rating, credibility, frequency/severity)
- Data quality (duplicate claims, billing anomalies, eligibility audits)
Plus four open-ended capstone projects suitable for a portfolio, like building dashboards. Full solution guide included. Works in DBeaver, DB Browser for SQLite, or any SQLite client-no server setup needed.
I've published the dataset and guides on Gumroad. If anyone's interested, drop a comment and I'll send the link.