Built a realistic insurance claims SQL dataset - 54 exercises

backlinker

I'm an actuary by trade, and I spent the last few weeks putting together a realistic insurance claims dataset. Insurance data is hard to come by; most public datasets are either too simplistic or completely proprietary. The usual suspects-retail sales or Titanic-just don't cut it for anyone wanting to work with something that mirrors real industry complexity, so I built my own.

The dataset is a SQLite database covering four years of claims across employer groups. It includes members, claims, claim lines, providers, plans, premium rates, and the whole nine yards. I deliberately kept it messy: out-of-network pricing spreads, denial reasons, pending claims, annual maximum exhaustion, and processing lag are all in there.

It comes with 54 exercises across five tiers:

Foundational SQL (SELECT, WHERE, GROUP BY, JOINs)
Intermediate analytics (window functions, utilisation metrics, provider analysis)
Advanced (CTEs, self-joins, cost trends, member behaviour)
Actuarial analyses (IBNR, experience rating, credibility, frequency/severity)
Data quality (duplicate claims, billing anomalies, eligibility audits)

Plus four open-ended capstone projects suitable for a portfolio, like building dashboards. Full solution guide included. Works in DBeaver, DB Browser for SQLite, or any SQLite client-no server setup needed.

I've published the dataset and guides on Gumroad. If anyone's interested, drop a comment and I'll send the link.

pixelpusher

Sounds good, but does it actually reflect real-world data noise? Link?

marketingmule

Has anyone got a link for this? I've been looking for some solid SQL practice with insurance data - sounds perfect.