One month of NYC 311, cleaned and proven

A 331,976-row public dataset made analysis-ready, with the validation report to prove it.

Problem

"Looks clean" is not a data-quality standard. One month of NYC 311 service requests — 331,976 rows — passed the eyeball test; nobody could say what would break an analysis built on it.

What I built

A rules-first pipeline: six explicit validation rules run before anything is touched, DuckDB applies the fixes, the same six rules run again, and every number lands in a committed receipts file. Re-running it produces identical results, byte for byte.

Result

The before-check scored 99.99% — and found the 41 rows eyeballs miss: 40 requests closed before they were created and one zip code that read "N/A". The pipeline nulled the impossible values, case-folded 397 mixed-case boroughs to canonical form, and verified zero duplicate keys — a fact now proven, not assumed. After-check: 100% of 331,976 rows passing every rule, in a 1.4-second run. The before/after reports and receipts are public.

What this costs you

This is the shape of the $500 starter: one dataset, cleaned and validated, receipts included. Bigger messes run $600–$1,500.

One month of NYC 311, cleaned and proven

Problem

What I built

Result

What this costs you

Receipts in your inbox.