Cleaning a month of NYC 311 data

Cleaning a real dataset starts with writing down what "clean" means — I ran one month of NYC 311 requests (331,976 rows) through six explicit rules, fixed exactly what failed, and published the receipts.

What was actually wrong with it?

Less than the dataset's reputation suggests — and that's the point. The before-check scored 99.99%: 40 requests closed before they were created, one zip code that read "N/A", and 397 boroughs typed in mixed case. Duplicates? Zero — but now that's a verified fact instead of a hope.

How do you clean it without lying to yourself?

Rules first, fixes second, re-check third. The same six rules run before and after the fix — the delta is the work: 99.99% → 100%. If a fix can't be re-checked, it didn't happen. A dataset that was almost clean still isn't a dataset you can certify — until you run the rules.

What made it repeatable?

DuckDB SQL, one deterministic pass, idempotent by design: run it twice, get identical numbers. The whole check-fix-recheck cycle takes 1.4 seconds, so it can run on every refresh, not once a quarter. The pipeline, rules, and both reports are public in the repo.

Key takeaways

Define "clean" as explicit rules before touching the data.
Measure before and after with the same rules — the delta is the receipt.
331,976 rows went 99.99% → 100% in 1.4 seconds — 41 real defects, found and fixed.
"Probably fine" and "verified: zero duplicates" are different products.

Keep reading: How much does it cost to clean up messy data? and the full case study.