Skip to content
Data EngineeringJul 4, 2026 · 5 min read

Cleaning a month of NYC 311 data

331,976 rows: 99.99% → 100% passing validation, in 1.4 s.

Cleaning a real dataset starts with writing down what "clean" means — I ran one month of NYC 311 requests (331,976 rows) through six explicit rules, fixed exactly what failed, and published the receipts.

What was actually wrong with it?

Less than the dataset's reputation suggests — and that's the point. The before-check scored 99.99%: 40 requests closed before they were created, one zip code that read "N/A", and 397 boroughs typed in mixed case. Duplicates? Zero — but now that's a verified fact instead of a hope.

How do you clean it without lying to yourself?

Rules first, fixes second, re-check third. The same six rules run before and after the fix — the delta is the work: 99.99% → 100%. If a fix can't be re-checked, it didn't happen. A dataset that was almost clean still isn't a dataset you can certify — until you run the rules.

What made it repeatable?

DuckDB SQL, one deterministic pass, idempotent by design: run it twice, get identical numbers. The whole check-fix-recheck cycle takes 1.4 seconds, so it can run on every refresh, not once a quarter. The pipeline, rules, and both reports are public in the repo.

Key takeaways

  • Define "clean" as explicit rules before touching the data.
  • Measure before and after with the same rules — the delta is the receipt.
  • 331,976 rows went 99.99% → 100% in 1.4 seconds — 41 real defects, found and fixed.
  • "Probably fine" and "verified: zero duplicates" are different products.

Keep reading: How much does it cost to clean up messy data? and the full case study.


Read the full writeup → the case study

Clip this

The newsletter

Receipts in your inbox.

Every build and post, as it ships. No fluff.

Work with freddyxai