One month of NYC 311, cleaned and proven
A 331,976-row public dataset made analysis-ready, with the validation report to prove it.
Jul 4, 2026
Problem
"Looks clean" is not a data-quality standard. One month of NYC 311 service requests — 331,976 rows — passed the eyeball test; nobody could say what would break an analysis built on it.
What I built
A rules-first pipeline: six explicit validation rules run before anything is touched, DuckDB applies the fixes, the same six rules run again, and every number lands in a committed receipts file. Re-running it produces identical results, byte for byte.
Result
The before-check scored 99.99% — and found the 41 rows eyeballs miss: 40 requests closed before they were created and one zip code that read "N/A". The pipeline nulled the impossible values, case-folded 397 mixed-case boroughs to canonical form, and verified zero duplicate keys — a fact now proven, not assumed. After-check: 100% of 331,976 rows passing every rule, in a 1.4-second run. The before/after reports and receipts are public.
What this costs you
This is the shape of the $500 starter: one dataset, cleaned and validated, receipts included. Bigger messes run $600–$1,500.
Buy this build: starter from $500 · full cleanup $600–$1,500. Work with freddyxai →
Read the full writeup → Cleaning a month of NYC 311 data
The newsletter
Receipts in your inbox.
Every build and post, as it ships. No fluff.