Skip to content

One month of NYC 311, cleaned and proven

A 331,976-row public dataset made analysis-ready, with the validation report to prove it.

Jul 4, 2026

100%rows passing all rules (was 99.99%)
331,976rows in
41defects found + fixed
99.99% → 100%passing all rules
1.4 spipeline runtime
PythonDuckDBGitHub

Problem

"Looks clean" is not a data-quality standard. One month of NYC 311 service requests — 331,976 rows — passed the eyeball test; nobody could say what would break an analysis built on it.

What I built

A rules-first pipeline: six explicit validation rules run before anything is touched, DuckDB applies the fixes, the same six rules run again, and every number lands in a committed receipts file. Re-running it produces identical results, byte for byte.

Result

The before-check scored 99.99% — and found the 41 rows eyeballs miss: 40 requests closed before they were created and one zip code that read "N/A". The pipeline nulled the impossible values, case-folded 397 mixed-case boroughs to canonical form, and verified zero duplicate keys — a fact now proven, not assumed. After-check: 100% of 331,976 rows passing every rule, in a 1.4-second run. The before/after reports and receipts are public.

What this costs you

This is the shape of the $500 starter: one dataset, cleaned and validated, receipts included. Bigger messes run $600–$1,500.

Buy this build: starter from $500 · full cleanup $600–$1,500. Work with freddyxai →

Read the full writeup → Cleaning a month of NYC 311 data

The newsletter

Receipts in your inbox.

Every build and post, as it ships. No fluff.

Work with freddyxai