Deduping records, scored against the truth

Fuzzy-matched a messy restaurant dataset — and proved it with a held-out test split.

Problem

"Are these two records the same customer?" is the question behind every deduplication. The only honest way to answer it is against labeled ground truth — and to report the score on data you didn't tune on.

What I built

A fuzzy-matching pipeline on the Fodors–Zagat benchmark: block candidate pairs, score each with a weighted name/address/city/phone similarity in DuckDB, then pick the match threshold using only the training labels and report precision, recall, and F1 on the held-out test split.

Result

Test F1 0.9333 (precision 0.913, recall 0.9545) at a threshold of 0.88 tuned only on train — 2 false matches and 1 missed pair out of 189 test pairs. Fodors–Zagat is a well-separated benchmark, so a high score is expected; the point is the method and the honest evaluation, not the leaderboard. The full metrics and a true/false-positive audit trail are public.

What this costs you

This is the $500 starter shape — records matched, scored, receipts included. Full cleanups run $600–$1,500.

Deduping records, scored against the truth

Problem

What I built

Result

What this costs you

Receipts in your inbox.