Deduping records, scored against the truth
Fuzzy-matched a messy restaurant dataset — and proved it with a held-out test split.
Jul 4, 2026
Problem
"Are these two records the same customer?" is the question behind every deduplication. The only honest way to answer it is against labeled ground truth — and to report the score on data you didn't tune on.
What I built
A fuzzy-matching pipeline on the Fodors–Zagat benchmark: block candidate pairs, score each with a weighted name/address/city/phone similarity in DuckDB, then pick the match threshold using only the training labels and report precision, recall, and F1 on the held-out test split.
Result
Test F1 0.9333 (precision 0.913, recall 0.9545) at a threshold of 0.88 tuned only on train — 2 false matches and 1 missed pair out of 189 test pairs. Fodors–Zagat is a well-separated benchmark, so a high score is expected; the point is the method and the honest evaluation, not the leaderboard. The full metrics and a true/false-positive audit trail are public.
What this costs you
This is the $500 starter shape — records matched, scored, receipts included. Full cleanups run $600–$1,500.
Buy this build: starter from $500 · full cleanup $600–$1,500. Work with freddyxai →
Read the full writeup → Fuzzy matching, scored honestly
The newsletter
Receipts in your inbox.
Every build and post, as it ships. No fluff.