Skip to content

Deduping records, scored against the truth

Fuzzy-matched a messy restaurant dataset — and proved it with a held-out test split.

Jul 4, 2026

0.9333F1 on held-out test data
0.9333test F1
0.913 / 0.9545test precision / recall
0.88threshold (tuned on train)
189held-out test pairs
DuckDBPython

Problem

"Are these two records the same customer?" is the question behind every deduplication. The only honest way to answer it is against labeled ground truth — and to report the score on data you didn't tune on.

What I built

A fuzzy-matching pipeline on the Fodors–Zagat benchmark: block candidate pairs, score each with a weighted name/address/city/phone similarity in DuckDB, then pick the match threshold using only the training labels and report precision, recall, and F1 on the held-out test split.

Result

Test F1 0.9333 (precision 0.913, recall 0.9545) at a threshold of 0.88 tuned only on train — 2 false matches and 1 missed pair out of 189 test pairs. Fodors–Zagat is a well-separated benchmark, so a high score is expected; the point is the method and the honest evaluation, not the leaderboard. The full metrics and a true/false-positive audit trail are public.

What this costs you

This is the $500 starter shape — records matched, scored, receipts included. Full cleanups run $600–$1,500.

Buy this build: starter from $500 · full cleanup $600–$1,500. Work with freddyxai →

Read the full writeup → Fuzzy matching, scored honestly

The newsletter

Receipts in your inbox.

Every build and post, as it ships. No fluff.

Work with freddyxai