Fuzzy matching, scored honestly

Anyone can report a dedupe accuracy number. The question is whether they tuned it on the data they're reporting on. I ran fuzzy matching on the Fodors–Zagat benchmark and scored it the honest way: threshold picked on train, precision/recall/F1 reported on a held-out test split.

How do you actually match messy records?

Block obvious non-matches, then score each candidate pair by weighted similarity across the fields that identify a record — name, address, city, phone. A pair above the threshold is a match. Simple; the honesty is in the evaluation, not the model.

Why the train/test split is the whole story

A threshold tuned on the same data you report on is a magic trick, not a measurement. Here the threshold (0.88) was chosen using only the training labels, then frozen. On the untouched test split: F1 0.9333, precision 0.913, recall 0.9545 — 2 false matches, 1 miss out of 189 pairs.

Is a high score bragging?

No — and saying so is the point. Fodors–Zagat is a well-separated benchmark; strong methods score near-perfect. A high number here reflects the dataset, not a claim that your CRM is this clean. The audit trail shows the actual matches, false positives, and misses.

Key takeaways

Score against labels or you're guessing — dedupe accuracy needs ground truth.
Tune on train, report on test — a number from the data you tuned on is meaningless.
0.9333 test F1, threshold 0.88 frozen from train — the method generalized.
Publish the audit trail — real false positives beat a single accuracy figure.

Keep reading: How much does it cost to clean up messy data? and the full case study.