Reprocessing only what changed

An incremental Parquet lake that refreshes one partition instead of rebuilding everything.

Problem

Most pipelines reprocess everything on every run — full reload, every night, whether the data changed or not. On a growing dataset that's a bill that only goes up. The fix is boring and proven: partition the data, and only touch the partitions that changed.

What I built

An incremental Parquet lake over NYC TLC yellow-taxi trips, partitioned by pickup year/month. Each month load is idempotent — it rewrites that one partition and never appends, so re-running is safe. Adding a new month touches a single partition; a full rebuild reprocesses every month. Same transform in both paths, so the comparison is fair.

Result

Each refresh reprocesses only the partition that changed — 4,090,822 rows of the 27,485,631 in the lake (7 months of NYC taxi trips, 437.3 MB of partitioned Parquet), not the whole history. That's the point: a full reload's cost grows with everything you've ever loaded, while an incremental refresh tracks only what moved. Measured across a growing lake, the full rebuild climbed to 0.88 s at 7 months while the incremental refresh stayed flat at ~0.12 s — about 7.3× less work today, and the gap widens as history accumulates. Re-running a month added 0 net rows — idempotent, proven not asserted. (Runtimes are processing time on one machine, median of 5 runs, and small at this scale; the durable receipts are the row counts, the storage, and the idempotency, which don't vary.) The pipeline, the scaling series, and full metrics are public.

What this costs you

This is an automation build — a pipeline that only reprocesses what changed, receipts included. Automation builds run $1,000–$2,000.

Reprocessing only what changed

Problem

What I built

Result

What this costs you

Receipts in your inbox.