Reprocessing only what changed
An incremental Parquet lake that refreshes one partition instead of rebuilding everything.
Jul 5, 2026
Problem
Most pipelines reprocess everything on every run — full reload, every night, whether the data changed or not. On a growing dataset that's a bill that only goes up. The fix is boring and proven: partition the data, and only touch the partitions that changed.
What I built
An incremental Parquet lake over NYC TLC yellow-taxi trips, partitioned by pickup year/month. Each month load is idempotent — it rewrites that one partition and never appends, so re-running is safe. Adding a new month touches a single partition; a full rebuild reprocesses every month. Same transform in both paths, so the comparison is fair.
Result
Each refresh reprocesses only the partition that changed — 4,090,822 rows of the 27,485,631 in the lake (7 months of NYC taxi trips, 437.3 MB of partitioned Parquet), not the whole history. That's the point: a full reload's cost grows with everything you've ever loaded, while an incremental refresh tracks only what moved. Measured across a growing lake, the full rebuild climbed to 0.88 s at 7 months while the incremental refresh stayed flat at ~0.12 s — about 7.3× less work today, and the gap widens as history accumulates. Re-running a month added 0 net rows — idempotent, proven not asserted. (Runtimes are processing time on one machine, median of 5 runs, and small at this scale; the durable receipts are the row counts, the storage, and the idempotency, which don't vary.) The pipeline, the scaling series, and full metrics are public.
What this costs you
This is an automation build — a pipeline that only reprocesses what changed, receipts included. Automation builds run $1,000–$2,000.
Buy this build: $1,000–$2,000, 3–5 days. Work with freddyxai →
Read the full writeup → Stop rebuilding the whole pipeline
The newsletter
Receipts in your inbox.
Every build and post, as it ships. No fluff.