Stop rebuilding the whole pipeline
A full reload is the default because it's simple: throw everything away, rebuild from scratch, never worry about state. It's also the reason pipeline bills grow with the data instead of with the changes. Here's the incremental alternative, measured on NYC taxi data.
Why full reloads quietly get expensive
Reprocessing every row every run means your cost scales with total history, not with what actually changed. Last month's data didn't move — but you paid to process it again anyway.
Partition + idempotency is the whole trick
Partition the data by a natural key (here, pickup year/month). Make each partition load idempotent — rewrite the partition, never append — so re-running is safe and can't double-count. Now "refresh" means "touch the partitions that changed," and a re-run is a no-op.
What the numbers say
Each refresh reprocesses one partition — 4,090,822 rows of the 27,485,631 in the lake — not the whole history. Rebuilding all 7 months of NYC taxi data took 0.88 s of processing; an incremental refresh stayed ~0.12 s no matter how many months were already loaded. That flat-vs-rising gap — 7.3× here, wider tomorrow — is the whole point: a full reload's cost scales with total history, an incremental refresh with change. Re-running a month added 0 rows: idempotency, proven. (Processing time, one machine, median of 5 runs — small at this scale and the wrong thing to fixate on; the row counts and idempotency are the deterministic receipts.)
Key takeaways
- Full-reload cost tracks history; incremental cost tracks change — the gap widens as you grow.
- Partition + idempotent writes — refresh only what moved; re-runs are safe and can't double-count.
- 4,090,822 of 27,485,631 rows reprocessed per refresh (7.3× less work today), measured across a growing lake.
- Prove idempotency — a re-run that adds 0 rows is the receipt that matters.
Keep reading: the full case study.
Read the full writeup → the case study
The newsletter
Receipts in your inbox.
Every build and post, as it ships. No fluff.