Redshift ETL pipelines to Snowflake
Re-home incremental ETL—staging, dedupe, UPSERT patterns, and orchestration—from Redshift into Snowflake with an explicit run contract and validation gates that prevent KPI drift and credit spikes.
- Input
- Redshift ETL / pipeline migration logic
- Output
- Snowflake equivalent (validated)
- Common pitfalls
- DELETE+INSERT drift: partial failures and retries double-apply or delete wrong rows without idempotency guards.
- Implicit watermarks: using job runtime or CURRENT_TIMESTAMP instead of persisted high-water marks.
- Non-deterministic dedupe: ROW_NUMBER without stable tie-breakers causes drift under retries.
Why this breaks
Redshift ETL systems often hide correctness in operational conventions: watermark tables, VACUUM/ANALYZE-driven assumptions, and UPSERT patterns implemented via staging + delete/insert. Snowflake can implement equivalent outcomes—but the run contract must be explicit: windows, keys, deterministic ordering, dedupe rules, and restart semantics.
Common symptoms after cutover:
- Duplicates or missing updates because watermarks and tie-breakers were implicit
- DELETE+INSERT UPSERT logic drifts under retries and partial failures
- Late-arrival behavior changes because reprocessing windows weren’t defined
- Credit spikes because MERGE/apply touches too much history (full-target scans)
- Orchestration differences change retries/dependencies, turning failures into silent data issues
How conversion works
- Inventory & classify ETL jobs, schedules, dependencies, and operational controls (watermarks, control tables, retries).
- Extract the run contract: business keys, incremental boundaries, dedupe tie-breakers, late-arrival policy, and failure/restart semantics.
- Re-home transformations to Snowflake staging (landing → typed staging → dedupe → apply) with bounded MERGE scopes.
- Implement restartability: applied-window/batch tracking, idempotency markers, deterministic ordering, and safe retries.
- Re-home orchestration (Airflow/dbt/your runner) with explicit DAG dependencies, retries, alerts, and warehouse isolation for batch vs BI.
- Gate cutover with evidence: golden outputs + incremental integrity simulations (reruns, backfills, late injections) and rollback-ready criteria.
Supported constructs
Representative Redshift ETL constructs we commonly migrate to Snowflake (exact coverage depends on your estate).
| Source | Target | Notes |
|---|---|---|
| Staging + DELETE/INSERT UPSERT patterns | Snowflake MERGE with bounded apply windows | Avoid full scans and ensure idempotency under retries. |
| Watermark/control tables | Explicit high-water marks + applied-window tracking | Restartable and auditable incremental behavior. |
| ROW_NUMBER-based dedupe patterns | Deterministic dedupe with explicit tie-breakers | Prevents nondeterministic drift under retries. |
| SCD Type-1 / Type-2 logic | MERGE + current-flag/end-date patterns | Backfills and late updates validated as first-class scenarios. |
| Copy/load ingestion chains | Landing tables + batch manifests | Ingestion made explicit; replayability supported. |
| Scheduler-driven chains | Airflow/dbt orchestration with warehouse isolation | Retries, concurrency, and alerts modeled and monitored. |
How workload changes
| Topic | Redshift | Snowflake |
|---|---|---|
| Incremental correctness | Often encoded in staging + delete/insert conventions and job timing | Explicit high-water marks + deterministic staging + integrity gates |
| Upserts | DELETE+INSERT common for UPSERT semantics | MERGE with bounded apply windows |
| Cost predictability | Cluster utilization and WLM behavior | Warehouse credits + pruning effectiveness |
| Orchestration | Schedulers + script chains | Airflow/dbt with explicit DAG contracts |
Examples
Canonical Snowflake incremental apply pattern for Redshift-style pipelines: stage → dedupe deterministically → bounded MERGE + applied-batch tracking. Adjust keys, offsets, and casts to your model.
-- Applied-batch tracking (restartability)
CREATE TABLE IF NOT EXISTS CONTROL.APPLIED_BATCHES (
job_name STRING NOT NULL,
batch_id STRING NOT NULL,
applied_at TIMESTAMP_NTZ NOT NULL,
PRIMARY KEY (job_name, batch_id)
);Common pitfalls
- DELETE+INSERT drift: partial failures and retries double-apply or delete wrong rows without idempotency guards.
- Implicit watermarks: using job runtime or CURRENT_TIMESTAMP instead of persisted high-water marks.
- Non-deterministic dedupe: ROW_NUMBER without stable tie-breakers causes drift under retries.
- Full-target MERGE: missing apply boundaries causes large scans and credit spikes.
- Schema evolution surprises: upstream types widen/change; typed targets break without a drift policy.
- No warehouse isolation: BI refresh shares warehouse with ETL; tail latency and spend spikes follow.
Validation approach
- Execution checks: pipelines run reliably under representative volumes and schedules.
- Structural parity: window-level row counts and column profiles (null/min/max/distinct) for key tables.
- KPI parity: aggregates by key dimensions for critical marts and dashboards.
- Incremental integrity (mandatory):
- Idempotency: rerun same window/batch → no net change
- Restart simulation: fail mid-run → resume → correct final state
- Backfill: historical windows replay without drift
- Late-arrival: inject late corrections → only expected rows change
- Dedupe stability: duplicates eliminated consistently under retries
- Cost/performance gates: bounded MERGE scope verified; credit/runtime thresholds set for top jobs.
- Operational readiness: retry/alerting tests, canary gates, and rollback criteria defined before cutover.
Migration steps
- 01
Inventory ETL jobs, schedules, and dependencies
Extract job chains, upstream/downstream dependencies, SLAs, retry policies, and control-table conventions. Identify business-critical marts and consumers.
- 02
Formalize the run contract
Define load windows/high-water marks, business keys, deterministic ordering/tie-breakers, dedupe rules, late-arrival policy, and restart semantics. Make idempotency explicit.
- 03
Rebuild staging and apply on Snowflake
Implement landing → typed staging → dedupe → bounded MERGE apply. Define schema evolution policy (widen/quarantine/reject) and explicit DQ checks.
- 04
Re-home orchestration and warehouse posture
Implement DAGs with retries/alerts and isolate batch warehouses from BI. Add applied-batch tracking and failure handling.
- 05
Run parity and integrity gates
Golden outputs + KPI aggregates, idempotency reruns, restart simulations, late-data injections, and backfill windows. Cut over only when thresholds pass and rollback criteria are defined.
We inventory your Redshift ETL estate, formalize the run contract (windows, restartability, SCD), migrate a representative pipeline end-to-end, and deliver parity evidence with cutover gates.
Get an actionable migration plan with restart simulations, incremental integrity tests, reconciliation evidence, and credit baselines—so ETL cutover is controlled and dispute-proof.