Snowflake ETL/ELT pipelines to BigQuery
Re-home incremental loads, SCD patterns, and semi-structured transforms from Snowflake into BigQuery-without breaking scheduling, idempotency, or downstream KPIs.
- Input
- Snowflake ETL / pipeline migration logic
- Output
- BigQuery equivalent (validated)
- Common pitfalls
- Implicit watermarks: relying on CURRENT_TIMESTAMP() or task schedule time instead of persisted high-water marks.
- Non-deterministic dedupe: ROW_NUMBER without stable tie-breakers (event time + source offset + ingestion time).
- MERGE scan blowups: applying MERGE without pruning boundaries or partition-aware staging.
Why this breaks
Snowflake pipelines usually encode business correctness across more than SQL: incremental boundaries, late-arrival behavior, dedupe rules, and orchestration retries. A migration that only “translates queries” often preserves shape but not outcomes.
Common symptoms after cutover:
- Duplicates or missing updates because watermarks and tie-breakers were implicit
- SCD dimensions drift during backfills or late updates
- Reprocessing semantics change (append-only in BigQuery vs implicit re-run windows in Snowflake tasks/jobs)
- Costs spike because transformations stop benefiting from pruning-aware staging
- Operational fragility: schedules, dependencies, and retries no longer match the original run contract
How conversion works
- Inventory & classify pipelines, tasks, and dependencies (dbt models, Snowflake Tasks/Streams, stored procedures, external orchestrators).
- Extract the correctness contract: watermarks, keys, dedupe rule, late-arrival window, SCD strategy, and failure/retry semantics.
- Re-home execution to BigQuery-native patterns (staging, partitioning, MERGE/apply strategy) + orchestration (Composer/Airflow, dbt, or your runner).
- Implement data contracts: typed staging, schema evolution policy (widen/quarantine/reject), and audit columns.
- Gate cutover with evidence: golden dataset parity, incremental integrity tests (reruns + late injections), and rollback-ready criteria.
Supported constructs
Representative pipeline constructs we commonly migrate from Snowflake to BigQuery (exact coverage depends on your estate).
| Source | Target | Notes |
|---|---|---|
| Snowflake Tasks + schedules | Composer/Airflow or dbt scheduling (explicit DAG dependencies) | Dependencies, retries, and concurrency modeled explicitly. |
| Streams + incremental capture patterns | CDC landing + typed staging + apply windows | High-water mark + late window policy made auditable. |
| MERGE-based upserts | BigQuery MERGE with pruning-aware staging | Avoid full-target scans via partition boundaries and staged apply. |
| SCD Type-1 / Type-2 logic | MERGE + end-date/current-flag patterns | Backfill + late updates tested as first-class scenarios. |
| VARIANT transforms in ELT | JSON extraction + explicit casting in BigQuery | Type intent and null handling enforced at extraction boundary. |
| Transient/staging schemas | Dataset staging + partitioned temp tables | Staging boundaries designed for pruning and cost predictability. |
How workload changes
| Topic | Snowflake | BigQuery |
|---|---|---|
| Incremental correctness | Often implicit in Tasks/Streams + timing assumptions | Explicit high-water mark + late-arrival windows + staged apply |
| Upserts / MERGE | MERGE performance relies on Snowflake clustering/micro-partitions | MERGE requires partition-aware staging and pruning strategies |
| Orchestration | Snowflake Tasks + external schedulers | Composer/dbt/native tooling with explicit DAG contracts |
| Cost model | Warehouse credits + micro-partition effects | Bytes scanned + slot time + reservation strategy |
Examples
Canonical BigQuery apply pattern for incremental upserts: stage -> dedupe deterministically -> MERGE with scoped partitions. Adjust keys, partitions, and casts to your model.
-- BigQuery: stage and type incoming batch
-- Landing: `proj.raw.debezium_events` (batch_id STRING, payload JSON, ingested_at TIMESTAMP)
CREATE TEMP TABLE stg_typed AS
SELECT
batch_id,
JSON_VALUE(payload, '$.op') AS op,
-- choose row image based on op
IF(JSON_VALUE(payload, '$.op') = 'd', JSON_QUERY(payload, '$.before'), JSON_QUERY(payload, '$.after')) AS row_img,
TIMESTAMP_MILLIS(CAST(JSON_VALUE(payload, '$.source.ts_ms') AS INT64)) AS event_ts,
SAFE_CAST(JSON_VALUE(payload, '$.source.lsn') AS INT64) AS src_lsn,
ingested_at,
CAST(JSON_VALUE(IF(JSON_VALUE(payload, '$.op')='d', JSON_QUERY(payload,'$.before'), JSON_QUERY(payload,'$.after')), '$.id') AS STRING) AS id,
CAST(JSON_VALUE(IF(JSON_VALUE(payload, '$.op')='d', JSON_QUERY(payload,'$.before'), JSON_QUERY(payload,'$.after')), '$.status') AS STRING) AS status,
CAST(JSON_VALUE(IF(JSON_VALUE(payload, '$.op')='d', JSON_QUERY(payload,'$.before'), JSON_QUERY(payload,'$.after')), '$.amount') AS NUMERIC) AS amount
FROM `proj.raw.debezium_events`
WHERE batch_id = @batch_id;Common pitfalls
- Implicit watermarks: relying on
CURRENT_TIMESTAMP()or task schedule time instead of persisted high-water marks. - Non-deterministic dedupe: ROW_NUMBER without stable tie-breakers (event time + source offset + ingestion time).
- MERGE scan blowups: applying MERGE without pruning boundaries or partition-aware staging.
- SCD drift: end-dating/current-flag logic not tested under backfills and late-arrival updates.
- Schema evolution surprises: upstream CDC widens types; downstream typed tables fail silently or truncate.
- Orchestration mismatch: retries, concurrency, and dependency ordering differ; pipeline becomes “eventually wrong.”
Validation approach
- Compile + execute checks: BigQuery jobs run reliably under representative volumes and concurrency.
- Golden parity: key facts/dims match on KPI aggregates and agreed checksums for defined windows.
- Incremental integrity:
- Idempotency: re-run same micro-batch -> no net change
- Late-arrival: inject late events -> only expected rows change
- Backfill safety: run historical windows -> SCD + dedupe remains stable
- Cost/performance gates: partition pruning verified; scan bytes + slot time thresholds set for top pipelines.
- Operational readiness: retry simulations, alerting, and rollback criteria validated before cutover.
Migration steps
- 01
Inventory pipelines, dependencies, and run contracts
Extract pipeline graph: tasks/jobs, schedules, upstream/downstream dependencies, SLAs, retries, and concurrency limits. Identify critical marts and consumers.
- 02
Formalize incremental semantics
Define high-water marks, keys, dedupe tie-breakers, late-arrival window policy, and backfill boundaries. Record as an auditable contract (not tribal knowledge).
- 03
Rebuild transformations on BigQuery-native staging
Implement landing -> typed staging -> dedupe -> apply. Choose partitioning/clustering for pruning and stable costs. Encode schema evolution policy (widen/quarantine/reject).
- 04
Re-home orchestration
Implement DAGs in Composer/Airflow or your orchestrator: dependencies, retries, alerting, and idempotency behavior. Confirm schedule alignment and failure handling.
- 05
Run parity and incremental integrity gates
Golden outputs + KPI aggregates, idempotency reruns, late-data injections, and backfill simulations. Cutover only when evidence meets thresholds and rollback criteria are defined.
We inventory your Snowflake pipelines, formalize incremental semantics, migrate a representative pipeline end-to-end, and produce parity evidence plus a cutover plan with rollback criteria.
Get an actionable plan with incremental integrity tests, reconciliation evidence, and cost/performance baselines-so your pipeline cutover is gated by proof, not optimism.