Workload

Snowflake ETL/ELT pipelines to BigQuery

Re-home incremental loads, SCD patterns, and semi-structured transforms from Snowflake into BigQuery-without breaking scheduling, idempotency, or downstream KPIs.

Book assessment See migration approach

Back to pair page

At a glance

Input: Snowflake ETL / pipeline migration logic
Output: BigQuery equivalent (validated)
Common pitfalls: Implicit watermarks: relying on CURRENT_TIMESTAMP() or task schedule time instead of persisted high-water marks.
Non-deterministic dedupe: ROW_NUMBER without stable tie-breakers (event time + source offset + ingestion time).
MERGE scan blowups: applying MERGE without pruning boundaries or partition-aware staging.

Context

Why this breaks

Snowflake pipelines usually encode business correctness across more than SQL: incremental boundaries, late-arrival behavior, dedupe rules, and orchestration retries. A migration that only “translates queries” often preserves shape but not outcomes.

Common symptoms after cutover:

Duplicates or missing updates because watermarks and tie-breakers were implicit
SCD dimensions drift during backfills or late updates
Reprocessing semantics change (append-only in BigQuery vs implicit re-run windows in Snowflake tasks/jobs)
Costs spike because transformations stop benefiting from pruning-aware staging
Operational fragility: schedules, dependencies, and retries no longer match the original run contract

Approach

How conversion works

Inventory & classify pipelines, tasks, and dependencies (dbt models, Snowflake Tasks/Streams, stored procedures, external orchestrators).
Extract the correctness contract: watermarks, keys, dedupe rule, late-arrival window, SCD strategy, and failure/retry semantics.
Re-home execution to BigQuery-native patterns (staging, partitioning, MERGE/apply strategy) + orchestration (Composer/Airflow, dbt, or your runner).
Implement data contracts: typed staging, schema evolution policy (widen/quarantine/reject), and audit columns.
Gate cutover with evidence: golden dataset parity, incremental integrity tests (reruns + late injections), and rollback-ready criteria.

Supported constructs

Representative pipeline constructs we commonly migrate from Snowflake to BigQuery (exact coverage depends on your estate).

Source	Target	Notes
Snowflake Tasks + schedules	Composer/Airflow or dbt scheduling (explicit DAG dependencies)	Dependencies, retries, and concurrency modeled explicitly.
Streams + incremental capture patterns	CDC landing + typed staging + apply windows	High-water mark + late window policy made auditable.
MERGE-based upserts	BigQuery MERGE with pruning-aware staging	Avoid full-target scans via partition boundaries and staged apply.
SCD Type-1 / Type-2 logic	MERGE + end-date/current-flag patterns	Backfill + late updates tested as first-class scenarios.
VARIANT transforms in ELT	JSON extraction + explicit casting in BigQuery	Type intent and null handling enforced at extraction boundary.
Transient/staging schemas	Dataset staging + partitioned temp tables	Staging boundaries designed for pruning and cost predictability.

How workload changes

Topic	Snowflake	BigQuery
Incremental correctness	Often implicit in Tasks/Streams + timing assumptions	Explicit high-water mark + late-arrival windows + staged apply
Upserts / MERGE	MERGE performance relies on Snowflake clustering/micro-partitions	MERGE requires partition-aware staging and pruning strategies
Orchestration	Snowflake Tasks + external schedulers	Composer/dbt/native tooling with explicit DAG contracts
Cost model	Warehouse credits + micro-partition effects	Bytes scanned + slot time + reservation strategy

Incremental correctness: Correctness becomes testable under retries and backfills.

Upserts / MERGE: Avoid full-table scans by scoping apply windows.

Orchestration: Retries, SLAs, and concurrency are modeled and monitored.

Cost model: Pruning and table layout drive predictable spend.

Examples

Canonical BigQuery apply pattern for incremental upserts: stage -> dedupe deterministically -> MERGE with scoped partitions. Adjust keys, partitions, and casts to your model.

-- BigQuery: stage and type incoming batch
-- Landing: `proj.raw.debezium_events` (batch_id STRING, payload JSON, ingested_at TIMESTAMP)

CREATE TEMP TABLE stg_typed AS
SELECT
  batch_id,
  JSON_VALUE(payload, '$.op') AS op,
  -- choose row image based on op
  IF(JSON_VALUE(payload, '$.op') = 'd', JSON_QUERY(payload, '$.before'), JSON_QUERY(payload, '$.after')) AS row_img,
  TIMESTAMP_MILLIS(CAST(JSON_VALUE(payload, '$.source.ts_ms') AS INT64)) AS event_ts,
  SAFE_CAST(JSON_VALUE(payload, '$.source.lsn') AS INT64) AS src_lsn,
  ingested_at,
  CAST(JSON_VALUE(IF(JSON_VALUE(payload, '$.op')='d', JSON_QUERY(payload,'$.before'), JSON_QUERY(payload,'$.after')), '$.id') AS STRING) AS id,
  CAST(JSON_VALUE(IF(JSON_VALUE(payload, '$.op')='d', JSON_QUERY(payload,'$.before'), JSON_QUERY(payload,'$.after')), '$.status') AS STRING) AS status,
  CAST(JSON_VALUE(IF(JSON_VALUE(payload, '$.op')='d', JSON_QUERY(payload,'$.before'), JSON_QUERY(payload,'$.after')), '$.amount') AS NUMERIC) AS amount
FROM `proj.raw.debezium_events`
WHERE batch_id = @batch_id;

Avoid

Common pitfalls

Implicit watermarks: relying on CURRENT_TIMESTAMP() or task schedule time instead of persisted high-water marks.
Non-deterministic dedupe: ROW_NUMBER without stable tie-breakers (event time + source offset + ingestion time).
MERGE scan blowups: applying MERGE without pruning boundaries or partition-aware staging.
SCD drift: end-dating/current-flag logic not tested under backfills and late-arrival updates.
Schema evolution surprises: upstream CDC widens types; downstream typed tables fail silently or truncate.
Orchestration mismatch: retries, concurrency, and dependency ordering differ; pipeline becomes “eventually wrong.”

Proof

Validation approach

Compile + execute checks: BigQuery jobs run reliably under representative volumes and concurrency.
Golden parity: key facts/dims match on KPI aggregates and agreed checksums for defined windows.
Incremental integrity:
- Idempotency: re-run same micro-batch -> no net change
- Late-arrival: inject late events -> only expected rows change
- Backfill safety: run historical windows -> SCD + dedupe remains stable
Cost/performance gates: partition pruning verified; scan bytes + slot time thresholds set for top pipelines.
Operational readiness: retry simulations, alerting, and rollback criteria validated before cutover.

Execution

Migration steps

A sequence that keeps pipeline correctness measurable and cutover controlled.

01
Inventory pipelines, dependencies, and run contracts
Extract pipeline graph: tasks/jobs, schedules, upstream/downstream dependencies, SLAs, retries, and concurrency limits. Identify critical marts and consumers.
02
Formalize incremental semantics
Define high-water marks, keys, dedupe tie-breakers, late-arrival window policy, and backfill boundaries. Record as an auditable contract (not tribal knowledge).
03
Rebuild transformations on BigQuery-native staging
Implement landing -> typed staging -> dedupe -> apply. Choose partitioning/clustering for pruning and stable costs. Encode schema evolution policy (widen/quarantine/reject).
04
Re-home orchestration
Implement DAGs in Composer/Airflow or your orchestrator: dependencies, retries, alerting, and idempotency behavior. Confirm schedule alignment and failure handling.
05
Run parity and incremental integrity gates
Golden outputs + KPI aggregates, idempotency reruns, late-data injections, and backfill simulations. Cutover only when evidence meets thresholds and rollback criteria are defined.

Workload Assessment

Migrate pipelines with a provable run contract

We inventory your Snowflake pipelines, formalize incremental semantics, migrate a representative pipeline end-to-end, and produce parity evidence plus a cutover plan with rollback criteria.

Book assessment See migration approach

Migration Acceleration

Cut over with parity gates and rollback-ready criteria

Get an actionable plan with incremental integrity tests, reconciliation evidence, and cost/performance baselines-so your pipeline cutover is gated by proof, not optimism.

Book assessment Explore workloads

FAQ

Frequently asked questions

Is this just a SQL translation exercise?+

No. Pipeline migration is about preserving the run contract: incremental boundaries, late-arrival behavior, dedupe rules, SCD semantics, and orchestration retries. SQL translation is only one component.

Can we keep our pipeline incremental after moving to BigQuery?+

Yes. We implement explicit high-water marks, late windows, and staged applies (often MERGE) with pruning-aware boundaries, then validate idempotency and late-correction behavior with simulations.

How do you avoid BigQuery cost surprises?+

We design staging boundaries and partition/clustering strategy to maximize pruning. Validation includes scan bytes/slot time baselines and regression thresholds for your top pipelines.

Do you migrate dbt models and orchestration too?+

Yes. We can migrate dbt models, Snowflake Tasks logic, and external orchestrations-ensuring schedules, dependencies, retries, and operational alerts remain equivalent (or improved).

Snowflake to BigQuery migration

Migration Pair

End-to-end approach: what breaks, validation gates, and cutover plan.

View page

SQL / query migration

Workload

Convert Snowflake SQL to BigQuery with semantic parity and golden-query validation.

View page

Validation & reconciliation

Method

How we define parity contracts, thresholds, and evidence for sign-off.

Learn more