Workload

Databricks ETL pipelines to BigQuery

Re-home Delta MERGE/upsert semantics—staging, dedupe tie-breakers, late data policy, and orchestration—into BigQuery with an explicit run contract and validation gates that prevent KPI drift.

At a glance
Input
Databricks ETL / pipeline migration logic
Output
BigQuery equivalent (validated)
Common pitfalls
  • Dedupe instability: ROW_NUMBER-based dedupe without complete ORDER BY causes nondeterministic drift under retries.
  • Late-arrival policy implicit: Delta relied on partition overwrite; BigQuery version becomes append-only.
  • Unbounded applies: MERGE touches full targets because apply windows aren’t scoped to affected partitions.
Context

Why this breaks

Delta pipelines encode correctness in operational behavior: MERGE logic, partition overwrite/reprocessing windows, and implicit retry/idempotency assumptions. BigQuery can produce the same business outcomes, but only if the run contract is made explicit—keys, ordering, dedupe rules, and late-arrival policy.

Common symptoms after cutover:

  • Duplicate or missing updates because dedupe keys and tie-breakers were implicit
  • Late events double-counted or silently ignored because reprocessing windows weren’t recreated
  • SCD dimensions drift during backfills and late updates
  • Costs spike because MERGE/apply touches too much history (unbounded scans)
  • Orchestration changes retries/ordering, turning failures into silent data issues
Approach

How conversion works

  1. Inventory & classify pipelines: Delta tables, MERGE targets, orchestration jobs, schedules, and dependencies.
  2. Extract the run contract: business keys, deterministic ordering/tie-breakers, dedupe rule, late-arrival window policy, and restart semantics.
  3. Re-home transformations to BigQuery-native staging (landing → typed staging → dedupe → apply) with partitioning/clustering aligned to load windows.
  4. Implement late-data behavior explicitly: reprocessing windows and staged re-apply so outcomes match without full-table rewrites.
  5. Re-home orchestration (Composer/Airflow, dbt, or your runner) with explicit DAG dependencies, retries, alerts, and concurrency posture.
  6. Gate cutover with evidence: golden outputs + incremental integrity simulations (reruns, backfills, late injections) and rollback-ready criteria.

Supported constructs

Representative Databricks/Delta ETL constructs we commonly migrate to BigQuery (exact coverage depends on your estate).

SourceTargetNotes
Delta MERGE INTO (upsert)BigQuery MERGE (staging + apply)Match keys, casts, and update predicates made explicit and testable.
Partition overwrite / reprocessing windowsExplicit late-window policy + staged re-applyBackfills and late updates validated as first-class scenarios.
ROW_NUMBER-based dedupe patternsDeterministic dedupe with explicit tie-breakersPrevents nondeterministic drift under retries.
SCD Type-1 / Type-2 apply logicMERGE + current-flag/end-date patternsValidated during backfills and late corrections.
Schema evolution in DeltaTyped staging + drift policy (widen/quarantine/reject)Auditability for changing upstream payloads.
Spark job orchestrationComposer/dbt orchestration with explicit DAG contractsRetries, concurrency, and alerts modeled and monitored.

How workload changes

TopicDatabricks / DeltaBigQuery
Incremental correctnessOften relies on partition overwrite and implicit reprocessingExplicit late-window policy + staged apply with integrity gates
Idempotency under retriesEmerges from job structure, not always testedProven via rerun simulations and applied-window markers
Cost predictabilityCluster runtime + shuffle patternsBytes scanned + slot time + pruning effectiveness
Incremental correctness: Correctness becomes auditable and repeatable under retries/backfills.
Idempotency under retries: Idempotency is enforced as a cutover gate.
Cost predictability: Apply windows and layout alignment keep spend stable.

Examples

Canonical BigQuery apply pattern for incremental ETL: stage → dedupe deterministically → MERGE with scoped partitions + applied-window tracking. Adjust keys, partitions, and casts to your model.

-- Applied-window tracking (restartability)
CREATE TABLE IF NOT EXISTS `proj.control.applied_windows` (
  job_name STRING NOT NULL,
  window_start TIMESTAMP NOT NULL,
  window_end TIMESTAMP NOT NULL,
  applied_at TIMESTAMP NOT NULL
);
Avoid

Common pitfalls

  • Dedupe instability: ROW_NUMBER-based dedupe without complete ORDER BY causes nondeterministic drift under retries.
  • Late-arrival policy implicit: Delta relied on partition overwrite; BigQuery version becomes append-only.
  • Unbounded applies: MERGE touches full targets because apply windows aren’t scoped to affected partitions.
  • Type drift in staging: implicit casts in Spark become explicit failures or silent coercions in BigQuery.
  • Schema evolution surprises: upstream types widen; downstream typed tables break without a drift policy.
  • Orchestration mismatch: concurrency and dependency ordering changes create freshness and correctness drift.
Proof

Validation approach

  • Execution checks: pipelines run reliably under representative volumes and schedules.
  • Structural parity: partition-level row counts and column profiles (null/min/max/distinct) for key tables.
  • KPI parity: aggregates by key dimensions for critical marts and dashboards.
  • Incremental integrity (mandatory):
    • Idempotency: rerun same micro-batch → no net change
    • Late-arrival: inject late updates → only expected rows change
    • Backfill safety: replay historical windows → stable SCD and dedupe
    • Dedupe stability: duplicates eliminated consistently under retries
  • Cost/performance gates: pruning verified; scan bytes/runtime thresholds set for top jobs.
  • Operational readiness: retry/alerting tests and rollback criteria defined before cutover.
Execution

Migration steps

A sequence that keeps pipeline correctness measurable and cutover controlled.
  1. 01

    Inventory pipelines, schedules, and dependencies

    Extract pipeline graph: MERGE targets, upstream feeds, orchestration DAGs, schedules, and SLAs. Identify business-critical marts and consumers.

  2. 02

    Formalize the run contract

    Define high-water marks, business keys, deterministic ordering/tie-breakers, dedupe rules, late-arrival window policy, and backfill boundaries. Make restartability explicit.

  3. 03

    Rebuild transformations on BigQuery-native staging

    Implement landing → typed staging → dedupe → apply with partitioning/clustering aligned to load windows. Define schema evolution policy (widen/quarantine/reject).

  4. 04

    Re-home orchestration and operations

    Implement DAGs in Composer/Airflow or your orchestrator: dependencies, retries, alerts, and concurrency. Add applied-window tracking and failure handling.

  5. 05

    Run parity and incremental integrity gates

    Golden outputs + KPI aggregates, idempotency reruns, late-data injections, and backfill windows. Cut over only when thresholds pass and rollback criteria are defined.

Workload Assessment
Migrate Delta pipelines with the run contract intact

We inventory your Databricks pipelines, formalize MERGE/late-data semantics, migrate a representative pipeline end-to-end, and produce parity evidence with cutover gates—without scan-cost surprises.

Migration Acceleration
Cut over pipelines with proof-backed gates

Get an actionable migration plan with incremental integrity tests (reruns, late data, backfills), reconciliation evidence, and cost/performance baselines—so pipeline cutover is controlled and dispute-proof.

FAQ

Frequently asked questions

Is Databricks ETL migration just rewriting Spark jobs?+
No. The critical work is preserving the run contract: keys, ordering/tie-breakers, dedupe rules, late-arrival policy, and restartability. Job translation is only one part.
How do you preserve Delta MERGE behavior in BigQuery?+
We formalize match keys and ordering, dedupe deterministically, and implement staged apply with bounded windows. Then we prove idempotency and late-data behavior with simulations as cutover gates.
What if our pipeline relies on partition overwrite semantics?+
We convert it into an explicit late-arrival and reprocessing policy and implement staged re-apply so outcomes match without full-table reprocessing.
How do you avoid BigQuery cost surprises for ETL?+
We design pruning-aware staging boundaries and choose partitioning/clustering aligned to load windows. Validation includes scan bytes/runtime baselines and regression thresholds for your top jobs.