Workload

Hadoop legacy ETL pipelines to BigQuery

Re-home Hadoop-era pipelines—partition overwrites, staging zones, UDF utilities, and orchestrated SQL chains—into BigQuery with an explicit run contract and validation gates that prevent drift and scan-cost surprises.

At a glance
Input
Hadoop (legacy clusters) ETL / pipeline migration logic
Output
BigQuery equivalent (validated)
Common pitfalls
  • Assuming “SQL translation” equals pipeline migration: the hard part is operational semantics (windows, retries, late data).
  • Partition semantics lost: overwrite-partition becomes append-only; duplicates appear.
  • Dedupe instability: missing tie-breakers causes nondeterministic drift.
Context

Why this breaks

Hadoop ETL systems rarely fail because a query won’t translate. They fail because the run contract is implicit: partition overwrite conventions, file-based staging semantics, late-arrival reprocessing windows, and restart behavior encoded in Oozie coordinators and shell scripts. BigQuery can implement equivalent outcomes, but only if those hidden rules are extracted and enforced with deterministic patterns and validation gates.

Common symptoms after cutover:

  • Duplicate or missing data because overwrite/reprocessing semantics weren’t recreated
  • Late-arrival events are ignored (or double-counted) because window policy was implicit
  • Scan-cost spikes because partition predicates and staging boundaries stop pruning
  • Orchestration dependencies and retries change, turning failures into silent data issues
  • Schema drift from upstream feeds breaks typed targets without a defined policy
Approach

How conversion works

  1. Inventory & classify the estate: Hive/Impala/Spark SQL, staging zones, UDFs, and orchestrators (Oozie/Airflow/Cron).
  2. Extract the run contract: keys, partition boundaries, watermarks, dedupe tie-breakers, late-arrival window policy, and failure/retry semantics.
  3. Re-home ingestion and staging: landing tables + manifests, typed staging, and standardized audit columns.
  4. Rebuild transforms using BigQuery-native patterns (landing → typed staging → dedupe → apply) with partitioning/clustering aligned to access paths.
  5. Implement restartability: applied-window tracking, idempotency keys, deterministic ordering, and safe retries.
  6. Re-home orchestration: Composer/Airflow or your runner with explicit DAG contracts, retries, alerts, and concurrency posture.
  7. Gate cutover with evidence: golden outputs + incremental integrity simulations (reruns, backfills, late injections) and rollback-ready criteria.

Supported constructs

Representative Hadoop-era ETL constructs we commonly migrate to BigQuery (exact coverage depends on your estate).

SourceTargetNotes
Partition overwrite pipelines (dt partitions)Partition-scoped apply (MERGE or overwrite-by-partition)Preserve overwrite semantics without full-table rewrites.
Oozie coordinators / shell script chainsComposer/Airflow DAGs with explicit contractsDependencies, retries, and SLAs become first-class artifacts.
Hive/Impala staging zones (files)Landing tables + typed stagingReplayable staging boundaries with audit columns and manifests.
Late-data reprocessing windowsExplicit late-arrival policy + staged re-applyBehavior verified via late-injection simulations.
SCD Type-1 / Type-2 logicMERGE + current-flag/end-date patternsBackfills and late updates tested as first-class scenarios.
Hive schema driftTyped staging + drift policy (widen/quarantine/reject)Auditability for changing upstream payloads.

How workload changes

TopicHadoop legacy clusterBigQuery
Incremental correctnessOften emerges from overwrite + coordinator conventionsExplicit windows, idempotency, and staged apply
Performance modelAvoid HDFS scans by partition predicatesBytes scanned is the cost driver; pruning must be explicit
OrchestrationOozie coordinators and script chainsComposer/Airflow/dbt DAGs with explicit contracts
Schema evolutionHive drift tolerated by downstream consumersTyped staging with explicit drift policy
Incremental correctness: Correctness becomes auditable and repeatable under retries/backfills.
Performance model: Staging boundaries and filters drive stable cost and runtime.
Orchestration: Retries and alerts are modeled and monitored.
Schema evolution: Prevents silent coercion and downstream surprises.

Examples

Canonical BigQuery pattern for windowed loads: stage → dedupe deterministically → partition-scoped apply + applied-window tracking. Adjust keys, partitions, and casts to your model.

-- Applied-window tracking (restartability)
CREATE TABLE IF NOT EXISTS `proj.control.applied_windows` (
  job_name STRING NOT NULL,
  window_start DATE NOT NULL,
  window_end DATE NOT NULL,
  applied_at TIMESTAMP NOT NULL
);
Avoid

Common pitfalls

  • Assuming “SQL translation” equals pipeline migration: the hard part is operational semantics (windows, retries, late data).
  • Partition semantics lost: overwrite-partition becomes append-only; duplicates appear.
  • Dedupe instability: missing tie-breakers causes nondeterministic drift.
  • Pruning defeated: filters wrap partition columns or cast in WHERE, causing scan bytes explosion.
  • Unbounded applies: MERGEs or refreshes touch too much history each run.
  • Schema drift surprises: upstream types widen/change; typed targets break without a drift policy.
  • Orchestrator mismatch: coordinator-based dependencies aren’t mapped; freshness and correctness drift.
Proof

Validation approach

  • Execution checks: pipelines run reliably under representative volumes and schedules.
  • Structural parity: partition/window-level row counts and column profiles (null/min/max/distinct).
  • KPI parity: aggregates by key dimensions for critical marts and dashboards.
  • Incremental integrity (mandatory):
    • Idempotency: rerun same window → no net change
    • Restart simulation: fail mid-run → resume → correct final state
    • Backfill safety: historical windows replay without drift
    • Late-arrival: inject late corrections → only expected rows change
    • Dedupe stability: duplicates eliminated consistently under retries
  • Cost/performance gates: pruning verified; scan bytes/runtime thresholds set for top jobs.
  • Operational readiness: retry/alerting tests, canary gates, and rollback criteria defined before cutover.
Execution

Migration steps

A sequence that keeps pipeline correctness measurable and cutover controlled.
  1. 01

    Inventory pipelines, schedules, and dependencies

    Extract pipeline graph: SQL jobs, staging zones, upstream feeds, and orchestrators (Oozie/Airflow/Cron). Identify business-critical marts and consumers.

  2. 02

    Formalize the run contract

    Define windows/high-water marks, business keys, deterministic ordering/tie-breakers, dedupe rules, late-arrival policy, and backfill boundaries. Make restartability explicit.

  3. 03

    Rebuild transformations on BigQuery-native staging

    Implement landing → typed staging → dedupe → apply with partitioning/clustering aligned to windows and access paths. Define schema evolution policy (widen/quarantine/reject).

  4. 04

    Re-home orchestration and operations

    Implement DAGs in Composer/Airflow/dbt: dependencies, retries, alerts, and concurrency. Recreate overwrite/reprocessing behavior with explicit late windows and idempotent markers.

  5. 05

    Run parity and incremental integrity gates

    Golden outputs + KPI aggregates, idempotency reruns, late-data injections, and backfill windows. Cut over only when thresholds pass and rollback criteria are defined.

Workload Assessment
Migrate legacy pipelines with the run contract intact

We inventory your Hadoop pipelines, formalize partition/late-data semantics, migrate a representative pipeline end-to-end, and produce parity evidence with cutover gates—without scan-cost surprises.

Migration Acceleration
Cut over pipelines with proof-backed gates

Get an actionable migration plan with incremental integrity tests (reruns, late data, backfills), reconciliation evidence, and cost/performance baselines—so pipeline cutover is controlled and dispute-proof.

FAQ

Frequently asked questions

Why is Hadoop ETL migration more than rewriting SQL?+
Because correctness lives in operational conventions: partition overwrite, reprocessing windows, retries, and script-based orchestration. We extract these rules into an explicit run contract and validate them with integrity gates.
How do you preserve overwrite-partition behavior in BigQuery?+
We implement partition-scoped apply (MERGE or overwrite-by-partition) with applied-window tracking and rerun simulations so outcomes remain stable under retries and backfills.
What about late-arriving data?+
We convert it into an explicit late-arrival window policy and staged re-apply strategy, then validate with late-injection simulations to prove only expected rows change.
How do you prevent BigQuery cost surprises?+
We design pruning-aware staging boundaries and choose partitioning/clustering aligned to windows. Validation includes scan bytes/runtime baselines and regression thresholds for your top jobs.