Hadoop legacy ETL pipelines to BigQuery
Re-home Hadoop-era pipelines—partition overwrites, staging zones, UDF utilities, and orchestrated SQL chains—into BigQuery with an explicit run contract and validation gates that prevent drift and scan-cost surprises.
- Input
- Hadoop (legacy clusters) ETL / pipeline migration logic
- Output
- BigQuery equivalent (validated)
- Common pitfalls
- Assuming “SQL translation” equals pipeline migration: the hard part is operational semantics (windows, retries, late data).
- Partition semantics lost: overwrite-partition becomes append-only; duplicates appear.
- Dedupe instability: missing tie-breakers causes nondeterministic drift.
Why this breaks
Hadoop ETL systems rarely fail because a query won’t translate. They fail because the run contract is implicit: partition overwrite conventions, file-based staging semantics, late-arrival reprocessing windows, and restart behavior encoded in Oozie coordinators and shell scripts. BigQuery can implement equivalent outcomes, but only if those hidden rules are extracted and enforced with deterministic patterns and validation gates.
Common symptoms after cutover:
- Duplicate or missing data because overwrite/reprocessing semantics weren’t recreated
- Late-arrival events are ignored (or double-counted) because window policy was implicit
- Scan-cost spikes because partition predicates and staging boundaries stop pruning
- Orchestration dependencies and retries change, turning failures into silent data issues
- Schema drift from upstream feeds breaks typed targets without a defined policy
How conversion works
- Inventory & classify the estate: Hive/Impala/Spark SQL, staging zones, UDFs, and orchestrators (Oozie/Airflow/Cron).
- Extract the run contract: keys, partition boundaries, watermarks, dedupe tie-breakers, late-arrival window policy, and failure/retry semantics.
- Re-home ingestion and staging: landing tables + manifests, typed staging, and standardized audit columns.
- Rebuild transforms using BigQuery-native patterns (landing → typed staging → dedupe → apply) with partitioning/clustering aligned to access paths.
- Implement restartability: applied-window tracking, idempotency keys, deterministic ordering, and safe retries.
- Re-home orchestration: Composer/Airflow or your runner with explicit DAG contracts, retries, alerts, and concurrency posture.
- Gate cutover with evidence: golden outputs + incremental integrity simulations (reruns, backfills, late injections) and rollback-ready criteria.
Supported constructs
Representative Hadoop-era ETL constructs we commonly migrate to BigQuery (exact coverage depends on your estate).
| Source | Target | Notes |
|---|---|---|
| Partition overwrite pipelines (dt partitions) | Partition-scoped apply (MERGE or overwrite-by-partition) | Preserve overwrite semantics without full-table rewrites. |
| Oozie coordinators / shell script chains | Composer/Airflow DAGs with explicit contracts | Dependencies, retries, and SLAs become first-class artifacts. |
| Hive/Impala staging zones (files) | Landing tables + typed staging | Replayable staging boundaries with audit columns and manifests. |
| Late-data reprocessing windows | Explicit late-arrival policy + staged re-apply | Behavior verified via late-injection simulations. |
| SCD Type-1 / Type-2 logic | MERGE + current-flag/end-date patterns | Backfills and late updates tested as first-class scenarios. |
| Hive schema drift | Typed staging + drift policy (widen/quarantine/reject) | Auditability for changing upstream payloads. |
How workload changes
| Topic | Hadoop legacy cluster | BigQuery |
|---|---|---|
| Incremental correctness | Often emerges from overwrite + coordinator conventions | Explicit windows, idempotency, and staged apply |
| Performance model | Avoid HDFS scans by partition predicates | Bytes scanned is the cost driver; pruning must be explicit |
| Orchestration | Oozie coordinators and script chains | Composer/Airflow/dbt DAGs with explicit contracts |
| Schema evolution | Hive drift tolerated by downstream consumers | Typed staging with explicit drift policy |
Examples
Canonical BigQuery pattern for windowed loads: stage → dedupe deterministically → partition-scoped apply + applied-window tracking. Adjust keys, partitions, and casts to your model.
-- Applied-window tracking (restartability)
CREATE TABLE IF NOT EXISTS `proj.control.applied_windows` (
job_name STRING NOT NULL,
window_start DATE NOT NULL,
window_end DATE NOT NULL,
applied_at TIMESTAMP NOT NULL
);Common pitfalls
- Assuming “SQL translation” equals pipeline migration: the hard part is operational semantics (windows, retries, late data).
- Partition semantics lost: overwrite-partition becomes append-only; duplicates appear.
- Dedupe instability: missing tie-breakers causes nondeterministic drift.
- Pruning defeated: filters wrap partition columns or cast in WHERE, causing scan bytes explosion.
- Unbounded applies: MERGEs or refreshes touch too much history each run.
- Schema drift surprises: upstream types widen/change; typed targets break without a drift policy.
- Orchestrator mismatch: coordinator-based dependencies aren’t mapped; freshness and correctness drift.
Validation approach
- Execution checks: pipelines run reliably under representative volumes and schedules.
- Structural parity: partition/window-level row counts and column profiles (null/min/max/distinct).
- KPI parity: aggregates by key dimensions for critical marts and dashboards.
- Incremental integrity (mandatory):
- Idempotency: rerun same window → no net change
- Restart simulation: fail mid-run → resume → correct final state
- Backfill safety: historical windows replay without drift
- Late-arrival: inject late corrections → only expected rows change
- Dedupe stability: duplicates eliminated consistently under retries
- Cost/performance gates: pruning verified; scan bytes/runtime thresholds set for top jobs.
- Operational readiness: retry/alerting tests, canary gates, and rollback criteria defined before cutover.
Migration steps
- 01
Inventory pipelines, schedules, and dependencies
Extract pipeline graph: SQL jobs, staging zones, upstream feeds, and orchestrators (Oozie/Airflow/Cron). Identify business-critical marts and consumers.
- 02
Formalize the run contract
Define windows/high-water marks, business keys, deterministic ordering/tie-breakers, dedupe rules, late-arrival policy, and backfill boundaries. Make restartability explicit.
- 03
Rebuild transformations on BigQuery-native staging
Implement landing → typed staging → dedupe → apply with partitioning/clustering aligned to windows and access paths. Define schema evolution policy (widen/quarantine/reject).
- 04
Re-home orchestration and operations
Implement DAGs in Composer/Airflow/dbt: dependencies, retries, alerts, and concurrency. Recreate overwrite/reprocessing behavior with explicit late windows and idempotent markers.
- 05
Run parity and incremental integrity gates
Golden outputs + KPI aggregates, idempotency reruns, late-data injections, and backfill windows. Cut over only when thresholds pass and rollback criteria are defined.
We inventory your Hadoop pipelines, formalize partition/late-data semantics, migrate a representative pipeline end-to-end, and produce parity evidence with cutover gates—without scan-cost surprises.
Get an actionable migration plan with incremental integrity tests (reruns, late data, backfills), reconciliation evidence, and cost/performance baselines—so pipeline cutover is controlled and dispute-proof.