Validation & reconciliation for Hadoop legacy cluster → BigQuery
Turn “it runs” into a measurable parity contract. We prove correctness *and* pruning posture with golden queries, KPI diffs, and replayable integrity simulations—then gate cutover with rollback-ready criteria.
- Input
- Hadoop (legacy clusters) Validation & reconciliation logic
- Output
- BigQuery equivalent (validated)
- Common pitfalls
- Spot-check validation: a few samples miss drift in ties and edge cohorts.
- No pruning gate: scan-cost regressions slip through because bytes scanned isn’t validated.
- Filters that defeat pruning: wrapping partition columns in functions/casts in WHERE.
Why this breaks
Hadoop migrations fail late because correctness and performance were enforced by convention: always filter partitions, accept implicit casts, and rely on orchestrator scripts for reruns and backfills. BigQuery will compile many translated jobs—but drift and cost spikes appear when partition/pruning behavior, typing/NULL semantics, and time conversions aren’t made explicit and validated under stress.
Common drift drivers in Hadoop legacy cluster → BigQuery:
- Pruning contract lost: partition filters don’t translate, or filters defeat pruning → scan bytes explode
- Implicit casts & NULL semantics: CASE/COALESCE branches and join keys behave differently
- Window/top-N ambiguity: missing tie-breakers changes winners under parallelism
- Epoch/time conversions: timezone intent missing → boundary-day drift
- Operational behavior: reruns/backfills differ when overwrite/reprocessing conventions weren’t recreated
Validation must treat this as an operational system and include pruning/cost posture as a first-class cutover gate.
How conversion works
- Define the parity contract: what must match (facts/dims, KPIs, dashboards) and what tolerances apply.
- Define the pruning/cost contract: which workloads must prune and what scan-byte/slot thresholds are acceptable.
- Build validation datasets: golden inputs, edge cohorts (ties, null-heavy segments), and representative windows (including boundary dates).
- Run readiness + execution gates: schemas/types align, dependencies deployed, and jobs run reliably.
- Run layered parity gates: counts/profiles → KPI diffs → targeted row-level diffs where needed.
- Validate operational integrity where applicable: idempotency reruns, restart simulations, and backfill/late-data injections.
- Gate cutover: pass/fail thresholds, canary strategy, rollback triggers, and post-cutover monitors.
Supported constructs
Representative validation and reconciliation mechanisms we apply in Hadoop legacy cluster → BigQuery migrations.
| Source | Target | Notes |
|---|---|---|
| Golden dashboards/queries | Golden query harness + repeatable parameter sets | Codifies business sign-off into runnable tests. |
| Partition/pruning expectations | Pruning verification + scan-byte thresholds | Treat pruning as part of correctness in BigQuery. |
| Counts and profiles | Partition-level counts + null/min/max/distinct profiles | Cheap early drift detection before deep diffs. |
| KPI validation | Aggregate diffs by key dimensions + tolerance thresholds | Aligns validation with business meaning. |
| Row-level diffs | Targeted sampling diffs + edge cohort tests | Use deep diffs only where aggregates signal drift. |
| Reruns/backfills in orchestrators | Operational integrity simulations | Proves behavior under operational stress. |
How workload changes
| Topic | Hadoop legacy cluster | BigQuery |
|---|---|---|
| Performance contract | Partition predicates are mandatory to avoid HDFS scans | Bytes scanned is the cost driver; pruning must be explicit |
| Drift drivers | Implicit casts and time conversions often tolerated | Explicit casts and timezone intent required |
| Operational sign-off | Often based on “looks right” report checks | Evidence-based gates + rollback triggers |
Examples
Illustrative parity and pruning checks in BigQuery. Replace datasets, keys, and KPI definitions to match your migration.
-- Row counts by partition/window
SELECT
event_date AS d,
COUNT(*) AS rows
FROM `proj.mart.events`
WHERE event_date BETWEEN @start_date AND @end_date
GROUP BY 1
ORDER BY 1;Common pitfalls
- Spot-check validation: a few samples miss drift in ties and edge cohorts.
- No pruning gate: scan-cost regressions slip through because bytes scanned isn’t validated.
- Filters that defeat pruning: wrapping partition columns in functions/casts in WHERE.
- No tolerance model: teams argue about diffs because thresholds weren’t defined upfront.
- Wrong comparison level: comparing raw rows when business cares about rollups (or vice versa).
- Ignoring reruns/backfills: parity looks fine once but fails under retries and historical replays.
- Cost-blind diffs: exhaustive row-level diffs can be expensive; use layered gates (cheap→deep).
Validation approach
Gate set (layered)
Gate 0 — Readiness
- Datasets, permissions, and target schemas exist
- Dependent assets deployed (UDFs/routines, reference data, control tables)
Gate 1 — Execution
- Converted jobs compile and run reliably
- Deterministic ordering + explicit casts enforced
Gate 2 — Structural parity
- Row counts by partition/window
- Null/min/max/distinct profiles for key columns
Gate 3 — KPI parity
- KPI aggregates by key dimensions
- Rankings and top-N validated on tie/edge cohorts
Gate 4 — Pruning & cost posture (mandatory)
- Partition filters prune as expected on representative parameters
- Bytes scanned and slot time remain within agreed thresholds
- Regression alerts defined for scan blowups
Gate 5 — Operational integrity (when applicable)
- Idempotency: rerun same window → no net change
- Restart simulation: fail mid-run → resume → correct final state
- Backfill: historical windows replay without drift
- Late-arrival: inject late corrections → only expected rows change
Gate 6 — Cutover & monitoring
- Canary criteria + rollback triggers
- Post-cutover monitors: latency, scan bytes/slot time, failures, KPI sentinels
Migration steps
- 01
Define parity and cost/pruning contracts
Decide what must match (tables, dashboards, KPIs) and define tolerances. Identify workloads where pruning is mandatory and set scan-byte/slot thresholds.
- 02
Create validation datasets and edge cohorts
Select representative windows and cohorts that trigger edge behavior (ties, null-heavy segments, boundary dates, epoch conversions).
- 03
Implement layered gates
Start with cheap checks (counts/profiles), then KPI diffs, then deep diffs only where needed. Add pruning verification and baseline capture for top workloads.
- 04
Validate operational integrity
Run idempotency reruns, restart simulations, backfill windows, and late-arrival injections where applicable. These scenarios typically break if not tested.
- 05
Gate cutover and monitor
Establish canary/rollback criteria and post-cutover monitors for KPIs and scan-cost sentinels (bytes/slot), plus latency and failures.
We define parity + pruning contracts, build golden queries, and implement layered reconciliation gates—so Hadoop→BigQuery cutover is gated by evidence and scan-cost safety.
Get a validation plan, runnable gates, and sign-off artifacts (diff reports, thresholds, pruning baselines, monitors) so Hadoop→BigQuery cutover is controlled and dispute-proof.