Hadoop legacy SQL queries to BigQuery
Translate Hive/Impala/Spark SQL—partition-driven filters, analytic functions, and Hadoop-era function idioms—into BigQuery Standard SQL with validation gates that prevent drift and scan-cost surprises.
- Input
- Hadoop (legacy clusters) SQL / query migration logic
- Output
- BigQuery equivalent (validated)
- Common pitfalls
- Partition predicate loss: queries that relied on Hadoop partition columns now scan entire BigQuery tables.
- Defeating pruning: wrapping partition columns in functions or casting in filters prevents partition elimination.
- Implicit cast drift: Hive/Impala coercion differs; BigQuery needs explicit casts for stable outputs.
Why this breaks
Hadoop-era SQL estates are shaped by Hive metastore conventions, partition columns as a performance requirement, and dialect-specific functions. BigQuery will compile many translated queries—but drift and cost spikes happen when partition pruning, implicit casts, and time semantics aren’t made explicit.
Common symptoms after cutover:
- Scan costs spike because partition predicates aren’t translated into BigQuery pruning-friendly filters
- KPI drift from NULL/type coercion differences in CASE/COALESCE and join keys
- Window logic behaves differently when ordering is incomplete or ties exist
- Regex and string function behavior changes due to dialect differences
- Timestamp/date intent shifts (DATE vs TIMESTAMP, timezone boundaries)
SQL migration must preserve both meaning and pruning posture so BigQuery stays predictable.
How conversion works
- Inventory & prioritize the SQL corpus (BI extracts, views, scheduled reports, ETL SQL) across Hive/Impala/Spark. Rank by business impact and risk patterns (partition filters, time, windows, casts).
- Normalize dialect noise: identifiers/quoting, CTE shapes, and common Hadoop-era function idioms.
- Rewrite with rule-anchored mappings: dialect-specific functions → BigQuery equivalents, explicit cast strategy, and deterministic ordering for windowed filters.
- Partition/pruning rewrites: translate
year/month/day/dtpredicates into direct BigQuery partition filters and eliminate patterns that defeat pruning. - Validate with gates: compile/run checks, catalog/type alignment, golden-query parity, and edge-cohort diffs.
- Performance-safe refactors: recommend partitioning/clustering alignment and pruning-friendly query shapes for top workloads.
Supported constructs
Representative Hadoop-era SQL constructs we commonly convert to BigQuery Standard SQL (exact coverage depends on your estate).
| Source | Target | Notes |
|---|---|---|
| Partition predicates (dt/year/month columns) | BigQuery partition filters (DATE/TIMESTAMP partitioning) | Rewrite to preserve pruning and predictable scan costs. |
| Window functions (ROW_NUMBER/RANK/OVER) | BigQuery window functions | Deterministic ordering and tie-breakers enforced where needed. |
| Epoch/time conversion helpers | TIMESTAMP_SECONDS/MILLIS and explicit timezone handling | DATE vs TIMESTAMP intent normalized explicitly. |
| NULL/type coercion idioms | Explicit casts + null-safe comparisons | Prevents join drift and filter selectivity changes. |
| String/regex functions | BigQuery string/regex equivalents | Regex differences validated with golden cohorts. |
| LIMIT/TOP-N patterns | LIMIT with explicit ORDER BY | Ordering made explicit for deterministic top-N outputs. |
How workload changes
| Topic | Hadoop SQL (Hive/Impala/Spark) | BigQuery |
|---|---|---|
| Performance contract | Partition predicates are mandatory to avoid HDFS scans | Bytes scanned is the cost driver; pruning must be explicit |
| Type behavior | Implicit casts often tolerated | Explicit casts recommended for stable outputs |
| Time semantics | Epoch/timezone assumptions often implicit | Explicit DATE vs TIMESTAMP + timezone conversions |
Examples
Representative Hadoop SQL → BigQuery rewrites for partition filters, epoch conversion, and windowed dedupe. Adjust identifiers and types to your schema.
-- Hadoop-era pattern: year/month/day partitions
SELECT COUNT(*)
FROM events
WHERE year = 2025 AND month = 1 AND day BETWEEN 1 AND 7;Common pitfalls
- Partition predicate loss: queries that relied on Hadoop partition columns now scan entire BigQuery tables.
- Defeating pruning: wrapping partition columns in functions or casting in filters prevents partition elimination.
- Implicit cast drift: Hive/Impala coercion differs; BigQuery needs explicit casts for stable outputs.
- NULL semantics in joins: equality behavior differs; joins can drop or duplicate rows.
- Window ordering ambiguity: ROW_NUMBER/RANK without stable tie-breakers causes nondeterministic drift.
- Regex dialect differences: escaping and matching behavior changes edge outputs.
- Timezone assumptions: epoch conversions and boundary-day reporting drift if not standardized.
Validation approach
- Compilation gates: converted queries compile under BigQuery Standard SQL.
- Catalog/type checks: referenced objects exist; implicit casts surfaced and made explicit.
- Golden-query parity: critical dashboards and reports match outputs or agreed tolerances.
- KPI aggregates: compare aggregates by key dimensions and partitions (date, region, product, cohorts).
- Edge-cohort diffs: validate ties, null-heavy segments, boundary dates, and timezone transitions.
- Pruning/performance baseline: capture bytes scanned and runtime for top queries; set regression thresholds.
Migration steps
- 01
Collect and prioritize the query estate
Export BI SQL, view definitions, scheduled report queries, and ETL SQL across Hive/Impala/Spark. Rank by business impact, frequency, and risk patterns.
- 02
Define pruning and semantic contracts
Agree on partition/filter contracts, casting rules, null-safe join expectations, and timezone intent. Identify golden queries for sign-off.
- 03
Convert with rule-anchored mappings
Apply deterministic rewrites for common Hadoop-era idioms and flag ambiguous intent with review markers (implicit casts, ordering ambiguity, regex semantics).
- 04
Validate with golden queries and edge cohorts
Compile and run in BigQuery, compare KPI aggregates, and test edge cohorts (ties, null-heavy segments, boundary dates).
- 05
Tune top queries for BigQuery
Ensure partition filters are pruning-friendly, align partitioning/clustering to access paths, and recommend pre-aggregations/materializations for the most expensive BI queries.
We inventory your SQL estate, convert a representative slice, and deliver parity evidence on golden queries—plus a pruning/cost risk register so BigQuery spend stays predictable.
Get a conversion plan, review markers, and validation artifacts so query cutover is gated by evidence and rollback-ready criteria—without scan-cost surprises.