Databricks SQL queries to BigQuery
Translate Spark SQL idioms—MERGE/upserts, window logic, array/map patterns, and timezone handling—into BigQuery Standard SQL with validation gates that prevent semantic drift and scan-cost surprises.
- Input
- Databricks SQL / query migration logic
- Output
- BigQuery equivalent (validated)
- Common pitfalls
- Type coercion drift: Spark implicit casts inside CASE/COALESCE differ; BigQuery may require explicit casts to preserve intent.
- NULL semantics in joins: join keys and filters change match behavior if null-safe rules aren’t explicit.
- Window ordering ambiguity: ROW_NUMBER/RANK without stable tie-breakers causes nondeterministic drift.
Why this breaks
Databricks SQL estates are typically Spark SQL plus Delta-centric patterns. Many queries translate syntactically, but drift happens when implicit semantics differ: type coercion inside CASE/COALESCE, NULL behavior in joins, timestamp/timezone assumptions, and array/map handling. In BigQuery, scan bytes and pruning posture also become first-class concerns.
Common symptoms after cutover:
- KPI drift from window logic and top-N selections with incomplete ordering
- MERGE behavior changes because match keys, casts, or update predicates were implicit
- Arrays/maps and semi-structured fields don’t translate 1:1
- Date/time logic shifts due to timezone assumptions and TIMESTAMP vs DATE intent
- Costs spike because Spark-era shapes don’t maximize BigQuery pruning
SQL migration must preserve both meaning and a BigQuery-native execution posture.
How conversion works
- Inventory & prioritize the SQL corpus (dbt models, notebooks, BI extracts, jobs, views). Rank by business impact and risk patterns (MERGE, windows, arrays/maps, time).
- Normalize Spark SQL: identifier/quoting rules, function idioms, and query shapes to reduce dialect noise.
- Rewrite with rule-anchored mappings: Spark SQL → BigQuery function equivalents, explicit cast strategy, and deterministic ordering for windowed filters.
- Apply pattern libraries for high-risk constructs: MERGE/upserts, late-arrival dedupe, arrays/maps, timezone normalization, and null-safe comparisons.
- Validate with gates: compilation, catalog/type alignment, and golden-query parity (plus edge cohorts for ties/null-heavy segments).
- Performance-safe refactors: pruning-aware filters, join strategy adjustments, and pre-aggregation/materialization guidance for expensive BI queries.
Supported constructs
Representative Databricks/Spark SQL constructs we commonly convert to BigQuery Standard SQL (exact coverage depends on your estate).
| Source | Target | Notes |
|---|---|---|
| Spark SQL window filters (ROW_NUMBER/RANK) | BigQuery window functions + QUALIFY | Deterministic ordering and tie-breakers enforced. |
| Delta MERGE and upsert queries | BigQuery MERGE with staged apply | Match keys, casts, and update predicates made explicit and testable. |
| explode/arrays/maps/structs | UNNEST + ARRAY/STRUCT handling (or normalized staging) | Empty-array semantics preserved intentionally. |
| DATE/TIMESTAMP arithmetic | BigQuery date/time functions | Timezone intent normalized explicitly. |
| NULL-safe comparisons | Explicit null-safe equality patterns | Prevents join drift and duplicate inserts. |
| String/regex functions | BigQuery string/regex equivalents | Edge-case behavior validated via golden cohorts. |
How workload changes
| Topic | Databricks / Spark SQL | BigQuery |
|---|---|---|
| Execution assumptions | Shuffle-heavy plans and partition overwrite patterns common | Pruning-first filters + partitioning/clustering alignment |
| Semi-structured handling | Struct/array/map types and explode patterns | ARRAY/STRUCT + UNNEST or JSON-based modeling |
| Time semantics | Timezone assumptions often implicit | Explicit timezone conversions and DATE vs TIMESTAMP intent |
Examples
Representative Databricks → BigQuery rewrites for windowed dedupe, explode/unnest, and MERGE patterns. Adjust keys, paths, and casts to your model.
-- Databricks: latest row per key
SELECT *
FROM events
QUALIFY ROW_NUMBER() OVER (
PARTITION BY business_key
ORDER BY event_ts DESC
) = 1;Common pitfalls
- Type coercion drift: Spark implicit casts inside CASE/COALESCE differ; BigQuery may require explicit casts to preserve intent.
- NULL semantics in joins: join keys and filters change match behavior if null-safe rules aren’t explicit.
- Window ordering ambiguity: ROW_NUMBER/RANK without stable tie-breakers causes nondeterministic drift.
- Array/map handling: Spark
explode,map, and nested structs need explicit UNNEST/JSON handling rewrites. - Timezone assumptions: session timezone differences change boundary-day reporting unless standardized.
- Pruning defeated: filters wrap partition columns or cast in WHERE, causing scan bytes to explode.
- Over-trusting “it runs”: compilation success is not parity; validate with golden outputs and edge cohorts.
Validation approach
- Compilation gates: converted queries compile and execute in BigQuery reliably under representative parameters.
- Catalog/type checks: referenced objects exist; implicit casts are surfaced and made explicit.
- Golden-query parity: business-critical queries/dashboards match outputs or agreed tolerances.
- KPI aggregates: compare aggregates by key dimensions (date, region, product, cohorts).
- Edge-cohort diffs: validate ties, null-heavy segments, boundary dates, and timezone transitions.
- Performance baseline: capture runtime/bytes scanned/slot time for top queries; set regression thresholds.
Migration steps
- 01
Collect and prioritize the query estate
Export dbt models, notebooks, job SQL, view definitions, and BI SQL. Rank by business impact, frequency, and risk patterns (MERGE, windows, arrays/maps, time).
- 02
Define semantic and pruning contracts
Make tie-breakers, NULL handling, casting strategy, timezone intent, and array/map handling explicit. Define pruning expectations and scan-byte thresholds for top queries.
- 03
Convert with rule-anchored mappings
Apply deterministic rewrites for common constructs and flag ambiguous intent with review markers (implicit casts, ordering ambiguity, timezone assumptions).
- 04
Validate with golden queries and edge cohorts
Compile and run in BigQuery, compare KPI aggregates, and run targeted diffs on edge cohorts (ties, null-heavy segments, boundary dates).
- 05
Tune top queries for BigQuery
Align partitioning/clustering to access paths, enforce pruning-first filters, and recommend materializations for the heaviest BI workloads.
We inventory your SQL estate, convert a representative slice, and deliver parity evidence on golden queries—plus pruning baselines so BigQuery spend stays predictable.
Get a conversion plan, review markers, and validation artifacts so query cutover is gated by evidence and rollback-ready criteria—without scan-cost surprises.