Stored procedures, macros & UDFs for Databricks → BigQuery
Re-home reusable logic—from Spark SQL UDFs and notebook “macro” utilities to BigQuery UDFs and stored procedures—so behavior stays stable under retries, backfills, and changing upstream schemas.
- Input
- Databricks Stored procedure / UDF migration logic
- Output
- BigQuery equivalent (validated)
- Common pitfalls
- Hidden dependencies: notebook macros rely on implicit session configs, temp views, or catalogs.
- Mixed-type branches: CASE/IF returns different types; BigQuery needs explicit casts.
- NULL semantics drift: null-safe equality and string functions behave differently unless explicit.
Why this breaks
In Databricks estates, “procedural” logic is often distributed: Spark SQL UDFs, notebook helper functions, dbt macros, and job utilities embedded across pipelines. During migration, teams focus on SQL translation and MERGE behavior, but these reusable assets quietly change outputs because typing, NULL semantics, and time handling differ in BigQuery.
Common symptoms after migration:
- UDF outputs drift due to type coercion and NULL handling differences
- Notebook macros that relied on Spark functions or implicit session settings stop matching
- Error handling changes (fail-fast vs continue), breaking orchestration expectations
- Dynamic SQL and identifier resolution behaves differently in BigQuery
- Side effects (audit/control tables) aren’t modeled, so retries/backfills double-apply or skip
A successful migration consolidates scattered utilities into explicit BigQuery routines with a behavior contract and a replayable harness that proves parity.
How conversion works
- Inventory & classify reusable logic: Spark SQL UDFs, notebook utilities/macros, dbt macros, and call sites across pipelines and BI queries.
- Extract the behavior contract: inputs/outputs, typing, null rules, side effects, error semantics, and performance constraints.
- Choose BigQuery target form per asset:
- SQL UDF for pure expressions
- JavaScript UDF for complex logic (string/regex/object handling)
- Stored procedure (SQL scripting) for control flow and multi-statement logic
- Set-based refactor where loop/cursor patterns exist
- Translate and normalize: casts, null-safe comparisons, timezone intent, and deterministic ordering where logic depends on ranking/dedup.
- Validate with a harness: golden inputs/outputs, branch coverage, failure-mode tests, and side-effect assertions—then integrate into representative pipelines.
Supported constructs
Representative Databricks procedural/macro constructs we commonly migrate to BigQuery routines (exact coverage depends on your estate).
| Source | Target | Notes |
|---|---|---|
| Spark SQL UDFs | BigQuery SQL UDFs | Pure expressions mapped with explicit casts and NULL behavior. |
| Notebook macro utilities | BigQuery procedures + views/UDFs | Hidden dependencies made explicit; interfaces stabilized. |
| dbt macros (logic reuse) | BigQuery macros/UDFs/views (depending on use) | Refactor into reusable, testable assets with stable contracts. |
| Dynamic SQL generation | EXECUTE IMMEDIATE with safe parameterization | Normalize identifier resolution; reduce drift and injection risk. |
| Control-table driven rerun behavior | Idempotency markers + applied-window tracking | Retries/backfills become safe and auditable. |
| Timestamp/timezone logic | Explicit timezone conversions and intent | Prevents boundary-day drift in reporting and SCD logic. |
How workload changes
| Topic | Databricks | BigQuery |
|---|---|---|
| Where logic lives | Notebooks, jobs, Spark UDFs, scattered utility code | Centralized routines (UDFs/procedures) with explicit contracts |
| Typing and coercion | Spark implicit casts often tolerated | Explicit casts recommended for stable outputs |
| Operational behavior | Reruns/retries often encoded in job structure | Idempotency and side effects must be explicit |
Examples
Illustrative patterns for moving Databricks UDFs and macro utilities into BigQuery routines. Adjust datasets, types, and identifiers to match your environment.
-- BigQuery SQL UDF example
CREATE OR REPLACE FUNCTION `proj.util.safe_div`(n NUMERIC, d NUMERIC) AS (
IF(d IS NULL OR d = 0, NULL, n / d)
);Common pitfalls
- Hidden dependencies: notebook macros rely on implicit session configs, temp views, or catalogs.
- Mixed-type branches: CASE/IF returns different types; BigQuery needs explicit casts.
- NULL semantics drift: null-safe equality and string functions behave differently unless explicit.
- Dynamic SQL risks: string concatenation without bindings/escaping causes drift and security risk.
- Side effects ignored: control/audit writes not recreated; reruns become unsafe.
- Row-by-row patterns: loops in notebooks are costly in BigQuery; refactor to set-based SQL.
- No harness: without replayable tests, parity becomes a debate at cutover.
Validation approach
- Compile + interface checks: each routine deploys; argument/return types match the contract.
- Golden tests: curated input sets validate outputs, including NULL-heavy and boundary values.
- Branch + failure-mode coverage: exercise key branches and expected failures.
- Side-effect verification: assert expected writes to control/log/audit tables and idempotency under retries.
- Integration replay: run routines inside representative pipelines and compare downstream KPIs/aggregates.
- Performance gate: confirm no hidden row-by-row scans; refactors validated with bytes scanned/runtime baselines.
Migration steps
- 01
Inventory reusable logic and call sites
Collect Spark SQL UDFs, notebook utilities/macros, and dbt macros, plus call sites across pipelines and BI queries. Identify dependencies and side effects (control/audit tables).
- 02
Define the behavior contract
Specify inputs/outputs, typing, NULL rules, error semantics, side effects, and performance expectations. Decide target form (SQL UDF, JS UDF, procedure, or refactor).
- 03
Convert with safety patterns
Make casts explicit, normalize timezone intent, implement null-safe comparisons, and migrate dynamic SQL using EXECUTE IMMEDIATE with bindings and explicit identifier rules.
- 04
Build a replayable harness
Create golden input sets, boundary cases, and expected failures. Validate outputs and side effects deterministically so parity isn’t debated at cutover.
- 05
Integrate and cut over behind gates
Run routines in representative pipelines, compare downstream KPIs, validate retries/backfills, and cut over with rollback-ready criteria.
We inventory your UDFs and macros, migrate a representative subset into BigQuery routines, and deliver a harness that proves parity—including side effects and failure behavior.
Get a conversion plan, review markers for ambiguous intent, and validation artifacts so procedural logic cutover is gated by evidence and rollback criteria.