UDFs & procedural utilities for Spark SQL → BigQuery
Re-home reusable logic—from Spark SQL UDFs and notebook “macro” utilities to BigQuery UDFs and stored procedures—so behavior stays stable under reruns, backfills, and changing upstream schemas.
- Input
- Spark SQL Stored procedure / UDF migration logic
- Output
- BigQuery equivalent (validated)
- Common pitfalls
- Hidden dependencies: notebook macros rely on implicit session configs, temp views, or catalogs.
- Mixed-type branches: CASE/IF returns different types; BigQuery needs explicit casts.
- NULL semantics drift: null-safe equality and string functions behave differently unless explicit.
Why this breaks
Spark estates frequently hide business rules in “reusable logic”: SQL UDFs, notebook helper functions, and script-driven utilities that generate SQL and apply control-table semantics. During migration, teams translate SQL statements but miss these utilities—then outputs drift because typing, NULL behavior, regex semantics, and timezone handling differ in BigQuery.
Common symptoms after migration:
- UDF outputs drift due to type coercion and NULL handling differences
- Notebook macros that relied on Spark session settings or temporary views stop matching
- Error handling changes (fail-fast vs continue), breaking orchestration expectations
- Dynamic SQL and identifier resolution behaves differently in BigQuery
- Side effects (audit/control tables) aren’t modeled, so retries/backfills double-apply or skip
A successful migration consolidates scattered utilities into explicit BigQuery routines with a behavior contract and a replayable harness that proves parity.
How conversion works
- Inventory & classify reusable logic: Spark SQL UDFs, notebook utilities/macros, and call sites across pipelines and BI queries.
- Extract the behavior contract: inputs/outputs, typing, NULL rules, side effects, error semantics, and performance constraints.
- Choose BigQuery target form per asset:
- SQL UDF for pure expressions
- JavaScript UDF for complex logic (string/regex/object handling)
- Stored procedure (SQL scripting) for multi-statement control flow and dynamic SQL
- Set-based refactor where loop/cursor patterns exist in notebooks/scripts
- Translate and normalize: explicit casts, null-safe comparisons, timezone intent, and deterministic ordering where logic depends on ranking/dedup.
- Validate with a harness: golden inputs/outputs, branch coverage, failure-mode tests, and side-effect assertions—then integrate into representative pipelines.
Supported constructs
Representative Spark procedural/macro constructs we commonly migrate to BigQuery routines (exact coverage depends on your estate).
| Source | Target | Notes |
|---|---|---|
| Spark SQL UDFs | BigQuery SQL UDFs | Pure expressions mapped with explicit casts and NULL behavior. |
| Notebook macro utilities | BigQuery procedures + views/UDFs | Hidden dependencies made explicit; interfaces stabilized. |
| Regex-heavy transforms | BigQuery REGEXP_* functions or JS UDFs | Regex edge cases validated with golden cohorts. |
| Dynamic SQL generation | EXECUTE IMMEDIATE with safe parameterization | Normalize identifier resolution; reduce drift and injection risk. |
| Control-table driven reruns | Idempotency markers + applied-window tracking | Retries/backfills become safe and auditable. |
| Timestamp/timezone helpers | Explicit timezone conversions and intent | Prevents boundary-day drift in reporting. |
How workload changes
| Topic | Spark SQL | BigQuery |
|---|---|---|
| Where logic lives | Notebooks, jobs, Spark UDFs, scattered utility code | Centralized routines (UDFs/procedures) with explicit contracts |
| Typing and coercion | Spark implicit casts often tolerated | Explicit casts recommended for stable outputs |
| Operational behavior | Reruns/retries often encoded in job structure | Idempotency and side effects must be explicit |
Examples
Illustrative patterns for moving Spark-era UDF and macro utilities into BigQuery routines. Adjust datasets, types, and identifiers to match your environment.
-- BigQuery SQL UDF example
CREATE OR REPLACE FUNCTION `proj.util.safe_div`(n NUMERIC, d NUMERIC) AS (
IF(d IS NULL OR d = 0, NULL, n / d)
);Common pitfalls
- Hidden dependencies: notebook macros rely on implicit session configs, temp views, or catalogs.
- Mixed-type branches: CASE/IF returns different types; BigQuery needs explicit casts.
- NULL semantics drift: null-safe equality and string functions behave differently unless explicit.
- Dynamic SQL risks: string concatenation without bindings/escaping causes drift and security risk.
- Side effects ignored: control/audit writes not recreated; reruns become unsafe.
- Row-by-row patterns: loops in notebooks are costly in BigQuery; refactor to set-based SQL.
- No harness: without replayable tests, parity becomes a debate at cutover.
Validation approach
- Compile + interface checks: each routine deploys; argument/return types match the contract.
- Golden tests: curated input sets validate outputs, including NULL-heavy and boundary values.
- Branch + failure-mode coverage: exercise key branches and expected failures.
- Side-effect verification: assert expected writes to control/log/audit tables and idempotency under retries/backfills.
- Integration replay: run routines inside representative pipelines and compare downstream KPIs/aggregates.
- Performance gate: confirm no hidden row-by-row scans; refactors validated with bytes scanned/runtime baselines.
Migration steps
- 01
Inventory reusable logic and call sites
Collect Spark SQL UDFs, notebook utilities/macros, and call sites across pipelines and BI queries. Identify dependencies and side effects (control/audit tables).
- 02
Define the behavior contract
Specify inputs/outputs, typing, NULL rules, regex expectations, error semantics, side effects, and performance expectations. Decide target form (SQL UDF, JS UDF, procedure, or refactor).
- 03
Convert with safety patterns
Make casts explicit, normalize timezone intent, implement null-safe comparisons, and migrate dynamic SQL using EXECUTE IMMEDIATE with bindings and explicit identifier rules.
- 04
Build a replayable harness
Create golden input sets, boundary cases, and expected failures. Validate outputs and side effects deterministically so parity isn’t debated at cutover.
- 05
Integrate and cut over behind gates
Run routines in representative pipelines, compare downstream KPIs, validate retries/backfills, and cut over with rollback-ready criteria.
We inventory your UDFs and macros, migrate a representative subset into BigQuery routines, and deliver a harness that proves parity—including side effects and failure behavior.
Get a conversion plan, review markers for ambiguous intent, and validation artifacts so procedural logic cutover is gated by evidence and rollback criteria.