Workload

UDFs & procedural utilities for Spark SQL → BigQuery

Re-home reusable logic—from Spark SQL UDFs and notebook “macro” utilities to BigQuery UDFs and stored procedures—so behavior stays stable under reruns, backfills, and changing upstream schemas.

Book assessment See migration approach

Back to pair page

At a glance

Input: Spark SQL Stored procedure / UDF migration logic
Output: BigQuery equivalent (validated)
Common pitfalls: Hidden dependencies: notebook macros rely on implicit session configs, temp views, or catalogs.
Mixed-type branches: CASE/IF returns different types; BigQuery needs explicit casts.
NULL semantics drift: null-safe equality and string functions behave differently unless explicit.

Context

Why this breaks

Spark estates frequently hide business rules in “reusable logic”: SQL UDFs, notebook helper functions, and script-driven utilities that generate SQL and apply control-table semantics. During migration, teams translate SQL statements but miss these utilities—then outputs drift because typing, NULL behavior, regex semantics, and timezone handling differ in BigQuery.

Common symptoms after migration:

UDF outputs drift due to type coercion and NULL handling differences
Notebook macros that relied on Spark session settings or temporary views stop matching
Error handling changes (fail-fast vs continue), breaking orchestration expectations
Dynamic SQL and identifier resolution behaves differently in BigQuery
Side effects (audit/control tables) aren’t modeled, so retries/backfills double-apply or skip

A successful migration consolidates scattered utilities into explicit BigQuery routines with a behavior contract and a replayable harness that proves parity.

Approach

How conversion works

Inventory & classify reusable logic: Spark SQL UDFs, notebook utilities/macros, and call sites across pipelines and BI queries.
Extract the behavior contract: inputs/outputs, typing, NULL rules, side effects, error semantics, and performance constraints.
Choose BigQuery target form per asset:
- SQL UDF for pure expressions
- JavaScript UDF for complex logic (string/regex/object handling)
- Stored procedure (SQL scripting) for multi-statement control flow and dynamic SQL
- Set-based refactor where loop/cursor patterns exist in notebooks/scripts
Translate and normalize: explicit casts, null-safe comparisons, timezone intent, and deterministic ordering where logic depends on ranking/dedup.
Validate with a harness: golden inputs/outputs, branch coverage, failure-mode tests, and side-effect assertions—then integrate into representative pipelines.

Supported constructs

Representative Spark procedural/macro constructs we commonly migrate to BigQuery routines (exact coverage depends on your estate).

Source	Target	Notes
Spark SQL UDFs	BigQuery SQL UDFs	Pure expressions mapped with explicit casts and NULL behavior.
Notebook macro utilities	BigQuery procedures + views/UDFs	Hidden dependencies made explicit; interfaces stabilized.
Regex-heavy transforms	BigQuery REGEXP_* functions or JS UDFs	Regex edge cases validated with golden cohorts.
Dynamic SQL generation	EXECUTE IMMEDIATE with safe parameterization	Normalize identifier resolution; reduce drift and injection risk.
Control-table driven reruns	Idempotency markers + applied-window tracking	Retries/backfills become safe and auditable.
Timestamp/timezone helpers	Explicit timezone conversions and intent	Prevents boundary-day drift in reporting.

How workload changes

Topic	Spark SQL	BigQuery
Where logic lives	Notebooks, jobs, Spark UDFs, scattered utility code	Centralized routines (UDFs/procedures) with explicit contracts
Typing and coercion	Spark implicit casts often tolerated	Explicit casts recommended for stable outputs
Operational behavior	Reruns/retries often encoded in job structure	Idempotency and side effects must be explicit

Where logic lives: Migration consolidates and stabilizes reusable logic.

Typing and coercion: Validation focuses on mixed-type branches and join keys.

Operational behavior: Harness proves behavior under retries/backfills.

Examples

Illustrative patterns for moving Spark-era UDF and macro utilities into BigQuery routines. Adjust datasets, types, and identifiers to match your environment.

-- BigQuery SQL UDF example
CREATE OR REPLACE FUNCTION `proj.util.safe_div`(n NUMERIC, d NUMERIC) AS (
  IF(d IS NULL OR d = 0, NULL, n / d)
);

Avoid

Common pitfalls

Hidden dependencies: notebook macros rely on implicit session configs, temp views, or catalogs.
Mixed-type branches: CASE/IF returns different types; BigQuery needs explicit casts.
NULL semantics drift: null-safe equality and string functions behave differently unless explicit.
Dynamic SQL risks: string concatenation without bindings/escaping causes drift and security risk.
Side effects ignored: control/audit writes not recreated; reruns become unsafe.
Row-by-row patterns: loops in notebooks are costly in BigQuery; refactor to set-based SQL.
No harness: without replayable tests, parity becomes a debate at cutover.

Proof

Validation approach

Compile + interface checks: each routine deploys; argument/return types match the contract.
Golden tests: curated input sets validate outputs, including NULL-heavy and boundary values.
Branch + failure-mode coverage: exercise key branches and expected failures.
Side-effect verification: assert expected writes to control/log/audit tables and idempotency under retries/backfills.
Integration replay: run routines inside representative pipelines and compare downstream KPIs/aggregates.
Performance gate: confirm no hidden row-by-row scans; refactors validated with bytes scanned/runtime baselines.

Execution

Migration steps

A sequence that keeps behavior explicit, testable, and safe to cut over.

01
Inventory reusable logic and call sites
Collect Spark SQL UDFs, notebook utilities/macros, and call sites across pipelines and BI queries. Identify dependencies and side effects (control/audit tables).
02
Define the behavior contract
Specify inputs/outputs, typing, NULL rules, regex expectations, error semantics, side effects, and performance expectations. Decide target form (SQL UDF, JS UDF, procedure, or refactor).
03
Convert with safety patterns
Make casts explicit, normalize timezone intent, implement null-safe comparisons, and migrate dynamic SQL using EXECUTE IMMEDIATE with bindings and explicit identifier rules.
04
Build a replayable harness
Create golden input sets, boundary cases, and expected failures. Validate outputs and side effects deterministically so parity isn’t debated at cutover.
05
Integrate and cut over behind gates
Run routines in representative pipelines, compare downstream KPIs, validate retries/backfills, and cut over with rollback-ready criteria.

Workload Assessment

Migrate reusable logic with a test harness

We inventory your UDFs and macros, migrate a representative subset into BigQuery routines, and deliver a harness that proves parity—including side effects and failure behavior.

Book assessment See migration approach

Migration Acceleration

Cut over routines with proof-backed sign-off

Get a conversion plan, review markers for ambiguous intent, and validation artifacts so procedural logic cutover is gated by evidence and rollback criteria.

Book assessment Explore workloads

FAQ

Frequently asked questions

What are “macros” in a Spark SQL context?+

They’re usually notebook utilities and shared SQL snippets that behave like reusable functions—generating SQL, applying control-table logic, or standardizing transforms. We migrate them into BigQuery-native assets (UDFs, views, procedures) with explicit contracts.

Do we have to move everything into stored procedures?+

No. Many utilities belong as SQL UDFs or views, and some procedural logic is best refactored into set-based SQL. We choose the target form per asset to minimize risk and cost.

How do you prove parity for UDFs and procedures?+

We build a replayable harness with golden inputs/outputs, branch and failure-mode coverage, and side-effect assertions. Integration replay validates downstream KPIs before cutover.

What’s the biggest risk area?+

Implicit typing/NULL behavior differences and hidden side effects used for restartability. We make these explicit and validate under retries/backfills so behavior remains stable.

Spark SQL to BigQuery migration

Migration Pair

End-to-end approach: what breaks, validation gates, and cutover plan.

View page

ETL / pipeline migration

Workload

Migrate Spark pipelines with explicit late-data semantics and restartability.

View page

SQL / query migration

Workload

Convert Spark SQL to BigQuery with semantic parity and golden-query validation.

View page