Workload

Stored procedures, macros & UDFs for Databricks → Snowflake

Re-home reusable logic—from Spark SQL UDFs and notebook “macro” utilities to Snowflake UDFs and stored procedures—so behavior stays stable under retries, backfills, and real concurrency.

At a glance
Input
Databricks Stored procedure / UDF migration logic
Output
Snowflake equivalent (validated)
Common pitfalls
  • Hidden dependencies: notebook macros rely on implicit session configs, catalogs, or temp views.
  • Mixed-type branches: CASE/IF branches return different types; Snowflake needs explicit casts to preserve intent.
  • NULL semantics drift: null-safe equality, joins, and string functions behave differently if not explicit.
Context

Why this breaks

In Databricks estates, “procedural” logic is often distributed: Spark SQL UDFs, notebook helper functions, job parameters, and macro-like snippets embedded across pipelines. During migration, teams focus on SQL translation and MERGE patterns, but these reusable utilities quietly change behavior—causing downstream drift.

Common symptoms after migration:

  • UDF outputs drift due to type coercion and NULL behavior differences
  • Notebook macros that relied on Spark functions or implicit session settings stop matching
  • Error handling changes (fail-fast vs continue), breaking orchestration expectations
  • Dynamic SQL and identifier resolution behaves differently in Snowflake
  • Side effects (audit/control tables) aren’t modeled, so retries/backfills double-apply or skip

A successful migration turns this scattered logic into explicit Snowflake routines with a test harness that proves parity.

Approach

How conversion works

  1. Inventory & classify reusable logic: Spark SQL UDFs, notebook utilities/macros, job libraries, and call sites across pipelines and BI layers.
  2. Extract the behavior contract: inputs/outputs, typing, null rules, side effects, error semantics, and performance constraints.
  3. Choose Snowflake target form per asset:
    • SQL UDF for pure expressions
    • JavaScript UDF for complex logic where JS is the right fit
    • Stored procedure for control flow and multi-statement logic
    • Set-based refactor where loop/cursor patterns appear
  4. Translate and normalize: casts, null-safe comparisons, timestamp intent (NTZ/LTZ/TZ), and deterministic ordering where logic depends on ranking/dedup.
  5. Validate with a harness: golden inputs/outputs, branch coverage, failure-mode tests, and side-effect assertions—then integrate into a representative pipeline slice.

Supported constructs

Representative Databricks procedural/macro constructs we commonly migrate to Snowflake routines (exact coverage depends on your estate).

SourceTargetNotes
Spark SQL UDFsSnowflake SQL UDFsPure expressions mapped with explicit casts and NULL behavior.
Notebook macro utilitiesSnowflake procedures + views/UDFsHidden dependencies made explicit; interfaces stabilized.
Dynamic SQL generationEXECUTE IMMEDIATE patterns with safe parameterizationIdentifier resolution and escaping normalized.
Control-table driven rerun behaviorIdempotency markers + applied-window trackingRetries/backfills become safe and auditable.
Timestamp/timezone logicExplicit NTZ/LTZ/TZ handlingPrevents boundary-day drift in reporting and SCD logic.
Row-by-row notebook loopsSet-based SQL refactorsAvoids performance and cost cliffs.

How workload changes

TopicDatabricksSnowflake
Where logic livesNotebooks, jobs, Spark UDFs, and scattered utility codeCentralized routines (UDFs/procedures) with explicit contracts
Typing and coercionSpark implicit casts often toleratedExplicit casts required for stable outputs
Operational behaviorReruns/retries often encoded in job structureIdempotency and side effects must be explicit
Where logic lives: Migration consolidates and stabilizes reusable logic.
Typing and coercion: Validation focuses on mixed-type branches and join keys.
Operational behavior: Harness proves behavior under retries/backfills.

Examples

Illustrative patterns for moving Databricks UDFs and macro-style utilities into Snowflake routines. Adjust databases/schemas and types to match your environment.

-- Snowflake SQL UDF example
CREATE OR REPLACE FUNCTION util.safe_div(n NUMBER(38,6), d NUMBER(38,6))
RETURNS NUMBER(38,6)
AS
$$
  IFF(d IS NULL OR d = 0, NULL, n / d)
$$;
Avoid

Common pitfalls

  • Hidden dependencies: notebook macros rely on implicit session configs, catalogs, or temp views.
  • Mixed-type branches: CASE/IF branches return different types; Snowflake needs explicit casts to preserve intent.
  • NULL semantics drift: null-safe equality, joins, and string functions behave differently if not explicit.
  • Dynamic SQL risks: string concatenation without bindings/escaping causes drift and security risk.
  • Side effects ignored: audit/control table updates not recreated, breaking idempotency and restartability.
  • Row-by-row patterns: Spark-era loops in notebooks become expensive or fragile if moved verbatim; prefer set-based SQL.
  • No harness: without replayable tests, parity becomes a debate at cutover.
Proof

Validation approach

  • Compile + interface checks: every UDF/procedure deploys successfully; argument/return types match the contract.
  • Golden tests: curated input sets validate outputs for each routine (including NULL-heavy and boundary values).
  • Branch + failure-mode coverage: exercise key branches and expected failures.
  • Side-effect verification: assert expected writes to control/log/audit tables and idempotency behavior under retries.
  • Integration replay: run routines inside representative pipelines and compare downstream KPIs/aggregates.
  • Performance gate: confirm no hidden row-by-row scans; refactors validated with runtime/credit baselines.
Execution

Migration steps

A sequence that keeps behavior explicit, testable, and safe to cut over.
  1. 01

    Inventory reusable logic and call sites

    Collect Spark SQL UDFs, notebook utilities/macros, and shared libraries. Map call sites across pipelines, dbt models, and BI queries.

  2. 02

    Define the behavior contract

    Specify inputs/outputs, typing, NULL rules, error semantics, side effects, and performance expectations. Decide target form (SQL UDF, JS UDF, procedure, or refactor).

  3. 03

    Convert with safety patterns

    Make casts explicit, normalize timezone intent, implement null-safe comparisons, and migrate dynamic SQL using safe parameterization and explicit identifier rules.

  4. 04

    Build a replayable harness

    Create golden input sets, boundary cases, and expected failures. Validate outputs and side effects deterministically so parity isn’t debated at cutover.

  5. 05

    Integrate and cut over behind gates

    Run routines in representative pipelines, compare downstream KPIs, validate retries/backfills, and cut over with rollback-ready criteria.

Workload Assessment
Migrate reusable logic with a test harness

We inventory your UDFs and notebook macros, migrate a representative subset into Snowflake routines, and deliver a harness that proves parity—including side effects and failure behavior.

Migration Acceleration
Cut over routines with proof-backed sign-off

Get a conversion plan, review markers for ambiguous intent, and validation artifacts so procedural logic cutover is gated by evidence and rollback criteria.

FAQ

Frequently asked questions

What are “macros” in a Databricks context?+
They’re often notebook utilities, shared SQL snippets, or dbt macro logic that behaves like reusable functions. We migrate them into Snowflake-native assets (UDFs, views, procedures) with explicit contracts.
Do we have to move everything into stored procedures?+
No. Many utilities belong as SQL UDFs or views, and some procedural logic is best refactored into set-based SQL. We choose the target form per asset to minimize risk and cost.
How do you prove parity for UDFs and procedures?+
We build a replayable harness with golden inputs/outputs, branch and failure-mode coverage, and side-effect assertions. Integration replay validates downstream KPIs before cutover.
What’s the biggest risk area?+
Implicit typing and NULL behavior differences, plus hidden side effects used for restartability. We make these explicit and validate under retries/backfills so behavior remains stable.