SQL Functions Documentation

_¶

_( [ddl type], [nullable] ) provides PlaceHolders for lambda functions to allow partial application, use them in place of actual values or expressions to either change arity or allow use in _lambda_.

The default type is Long / Bigint, you will have to provide the types directly when using something else. By default the placeholders are assumed to be nullable (i.e. true), you can use false to state the field should not be null.

Tags:

lambda

_lambda_¶

_lambda_( user function ) extracts the Spark LambdaFunction from a resolved user function, this must have the correct types expected by the Spark HigherOrderFunction they are parameters for.

This allows using user defined functions and lambdas with in-built Spark HigherOrderFunctions

Tags:

lambda

agg_Expr¶

agg_Expr( [ddl sum type], filter, sum, result) aggregates on rows which match the filter expression using the sum expression to aggregate then processes the results using the result expression.

You can run multiple agg_Expr's in a single pass select, use the first parameter to thread DDL type information through to the sum and result functions.

Tags:

aggregate

as_uuid¶

as_uuid(lower_long, higher_long) converts two longs into a uuid. Note: this is not functionally equivalent to rng_uuid(longPair(lower, higher)) despite having the same types.

Tags:

longs

big_Bloom¶

big_Bloom(buildFrom, expectedSize, expectedFPP) creates an aggregated bloom filter using the buildFrom expression.

The blooms are stored on a shared filesystem using a generated uuid, they can scale to high numbers of items whilst keeping the FPP (e.g. millions at 0.01 would imply 99% probability, you may have to cast to double in Spark 3.2).

buildFrom can be driven by digestToLongs or hashWith functions when using multiple fields.

Alternatives:

big_Bloom(buildFrom, expectedSize, expectedFPP, 'bloom_loc') - per above but uses a fixed string bloom_loc instead of a uuid

Tags:

bloom

callFun¶

callFun( user function lambda variable, param1, param2, … paramN ) used within a lambda function it allows calling a lambda variable that contains a user function.

Used from the top level sql it performs a similar function expecting either a full user function or a partially applied function, typically returned from another lambda user function.

Tags:

lambda

coalesce_If_Attributes_Missing¶

coalesce_If_Attributes_Missing(expr, replaceWith) substitutes expr with the replaceWith expression when expr has missing attributes in the source dataframe. Your code must call the scala processIfAttributeMissing function before using in validate or ruleEngineRunner/ruleRunner:

val missingAttributesAreReplacedRS = processIfAttributeMissing(rs, struct)

val (errors, _) = validate(struct, missingAttributesAreReplacedRS)

// use it missingAttributesAreReplacedRS in your dataframe..

NOTE: When using Spark 4 / DBR 17.3 or higher you can use the connect friendly process_if_attribute_missing SQL function instead

Tags:

rule

coalesce_If_Attributes_Missing_Disable¶

coalesce_If_Attributes_Missing_Disable(expr) substitutes expr with the DisabledRule Integer result (-2) when expr has missing attributes in the source dataframe. Your code must call the scala processIfAttributeMissing function before using in validate or ruleEngineRunner/ruleRunner:

val missingAttributesAreReplacedRS = processIfAttributeMissing(rs, struct)

val (errors, _) = validate(struct, missingAttributesAreReplacedRS)

// use it missingAttributesAreReplacedRS in your dataframe..

NOTE: When using Spark 4 / DBR 17.3 or higher you can use the connect friendly process_if_attribute_missing SQL function instead

Tags:

rule

comparable_Maps¶

comparable_Maps(struct | array | map) converts any maps in the input param into sorted arrays of a key, value struct.

This allows developers to perform sorts, distincts, group bys and union set operations with Maps, currently not supported by Spark sql as of 3.4.

The sorting behaviour uses Sparks existing odering logic but allows for extension during the calls to the registerQualityFunctions via the mapCompare parameter and the defaultMapCompare function.

Tags:

map

digest_To_Longs¶

digest_To_Longs('digestImpl', fields*) creates an array of longs based on creating the given MessageDigest impl. A 128-bit impl will generate two longs from it's digest

Tags:

Hash

digest_To_Longs_Struct¶

digest_To_Longs_Struct('digestImpl', fields*) creates structure of longs with i0 to iN named fields based on creating the given MessageDigest impl.

Tags:

Hash

disabled_Rule¶

disabledRule() returns the DisabledRule Integer result (-2) for use in filtering and to disable rules (which may not signify a version bump)

Tags:

rule

dq_rule_runner¶

dq_rule_runner( ruleSuiteVariable ) processes the rule suite stored at ruleSuiteVariable using the DQ runner as if calling ruleRunner directly.

All the alternatives correlate to their pre Spark 4 non-sql versions.

ruleSuiteVariables must be registered via the register_rule_suite scala functions.

NB You should use the default compilation options if you are nesting or chaining runners per #110.

Alternatives:

dq_rule_runner( ruleSuiteVariable, compile ) - enables separate compilation of evals, by default this is false and compilation is performed at wholestage codegen

dq_rule_runner( ruleSuiteVariable, compile, variablesPerFunc, variableFuncGroup ) - additionally specifies the number of expressions to use per function and the number of functions to call in one group, defaulting to 40 and 20 respectively

dq_rule_runner( ruleSuiteVariable, compile, variablesPerFunc, variableFuncGroup, forceRunnerEval ) - additionally allows the expressions to be interpreted, instead of the default - false - which compiles as part of wholestage codegen

Tags:

variable
runner
Spark4

drop_field¶

drop_field(structure_expr, 'field.subfield'*) removes fields from a structure, but will not remove parent nodes.

This is a wrapped version of 3.4.1's dropField implementation.

Tags:

struct

expression_runner¶

expression_runner( ruleSuiteVariable ) processes the rule suite stored at ruleSuiteVariable using the typed expression runner as if calling expressionRunner directly.

The function always serializes to yaml output and expressions do not need to share the actual real type.

All the alternatives correlate to their pre Spark 4 non-sql versions.

ruleSuiteVariables must be registered via the register_rule_suite scala functions.

NB You should use the default compilation options if you are nesting or chaining runners per #110.

Alternatives:

expression_runner( ruleSuiteVariable, name ) - allows naming the column directly

expression_runner( ruleSuiteVariable, name, options ) - additionally allows the SnakeYaml output to be configured by the options map

expression_runner( ruleSuiteVariable, name, options, forceRunnerEval ) - additionally allows the expressions to be interpreted, instead of the default - false - which compiles as part of wholestage codegen

Tags:

variable
runner
Spark4

failed¶

failed() returns the Failed Integer result (0) for use in filtering

Tags:

rule

field_Based_ID¶

field_Based_ID('prefix', 'digestImpl', fields*) creates a variable bit length id by using a given MessageDigest impl over the fields, prefix is used with the _base, _i0 and _iN fields in the resulting structure

Tags:

ID
Hash

flatten_Results¶

flatten_Results(dataQualityExpr) expands data quality results into a flat array

flatten_Rule_Results¶

flatten_Rule_Results(dataQualityExpr) expands data quality results into a structure of flattenedResults, salientRule (the one used to create the output) and the rule result.

salientRule will be null if there was no matching rule

from_yaml¶

from_yaml(string, 'ddlType') uses snakeyaml to convert yaml into Spark datatypes

Tags:

yaml

hash_Field_Based_ID¶

hash_Field_Based_ID('prefix', 'digestImpl', fields*) creates a variable bit length id by using a given Guava Hasher impl over the fields, prefix is used with the _base, _i0 and _iN fields in the resulting structure

Tags:

ID
Hash

hash_With¶

hash_With('HASH', fields*) Generates a hash value (array of longs) suitable for using in blooms based on the given Guava hash implementation.

Note based on testing the digestToLongs function for SHA256 and MD5 are faster.

Valid hashes: MURMUR3_32, MURMUR3_128, MD5, SHA-1, SHA-256, SHA-512, ADLER32, CRC32, SIPHASH24. When an invalid HASH name is provided MURMUR3_128 will be chosen.

Open source Spark 3.1.2/3 issues

On Spark 3.1.2/3 open source this may get resolver errors due to a downgrade on guava version - 15.0 is used on Databricks, open source 3.0.3 uses 16.0.1, 3.1.2 drops this to 11 and misses crc32, sipHash24 and adler32.

Tags:

Hash

hash_With_Struct¶

per hash_With('HASH', fields*) but generates a struct with i0 to ix named longs. This structure is not suitable for blooms

Tags:

Hash

id_base64¶

id_base64(base, i0, i1, ix) Generates a base64 encoded representation of the id, either the single struct field or the individual parts

Alternatives:

id_base64(id_struct) Uses an id field to generate

Tags:

ID

id_Equal¶

id_Equal(leftPrefix, rightPrefix) takes two prefixes which will be used to match leftPrefix_base = rightPrefix_base, i0 and i1 fields. It does not currently support more than two i's

Tags:

ID

id_from_base64¶

id_from_base64(base64) Parses the base64 string with an expected default long size of two i.e. an 160bit ID, any string which is not of the correct size will return null

Alternatives:

id_from_base64(base64f, size) Uses a size, which must be literal, to specify the type

Tags:

ID

id_raw_type¶

id_raw_type(idstruct) Given a prefixed id returns the fields without their prefix

Tags:

ID

id_size¶

id_size(base64) Given a base64 from id_base64 returns the number of _i long fields

Tags:

ID

inc¶

inc() increments the current sum by 1

Alternatives:

inc( x ) use an expression of type Long to increment

Tags:

aggregate

long_Pair¶

long_Pair(lower, higher) creates a structure with these lower and higher longs

Tags:

longs

long_Pair_Equal¶

long_Pair_Equal(leftPrefix, rightPrefix) takes two prefixes which will be used to match leftPrefix_lower = rightPrefix_lower and leftPrefix_higher = rightPrefix_higher

Tags:

longs

long_Pair_From_UUID¶

long_Pair_From_UUID(expr) converts a UUID to a structure with lower and higher longs

Tags:

longs

map_Contains¶

map_Contains('mapid', expr) returns true if there is an item in the map.

On Spark 4 / DBR 17.3 and later this function uses Spark Variables and takes the form:

map_contains('mapid', expr, mapLookupsVar) Where mapLookupsVar is the result of loadMaps

Tags:

map
variable
Spark4

map_Lookup¶

map_Lookup('mapid', expr) returns either the lookup in map specified by mapid or null.

On Spark 4 / DBR 17.3 and later this function uses Spark Variables and takes the form:

map_lookup('mapid', expr, mapLookupsVar) Where mapLookupsVar is the result of loadMaps

Tags:

map
variable
Spark4

meanF¶

meanF() simple mean on the results, expecting sum and count type Long

Tags:

aggregate

murmur3_ID¶

murmur3ID('prefix', fields*) Generates a 160bit id using murmer3 hashing over input fields, prefix is used with the _base, _i0 and _i1 fields in the resulting structure

Tags:

ID
Hash

pack_Ints¶

pack_Ints(lower, higher) a packaged long from two ints, used within result compression

Tags:

ruleid

passed¶

passed() returns the Passed Integer for use in filtering: 10000

Tags:

rule

prefixed_To_Long_Pair¶

prefixed_To_Long_Pair(field, 'prefix') converts a 128bit longpair field with the given prefix into a higher and lower long pair without prefix.

This is suitable for converting provided id's into uuids for example via a further call to rngUUID.

Tags:

ID, longs

print_Code¶

print_Code( [msg], expr ) prints the code generated by an expression, the value variable and the isNull variable and forwards eval calls / type etc. to the expression.

The code is printed once per partition on the executors std. output. You will have to check each executor to find the used nodes output. To use with unit testing on a single host you may overwrite the writer function in registerQualityFunctions, you should however use a top level object and var to write into (or stream), printCode will not be able to write to std out properly (spark redirects / captures stdout) or non top level objects (due to classloader / function instance issues). Testing on other hosts without using stdout should do so to a shared file location or similar.

!!! "information" It is not compatible with every expression Aggregate expressions like aggExpr or sum etc. won't generate code so they aren't compatible with printCode.

\_lambda\_ is also incompatible with printCode both wrapping a user function and the \_lambda\_ function.  Similarly the \_() placeholder function cannot be wrapped.

Any function expecting a specific signature like aggExpr or other HigherOrderFunctions like aggregate or filter are unlikely to support wrapped arguments.

Tags:

util

print_Expr¶

print_Expr( [msg], expr ) prints the expression tree via toString with an optional msg

The message is printed to the driver nodes std. output, often shown in notebooks as well. To use with unit testing you may overwrite the writer function in registerQualityFunctions, you should however use a top level object and var to write into (or stream).

Tags:

util

probability¶

probability(expr) will translate probability rule results into a double, e.g. 1000 returns 0.01. This is useful for interpreting and filtering on probability based results: 0 -> 10000 non-inclusive

Tags:

rule

probability_In¶

probability_In(expr, 'bloomid') returns the probability of the expr being in the bloomfilter specified by bloomid.

This function either returns 0.0, where it is definitely not present, or the original FPP where it may be present.

You may use digestToLongs or hashWith as appropriate to use multiple columns safely.

Tags:

bloom

process_if_attribute_missing¶

process_if_attribute_missing( ruleSuiteVariable, ddl, name ) processes the rule suite at ruleSuiteVariable for coalesce_If_Attributes_Missing_Disable / coalesce_If_Attributes_Missing and sets the resulting transformed ruleSuite to a variable with the name parameter (this can be the same name as ruleSuiteVariable e.g.:

process_if_attribute_missing(myRuleSuite, 'struct<..>', 'myRuleSuite')

Calling process_if_attribute_missing inside a ruleSuite is not permitted and will throw exceptions.

ruleSuiteVariables must be registered via the register_rule_suite scala functions.

Tags:

variable
util
Spark4

provided_ID¶

provided_ID('prefix', existingLongs) creates an id for an existing array of longs, prefix is used with the _base, _i0 and _iN fields in the resulting structure

Tags:

ID

results_With¶

results_With( x ) process results lambda x (e.g. (sum, count) -> sum ) that takes sum from the aggregate, count from the number of rows counted. Defaults both the sumtype and counttype as LongType

Alternatives:

results_With( [sum ddl type], x) Use the given ddl type for the sum type e.g. 'MAP<STRING, DOUBLE>'

results_With( [sum ddl type], [result ddl type], x) Use the given ddl type for the sum and result types

Tags:

aggregate

return_Sum¶

return_Sum( sum type ddl ) just returns the sum and ignores the count param, expands to resultsWith( [sum ddl_type], (sum, count) -> sum)

Tags:

aggregate

reverse_Comparable_Maps¶

reverses a call to comparableMaps

Tags:

map

rng¶

rng() Generates a 128bit random id using XO_RO_SHI_RO_128_PP, encoded as a lower and higher long pair

Alternatives:

rng('algorithm') Uses Commons RNG RandomSource to implement the RNG

rng('algorithm', seedL) Uses Commons RNG RandomSource to implement the RNG with a long seed

Tags:

longs
RNG

rng_Bytes¶

rng_Bytes() Generates a 128bit random id using XO_RO_SHI_RO_128_PP, encoded as a byte array

Alternatives:

rng_Bytes('algorithm') Uses Commons RNG RandomSource to implement the RNG

rng_Bytes('algorithm', seedL) Uses Commons RNG RandomSource to implement the RNG with a long seed

rng_Bytes('algorithm', seedL, byteCount) Uses Commons RNG RandomSource to implement the RNG with a long seed, with a specific byte length integer (e.g. 16 is two longs, 8 is integer)

Tags:

RNG

rng_ID¶

rng_ID('prefix') Generates a 160bit random id using XO_RO_SHI_RO_128_PP, prefix is used with the _base, _i0 and _i1 fields in the resulting structure

Alternatives:

rng_Id('prefix', 'algorithm') Uses Commons RNG RandomSource to implement the RNG, using other algorithm's may generate more long _iN fields

rng_Id('prefix', 'algorithm', seedL) Uses Commons RNG RandomSource to implement the RNG with a long seed, using other algorithm's may generate more long _iN fields

Tags:

ID
RNG

rng_UUID¶

rng_UUID(expr) takes either a structure with lower and higher longs or a 128bit binary type and converts to a string uuid - use with, for example, the rng() function.

If a simple conversion from two longs (lower, higher) to a uuid is desired then use as_uuid, rng_uuid applies the same transformations as the Spark uuid to the input higher and lower longs.

Tags:

longs

rule_engine_runner¶

rule_engine_runner( ruleSuiteVariable, ddl ) processes the rule suite stored at ruleSuiteVariable using the DQ runner as if calling ruleEngineRunner directly.

The specified ddl type is used for all output expressions

All the alternatives correlate to their pre Spark 4 non-sql versions.

ruleSuiteVariables must be registered via the register_rule_suite scala functions.

NB You should use the default compilation options if you are nesting or chaining runners per #110.

Alternatives:

rule_engine_runner( ruleSuiteVariable ) - Spark derives the output expression type, this often does not match nullability expectations, as such specifying the DDL is preferred and more reliable

rule_engine_runner( ruleSuiteVariable, ddl, debug ) - additionally allows entering debug mode and returns each matching rule's output expression results

rule_engine_runner( ruleSuiteVariable, ddl, compile, debug, variablesPerFunc, variableFuncGroup, forceRunnerEval, forceTriggerEval) This version provides the full configuration options available from the ruleEngineRunner.

Tags:

variable
runner
Spark4

rule_folder_runner¶

rule_folder_runner( ruleSuiteVariable, starter, ddl ) processes the rule suite stored at ruleSuiteVariable using the DQ runner as if calling ruleFolderRunner directly.

The starter expression parameter is used to create the initial structure to be folded over, the ddl type is used to control nullability and the type exactly.

All the alternatives correlate to their pre Spark 4 non-sql versions.

ruleSuiteVariables must be registered via the register_rule_suite scala functions.

NB You should use the default compilation options if you are nesting or chaining runners per #110.

Alternatives:

rule_folder_runner( ruleSuiteVariable, starter ) - Spark derives the output expression type from starter, this often does not match nullability expectations, as such specifying the DDL is preferred and more reliable

rule_folder_runner( ruleSuiteVariable, starter, debug, ddl ) - additionally allows entering debug mode and returns each matching rule's output expression results so you can see changes between folds

rule_folder_runner( ruleSuiteVariable, starter, compile, debug, variablesPerFunc, variableFuncGroup, forceRunnerEval, ddl, forceTriggerEval) This version provides the full configuration options available from the ruleFolderRunner.

Tags:

variable
runner
Spark4

rule_result¶

rule_result(ruleSuiteResultColumn, packedRuleSuiteId, packedRuleSetId, packedRuleId) uses the packed long id's to retrieve the integer ruleResult (see below for ExpressionRunner) or null if it can't be found.

You can use pack_ints(id, version) to specify each id if you don't already have the packed long version. This is suitable for retrieving individual rule results, for example to aggregate counts of a specific rule result, without having to resort to using filter and map values.

rule_result works with ruleRunner (DQ) results (including details) and ExpressionRunner results. ExpressionRunner results return a tuple of ruleResult and resultDDL, both strings, or if strip_result_ddl is called a string.

Tags:

rule

rule_Suite_Result_Details¶

rule_Suite_Result_Details(dq) strips the overallResult from the dataquality results, suitable for keeping overall result as a top-level field with associated performance improvements

small_Bloom¶

small_Bloom(buildFrom, expectedSize, expectedFPP) creates a simply bytearray bloom filter using the expected size and fpp - 0.01 is 99%, you may have to cast to double in Spark 3.2. buildFrom can be driven by digestToLongs or hashWith functions when using multiple fields.

Tags:

bloom

soft_Fail¶

soft_Fail(ruleexpr) will treat any rule failure (e.g. failed() ) as returning softFailed()

Tags:

rule

soft_Failed¶

soft_Failed() returns the SoftFailed Integer result (-1) for use in filtering

Tags:

rule

strip_result_ddl¶

strip_result_ddl(expressionsResult) removes the resultDDL field from expressionsRunner results, leaving only the string result itself for more compact storage

Tags:

rule

sum_With¶

sum_With( x ) adds expression x for each row processed in an aggExpr with a default of LongType

Alternatives:

sum_With( [ddl type], x) Use the given ddl type e.g. 'MAP<STRING, DOUBLE>'

Tags:

aggregate

to_yaml¶

to_yaml(expression, [options map]) uses snakeyaml to convert Spark datatypes into yaml.

Passing null into the function returns a null yaml (newline is appended):

null

All null values will be treated in this fashion. The string "null" will be represented as (again new line is present):

'null'

The optional "options map" parameter currently supports the following output options:

useFullScalarType, defaults to false. Instead of using the default yaml tags uses the full classnames for scalars, reducing risk of precision loss if the yaml is to be used outside of the from_yaml function.

sample usage:

val df = sparkSession.sql("select array(1,2,3,4,5) og")
    .selectExpr("*", "to_yaml(og, map('useFullScalarType', 'true')) y")
    .selectExpr("*", "from_yaml(y, 'array<int>') f")
    .filter("f == og")

snakeyaml is provided scope

Databricks runtimes provide sparkyaml, so whilst Quality builds against the correct versions for Databricks it can onyl use provided scope.

snakeyaml is 1.24 on DBRs below 13.1, but not present on OSS, so you may need to add the dependency yourself, tested compatible versions are 1.24 and 1.33.

Tags:

yaml

typed_expression_runner¶

typed_expression_runner( ruleSuiteVariable, ddl ) processes the rule suite stored at ruleSuiteVariable using the typed expression runner as if calling typedExpressionRunner directly.

The ddl parameter specifies the output expressions type, all expressions must share this type.

All the alternatives correlate to their pre Spark 4 non-sql versions.

ruleSuiteVariables must be registered via the register_rule_suite scala functions.

NB You should use the default compilation options if you are nesting or chaining runners per #110.

Alternatives:

typed_expression_runner( ruleSuiteVariable, ddl, name ) - allows naming the column directly

typed_expression_runner( ruleSuiteVariable, ddl, name, forceRunnerEval ) - additionally allows the expressions to be interpreted, instead of the default - false - which compiles as part of wholestage codegen

Tags:

variable
runner
Spark4

unique_ID¶

uniqueID('prefix') Generates a 160bit guaranteed unique id (requires MAC address uniqueness) with contiguous higher values within a partition and overflow with timestamp ms., prefix is used with the _base, _i0 and _i1 fields in the resulting structure

Tags:

ID

unpack¶

unpack(expr) takes a packed rule long and unpacks it to a .id and .version structure

Tags:

ruleid

unpack_Id_Triple¶

unpack_Id_Triple(expr) takes a packed rule triple of longs (ruleSuiteId, ruleSetId and ruleId) and unpacks it to (ruleSuiteId, ruleSuiteVersion, ruleSetId, ruleSetVersion, ruleId, ruleVersion)

Tags:

ruleid

update_field¶

update_field(structure_expr, 'field.subfield', replaceWith, 'fieldN', replaceWithN) processes structures allowing you to replace sub items (think lens in functional programming) using the structure fields path name.

This is a wrapped version of 3.4.1's withField implementation.

Tags:

struct

za_Field_Based_ID¶

za_Field_Based_ID('prefix', 'digestImpl', fields*) creates a 64bit id (96bit including header) by using a given Zero Allocation impl over the fields, prefix is used with the _base and _i0 fields in the resulting structure.

Prefer using the zaLongsFieldBasedID for less collisions

Tags:

ID
Hash

za_Hash_Longs_With¶

za_Hash_Longs_With('HASH', fields*) generates a multi length long array but with a zero allocation implementation. This structure is suitable for blooms, the default XXH3 algorithm is the 128bit version of that used by the internal bigBloom implementation.

Available HASH functions are MURMUR3_128, XXH3

Tags:

Hash

za_Hash_Longs_With_Struct¶

similar to za_Hash_Longs_With('HASH', fields*) but generates an ID relevant multi length long struct, which is not suitable for blooms

Tags:

Hash

za_Hash_With¶

za_Hash_With('HASH', fields*) generates a single length long array always with 64 bits but with a zero allocation implementation. This structure is suitable for blooms, the default XX algorithm is used by the internal bigBloom implementation.

Available HASH functions are MURMUR3_64, CITY_1_1, FARMNA, FARMOU, METRO, WY_V3, XX

Tags:

Hash

za_Hash_With_Struct¶

similar to za_Hash_With('HASH', fields*) but generates an ID relevant multi length long struct (of one long), which is not suitable for blooms.

Prefer zaHashLongsWithStruct for reduced collisions with either the MURMUR3_128 or XXH3 versions of hashes

Tags:

Hash

za_Longs_Field_Based_ID¶

za_Longs_Field_Based_ID('prefix', 'digestImpl', fields*) creates a variable length id by using a given Zero Allocation impl over the fields, prefix is used with the _base, _i0 and _iN fields in the resulting structure. Murmur3_128 is faster than on the Guava implementation.

Tags:

ID
Hash

Last update: January 13, 2026 10:09:55
Created: January 13, 2026 10:09:55