SQL Functions Documentation
_¶
_( [ddl type], [nullable] ) provides PlaceHolders for lambda functions to allow partial application, use them in place of actual values or expressions to either change arity or allow use in _lambda_.
The default type is Long / Bigint, you will have to provide the types directly when using something else. By default the placeholders are assumed to be nullable (i.e. true), you can use false to state the field should not be null.
Tags:
- lambda
_lambda_¶
_lambda_( user function ) extracts the Spark LambdaFunction from a resolved user function, this must have the correct types expected by the Spark HigherOrderFunction they are parameters for.
This allows using user defined functions and lambdas with in-built Spark HigherOrderFunctions
Tags:
- lambda
agg_Expr¶
agg_Expr( [ddl sum type], filter, sum, result) aggregates on rows which match the filter expression using the sum expression to aggregate then processes the results using the result expression.
You can run multiple agg_Expr's in a single pass select, use the first parameter to thread DDL type information through to the sum and result functions.
Tags:
- aggregate
as_uuid¶
as_uuid(lower_long, higher_long) converts two longs into a uuid. Note: this is not functionally equivalent to rng_uuid(longPair(lower, higher)) despite having the same types.
Tags:
- longs
big_Bloom¶
big_Bloom(buildFrom, expectedSize, expectedFPP) creates an aggregated bloom filter using the buildFrom expression.
The blooms are stored on a shared filesystem using a generated uuid, they can scale to high numbers of items whilst keeping the FPP (e.g. millions at 0.01 would imply 99% probability, you may have to cast to double in Spark 3.2).
buildFrom can be driven by digestToLongs or hashWith functions when using multiple fields.
Alternatives:
Tags:
- bloom
callFun¶
callFun( user function lambda variable, param1, param2, … paramN ) used within a lambda function it allows calling a lambda variable that contains a user function.
Used from the top level sql it performs a similar function expecting either a full user function or a partially applied function, typically returned from another lambda user function.
Tags:
- lambda
coalesce_If_Attributes_Missing¶
coalesce_If_Attributes_Missing(expr, replaceWith) substitutes expr with the replaceWith expression when expr has missing attributes in the source dataframe. Your code must call the scala processIfAttributeMissing function before using in validate or ruleEngineRunner/ruleRunner:
val missingAttributesAreReplacedRS = processIfAttributeMissing(rs, struct)
val (errors, _) = validate(struct, missingAttributesAreReplacedRS)
// use it missingAttributesAreReplacedRS in your dataframe..
Tags:
- rule
coalesce_If_Attributes_Missing_Disable¶
coalesce_If_Attributes_Missing_Disable(expr) substitutes expr with the DisabledRule Integer result (-2) when expr has missing attributes in the source dataframe. Your code must call the scala processIfAttributeMissing function before using in validate or ruleEngineRunner/ruleRunner:
val missingAttributesAreReplacedRS = processIfAttributeMissing(rs, struct)
val (errors, _) = validate(struct, missingAttributesAreReplacedRS)
// use it missingAttributesAreReplacedRS in your dataframe..
Tags:
- rule
comparable_Maps¶
comparable_Maps(struct | array | map) converts any maps in the input param into sorted arrays of a key, value struct.
This allows developers to perform sorts, distincts, group bys and union set operations with Maps, currently not supported by Spark sql as of 3.4.
The sorting behaviour uses Sparks existing odering logic but allows for extension during the calls to the registerQualityFunctions via the mapCompare parameter and the defaultMapCompare function.
Tags:
- map
digest_To_Longs¶
digest_To_Longs('digestImpl', fields*) creates an array of longs based on creating the given MessageDigest impl. A 128-bit impl will generate two longs from it's digest
Tags:
- Hash
digest_To_Longs_Struct¶
digest_To_Longs_Struct('digestImpl', fields*) creates structure of longs with i0 to iN named fields based on creating the given MessageDigest impl.
Tags:
- Hash
disabled_Rule¶
disabledRule() returns the DisabledRule Integer result (-2) for use in filtering and to disable rules (which may not signify a version bump)
Tags:
- rule
dq_rule_runner¶
dq_rule_runner( ruleSuiteVariable ) processes the rule suite stored at ruleSuiteVariable using the DQ runner as if calling ruleRunner directly.
All the alternatives correlate to their pre Spark 4 non-sql versions.
ruleSuiteVariables must be registered via the register_rule_suite scala functions.
Alternatives:
Tags:
- variable
- runner
- Spark4
drop_field¶
drop_field(structure_expr, 'field.subfield'*) removes fields from a structure, but will not remove parent nodes.
This is a wrapped version of 3.4.1's dropField implementation.
Tags:
- struct
expression_runner¶
expression_runner( ruleSuiteVariable ) processes the rule suite stored at ruleSuiteVariable using the typed expression runner as if calling expressionRunner directly.
The function always serializes to yaml output and expressions do not need to share the actual real type.
All the alternatives correlate to their pre Spark 4 non-sql versions.
ruleSuiteVariables must be registered via the register_rule_suite scala functions.
Alternatives:
Tags:
- variable
- runner
- Spark4
failed¶
failed() returns the Failed Integer result (0) for use in filtering
Tags:
- rule
field_Based_ID¶
field_Based_ID('prefix', 'digestImpl', fields*) creates a variable bit length id by using a given MessageDigest impl over the fields, prefix is used with the _base, _i0 and _iN fields in the resulting structure
Tags:
- ID
- Hash
flatten_Results¶
flatten_Results(dataQualityExpr) expands data quality results into a flat array
flatten_Rule_Results¶
flatten_Rule_Results(dataQualityExpr) expands data quality results into a structure of flattenedResults, salientRule (the one used to create the output) and the rule result.
salientRule will be null if there was no matching rule
from_yaml¶
from_yaml(string, 'ddlType') uses snakeyaml to convert yaml into Spark datatypes
Tags:
- yaml
hash_Field_Based_ID¶
hash_Field_Based_ID('prefix', 'digestImpl', fields*) creates a variable bit length id by using a given Guava Hasher impl over the fields, prefix is used with the _base, _i0 and _iN fields in the resulting structure
Tags:
- ID
- Hash
hash_With¶
hash_With('HASH', fields*) Generates a hash value (array of longs) suitable for using in blooms based on the given Guava hash implementation.
Note based on testing the digestToLongs function for SHA256 and MD5 are faster.
Valid hashes: MURMUR3_32, MURMUR3_128, MD5, SHA-1, SHA-256, SHA-512, ADLER32, CRC32, SIPHASH24. When an invalid HASH name is provided MURMUR3_128 will be chosen.
Open source Spark 3.1.2/3 issues
On Spark 3.1.2/3 open source this may get resolver errors due to a downgrade on guava version - 15.0 is used on Databricks, open source 3.0.3 uses 16.0.1, 3.1.2 drops this to 11 and misses crc32, sipHash24 and adler32.
Tags:
- Hash
hash_With_Struct¶
per hash_With('HASH', fields*) but generates a struct with i0 to ix named longs. This structure is not suitable for blooms
Tags:
- Hash
id_base64¶
id_base64(base, i0, i1, ix) Generates a base64 encoded representation of the id, either the single struct field or the individual parts
Alternatives:
Tags:
- ID
id_Equal¶
id_Equal(leftPrefix, rightPrefix) takes two prefixes which will be used to match leftPrefix_base = rightPrefix_base, i0 and i1 fields. It does not currently support more than two i's
Tags:
- ID
id_from_base64¶
id_from_base64(base64) Parses the base64 string with an expected default long size of two i.e. an 160bit ID, any string which is not of the correct size will return null
Alternatives:
Tags:
- ID
id_raw_type¶
id_raw_type(idstruct) Given a prefixed id returns the fields without their prefix
Tags:
- ID
id_size¶
id_size(base64) Given a base64 from id_base64 returns the number of _i long fields
Tags:
- ID
inc¶
inc() increments the current sum by 1
Alternatives:
Tags:
- aggregate
long_Pair¶
long_Pair(lower, higher) creates a structure with these lower and higher longs
Tags:
- longs
long_Pair_Equal¶
long_Pair_Equal(leftPrefix, rightPrefix) takes two prefixes which will be used to match leftPrefix_lower = rightPrefix_lower and leftPrefix_higher = rightPrefix_higher
Tags:
- longs
long_Pair_From_UUID¶
long_Pair_From_UUID(expr) converts a UUID to a structure with lower and higher longs
Tags:
- longs
map_Contains¶
map_Contains('mapid', expr) returns true if there is an item in the map.
On Spark 4 / DBR 17.3 and later this function uses Spark Variables and takes the form:
map_contains('mapid', expr, mapLookupsVar) Where mapLookupsVar is the result of loadMaps
Tags:
- map
- variable
- Spark4
map_Lookup¶
map_Lookup('mapid', expr) returns either the lookup in map specified by mapid or null.
On Spark 4 / DBR 17.3 and later this function uses Spark Variables and takes the form:
map_lookup('mapid', expr, mapLookupsVar) Where mapLookupsVar is the result of loadMaps
Tags:
- map
- variable
- Spark4
meanF¶
meanF() simple mean on the results, expecting sum and count type Long
Tags:
- aggregate
murmur3_ID¶
murmur3ID('prefix', fields*) Generates a 160bit id using murmer3 hashing over input fields, prefix is used with the _base, _i0 and _i1 fields in the resulting structure
Tags:
- ID
- Hash
pack_Ints¶
pack_Ints(lower, higher) a packaged long from two ints, used within result compression
Tags:
- ruleid
passed¶
passed() returns the Passed Integer for use in filtering: 10000
Tags:
- rule
prefixed_To_Long_Pair¶
prefixed_To_Long_Pair(field, 'prefix') converts a 128bit longpair field with the given prefix into a higher and lower long pair without prefix.
This is suitable for converting provided id's into uuids for example via a further call to rngUUID.
Tags:
- ID, longs
print_Code¶
print_Code( [msg], expr ) prints the code generated by an expression, the value variable and the isNull variable and forwards eval calls / type etc. to the expression.
The code is printed once per partition on the executors std. output. You will have to check each executor to find the used nodes output. To use with unit testing on a single host you may overwrite the writer function in registerQualityFunctions, you should however use a top level object and var to write into (or stream), printCode will not be able to write to std out properly (spark redirects / captures stdout) or non top level objects (due to classloader / function instance issues). Testing on other hosts without using stdout should do so to a shared file location or similar.
!!! "information" It is not compatible with every expression Aggregate expressions like aggExpr or sum etc. won't generate code so they aren't compatible with printCode.
\_lambda\_ is also incompatible with printCode both wrapping a user function and the \_lambda\_ function. Similarly the \_() placeholder function cannot be wrapped.
Any function expecting a specific signature like aggExpr or other HigherOrderFunctions like aggregate or filter are unlikely to support wrapped arguments.
Tags:
- util
print_Expr¶
print_Expr( [msg], expr ) prints the expression tree via toString with an optional msg
The message is printed to the driver nodes std. output, often shown in notebooks as well. To use with unit testing you may overwrite the writer function in registerQualityFunctions, you should however use a top level object and var to write into (or stream).
Tags:
- util
probability¶
probability(expr) will translate probability rule results into a double, e.g. 1000 returns 0.01. This is useful for interpreting and filtering on probability based results: 0 -> 10000 non-inclusive
Tags:
- rule
probability_In¶
probability_In(expr, 'bloomid') returns the probability of the expr being in the bloomfilter specified by bloomid.
This function either returns 0.0, where it is definitely not present, or the original FPP where it may be present.
You may use digestToLongs or hashWith as appropriate to use multiple columns safely.
Tags:
- bloom
process_if_attribute_missing¶
process_if_attribute_missing( ruleSuiteVariable, ddl, name ) processes the rule suite at ruleSuiteVariable for coalesce_If_Attributes_Missing_Disable / coalesce_If_Attributes_Missing and sets the resulting transformed ruleSuite to a variable with the name parameter (this can be the same name as ruleSuiteVariable e.g.:
process_if_attribute_missing(myRuleSuite, 'struct<..>', 'myRuleSuite')
Calling process_if_attribute_missing inside a ruleSuite is not permitted and will throw exceptions.
ruleSuiteVariables must be registered via the register_rule_suite scala functions.
Tags:
- variable
- util
- Spark4
provided_ID¶
provided_ID('prefix', existingLongs) creates an id for an existing array of longs, prefix is used with the _base, _i0 and _iN fields in the resulting structure
Tags:
- ID
results_With¶
results_With( x ) process results lambda x (e.g. (sum, count) -> sum ) that takes sum from the aggregate, count from the number of rows counted. Defaults both the sumtype and counttype as LongType
Alternatives:
Tags:
- aggregate
return_Sum¶
return_Sum( sum type ddl ) just returns the sum and ignores the count param, expands to resultsWith( [sum ddl_type], (sum, count) -> sum)
Tags:
- aggregate
reverse_Comparable_Maps¶
reverses a call to comparableMaps
Tags:
- map
rng¶
rng() Generates a 128bit random id using XO_RO_SHI_RO_128_PP, encoded as a lower and higher long pair
Alternatives:
Tags:
- longs
- RNG
rng_Bytes¶
rng_Bytes() Generates a 128bit random id using XO_RO_SHI_RO_128_PP, encoded as a byte array
Alternatives:
Tags:
- RNG
rng_ID¶
rng_ID('prefix') Generates a 160bit random id using XO_RO_SHI_RO_128_PP, prefix is used with the _base, _i0 and _i1 fields in the resulting structure
Alternatives:
Tags:
- ID
- RNG
rng_UUID¶
rng_UUID(expr) takes either a structure with lower and higher longs or a 128bit binary type and converts to a string uuid - use with, for example, the rng() function.
If a simple conversion from two longs (lower, higher) to a uuid is desired then use as_uuid, rng_uuid applies the same transformations as the Spark uuid to the input higher and lower longs.
Tags:
- longs
rule_engine_runner¶
rule_engine_runner( ruleSuiteVariable, ddl ) processes the rule suite stored at ruleSuiteVariable using the DQ runner as if calling ruleEngineRunner directly.
The specified ddl type is used for all output expressions
All the alternatives correlate to their pre Spark 4 non-sql versions.
ruleSuiteVariables must be registered via the register_rule_suite scala functions.
Alternatives:
Tags:
- variable
- runner
- Spark4
rule_folder_runner¶
rule_folder_runner( ruleSuiteVariable, starter, ddl ) processes the rule suite stored at ruleSuiteVariable using the DQ runner as if calling ruleFolderRunner directly.
The starter expression parameter is used to create the initial structure to be folded over, the ddl type is used to control nullability and the type exactly.
All the alternatives correlate to their pre Spark 4 non-sql versions.
ruleSuiteVariables must be registered via the register_rule_suite scala functions.
Alternatives:
Tags:
- variable
- runner
- Spark4
rule_result¶
rule_result(ruleSuiteResultColumn, packedRuleSuiteId, packedRuleSetId, packedRuleId) uses the packed long id's to retrieve the integer ruleResult (see below for ExpressionRunner) or null if it can't be found.
You can use pack_ints(id, version) to specify each id if you don't already have the packed long version. This is suitable for retrieving individual rule results, for example to aggregate counts of a specific rule result, without having to resort to using filter and map values.
rule_result works with ruleRunner (DQ) results (including details) and ExpressionRunner results. ExpressionRunner results return a tuple of ruleResult and resultDDL, both strings, or if strip_result_ddl is called a string.
Tags:
- rule
rule_Suite_Result_Details¶
rule_Suite_Result_Details(dq) strips the overallResult from the dataquality results, suitable for keeping overall result as a top-level field with associated performance improvements
small_Bloom¶
small_Bloom(buildFrom, expectedSize, expectedFPP) creates a simply bytearray bloom filter using the expected size and fpp - 0.01 is 99%, you may have to cast to double in Spark 3.2. buildFrom can be driven by digestToLongs or hashWith functions when using multiple fields.
Tags:
- bloom
soft_Fail¶
soft_Fail(ruleexpr) will treat any rule failure (e.g. failed() ) as returning softFailed()
Tags:
- rule
soft_Failed¶
soft_Failed() returns the SoftFailed Integer result (-1) for use in filtering
Tags:
- rule
strip_result_ddl¶
strip_result_ddl(expressionsResult) removes the resultDDL field from expressionsRunner results, leaving only the string result itself for more compact storage
Tags:
- rule
sum_With¶
sum_With( x ) adds expression x for each row processed in an aggExpr with a default of LongType
Alternatives:
Tags:
- aggregate
to_yaml¶
to_yaml(expression, [options map]) uses snakeyaml to convert Spark datatypes into yaml.
Passing null into the function returns a null yaml (newline is appended):
null
All null values will be treated in this fashion. The string "null" will be represented as (again new line is present):
'null'
The optional "options map" parameter currently supports the following output options:
- useFullScalarType, defaults to false. Instead of using the default yaml tags uses the full classnames for scalars, reducing risk of precision loss if the yaml is to be used outside of the from_yaml function.
sample usage:
val df = sparkSession.sql("select array(1,2,3,4,5) og")
.selectExpr("*", "to_yaml(og, map('useFullScalarType', 'true')) y")
.selectExpr("*", "from_yaml(y, 'array<int>') f")
.filter("f == og")
snakeyaml is provided scope
Databricks runtimes provide sparkyaml, so whilst Quality builds against the correct versions for Databricks it can onyl use provided scope.
snakeyaml is 1.24 on DBRs below 13.1, but not present on OSS, so you may need to add the dependency yourself, tested compatible versions are 1.24 and 1.33.
Tags:
- yaml
typed_expression_runner¶
typed_expression_runner( ruleSuiteVariable, ddl ) processes the rule suite stored at ruleSuiteVariable using the typed expression runner as if calling typedExpressionRunner directly.
The ddl parameter specifies the output expressions type, all expressions must share this type.
All the alternatives correlate to their pre Spark 4 non-sql versions.
ruleSuiteVariables must be registered via the register_rule_suite scala functions.
Alternatives:
Tags:
- variable
- runner
- Spark4
unique_ID¶
uniqueID('prefix') Generates a 160bit guaranteed unique id (requires MAC address uniqueness) with contiguous higher values within a partition and overflow with timestamp ms., prefix is used with the _base, _i0 and _i1 fields in the resulting structure
Tags:
- ID
unpack¶
unpack(expr) takes a packed rule long and unpacks it to a .id and .version structure
Tags:
- ruleid
unpack_Id_Triple¶
unpack_Id_Triple(expr) takes a packed rule triple of longs (ruleSuiteId, ruleSetId and ruleId) and unpacks it to (ruleSuiteId, ruleSuiteVersion, ruleSetId, ruleSetVersion, ruleId, ruleVersion)
Tags:
- ruleid
update_field¶
update_field(structure_expr, 'field.subfield', replaceWith, 'fieldN', replaceWithN) processes structures allowing you to replace sub items (think lens in functional programming) using the structure fields path name.
This is a wrapped version of 3.4.1's withField implementation.
Tags:
- struct
za_Field_Based_ID¶
za_Field_Based_ID('prefix', 'digestImpl', fields*) creates a 64bit id (96bit including header) by using a given Zero Allocation impl over the fields, prefix is used with the _base and _i0 fields in the resulting structure.
Prefer using the zaLongsFieldBasedID for less collisions
Tags:
- ID
- Hash
za_Hash_Longs_With¶
za_Hash_Longs_With('HASH', fields*) generates a multi length long array but with a zero allocation implementation. This structure is suitable for blooms, the default XXH3 algorithm is the 128bit version of that used by the internal bigBloom implementation.
Available HASH functions are MURMUR3_128, XXH3
Tags:
- Hash
za_Hash_Longs_With_Struct¶
similar to za_Hash_Longs_With('HASH', fields*) but generates an ID relevant multi length long struct, which is not suitable for blooms
Tags:
- Hash
za_Hash_With¶
za_Hash_With('HASH', fields*) generates a single length long array always with 64 bits but with a zero allocation implementation. This structure is suitable for blooms, the default XX algorithm is used by the internal bigBloom implementation.
Available HASH functions are MURMUR3_64, CITY_1_1, FARMNA, FARMOU, METRO, WY_V3, XX
Tags:
- Hash
za_Hash_With_Struct¶
similar to za_Hash_With('HASH', fields*) but generates an ID relevant multi length long struct (of one long), which is not suitable for blooms.
Prefer zaHashLongsWithStruct for reduced collisions with either the MURMUR3_128 or XXH3 versions of hashes
Tags:
- Hash
za_Longs_Field_Based_ID¶
za_Longs_Field_Based_ID('prefix', 'digestImpl', fields*) creates a variable length id by using a given Zero Allocation impl over the fields, prefix is used with the _base, _i0 and _iN fields in the resulting structure. Murmur3_128 is faster than on the Guava implementation.
Tags:
- ID
- Hash
Created: December 16, 2025 07:52:51