SQL Functions Documentation
_¶
_( [ddl type], [nullable] ) provides PlaceHolders for lambda functions to allow partial application, use them in place of actual values or expressions to either change arity or allow use in _lambda_.
The default type is Long / Bigint, you will have to provide the types directly when using something else. By default the placeholders are assumed to be nullable (i.e. true), you can use false to state the field should not be null.
_lambda_¶
_lambda_( user function ) extracts the Spark LambdaFunction from a resolved user function, this must have the correct types expected by the Spark HigherOrderFunction they are parameters for.
This allows using user defined functions and lambdas with in-built Spark HigherOrderFunctions
agg_Expr¶
agg_Expr( [ddl sum type], filter, sum, result) aggregates on rows which match the filter expression using the sum expression to aggregate then processes the results using the result expression.
You can run multiple agg_Expr's in a single pass select, use the first parameter to thread DDL type information through to the sum and result functions.
as_uuid¶
as_uuid(lower_long, higher_long) converts two longs into a uuid. Note: this is not functionally equivalent to rng_uuid(longPair(lower, higher)) despite having the same types.
big_Bloom¶
big_Bloom(buildFrom, expectedSize, expectedFPP) creates an aggregated bloom filter using the buildFrom expression.
The blooms are stored on a shared filesystem using a generated uuid, they can scale to high numbers of items whilst keeping the FPP (e.g. millions at 0.01 would imply 99% probability, you may have to cast to double in Spark 3.2).
buildFrom can be driven by digestToLongs or hashWith functions when using multiple fields.
Alternatives:
callFun¶
callFun( user function lambda variable, param1, param2, … paramN ) used within a lambda function it allows calling a lambda variable that contains a user function.
Used from the top level sql it performs a similar function expecting either a full user function or a partially applied function, typically returned from another lambda user function.
coalesce_If_Attributes_Missing¶
coalesce_If_Attributes_Missing(expr, replaceWith) substitutes expr with the replaceWith expression when expr has missing attributes in the source dataframe. Your code must call the scala processIfAttributeMissing function before using in validate or ruleEngineRunner/ruleRunner:
val missingAttributesAreReplacedRS = processIfAttributeMissing(rs, struct)
val (errors, _) = validate(struct, missingAttributesAreReplacedRS)
// use it missingAttributesAreReplacedRS in your dataframe..
coalesce_If_Attributes_Missing_Disable¶
coalesce_If_Attributes_Missing_Disable(expr) substitutes expr with the DisabledRule Integer result (-2) when expr has missing attributes in the source dataframe. Your code must call the scala processIfAttributeMissing function before using in validate or ruleEngineRunner/ruleRunner:
val missingAttributesAreReplacedRS = processIfAttributeMissing(rs, struct)
val (errors, _) = validate(struct, missingAttributesAreReplacedRS)
// use it missingAttributesAreReplacedRS in your dataframe..
comparable_Maps¶
comparable_Maps(struct | array | map) converts any maps in the input param into sorted arrays of a key, value struct.
This allows developers to perform sorts, distincts, group bys and union set operations with Maps, currently not supported by Spark sql as of 3.4.
The sorting behaviour uses Sparks existing odering logic but allows for extension during the calls to the registerQualityFunctions via the mapCompare parameter and the defaultMapCompare function.
digest_To_Longs¶
digest_To_Longs('digestImpl', fields*) creates an array of longs based on creating the given MessageDigest impl. A 128-bit impl will generate two longs from it's digest
digest_To_Longs_Struct¶
digest_To_Longs_Struct('digestImpl', fields*) creates structure of longs with i0 to iN named fields based on creating the given MessageDigest impl.
disabled_Rule¶
disabledRule() returns the DisabledRule Integer result (-2) for use in filtering and to disable rules (which may not signify a version bump)
drop_field¶
drop_field(structure_expr, 'field.subfield'*) removes fields from a structure, but will not remove parent nodes.
This is a wrapped version of 3.4.1's dropField implementation.
failed¶
failed() returns the Failed Integer result (0) for use in filtering
field_Based_ID¶
field_Based_ID('prefix', 'digestImpl', fields*) creates a variable bit length id by using a given MessageDigest impl over the fields, prefix is used with the _base, _i0 and _iN fields in the resulting structure
flatten_Results¶
flatten_Results(dataQualityExpr) expands data quality results into a flat array
flatten_Rule_Results¶
flatten_Rule_Results(dataQualityExpr) expands data quality results into a structure of flattenedResults, salientRule (the one used to create the output) and the rule result.
salientRule will be null if there was no matching rule
from_yaml¶
from_yaml(string, 'ddlType') uses snakeyaml to convert yaml into Spark datatypes
hash_Field_Based_ID¶
hash_Field_Based_ID('prefix', 'digestImpl', fields*) creates a variable bit length id by using a given Guava Hasher impl over the fields, prefix is used with the _base, _i0 and _iN fields in the resulting structure
hash_With¶
hash_With('HASH', fields*) Generates a hash value (array of longs) suitable for using in blooms based on the given Guava hash implementation.
Note based on testing the digestToLongs function for SHA256 and MD5 are faster.
Valid hashes: MURMUR3_32, MURMUR3_128, MD5, SHA-1, SHA-256, SHA-512, ADLER32, CRC32, SIPHASH24. When an invalid HASH name is provided MURMUR3_128 will be chosen.
Open source Spark 3.1.2/3 issues
On Spark 3.1.2/3 open source this may get resolver errors due to a downgrade on guava version - 15.0 is used on Databricks, open source 3.0.3 uses 16.0.1, 3.1.2 drops this to 11 and misses crc32, sipHash24 and adler32.
hash_With_Struct¶
per hash_With('HASH', fields*) but generates a struct with i0 to ix named longs. This structure is not suitable for blooms
id_base64¶
id_base64(base, i0, i1, ix) Generates a base64 encoded representation of the id, either the single struct field or the individual parts
Alternatives:
id_Equal¶
id_Equal(leftPrefix, rightPrefix) takes two prefixes which will be used to match leftPrefix_base = rightPrefix_base, i0 and i1 fields. It does not currently support more than two i's
id_from_base64¶
id_from_base64(base64) Parses the base64 string with an expected default long size of two i.e. an 160bit ID, any string which is not of the correct size will return null
Alternatives:
id_raw_type¶
id_raw_type(idstruct) Given a prefixed id returns the fields without their prefix
id_size¶
id_size(base64) Given a base64 from id_base64 returns the number of _i long fields
inc¶
inc() increments the current sum by 1
Alternatives:
long_Pair¶
long_Pair(lower, higher) creates a structure with these lower and higher longs
long_Pair_Equal¶
long_Pair_Equal(leftPrefix, rightPrefix) takes two prefixes which will be used to match leftPrefix_lower = rightPrefix_lower and leftPrefix_higher = rightPrefix_higher
long_Pair_From_UUID¶
long_Pair_From_UUID(expr) converts a UUID to a structure with lower and higher longs
map_Contains¶
map_Contains('mapid', expr) returns true if there is an item in the map
map_Lookup¶
map_Lookup('mapid', expr) returns either the lookup in map specified by mapid or null
meanF¶
meanF() simple mean on the results, expecting sum and count type Long
murmur3_ID¶
murmur3ID('prefix', fields*) Generates a 160bit id using murmer3 hashing over input fields, prefix is used with the _base, _i0 and _i1 fields in the resulting structure
pack_Ints¶
pack_Ints(lower, higher) a packaged long from two ints, used within result compression
passed¶
passed() returns the Passed Integer for use in filtering: 10000
prefixed_To_Long_Pair¶
prefixed_To_Long_Pair(field, 'prefix') converts a 128bit longpair field with the given prefix into a higher and lower long pair without prefix.
This is suitable for converting provided id's into uuids for example via a further call to rngUUID.
print_Code¶
print_Code( [msg], expr ) prints the code generated by an expression, the value variable and the isNull variable and forwards eval calls / type etc. to the expression.
The code is printed once per partition on the executors std. output. You will have to check each executor to find the used nodes output. To use with unit testing on a single host you may overwrite the writer function in registerQualityFunctions, you should however use a top level object and var to write into (or stream), printCode will not be able to write to std out properly (spark redirects / captures stdout) or non top level objects (due to classloader / function instance issues). Testing on other hosts without using stdout should do so to a shared file location or similar.
!!! "information" It is not compatible with every expression Aggregate expressions like aggExpr or sum etc. won't generate code so they aren't compatible with printCode.
\_lambda\_ is also incompatible with printCode both wrapping a user function and the \_lambda\_ function. Similarly the \_() placeholder function cannot be wrapped.
Any function expecting a specific signature like aggExpr or other HigherOrderFunctions like aggregate or filter are unlikely to support wrapped arguements.
print_Expr¶
print_Expr( [msg], expr ) prints the expression tree via toString with an optional msg
The message is printed to the driver nodes std. output, often shown in notebooks as well. To use with unit testing you may overwrite the writer function in registerQualityFunctions, you should however use a top level object and var to write into (or stream).
probability¶
probability(expr) will translate probability rule results into a double, e.g. 1000 returns 0.01. This is useful for interpreting and filtering on probability based results: 0 -> 10000 non-inclusive
probability_In¶
probability_In(expr, 'bloomid') returns the probability of the expr being in the bloomfilter specified by bloomid.
This function either returns 0.0, where it is definitely not present, or the original FPP where it may be present.
You may use digestToLongs or hashWith as appropriate to use multiple columns safely.
provided_ID¶
provided_ID('prefix', existingLongs) creates an id for an existing array of longs, prefix is used with the _base, _i0 and _iN fields in the resulting structure
results_With¶
results_With( x ) process results lambda x (e.g. (sum, count) -> sum ) that takes sum from the aggregate, count from the number of rows counted. Defaults both the sumtype and counttype as LongType
Alternatives:
return_Sum¶
return_Sum( sum type ddl ) just returns the sum and ignores the count param, expands to resultsWith( [sum ddl_type], (sum, count) -> sum)
reverse_Comparable_Maps¶
reverses a call to comparableMaps
rng¶
rng() Generates a 128bit random id using XO_RO_SHI_RO_128_PP, encoded as a lower and higher long pair
Alternatives:
rng_Bytes¶
rng_Bytes() Generates a 128bit random id using XO_RO_SHI_RO_128_PP, encoded as a byte array
Alternatives:
rng_ID¶
rng_ID('prefix') Generates a 160bit random id using XO_RO_SHI_RO_128_PP, prefix is used with the _base, _i0 and _i1 fields in the resulting structure
Alternatives:
rng_UUID¶
rng_UUID(expr) takes either a structure with lower and higher longs or a 128bit binary type and converts to a string uuid - use with, for example, the rng() function.
If a simple conversion from two longs (lower, higher) to a uuid is desired then use as_uuid, rng_uuid applies the same transformations as the Spark uuid to the input higher and lower longs.
rule_result¶
rule_result(ruleSuiteResultColumn, packedRuleSuiteId, packedRuleSetId, packedRuleId) uses the packed long id's to retrieve the integer ruleResult (see below for ExpressionRunner) or null if it can't be found.
You can use pack_ints(id, version) to specify each id if you don't already have the packed long version. This is suitable for retrieving individual rule results, for example to aggregate counts of a specific rule result, without having to resort to using filter and map values.
rule_result works with ruleRunner (DQ) results (including details) and ExpressionRunner results. ExpressionRunner results return a tuple of ruleResult and resultDDL, both strings, or if strip_result_ddl is called a string.
rule_Suite_Result_Details¶
rule_Suite_Result_Details(dq) strips the overallResult from the dataquality results, suitable for keeping overall result as a top-level field with associated performance improvements
small_Bloom¶
small_Bloom(buildFrom, expectedSize, expectedFPP) creates a simply bytearray bloom filter using the expected size and fpp - 0.01 is 99%, you may have to cast to double in Spark 3.2. buildFrom can be driven by digestToLongs or hashWith functions when using multiple fields.
soft_Fail¶
soft_Fail(ruleexpr) will treat any rule failure (e.g. failed() ) as returning softFailed()
soft_Failed¶
soft_Failed() returns the SoftFailed Integer result (-1) for use in filtering
strip_result_ddl¶
strip_result_ddl(expressionsResult) removes the resultDDL field from expressionsRunner results, leaving only the string result itself for more compact storage
sum_With¶
sum_With( x ) adds expression x for each row processed in an aggExpr with a default of LongType
Alternatives:
to_yaml¶
to_yaml(expression, [options map]) uses snakeyaml to convert Spark datatypes into yaml.
Passing null into the function returns a null yaml (newline is appended):
null
All null values will be treated in this fashion. The string "null" will be represented as (again new line is present):
'null'
The optional "options map" parameter currently supports the following output options:
- useFullScalarType, defaults to false. Instead of using the default yaml tags uses the full classnames for scalars, reducing risk of precision loss if the yaml is to be used outside of the from_yaml function.
sample usage:
val df = sparkSession.sql("select array(1,2,3,4,5) og")
.selectExpr("*", "to_yaml(og, map('useFullScalarType', 'true')) y")
.selectExpr("*", "from_yaml(y, 'array<int>') f")
.filter("f == og")
snakeyaml is provided scope
Databricks runtimes provide sparkyaml, so whilst Quality builds against the correct versions for Databricks it can onyl use provided scope.
snakeyaml is 1.24 on DBRs below 13.1, but not present on OSS, so you may need to add the dependency yourself, tested compatible versions are 1.24 and 1.33.
unique_ID¶
uniqueID('prefix') Generates a 160bit guaranteed unique id (requires MAC address uniqueness) with contiguous higher values within a partition and overflow with timestamp ms., prefix is used with the _base, _i0 and _i1 fields in the resulting structure
unpack¶
unpack(expr) takes a packed rule long and unpacks it to a .id and .version structure
unpack_Id_Triple¶
unpack_Id_Triple(expr) takes a packed rule triple of longs (ruleSuiteId, ruleSetId and ruleId) and unpacks it to (ruleSuiteId, ruleSuiteVersion, ruleSetId, ruleSetVersion, ruleId, ruleVersion)
update_field¶
update_field(structure_expr, 'field.subfield', replaceWith, 'fieldN', replaceWithN) processes structures allowing you to replace sub items (think lens in functional programming) using the structure fields path name.
This is a wrapped version of 3.4.1's withField implementation.
za_Field_Based_ID¶
za_Field_Based_ID('prefix', 'digestImpl', fields*) creates a 64bit id (96bit including header) by using a given Zero Allocation impl over the fields, prefix is used with the _base and _i0 fields in the resulting structure.
Prefer using the zaLongsFieldBasedID for less collisions
za_Hash_Longs_With¶
za_Hash_Longs_With('HASH', fields*) generates a multi length long array but with a zero allocation implementation. This structure is suitable for blooms, the default XXH3 algorithm is the 128bit version of that used by the internal bigBloom implementation.
Available HASH functions are MURMUR3_128, XXH3
za_Hash_Longs_With_Struct¶
similar to za_Hash_Longs_With('HASH', fields*) but generates an ID relevant multi length long struct, which is not suitable for blooms
za_Hash_With¶
za_Hash_With('HASH', fields*) generates a single length long array always with 64 bits but with a zero allocation implementation. This structure is suitable for blooms, the default XX algorithm is used by the internal bigBloom implementation.
Available HASH functions are MURMUR3_64, CITY_1_1, FARMNA, FARMOU, METRO, WY_V3, XX
za_Hash_With_Struct¶
similar to za_Hash_With('HASH', fields*) but generates an ID relevant multi length long struct (of one long), which is not suitable for blooms.
Prefer zaHashLongsWithStruct for reduced collisions with either the MURMUR3_128 or XXH3 versions of hashes
za_Longs_Field_Based_ID¶
za_Longs_Field_Based_ID('prefix', 'digestImpl', fields*) creates a variable length id by using a given Zero Allocation impl over the fields, prefix is used with the _base, _i0 and _iN fields in the resulting structure. Murmur3_128 is faster than on the Guava implementation.
Created: December 3, 2024 17:06:32