classicFunctions

package classicFunctions

Collection of the Quality Spark Expressions for use in select( Column * )

Linear Supertypes

ProcessDisableIfMissingImports, ValidationImports, ClassicRuleEngineRunnerImports, LambdaFunctionsImports, SerializingImports, BlockSplitBloomFilterImports, BloomFilterLookupImports, LookupIdFunctionsImports, MapLookupImportsShared, Serializable, BloomFilterRegistration, ClassicRuleRunnerFunctionsImport, BucketedCreatorFunctionImports, BloomFilterTypes, MapLookupFunctionImports, ClassicRuleFolderRunnerImports, BloomExpressionFunctions, ClassicRuleRunnerImports, BloomFilterLookupFunctionImport, AnyRef, Any

Ordering

Alphabetic
By Inheritance

Inherited

classicFunctions
ProcessDisableIfMissingImports
ValidationImports
ClassicRuleEngineRunnerImports
LambdaFunctionsImports
SerializingImports
BlockSplitBloomFilterImports
BloomFilterLookupImports
LookupIdFunctionsImports
MapLookupImportsShared
Serializable
BloomFilterRegistration
ClassicRuleRunnerFunctionsImport
BucketedCreatorFunctionImports
BloomFilterTypes
MapLookupFunctionImports
ClassicRuleFolderRunnerImports
BloomExpressionFunctions
ClassicRuleRunnerImports
BloomFilterLookupFunctionImport
AnyRef
Any

Hide All
Show All

Visibility

Public
Protected

Type Members

type BloomFilterMap = Map[String, (BloomLookup, Double)]
Used as a param to load the bloomfilter expr
Used as a param to load the bloomfilter expr
Definition Classes
BloomFilterTypes
final class EmptyToGenerateScalaDocs extends AnyRef
Forces scaladoc to generate the package - https://github.com/scala/bug/issues/8124
Forces scaladoc to generate the package - https://github.com/scala/bug/issues/8124
Attributes
protected[quality]
type IdTriple = (Id, Id, Id)
Definition Classes
LambdaFunctionsImports
type MapCreator = () => (DataFrame, Column, Column)
Definition Classes
MapLookupImportsShared
type MapLookups = String
Used as a param to load the map lookups - note the type of the broadcast is always Map[AnyRef, AnyRef]
Used as a param to load the map lookups - note the type of the broadcast is always Map[AnyRef, AnyRef]
Definition Classes
MapLookupImportsShared

Abstract Value Members

abstract def getClass(): Class[_ <: AnyRef]
Definition Classes
Any

Concrete Value Members

final def !=(arg0: Any): Boolean
Definition Classes
Any
final def ##: Int
Definition Classes
Any
final def ==(arg0: Any): Boolean
Definition Classes
Any
final def asInstanceOf[T0]: T0
Definition Classes
Any
def big_bloom(bloomOver: Column, expectedNumberOfRows: Long, expectedFPP: Double, id: String, bucketedFilesRoot: BucketedFilesRoot): Column
Creates a bloom over the bloomOver expression, with an expected number of rows and fpp to build.
Creates a bloom over the bloomOver expression, with an expected number of rows and fpp to build.
This bloom is an array of byte arrays, with the underlying arrays stored on file, as such it is only limited by file system size.
id
- the id of this bloom used to separate from other big_blooms - defaults to a uuid
bucketedFilesRoot
- provide this to override the default file root for the underlying arrays - defaults to a temporary file location or /dbfs/ on Databricks. The config property sparkutils.quality.bloom.root can also be used to set for all blooms in a cluster.
Definition Classes
BloomExpressionFunctions
Annotations
@ClassicOnly()
def big_bloom(bloomOver: Column, expectedNumberOfRows: Long, expectedFPP: Double, id: String): Column
Creates a bloom over the bloomOver expression, with an expected number of rows and fpp to build.
Creates a bloom over the bloomOver expression, with an expected number of rows and fpp to build.
This bloom is an array of byte arrays, with the underlying arrays stored on file, as such it is only limited by file system size.
id
- the id of this bloom used to separate from other big_blooms - defaults to a uuid
Definition Classes
BloomExpressionFunctions
Annotations
@ClassicOnly()
def big_bloom(bloomOver: Column, expectedNumberOfRows: Long, expectedFPP: Double): Column
Creates a bloom over the bloomOver expression, with an expected number of rows and fpp to build.
Creates a bloom over the bloomOver expression, with an expected number of rows and fpp to build.
This bloom is an array of byte arrays, with the underlying arrays stored on file, as such it is only limited by file system size.
Definition Classes
BloomExpressionFunctions
Annotations
@ClassicOnly()
def big_bloom(bloomOver: Column, expectedNumberOfRows: Column, expectedFPP: Column, id: Column = lit(java.util.UUID.randomUUID().toString), bucketedFilesRoot: BucketedFilesRoot = BucketedFilesRoot(FileRoot(com.sparkutils.quality.classicFunctions.bloomFileLocation))): Column
Creates a bloom over the bloomOver expression, with an expected number of rows and fpp to build.
Creates a bloom over the bloomOver expression, with an expected number of rows and fpp to build.
This bloom is an array of byte arrays, with the underlying arrays stored on file, as such it is only limited by file system size.
id
- the id of this bloom used to separate from other big_blooms - defaults to a uuid, this cannot refer to a row column or expression
bucketedFilesRoot
- provide this to override the default file root for the underlying arrays - defaults to a temporary file location or /dbfs/ on Databricks. The config property sparkutils.quality.bloom.root can also be used to set for all blooms in a cluster.
Definition Classes
BloomExpressionFunctions
Annotations
@ClassicOnly()
val bloomFileLocation: String
Where the bloom filter files will be stored
Where the bloom filter files will be stored
Definition Classes
BucketedCreatorFunctionImports
def bloomFrom(dataFrame: DataFrame, bloomOn: String, expectedSize: Int, fpp: Double, bloomId: String, partitions: Int): BloomModel
Generates a bloom filter using the expression passed via bloomOn with optional repartitioning and a default fpp of 0.01.
Generates a bloom filter using the expression passed via bloomOn with optional repartitioning and a default fpp of 0.01. A unique run will be created so other results can co-exist but this can be overridden by specifying bloomId. The other interim result files will be removed and the resulting BucketedFiles can be used in lookups or stored elsewhere.
bloomOn
a compatible expression to generate the bloom hashes against
Definition Classes
BucketedCreatorFunctionImports
def bloomFrom(dataFrame: DataFrame, bloomOn: String, expectedSize: Int, fpp: Double, bloomId: String): BloomModel
Generates a bloom filter using the expression passed via bloomOn with optional repartitioning and a default fpp of 0.01.
Generates a bloom filter using the expression passed via bloomOn with optional repartitioning and a default fpp of 0.01. A unique run will be created so other results can co-exist but this can be overridden by specifying bloomId. The other interim result files will be removed and the resulting BucketedFiles can be used in lookups or stored elsewhere.
bloomOn
a compatible expression to generate the bloom hashes against
Definition Classes
BucketedCreatorFunctionImports
def bloomFrom(dataFrame: DataFrame, bloomOn: String, expectedSize: Int, fpp: Double): BloomModel
Generates a bloom filter using the expression passed via bloomOn with optional repartitioning and a default fpp of 0.01.
Generates a bloom filter using the expression passed via bloomOn with optional repartitioning and a default fpp of 0.01. A unique run will be created so other results can co-exist but this can be overridden by specifying bloomId. The other interim result files will be removed and the resulting BucketedFiles can be used in lookups or stored elsewhere.
bloomOn
a compatible expression to generate the bloom hashes against
Definition Classes
BucketedCreatorFunctionImports
def bloomFrom(dataFrame: DataFrame, bloomOn: String, expectedSize: Int): BloomModel
Generates a bloom filter using the expression passed via bloomOn with optional repartitioning and a default fpp of 0.01.
Generates a bloom filter using the expression passed via bloomOn with optional repartitioning and a default fpp of 0.01. A unique run will be created so other results can co-exist but this can be overridden by specifying bloomId. The other interim result files will be removed and the resulting BucketedFiles can be used in lookups or stored elsewhere.
bloomOn
a compatible expression to generate the bloom hashes against
Definition Classes
BucketedCreatorFunctionImports
def bloomFrom(dataFrame: DataFrame, bloomOn: Column, expectedSize: Int, fpp: Double = 0.01, bloomId: String = java.util.UUID.randomUUID().toString, partitions: Int = 0): BloomModel
Generates a bloom filter using the expression passed via bloomOn with optional repartitioning and a default fpp of 0.01.
Generates a bloom filter using the expression passed via bloomOn with optional repartitioning and a default fpp of 0.01. A unique run will be created so other results can co-exist but this can be overridden by specifying bloomId. The other interim result files will be removed and the resulting BucketedFiles can be used in lookups or stored elsewhere.
bloomOn
a compatible expression to generate the bloom hashes against
Definition Classes
BucketedCreatorFunctionImports
def bloomLookup(bucketedFiles: BloomModel): BloomLookup
Creates a very large bloom filter from multiple buckets of 2gb arrays backed by the default Parquet implementation
def bloomLookup(bytes: Array[Byte]): BloomLookup
Creates a bloom filter from an array of bytes using the default Parquet bloom filter implementation
def enableFunNRewrites(): Unit
Enables the FunNRewrite optimisation.
Enables the FunNRewrite optimisation. Where a user configured LambdaFunction does not have nested HigherOrderFunctions, or declares the /* USED_AS_LAMBDA */ comment, the lambda function will be expanded, replacing all LambdaVariables with the input expressions.
Definition Classes
ClassicRuleRunnerFunctionsImport
def enableOptimizations(extraOptimizations: Seq[org.apache.spark.sql.catalyst.rules.Rule[LogicalPlan]]): Unit
Enables optimisations via experimental.extraOptimizations, first checking if they are already present and ensuring all are added in order, can be used in place of enableFunNRewrites or additionally, for example adding ConstantFolding may improve performance for given rule types
Enables optimisations via experimental.extraOptimizations, first checking if they are already present and ensuring all are added in order, can be used in place of enableFunNRewrites or additionally, for example adding ConstantFolding may improve performance for given rule types
extraOptimizations
rules to be added in order
Definition Classes
ClassicRuleRunnerFunctionsImport
def equals(arg0: Any): Boolean
Definition Classes
Any
def getBlooms(ruleSuite: RuleSuite): Seq[String]
Identifies bloom ids before (or after) resolving for a given ruleSuite, use to know which bloom filters need to be loaded
Identifies bloom ids before (or after) resolving for a given ruleSuite, use to know which bloom filters need to be loaded
ruleSuite
a ruleSuite full of expressions to check
returns
The bloom id's used, for unresolved expression trees this may contain blooms which are not present in the bloom map
Definition Classes
BloomFilterLookupImports
Annotations
@ClassicOnly()
def hashCode(): Int
Definition Classes
Any
def identifyLookups(ruleSuite: RuleSuite): LookupResults
Use this function to identify which maps / blooms etc.
Use this function to identify which maps / blooms etc. are used by a given rulesuite collects all rules that are using lookup functions but without constant expressions and the list of lookups that are constants.
Definition Classes
LookupIdFunctionsImports
def integrateLambdas(ruleSuiteMap: RuleSuiteMap, lambdas: Map[Id, Seq[LambdaFunction]], globalLibrary: Option[Id] = None): RuleSuiteMap
Add any of the Lambdas loaded for the rule suites
Add any of the Lambdas loaded for the rule suites
globalLibrary
- all lambdas with this RuleSuite Id will be added to all RuleSuites
Definition Classes
SerializingImports
def integrateMetaRuleSets(dataFrame: DataFrame, ruleSuiteMap: RuleSuiteMap, metaRuleSetMap: Map[Id, Seq[MetaRuleSetRow]], stablePosition: (String) => Int, transform: (DataFrame) => DataFrame = identity): RuleSuiteMap
Integrates meta rulesets into the rulesuites.
Integrates meta rulesets into the rulesuites. Note this only works for a specific dataset, if rulesuites should be filtered for a given dataset then this must take place before calling.
dataFrame
the dataframe to identify columns for metarules
ruleSuiteMap
the ruleSuites relevant for this dataframe
metaRuleSetMap
the meta rulesets
stablePosition
this function must maintain the law that each column name within a RuleSet always generates the same position
Definition Classes
SerializingImports
def integrateOutputExpressions(ruleSuiteMap: RuleSuiteMap, outputs: Map[Id, Seq[OutputExpressionRow]], globalLibrary: Option[Id] = None): (RuleSuiteMap, Map[Id, Set[Rule]])
Returns an integrated ruleSuiteMap with a set of RuleSuite Id -> Rule mappings where the OutputExpression didn't exist.
Returns an integrated ruleSuiteMap with a set of RuleSuite Id -> Rule mappings where the OutputExpression didn't exist. If defaultProcessor is expected to be used then call integrateRuleSuites *before*.
Users should check if their RuleSuite is in the "error" map. The isAMissingRuleSuiteRule can be used to identify if a Rule is referring to a missing RuleSuite.defaultProcessor
Definition Classes
SerializingImports
def integrateRuleSuites(ruleSuiteMap: RuleSuiteMap, ruleSuites: Map[Id, RuleSuiteRow]): RuleSuiteMap
Returns an integrated ruleSuiteMap with probablePass and defaultOutputExpression applied
Returns an integrated ruleSuiteMap with probablePass and defaultOutputExpression applied
Call this before calling integrateOutputExpressions.
Definition Classes
SerializingImports
def isAMissingRuleSuiteRule(rule: Rule): Boolean
Identify if the missing OutputExpression from a RuleSuite is from a defaultProcessor
Identify if the missing OutputExpression from a RuleSuite is from a defaultProcessor
Definition Classes
SerializingImports
final def isInstanceOf[T0]: Boolean
Definition Classes
Any
def loadBloomConfigs(loader: DataFrameLoader, viewDF: DataFrame, name: Column, token: Column, filter: Column, sql: Column, bigBloom: Column, value: Column, numberOfElements: Column, expectedFPP: Column): (Seq[BloomConfig], Set[String])
Loads map configurations from a given DataFrame.
Loads map configurations from a given DataFrame. Wherever token is present loader will be called and the filter optionally applied.
returns
A tuple of MapConfig's and the names of rows which had unexpected content (either token or sql must be present)
Definition Classes
BloomFilterLookupImports
def loadBloomConfigs(loader: DataFrameLoader, viewDF: DataFrame, ruleSuiteIdColumn: Column, ruleSuiteVersionColumn: Column, ruleSuiteId: Id, name: Column, token: Column, filter: Column, sql: Column, bigBloom: Column, value: Column, numberOfElements: Column, expectedFPP: Column): (Seq[BloomConfig], Set[String])
Loads map configurations from a given DataFrame for ruleSuiteId.
Loads map configurations from a given DataFrame for ruleSuiteId. Wherever token is present loader will be called and the filter optionally applied.
returns
A tuple of MapConfig's and the names of rows which had unexpected content (either token or sql must be present)
Definition Classes
BloomFilterLookupImports
def loadBlooms(configs: Seq[BloomConfig]): impl.bloom.BloomExpressionLookup.BloomFilterMap
Loads bloom maps ready to register from configuration
Loads bloom maps ready to register from configuration
Definition Classes
BloomFilterLookupImports
def loadMapConfigs(loader: DataFrameLoader, viewDF: DataFrame, name: Column, token: Column, filter: Column, sql: Column, key: Column, value: Column): (Seq[MapConfig], Set[String])
Loads map configurations from a given DataFrame.
Loads map configurations from a given DataFrame. Wherever token is present loader will be called and the filter optionally applied.
returns
A tuple of MapConfig's and the names of rows which had unexpected content (either token or sql must be present)
Definition Classes
MapLookupImportsShared
def loadMapConfigs(loader: DataFrameLoader, viewDF: DataFrame, ruleSuiteIdColumn: Column, ruleSuiteVersionColumn: Column, ruleSuiteId: Id, name: Column, token: Column, filter: Column, sql: Column, key: Column, value: Column): (Seq[MapConfig], Set[String])
Loads map configurations from a given DataFrame for ruleSuiteId.
Loads map configurations from a given DataFrame for ruleSuiteId. Wherever token is present loader will be called and the filter optionally applied.
returns
A tuple of MapConfig's and the names of rows which had unexpected content (either token or sql must be present)
Definition Classes
MapLookupImportsShared
def loadMaps(configs: Seq[MapConfig]): MapLookups
Definition Classes
MapLookupImportsShared
def mapLookupsFromDFs(creators: Map[String, MapCreator]): MapLookups
Loads maps to broadcast, each individual dataframe may have different associated expressions
Loads maps to broadcast, each individual dataframe may have different associated expressions
creators
a map of string id to MapCreator
returns
a map of id to broadcast variables needed for exact lookup and mapping checks
Definition Classes
MapLookupImportsShared
def map_contains(mapLookupName: String, lookupKey: Column, mapLookups: quality.MapLookups): Column
Tests if there is a stored value from a map via the name mapLookupName and 'key' lookupKey.
Tests if there is a stored value from a map via the name mapLookupName and 'key' lookupKey. Implementation is map_lookup.isNotNull
Definition Classes
MapLookupFunctionImports
def map_lookup(mapLookupName: String, lookupKey: Column, mapLookups: quality.MapLookups): Column
Retrieves the stored value from a map via the name mapLookupName and 'key' lookupKey
Retrieves the stored value from a map via the name mapLookupName and 'key' lookupKey
Definition Classes
MapLookupFunctionImports
def namesFromSchema(schema: StructType): Set[String]
Retrieves the names (with parent fields following the .
Retrieves the names (with parent fields following the . notation for any nested fields) from the schema
Definition Classes
LookupIdFunctionsImports
def optimalNumOfBits(n: Long, p: Double): Int
Calculate optimal size according to the number of distinct values and false positive probability.
Calculate optimal size according to the number of distinct values and false positive probability.
n
: The number of distinct values.
p
: The false positive probability.
returns
optimal number of bits of given n and p.
Definition Classes
BlockSplitBloomFilterImports
def optimalNumberOfBuckets(n: Long, p: Double): Long
Definition Classes
BlockSplitBloomFilterImports
def probability_in(lookupValue: Column, filterName: String, bloomMap: Broadcast[impl.bloom.BloomExpressionLookup.BloomFilterMap]): Column
Lookup the value against a bloom filter from bloomMap with name filterName.
Lookup the value against a bloom filter from bloomMap with name filterName.
Definition Classes
BloomFilterLookupFunctionImport
Annotations
@ClassicOnly()
def processCoalesceIfAttributeMissing(expression: Expression, names: Set[String]): Expression
Definition Classes
ProcessDisableIfMissingImports
def processIfAttributeMissing(ruleSuite: RuleSuite, schema: StructType = StructType(Seq())): RuleSuite
Processes a given RuleSuite to replace any coalesceIfMissingAttributes.
Processes a given RuleSuite to replace any coalesceIfMissingAttributes. This may be called before validate / docs but *must* be called *before* adding the expression to a dataframe.
schema
The names to validate against, if empty no attempt to process coalesceIfAttributeMissing will be made
Definition Classes
ProcessDisableIfMissingImports
def readLambdaRowsFromDF(lambdaFunctionDF: DataFrame, lambdaFunctionName: Column, lambdaFunctionExpression: Column, lambdaFunctionId: Column, lambdaFunctionVersion: Column, lambdaFunctionRuleSuiteId: Column, lambdaFunctionRuleSuiteVersion: Column): Dataset[LambdaFunctionRow]
Loads lambda functions
Loads lambda functions
Definition Classes
SerializingImports
def readLambdasFromDF(lambdaFunctionDF: DataFrame, lambdaFunctionName: Column, lambdaFunctionExpression: Column, lambdaFunctionId: Column, lambdaFunctionVersion: Column, lambdaFunctionRuleSuiteId: Column, lambdaFunctionRuleSuiteVersion: Column): Map[Id, Seq[LambdaFunction]]
Loads lambda functions
Loads lambda functions
Definition Classes
SerializingImports
def readMetaRuleSetsFromDF(metaRuleSetDF: DataFrame, columnSelectionExpression: Column, lambdaFunctionExpression: Column, metaRuleSetId: Column, metaRuleSetVersion: Column, metaRuleSuiteId: Column, metaRuleSuiteVersion: Column): Map[Id, Seq[MetaRuleSetRow]]
Definition Classes
SerializingImports
def readOutputExpressionRowsFromDF(outputExpressionDF: DataFrame, outputExpression: Column, outputExpressionId: Column, outputExpressionVersion: Column, outputExpressionRuleSuiteId: Column, outputExpressionRuleSuiteVersion: Column): Dataset[OutputExpressionRow]
Loads output expressions
Loads output expressions
Definition Classes
SerializingImports
def readOutputExpressionsFromDF(outputExpressionDF: DataFrame, outputExpression: Column, outputExpressionId: Column, outputExpressionVersion: Column, outputExpressionRuleSuiteId: Column, outputExpressionRuleSuiteVersion: Column): Map[Id, Seq[OutputExpressionRow]]
Loads output expressions
Loads output expressions
Definition Classes
SerializingImports
def readRuleRowsFromDF(df: DataFrame, ruleSuiteId: Column, ruleSuiteVersion: Column, ruleSetId: Column, ruleSetVersion: Column, ruleId: Column, ruleVersion: Column, ruleExpr: Column, outputExpressionSalience: Column, outputExpressionId: Column, outputExpressionVersion: Column): Dataset[RuleRow]
Loads RuleRows from a dataframe with integers ruleSuiteId, ruleSuiteVersion, ruleSetId, ruleSetVersion, ruleId, ruleVersion and an expression string ruleExpr
Loads RuleRows from a dataframe with integers ruleSuiteId, ruleSuiteVersion, ruleSetId, ruleSetVersion, ruleId, ruleVersion and an expression string ruleExpr
Definition Classes
SerializingImports
def readRuleRowsFromDF(df: DataFrame, ruleSuiteId: Column, ruleSuiteVersion: Column, ruleSetId: Column, ruleSetVersion: Column, ruleId: Column, ruleVersion: Column, ruleExpr: Column, ruleEngine: Option[(Column, Column, Column)] = None): Dataset[RuleRow]
Loads RuleRows from a dataframe with integers ruleSuiteId, ruleSuiteVersion, ruleSetId, ruleSetVersion, ruleId, ruleVersion and an expression string ruleExpr
Loads RuleRows from a dataframe with integers ruleSuiteId, ruleSuiteVersion, ruleSetId, ruleSetVersion, ruleId, ruleVersion and an expression string ruleExpr
Definition Classes
SerializingImports
def readRuleSuitesFromDF(ruleSuites: Dataset[RuleSuiteRow]): Map[Id, RuleSuiteRow]
Loads RuleSuite specific attributes to use with integrateRuleSuites
Loads RuleSuite specific attributes to use with integrateRuleSuites
Definition Classes
SerializingImports
def readRulesFromDF(df: DataFrame, ruleSuiteId: Column, ruleSuiteVersion: Column, ruleSetId: Column, ruleSetVersion: Column, ruleId: Column, ruleVersion: Column, ruleExpr: Column, ruleEngineSalience: Column, ruleEngineId: Column, ruleEngineVersion: Column): RuleSuiteMap
Loads a RuleSuite from a dataframe with integers ruleSuiteId, ruleSuiteVersion, ruleSetId, ruleSetVersion, ruleId, ruleVersion and an expression string ruleExpr
Loads a RuleSuite from a dataframe with integers ruleSuiteId, ruleSuiteVersion, ruleSetId, ruleSetVersion, ruleId, ruleVersion and an expression string ruleExpr
Definition Classes
SerializingImports
def readRulesFromDF(df: DataFrame, ruleSuiteId: Column, ruleSuiteVersion: Column, ruleSetId: Column, ruleSetVersion: Column, ruleId: Column, ruleVersion: Column, ruleExpr: Column): RuleSuiteMap
Loads a RuleSuite from a dataframe with integers ruleSuiteId, ruleSuiteVersion, ruleSetId, ruleSetVersion, ruleId, ruleVersion and an expression string ruleExpr
Loads a RuleSuite from a dataframe with integers ruleSuiteId, ruleSuiteVersion, ruleSetId, ruleSetVersion, ruleId, ruleVersion and an expression string ruleExpr
Definition Classes
SerializingImports
def registerBloomMapAndFunction(bloomFilterMap: Broadcast[impl.bloom.BloomExpressionLookup.BloomFilterMap]): Unit
Registers this bloom map and associates the probabilityIn sql expression against it
Registers this bloom map and associates the probabilityIn sql expression against it
Definition Classes
BloomFilterRegistration
Annotations
@ClassicOnly()
def registerLambdaFunctions(functions: Seq[LambdaFunction]): Unit
This function is for use with directly loading lambda functions to use within normal spark apps.
This function is for use with directly loading lambda functions to use within normal spark apps.
Please be aware loading rules via ruleRunner, addOverallResultsAndDetails or addDataQuality will also load any registered rules applicable for that ruleSuite and may override other registered rules. You should not use this directly and RuleSuites for that reason (otherwise your results will not be reproducible).
Definition Classes
LambdaFunctionsImports
def registerQualityFunctions(parseTypes: (String) => Option[DataType] = defaultParseTypes, zero: (DataType) => Option[Any] = defaultZero, add: (DataType) => Option[(Expression, Expression) => Expression] = (dataType: DataType) => defaultAdd(dataType), mapCompare: (DataType) => Option[(Any, Any) => Int] = (dataType: DataType) => utils.defaultMapCompare(dataType), writer: (String) => Unit = println, registerFunction: (String, (Seq[Expression]) => Expression) => Unit = (n, f) => ShimUtils.registerFunction(SparkSession.active)(n,f), fromExtension: Boolean = false): Unit
Must be called before using any functions like Passed, Failed or Probability(X)
Must be called before using any functions like Passed, Failed or Probability(X)
parseTypes
override type parsing (e.g. DDL, defaults to defaultParseTypes / DataType.fromDDL)
zero
override zero creation for aggExpr (defaults to defaultZero)
add
override the "add" function for aggExpr types (defaults to defaultAdd(dataType))
writer
override the printCode and printExpr print writing function (defaults to println)
registerFunction
function to register the sql extensions
fromExtension
only the SparkExtension should set this to true
Definition Classes
ClassicRuleRunnerFunctionsImport
def ruleEngineRunner(ruleSuite: RuleSuite, resultDataType: DataType): Column
Creates a column that runs the RuleSuite.
Creates a column that runs the RuleSuite. This also forces registering the lambda functions used by that RuleSuite
ruleSuite
The ruleSuite with runOnPassProcessors
resultDataType
The type of the results from runOnPassProcessors - must be the same for all result types, by default most fields will be nullable and encoding must follow the fields when not specified. *
returns
A Column representing the QualityRules expression built from this ruleSuite
Definition Classes
ClassicRuleEngineRunnerImports
def ruleEngineRunner(ruleSuite: RuleSuite, resultDataType: DataType, debugMode: Boolean): Column
Creates a column that runs the RuleSuite.
Creates a column that runs the RuleSuite. This also forces registering the lambda functions used by that RuleSuite
ruleSuite
The ruleSuite with runOnPassProcessors
resultDataType
The type of the results from runOnPassProcessors - must be the same for all result types, by default most fields will be nullable and encoding must follow the fields when not specified. *
debugMode
When debugMode is enabled the resultDataType is wrapped in Array of (salience, result) pairs to ease debugging
returns
A Column representing the QualityRules expression built from this ruleSuite
Definition Classes
ClassicRuleEngineRunnerImports
def ruleEngineRunner(ruleSuite: RuleSuite, resultDataType: Option[DataType] = None, compileEvals: Boolean = false, debugMode: Boolean = false, resolveWith: Option[DataFrame] = None, variablesPerFunc: Int = 40, variableFuncGroup: Int = 20, forceRunnerEval: Boolean = false, forceTriggerEval: Boolean = false): Column
Creates a column that runs the RuleSuite.
Creates a column that runs the RuleSuite. This also forces registering the lambda functions used by that RuleSuite.
NOTE if using this in a Connect session resolveWith, compileEvals, forceRunnerEval and forceTriggerEval are ignored and use defaults
ruleSuite
The ruleSuite with runOnPassProcessors
resultDataType
The type of the results from runOnPassProcessors - must be the same for all result types, by default most fields will be nullable and encoding must follow the fields when not specified. *
compileEvals
Should the rules be compiled out to interim objects - by default false
debugMode
When debugMode is enabled the resultDataType is wrapped in Array of (salience, result) pairs to ease debugging
resolveWith
This experimental parameter can take the DataFrame these rules will be added to and pre-resolve and optimise the sql expressions, see the documentation for details on when to and not to use this.
variablesPerFunc
Defaulting to 40 allows, in combination with variableFuncGroup allows customisation of handling the 64k jvm method size limitation when performing WholeStageCodeGen
variableFuncGroup
Defaulting to 20
forceRunnerEval
Defaulting to false, passing true forces a simplified partially interpreted evaluation (compileEvals must be false to get fully interpreted)
forceTriggerEval
Defaulting to false
returns
A Column representing the QualityRules expression built from this ruleSuite
Definition Classes
ClassicRuleEngineRunnerImports
def ruleFolderRunner(ruleSuite: RuleSuite, startingStruct: Column, compileEvals: Boolean = false, debugMode: Boolean = false, resolveWith: Option[DataFrame] = None, variablesPerFunc: Int = 40, variableFuncGroup: Int = 20, forceRunnerEval: Boolean = false, useType: Option[StructType] = None, forceTriggerEval: Boolean = false): Column
Creates a column that runs the folding RuleSuite.
Creates a column that runs the folding RuleSuite. This also forces registering the lambda functions used by that RuleSuite.
FolderRunner runs all output expressions for matching rules in order of salience, the startingStruct is passed ot the first matching, the result passed to the second etc. In contrast to ruleEngineRunner OutputExpressions should be lambdas with one parameter, that of the structure
NOTE if using this in a Connect session resolveWith, compileEvals, forceRunnerEval and forceTriggerEval are ignored and use defaults
ruleSuite
The ruleSuite with runOnPassProcessors
startingStruct
This struct is passed to the first matching rule, ideally you would use the spark dsl struct function to refer to existing columns
compileEvals
Should the rules be compiled out to interim objects - by default false, allowing optimisations
debugMode
When debugMode is enabled the resultDataType is wrapped in Array of (salience, result) pairs to ease debugging
resolveWith
This experimental parameter can take the DataFrame these rules will be added to and pre-resolve and optimise the sql expressions, see the documentation for details on when to and not to use this.
variablesPerFunc
Defaulting to 40 allows, in combination with variableFuncGroup allows customisation of handling the 64k jvm method size limitation when performing WholeStageCodeGen
variableFuncGroup
Defaulting to 20
forceRunnerEval
Defaulting to false, passing true forces a simplified partially interpreted evaluation (compileEvals must be false to get fully interpreted)
useType
In the case you must use select and can't use withColumn you may provide a type directly to stop the NPE
forceTriggerEval
Defaulting to false, passing true forces each trigger expression to be compiled (compileEvals) and used in place, false instead expands the trigger in-line giving possible performance boosts based on JIT
returns
A Column representing the QualityRules expression built from this ruleSuite
Definition Classes
ClassicRuleFolderRunnerImports
def ruleFolderRunnerClassic(ruleSuite: RuleSuite, startingStruct: Column, compileEvals: Boolean = false, debugMode: Boolean = false, resolveWith: Option[DataFrame] = None, variablesPerFunc: Int = 40, variableFuncGroup: Int = 20, forceRunnerEval: Boolean = false, useType: Option[StructType] = None, forceTriggerEval: Boolean = false): Column
Creates a column that runs the folding RuleSuite.
Creates a column that runs the folding RuleSuite. This also forces registering the lambda functions used by that RuleSuite.
FolderRunner runs all output expressions for matching rules in order of salience, the startingStruct is passed ot the first matching, the result passed to the second etc. In contrast to ruleEngineRunner OutputExpressions should be lambdas with one parameter, that of the structure
ruleSuite
The ruleSuite with runOnPassProcessors
startingStruct
This struct is passed to the first matching rule, ideally you would use the spark dsl struct function to refer to existing columns
compileEvals
Should the rules be compiled out to interim objects - by default false, allowing optimisations
debugMode
When debugMode is enabled the resultDataType is wrapped in Array of (salience, result) pairs to ease debugging
resolveWith
This experimental parameter can take the DataFrame these rules will be added to and pre-resolve and optimise the sql expressions, see the documentation for details on when to and not to use this.
variablesPerFunc
Defaulting to 40 allows, in combination with variableFuncGroup allows customisation of handling the 64k jvm method size limitation when performing WholeStageCodeGen
variableFuncGroup
Defaulting to 20
forceRunnerEval
Defaulting to false, passing true forces a simplified partially interpreted evaluation (compileEvals must be false to get fully interpreted)
useType
In the case you must use select and can't use withColumn you may provide a type directly to stop the NPE
forceTriggerEval
Defaulting to false, passing true forces each trigger expression to be compiled (compileEvals) and used in place, false instead expands the trigger in-line giving possible performance boosts based on JIT
returns
A Column representing the QualityRules expression built from this ruleSuite
Definition Classes
ClassicRuleFolderRunnerImports
def ruleRunner(ruleSuite: RuleSuite, compileEvals: Boolean = false, resolveWith: Option[DataFrame] = None, variablesPerFunc: Int = 40, variableFuncGroup: Int = 20, forceRunnerEval: Boolean = false): Column
Creates a column that runs the RuleSuite.
Creates a column that runs the RuleSuite. This also forces registering the lambda functions used by that RuleSuite
NOTE if using this in a Connect session resolveWith, compileEvals and forceRunnerEval are ignored and use defaults
ruleSuite
The Qualty RuleSuite to evaluate
compileEvals
Should the rules be compiled out to interim objects - by default false
resolveWith
This experimental parameter can take the DataFrame these rules will be added to and pre-resolve and optimise the sql expressions, see the documentation for details on when to and not to use this. RuleRunner does not currently do wholestagecodegen when resolveWith is used.
variablesPerFunc
Defaulting to 40, it allows, in combination with variableFuncGroup customisation of handling the 64k jvm method size limitation when performing WholeStageCodeGen. You _shouldn't_ need it but it's there just in case.
variableFuncGroup
Defaulting to 20
forceRunnerEval
Defaulting to false, passing true forces a simplified partially interpreted evaluation (compileEvals must be false to get fully interpreted)
returns
A Column representing the Quality DQ expression built from this ruleSuite
Definition Classes
ClassicRuleRunnerImports
def small_bloom(bloomOver: Column, expectedNumberOfRows: Long, expectedFPP: Double): Column
Creates a bloom over the bloomOver expression, with an expected number of rows and fpp to build.
Creates a bloom over the bloomOver expression, with an expected number of rows and fpp to build.
This bloom fit's in a single byte array, as such it's limited to the number of elements it can fit it and still maintain the fpp, use big_bloom where the item counts are high and fpp must hold.
Definition Classes
BloomExpressionFunctions
Annotations
@ClassicOnly()
def small_bloom(bloomOver: Column, expectedNumberOfRows: Column, expectedFPP: Column): Column
Creates a bloom over the bloomOver expression, with an expected number of rows and fpp to build.
Creates a bloom over the bloomOver expression, with an expected number of rows and fpp to build.
This bloom fit's in a single byte array, as such it's limited to the number of elements it can fit it and still maintain the fpp, use big_bloom where the item counts are high and fpp must hold.
Definition Classes
BloomExpressionFunctions
Annotations
@ClassicOnly()
def toDS(ruleSuite: RuleSuite): Dataset[RuleRow]
Must have an active sparksession before calling and only works with ExpressionRule's, all other rules are converted to 1=1
Must have an active sparksession before calling and only works with ExpressionRule's, all other rules are converted to 1=1
returns
a Dataset[RowRaw] with the rules flattened out
Definition Classes
SerializingImports
def toLambdaDS(ruleSuite: RuleSuite): Dataset[LambdaFunctionRow]
Must have an active sparksession before calling and only works with ExpressionRule's, all other rules are converted to 1=1
Must have an active sparksession before calling and only works with ExpressionRule's, all other rules are converted to 1=1
returns
a Dataset[RowRaw] with the rules flattened out
Definition Classes
SerializingImports
def toOutputExpressionDS(ruleSuite: RuleSuite): Dataset[OutputExpressionRow]
Must have an active sparksession before calling and only works with ExpressionRule's, all other rules are converted to 1=1
Must have an active sparksession before calling and only works with ExpressionRule's, all other rules are converted to 1=1
returns
a Dataset[RowRaw] with the rules flattened out
Definition Classes
SerializingImports
def toRuleSuiteDF(ruleSuite: RuleSuite): DataFrame
Utility function to easy dealing with simple DQ rules where the rule engine functionality is ignored.
Utility function to easy dealing with simple DQ rules where the rule engine functionality is ignored. This can be paired with the default ruleEngine parameter in readRulesFromDF
Definition Classes
SerializingImports
def toRuleSuiteRow(ruleSuite: RuleSuite): (RuleSuiteRow, Option[OutputExpressionRow])
Creates a RuleSuiteRow from a RuleSuite for RuleSuite specific attributes and an optional defaultProcessor
Creates a RuleSuiteRow from a RuleSuite for RuleSuite specific attributes and an optional defaultProcessor
Definition Classes
SerializingImports
def toString(): String
Definition Classes
Any
def validate(schemaOrFrame: Either[StructType, DataFrame], ruleSuite: RuleSuite, showParams: ShowParams = ShowParams(), runnerFunction: Option[(DataFrame) => Column] = None, qualityName: String = "Quality", recursiveLambdasSOEIsOk: Boolean = false, transformBeforeShow: (DataFrame) => DataFrame = identity, viewLookup: (String) => Boolean = ViewLoader.defaultViewLookup): (Set[RuleError], Set[RuleWarning], String, RuleSuiteDocs, Map[IdTrEither, ExpressionLookup])
For a given dataFrame provide a full set of any validation errors for a given ruleSuite.
For a given dataFrame provide a full set of any validation errors for a given ruleSuite.
schemaOrFrame
when it's a Left( StructType ) the struct will be used to test against and an emptyDataframe of this type created to resolve on the spark level. Using Right(DataFrame) will cause that dataframe to be used which is great for test cases with a runnerFunction
showParams
- configure how the output text is formatted using the same options and formatting as dataFrame.show
runnerFunction
- allows you to create a ruleRunner or ruleEngineRunner with different configurations
qualityName
- the column name to store the runnerFunction results in
recursiveLambdasSOEIsOk
- this signals that finding a recursive lambda SOE should not stop the evaluations - if true it will still try to run any runnerFunction but may not give the correct results
transformBeforeShow
- an optional transformation function to help shape what results are pushed to show
viewLookup
- for any subquery used looks up the view name for being present (quoted and with schema), defaults to the current spark catalogue
returns
A set of errors and the output from the dataframe when a runnerFunction is specified
Definition Classes
ValidationImports
def validate(frame: DataFrame, ruleSuite: RuleSuite, runnerFunction: (DataFrame) => Column, transformBeforeShow: (DataFrame) => DataFrame): (Set[RuleError], Set[RuleWarning], String, RuleSuiteDocs, Map[IdTrEither, ExpressionLookup])
For a given dataFrame provide a full set of any validation errors for a given ruleSuite.
For a given dataFrame provide a full set of any validation errors for a given ruleSuite.
frame
when it's a Left( StructType ) the struct will be used to test against and an emptyDataframe of this type created to resolve on the spark level. Using Right(DataFrame) will cause that dataframe to be used which is great for test cases with a runnerFunction
runnerFunction
- allows you to create a ruleRunner or ruleEngineRunner with different configurations
transformBeforeShow
- an optional transformation function to help shape what results are pushed to show
returns
A set of errors and the output from the dataframe when a runnerFunction is specified
Definition Classes
ValidationImports
def validate(frame: DataFrame, ruleSuite: RuleSuite, runnerFunction: (DataFrame) => Column, transformBeforeShow: (DataFrame) => DataFrame, viewLookup: (String) => Boolean): (Set[RuleError], Set[RuleWarning], String, RuleSuiteDocs, Map[IdTrEither, ExpressionLookup])
For a given dataFrame provide a full set of any validation errors for a given ruleSuite.
For a given dataFrame provide a full set of any validation errors for a given ruleSuite.
frame
when it's a Left( StructType ) the struct will be used to test against and an emptyDataframe of this type created to resolve on the spark level. Using Right(DataFrame) will cause that dataframe to be used which is great for test cases with a runnerFunction
runnerFunction
- allows you to create a ruleRunner or ruleEngineRunner with different configurations
transformBeforeShow
- an optional transformation function to help shape what results are pushed to show
returns
A set of errors and the output from the dataframe when a runnerFunction is specified
Definition Classes
ValidationImports
def validate(frame: DataFrame, ruleSuite: RuleSuite, runnerFunction: (DataFrame) => Column): (Set[RuleError], Set[RuleWarning], String, RuleSuiteDocs, Map[IdTrEither, ExpressionLookup])
For a given dataFrame provide a full set of any validation errors for a given ruleSuite.
For a given dataFrame provide a full set of any validation errors for a given ruleSuite.
frame
when it's a Left( StructType ) the struct will be used to test against and an emptyDataframe of this type created to resolve on the spark level. Using Right(DataFrame) will cause that dataframe to be used which is great for test cases with a runnerFunction
runnerFunction
- allows you to create a ruleRunner or ruleEngineRunner with different configurations
returns
A set of errors and the output from the dataframe when a runnerFunction is specified
Definition Classes
ValidationImports
def validate(frame: DataFrame, ruleSuite: RuleSuite): (Set[RuleError], Set[RuleWarning])
For a given dataFrame provide a full set of any validation errors for a given ruleSuite.
For a given dataFrame provide a full set of any validation errors for a given ruleSuite.
frame
when it's a Left( StructType ) the struct will be used to test against and an emptyDataframe of this type created to resolve on the spark level. Using Right(DataFrame) will cause that dataframe to be used which is great for test cases with a runnerFunction
returns
A set of errors and the output from the dataframe when a runnerFunction is specified
Definition Classes
ValidationImports
def validate(schema: StructType, ruleSuite: RuleSuite): (Set[RuleError], Set[RuleWarning])
For a given dataFrame provide a full set of any validation errors for a given ruleSuite.
For a given dataFrame provide a full set of any validation errors for a given ruleSuite.
schema
which fields should the dataframe have
returns
A set of errors and the output from the dataframe when a runnerFunction is specified
Definition Classes
ValidationImports
def validate_Lookup(frame: DataFrame, ruleSuite: RuleSuite, runnerFunction: (DataFrame) => Column, viewLookup: (String) => Boolean): (Set[RuleError], Set[RuleWarning], String, RuleSuiteDocs, Map[IdTrEither, ExpressionLookup])
For a given dataFrame provide a full set of any validation errors for a given ruleSuite.
For a given dataFrame provide a full set of any validation errors for a given ruleSuite.
frame
when it's a Left( StructType ) the struct will be used to test against and an emptyDataframe of this type created to resolve on the spark level. Using Right(DataFrame) will cause that dataframe to be used which is great for test cases with a runnerFunction
runnerFunction
- allows you to create a ruleRunner or ruleEngineRunner with different configurations
returns
A set of errors and the output from the dataframe when a runnerFunction is specified
Definition Classes
ValidationImports
def validate_Lookup(frame: DataFrame, ruleSuite: RuleSuite, viewLookup: (String) => Boolean = ViewLoader.defaultViewLookup): (Set[RuleError], Set[RuleWarning])
For a given dataFrame provide a full set of any validation errors for a given ruleSuite.
For a given dataFrame provide a full set of any validation errors for a given ruleSuite.
frame
when it's a Left( StructType ) the struct will be used to test against and an emptyDataframe of this type created to resolve on the spark level. Using Right(DataFrame) will cause that dataframe to be used which is great for test cases with a runnerFunction
returns
A set of errors and the output from the dataframe when a runnerFunction is specified
Definition Classes
ValidationImports
def validate_Lookup(schema: StructType, ruleSuite: RuleSuite, viewLookup: (String) => Boolean): (Set[RuleError], Set[RuleWarning])
For a given dataFrame provide a full set of any validation errors for a given ruleSuite.
For a given dataFrame provide a full set of any validation errors for a given ruleSuite.
schema
which fields should the dataframe have
returns
A set of errors and the output from the dataframe when a runnerFunction is specified
Definition Classes
ValidationImports

Deprecated Value Members

def bloomFilterLookup(lookupValue: Column, bloomFilterName: Column, bloomMap: Broadcast[impl.bloom.BloomExpressionLookup.BloomFilterMap]): Column
Lookup the value against a bloom filter from bloomMap with name bloomFilterName.
Lookup the value against a bloom filter from bloomMap with name bloomFilterName. In line with the sql functions please migrate to probability_in
Definition Classes
BloomFilterLookupFunctionImport
Annotations
@deprecated @ClassicOnly()
Deprecated
(Since version 0.1.0) Please migrate to bloom_lookup, bloomFilterLookup will be removed in 0.2.0

Packages

classicFunctions

package classicFunctions

Type Members

Abstract Value Members

Concrete Value Members

Deprecated Value Members

Inherited from ProcessDisableIfMissingImports

Inherited from ValidationImports

Inherited from ClassicRuleEngineRunnerImports

Inherited from LambdaFunctionsImports

Inherited from SerializingImports

Inherited from BlockSplitBloomFilterImports

Inherited from BloomFilterLookupImports

Inherited from LookupIdFunctionsImports

Inherited from MapLookupImportsShared

Inherited from Serializable

Inherited from BloomFilterRegistration

Inherited from ClassicRuleRunnerFunctionsImport

Inherited from BucketedCreatorFunctionImports

Inherited from BloomFilterTypes

Inherited from MapLookupFunctionImports

Inherited from ClassicRuleFolderRunnerImports

Inherited from BloomExpressionFunctions

Inherited from ClassicRuleRunnerImports

Inherited from BloomFilterLookupFunctionImport

Inherited from AnyRef

Inherited from Any

Ungrouped

Packages

classicFunctions

package classicFunctions

Type Members

Abstract Value Members

Concrete Value Members

Deprecated Value Members

Inherited from ProcessDisableIfMissingImports

Inherited from ValidationImports

Inherited from ClassicRuleEngineRunnerImports

Inherited from LambdaFunctionsImports

Inherited from SerializingImports

Inherited from BlockSplitBloomFilterImports

Inherited from BloomFilterLookupImports

Inherited from LookupIdFunctionsImports

Inherited from MapLookupImportsShared

Inherited from Serializable

Inherited from BloomFilterRegistration

Inherited from ClassicRuleRunnerFunctionsImport

Inherited from BucketedCreatorFunctionImports

Inherited from BloomFilterTypes

Inherited from MapLookupFunctionImports

Inherited from ClassicRuleFolderRunnerImports

Inherited from BloomExpressionFunctions

Inherited from ClassicRuleRunnerImports

Inherited from BloomFilterLookupFunctionImport

Inherited from AnyRef

Inherited from Any

Ungrouped

classicFunctions