quality

package quality

Provides an easy import point for the library.

Linear Supertypes

ExpressionRunnerImports, ViewLoading, RuleFolderRunnerImports, ProcessDisableIfMissingImports, ValidationImports, RuleEngineRunnerImports, LambdaFunctionsImports, AddDataFunctionsImports, SerializingImports, BlockSplitBloomFilterImports, BloomFilterLookupImports, LookupIdFunctionsImports, MapLookupImports, Serializable, Serializable, RuleRunnerImports, BloomFilterRegistration, RuleRunnerFunctionsImport, BucketedCreatorFunctionImports, BloomFilterTypes, AnyRef, Any

Content Hierarchy

Ordering

Alphabetic
By Inheritance

Inherited

quality
ExpressionRunnerImports
ViewLoading
RuleFolderRunnerImports
ProcessDisableIfMissingImports
ValidationImports
RuleEngineRunnerImports
LambdaFunctionsImports
AddDataFunctionsImports
SerializingImports
BlockSplitBloomFilterImports
BloomFilterLookupImports
LookupIdFunctionsImports
MapLookupImports
Serializable
Serializable
RuleRunnerImports
BloomFilterRegistration
RuleRunnerFunctionsImport
BucketedCreatorFunctionImports
BloomFilterTypes
AnyRef
Any

Hide All
Show All

Visibility

Public
All

Type Members

type BloomFilterMap = Map[String, (BloomLookup, Double)]
Used as a param to load the bloomfilter expr
Used as a param to load the bloomfilter expr

Definition Classes
BloomFilterTypes
trait BloomLookup extends AnyRef
Simple does it contain function to test a bloom
case class BloomModel(rootDir: String, fpp: Double, numBuckets: Int) extends Serializable with Product with Serializable
Represents the shared file location of a bucked bloom filter.
Represents the shared file location of a bucked bloom filter. There should be files with names 0..numBuckets containing the same number of bytes representing each bucket.
rootDir
The directory which contains each bucket
fpp
The fpp for this bloom - note it is informational only and will not be used in further processing
numBuckets
The number of buckets within this bloom
trait DataFrameLoader extends AnyRef
Simple interface to load DataFrames used by map/bloom and view loading
case class ExpressionRule(rule: String) extends ExprLogic with HasRuleText with Product with Serializable
The result of serializing or loading rules
case class GeneralExpressionResult(ruleResult: String, resultDDL: String) extends Product with Serializable
Represents the expression results of ExpressionRunner
Represents the expression results of ExpressionRunner
ruleResult
the result casted to string
resultDDL
the result type in ddl
case class GeneralExpressionsResult[R](id: VersionedId, ruleSetResults: Map[VersionedId, Map[VersionedId, R]]) extends Serializable with Product
Represents the results of the ExpressionRunner
case class GeneralExpressionsResultNoDDL(id: VersionedId, ruleSetResults: Map[VersionedId, Map[VersionedId, String]]) extends Serializable with Product
Represents the results of the ExpressionRunner after calling strip_result_ddl
case class Id(id: Int, version: Int) extends VersionedId with Product with Serializable
A versioned rule ID - note the name is never persisted in results, the id and version are sufficient to retrieve the name
A versioned rule ID - note the name is never persisted in results, the id and version are sufficient to retrieve the name
id
a unique ID to identify this rule
version
the version of the rule - again tied to ID
type IdTriple = (Id, Id, Id)

Definition Classes
LambdaFunctionsImports
type MapCreator = () ⇒ (DataFrame, Column, Column)

Definition Classes
MapLookupImports
type MapLookups = Map[String, (Broadcast[MapData], DataType)]
Used as a param to load the map lookups - note the type of the broadcast is always Map[AnyRef, AnyRef]
Used as a param to load the map lookups - note the type of the broadcast is always Map[AnyRef, AnyRef]

Definition Classes
MapLookupImports
case class OutputExpression(rule: String) extends OutputExprLogic with HasRuleText with Logging with Product with Serializable
Used as a result of serializing
case class OverallResult(probablePass: Double = 0.8, currentResult: RuleResult = Passed) extends Product with Serializable
Probability is evaluated at over probablePass percent, defaults to 80% 0.8.
Probability is evaluated at over probablePass percent, defaults to 80% 0.8. Passed until any failure occurs
case class Probability(percentage: Double) extends RuleResult with Product with Serializable
0-1 with 1 being absolutely likely a pass
case class QualityException(msg: String, cause: Exception = null) extends RuntimeException with Product with Serializable
Simple marker instead of sys.error
case class Rule(id: Id, expression: RuleLogic, runOnPassProcessor: RunOnPassProcessor = NoOpRunOnPassProcessor.noOp) extends Serializable with Product
A rule to run over a row
case class RuleEngineResult[T](ruleSuiteResults: RuleSuiteResult, salientRule: Option[SalientRule], result: Option[T]) extends Serializable with Product
Results for all rules run against a DataFrame.
Results for all rules run against a DataFrame. Note in debug mode you should provide Array[T] instead
ruleSuiteResults
Overall results from applying the engine
salientRule
if it's None there is no rule which matched for this row or it's in Debug mode which will return all results.
result
The result type for this row, if no rule matched this will be None, if a rule matched but the outputexpression returned null this will also be None
case class RuleFolderResult[T](ruleSuiteResults: RuleSuiteResult, result: Option[T]) extends Serializable with Product
Results for all rules run against a DataFrame.
Results for all rules run against a DataFrame. Note in debug mode you should provide Array[T] instead
ruleSuiteResults
Overall results from applying the engine
result
The result type for this row, if no rule matched this will be None, if a rule matched but the outputexpression returned null this will also be None
sealed trait RuleResult extends Serializable
case class RuleResultWithProcessor(ruleResult: RuleResult, runOnPassProcessor: RunOnPassProcessor) extends RuleResult with Product with Serializable
Packs a rule result with a RunOnPassProcessor processor
case class RuleSet(id: Id, rules: Seq[Rule]) extends Serializable with Product
case class RuleSetResult(overallResult: RuleResult, ruleResults: Map[VersionedId, RuleResult]) extends Serializable with Product
Result collection for a number of rules
Result collection for a number of rules
ruleResults
rule id -> ruleresult
case class RuleSuite(id: Id, ruleSets: Seq[RuleSet], lambdaFunctions: Seq[LambdaFunction] = Seq.empty, probablePass: Double = 0.8) extends Serializable with Product
Represents a versioned collection of RuleSet's
Represents a versioned collection of RuleSet's
probablePass
override to specify a different percentage for treating probability results as passes - defaults to 80% (0.8)
case class RuleSuiteResult(id: VersionedId, overallResult: RuleResult, ruleSetResults: Map[VersionedId, RuleSetResult]) extends Serializable with Product
Results for all rules run against a dataframe
Results for all rules run against a dataframe
id
- the Id of the suite, all other content is mapped
case class RuleSuiteResultDetails(id: VersionedId, ruleSetResults: Map[VersionedId, RuleSetResult]) extends Serializable with Product
Results for all rules run against a dataframe without the overallResult.
Results for all rules run against a dataframe without the overallResult. Performance differences for filtering on top level fields are significant over nested structures even under Spark 3, in the region of 30-50% depending on op.
case class SalientRule(ruleSuiteId: VersionedId, ruleSetId: VersionedId, ruleId: VersionedId) extends Product with Serializable
Represents the rule that matched a given RuleEngine result
case class SparkBloomFilter(bloom: BloomFilter) extends BloomLookup with Product with Serializable

Abstract Value Members

abstract def getClass(): Class[_]

Definition Classes
Any

Concrete Value Members

final def !=(arg0: Any): Boolean

Definition Classes
Any
final def ##(): Int

Definition Classes
Any
final def ==(arg0: Any): Boolean

Definition Classes
Any
val DisabledRuleInt: Int
The integer value for disabled dq rules
The integer value for disabled dq rules

Definition Classes
RuleRunnerImports
val FailedInt: Int
The integer value for failed dq rules
The integer value for failed dq rules

Definition Classes
RuleRunnerImports
val PassedInt: Int
The integer value for passed dq rules
The integer value for passed dq rules

Definition Classes
RuleRunnerImports
val SoftFailedInt: Int
The integer value for soft failed dq rules
The integer value for soft failed dq rules

Definition Classes
RuleRunnerImports
def addDataQuality(dataFrame: DataFrame, rules: RuleSuite, name: String = "DataQuality"): DataFrame
Adds a DataQuality field using the RuleSuite and RuleSuiteResult structure
Adds a DataQuality field using the RuleSuite and RuleSuiteResult structure

Definition Classes
AddDataFunctionsImports
def addDataQualityF(rules: RuleSuite, name: String = "DataQuality"): (DataFrame) ⇒ DataFrame
Adds a DataQuality field using the RuleSuite and RuleSuiteResult structure for use with dataset.transform functions
Adds a DataQuality field using the RuleSuite and RuleSuiteResult structure for use with dataset.transform functions

Definition Classes
AddDataFunctionsImports
def addOverallResultsAndDetails(dataFrame: DataFrame, rules: RuleSuite, overallResult: String = "DQ_overallResult", resultDetails: String = "DQ_Details"): DataFrame
Adds two columns, one for overallResult and the other the details, allowing 30-50% performance gains for simple filters
Adds two columns, one for overallResult and the other the details, allowing 30-50% performance gains for simple filters

Definition Classes
AddDataFunctionsImports
def addOverallResultsAndDetailsF(rules: RuleSuite, overallResult: String = "DQ_overallResult", resultDetails: String = "DQ_Details"): (DataFrame) ⇒ DataFrame
Adds two columns, one for overallResult and the other the details, allowing 30-50% performance gains for simple filters, for use in dataset.transform functions
Adds two columns, one for overallResult and the other the details, allowing 30-50% performance gains for simple filters, for use in dataset.transform functions

Definition Classes
AddDataFunctionsImports
final def asInstanceOf[T0]: T0

Definition Classes
Any
val bloomFileLocation: String
Where the bloom filter files will be stored
Where the bloom filter files will be stored

Definition Classes
BucketedCreatorFunctionImports
def bloomFrom(dataFrame: DataFrame, bloomOn: String, expectedSize: Int, fpp: Double, bloomId: String, partitions: Int): BloomModel
Generates a bloom filter using the expression passed via bloomOn with optional repartitioning and a default fpp of 0.01.
Generates a bloom filter using the expression passed via bloomOn with optional repartitioning and a default fpp of 0.01. A unique run will be created so other results can co-exist but this can be overridden by specifying bloomId. The other interim result files will be removed and the resulting BucketedFiles can be used in lookups or stored elsewhere.
bloomOn
a compatible expression to generate the bloom hashes against

Definition Classes
BucketedCreatorFunctionImports
def bloomFrom(dataFrame: DataFrame, bloomOn: String, expectedSize: Int, fpp: Double, bloomId: String): BloomModel
Generates a bloom filter using the expression passed via bloomOn with optional repartitioning and a default fpp of 0.01.
Generates a bloom filter using the expression passed via bloomOn with optional repartitioning and a default fpp of 0.01. A unique run will be created so other results can co-exist but this can be overridden by specifying bloomId. The other interim result files will be removed and the resulting BucketedFiles can be used in lookups or stored elsewhere.
bloomOn
a compatible expression to generate the bloom hashes against

Definition Classes
BucketedCreatorFunctionImports
def bloomFrom(dataFrame: DataFrame, bloomOn: String, expectedSize: Int, fpp: Double): BloomModel
Generates a bloom filter using the expression passed via bloomOn with optional repartitioning and a default fpp of 0.01.
Generates a bloom filter using the expression passed via bloomOn with optional repartitioning and a default fpp of 0.01. A unique run will be created so other results can co-exist but this can be overridden by specifying bloomId. The other interim result files will be removed and the resulting BucketedFiles can be used in lookups or stored elsewhere.
bloomOn
a compatible expression to generate the bloom hashes against

Definition Classes
BucketedCreatorFunctionImports
def bloomFrom(dataFrame: DataFrame, bloomOn: String, expectedSize: Int): BloomModel
Generates a bloom filter using the expression passed via bloomOn with optional repartitioning and a default fpp of 0.01.
Generates a bloom filter using the expression passed via bloomOn with optional repartitioning and a default fpp of 0.01. A unique run will be created so other results can co-exist but this can be overridden by specifying bloomId. The other interim result files will be removed and the resulting BucketedFiles can be used in lookups or stored elsewhere.
bloomOn
a compatible expression to generate the bloom hashes against

Definition Classes
BucketedCreatorFunctionImports
def bloomFrom(dataFrame: DataFrame, bloomOn: Column, expectedSize: Int, fpp: Double = 0.01, bloomId: String = ..., partitions: Int = 0): BloomModel
Generates a bloom filter using the expression passed via bloomOn with optional repartitioning and a default fpp of 0.01.
Generates a bloom filter using the expression passed via bloomOn with optional repartitioning and a default fpp of 0.01. A unique run will be created so other results can co-exist but this can be overridden by specifying bloomId. The other interim result files will be removed and the resulting BucketedFiles can be used in lookups or stored elsewhere.
bloomOn
a compatible expression to generate the bloom hashes against

Definition Classes
BucketedCreatorFunctionImports
def bloomLookup(bucketedFiles: BloomModel): BloomLookup
Creates a very large bloom filter from multiple buckets of 2gb arrays backed by the default Parquet implementation
def bloomLookup(bytes: Array[Byte]): BloomLookup
Creates a bloom filter from an array of bytes using the default Parquet bloom filter implementation
def equals(arg0: Any): Boolean

Definition Classes
Any
def expressionRunner(ruleSuite: RuleSuite, name: String = "expressionResults", renderOptions: Map[String, String] = Map.empty): Column

Definition Classes
ExpressionRunnerImports
def foldAndReplaceFields(rules: RuleSuite, fields: Seq[String], foldFieldName: String = "foldedFields", debugMode: Boolean = false, tempFoldDebugName: String = "tempFOLDDEBUG", maintainOrder: Boolean = true): (DataFrame) ⇒ DataFrame
Leverages the foldRunner to replace fields, the input fields are used to create a structure that the rules fold over.
Leverages the foldRunner to replace fields, the input fields are used to create a structure that the rules fold over. The fields are then dropped from the original Dataframe and added back from the resulting structure.
NOTE: The field order and types of the original DF will be maintained only when maintainOrder is true. As it requires access to the schema it may incur extra work.
debugMode
when true the last results are taken for the replaced fields
maintainOrder
when true the schema is used to replace fields in the correct location, when false they are simply appended

Definition Classes
AddDataFunctionsImports
def foldAndReplaceFieldsWithStruct(rules: RuleSuite, struct: StructType, foldFieldName: String = "foldedFields", debugMode: Boolean = false, tempFoldDebugName: String = "tempFOLDDEBUG", maintainOrder: Boolean = true): (DataFrame) ⇒ DataFrame
Leverages the foldRunner to replace fields, the input fields are used to create a structure that the rules fold over.
Leverages the foldRunner to replace fields, the input fields are used to create a structure that the rules fold over. The fields are then dropped from the original Dataframe and added back from the resulting structure.
This version should only be used when you require select(*, foldRunner) to be used, it requires you fully specify types.
NOTE: The field order and types of the original DF will be maintained only when maintainOrder is true. As it requires access to the schema it may incur extra work.
struct
The fields, and types, are used to call the foldRunner. These types must match in the input fields
debugMode
when true the last results are taken for the replaced fields
maintainOrder
when true the schema is used to replace fields in the correct location, when false they are simply appended

Definition Classes
AddDataFunctionsImports
def getBlooms(ruleSuite: RuleSuite): Seq[String]
Identifies bloom ids before (or after) resolving for a given ruleSuite, use to know which bloom filters need to be loaded
Identifies bloom ids before (or after) resolving for a given ruleSuite, use to know which bloom filters need to be loaded
ruleSuite
a ruleSuite full of expressions to check
returns
The bloom id's used, for unresolved expression trees this may contain blooms which are not present in the bloom map

Definition Classes
BloomFilterLookupImports
def getConfig(name: String, default: String = ""): String
first attempts to get the system env, then system java property then sqlconf
def hashCode(): Int

Definition Classes
Any
def identifyLookups(ruleSuite: RuleSuite): LookupResults
Use this function to identify which maps / blooms etc.
Use this function to identify which maps / blooms etc. are used by a given rulesuite collects all rules that are using lookup functions but without constant expressions and the list of lookups that are constants.

Definition Classes
LookupIdFunctionsImports
def integrateLambdas(ruleSuiteMap: RuleSuiteMap, lambdas: Map[Id, Seq[LambdaFunction]], globalLibrary: Option[Id] = None): RuleSuiteMap
Add any of the Lambdas loaded for the rule suites
Add any of the Lambdas loaded for the rule suites
globalLibrary
- all lambdas with this RuleSuite Id will be added to all RuleSuites

Definition Classes
SerializingImports
def integrateMetaRuleSets(dataFrame: DataFrame, ruleSuiteMap: RuleSuiteMap, metaRuleSetMap: Map[Id, Seq[MetaRuleSetRow]], stablePosition: (String) ⇒ Int, transform: (DataFrame) ⇒ DataFrame = identity): RuleSuiteMap
Integrates meta rulesets into the rulesuites.
Integrates meta rulesets into the rulesuites. Note this only works for a specific dataset, if rulesuites should be filtered for a given dataset then this must take place before calling.
dataFrame
the dataframe to identify columns for metarules
ruleSuiteMap
the ruleSuites relevant for this dataframe
metaRuleSetMap
the meta rulesets
stablePosition
this function must maintain the law that each column name within a RuleSet always generates the same position

Definition Classes
SerializingImports
def integrateOutputExpressions(ruleSuiteMap: RuleSuiteMap, outputs: Map[Id, Seq[OutputExpressionRow]], globalLibrary: Option[Id] = None): (RuleSuiteMap, Map[Id, Set[Rule]])
Returns an integrated ruleSuiteMap with a set of RuleSuite Id -> Rule mappings where the OutputExpression didn't exist.
Returns an integrated ruleSuiteMap with a set of RuleSuite Id -> Rule mappings where the OutputExpression didn't exist.
Users should check if their RuleSuite is in the "error" map.

Definition Classes
SerializingImports
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
def loadBloomConfigs(loader: DataFrameLoader, viewDF: DataFrame, name: Column, token: Column, filter: Column, sql: Column, bigBloom: Column, value: Column, numberOfElements: Column, expectedFPP: Column): (Seq[BloomConfig], Set[String])
Loads map configurations from a given DataFrame.
Loads map configurations from a given DataFrame. Wherever token is present loader will be called and the filter optionally applied.
returns
A tuple of MapConfig's and the names of rows which had unexpected content (either token or sql must be present)

Definition Classes
BloomFilterLookupImports
def loadBloomConfigs(loader: DataFrameLoader, viewDF: DataFrame, ruleSuiteIdColumn: Column, ruleSuiteVersionColumn: Column, ruleSuiteId: Id, name: Column, token: Column, filter: Column, sql: Column, bigBloom: Column, value: Column, numberOfElements: Column, expectedFPP: Column): (Seq[BloomConfig], Set[String])
Loads map configurations from a given DataFrame for ruleSuiteId.
Loads map configurations from a given DataFrame for ruleSuiteId. Wherever token is present loader will be called and the filter optionally applied.
returns
A tuple of MapConfig's and the names of rows which had unexpected content (either token or sql must be present)

Definition Classes
BloomFilterLookupImports
def loadBlooms(configs: Seq[BloomConfig]): quality.impl.bloom.BloomExpressionLookup.BloomFilterMap
Loads bloom maps ready to register from configuration
Loads bloom maps ready to register from configuration

Definition Classes
BloomFilterLookupImports
def loadMapConfigs(loader: DataFrameLoader, viewDF: DataFrame, name: Column, token: Column, filter: Column, sql: Column, key: Column, value: Column): (Seq[MapConfig], Set[String])
Loads map configurations from a given DataFrame.
Loads map configurations from a given DataFrame. Wherever token is present loader will be called and the filter optionally applied.
returns
A tuple of MapConfig's and the names of rows which had unexpected content (either token or sql must be present)

Definition Classes
MapLookupImports
def loadMapConfigs(loader: DataFrameLoader, viewDF: DataFrame, ruleSuiteIdColumn: Column, ruleSuiteVersionColumn: Column, ruleSuiteId: Id, name: Column, token: Column, filter: Column, sql: Column, key: Column, value: Column): (Seq[MapConfig], Set[String])
Loads map configurations from a given DataFrame for ruleSuiteId.
Loads map configurations from a given DataFrame for ruleSuiteId. Wherever token is present loader will be called and the filter optionally applied.
returns
A tuple of MapConfig's and the names of rows which had unexpected content (either token or sql must be present)

Definition Classes
MapLookupImports
def loadMaps(configs: Seq[MapConfig]): MapLookups

Definition Classes
MapLookupImports
def loadViewConfigs(loader: DataFrameLoader, viewDF: DataFrame, name: Column, token: Column, filter: Column, sql: Column): (Seq[ViewConfig], Set[String])
Loads view configurations from a given DataFrame.
Loads view configurations from a given DataFrame. Wherever token is present loader will be called and the filter optionally applied.
returns
A tuple of ViewConfig's and the names of rows which had unexpected content (either token or sql must be present)

Definition Classes
ViewLoading
def loadViewConfigs(loader: DataFrameLoader, viewDF: DataFrame, ruleSuiteIdColumn: Column, ruleSuiteVersionColumn: Column, ruleSuiteId: Id, name: Column, token: Column, filter: Column, sql: Column): (Seq[ViewConfig], Set[String])
Loads view configurations from a given DataFrame for ruleSuiteId.
Loads view configurations from a given DataFrame for ruleSuiteId. Wherever token is present loader will be called and the filter optionally applied.
returns
A tuple of ViewConfig's and the names of rows which had unexpected content (either token or sql must be present)

Definition Classes
ViewLoading
def loadViews(viewConfigs: Seq[ViewConfig]): ViewLoadResults
Attempts to load all the views present in the config.
Attempts to load all the views present in the config. If a view is already registered in that name it will be replaced.
returns
the names of views which have been replaced

Definition Classes
ViewLoading
def mapLookupsFromDFs(creators: Map[String, MapCreator]): MapLookups
Loads maps to broadcast, each individual dataframe may have different associated expressions
Loads maps to broadcast, each individual dataframe may have different associated expressions
creators
a map of string id to MapCreator
returns
a map of id to broadcast variables needed for exact lookup and mapping checks

Definition Classes
MapLookupImports
def namesFromSchema(schema: StructType): Set[String]
Retrieves the names (with parent fields following the .
Retrieves the names (with parent fields following the . notation for any nested fields) from the schema

Definition Classes
LookupIdFunctionsImports
def optimalNumOfBits(n: Long, p: Double): Int
Calculate optimal size according to the number of distinct values and false positive probability.
Calculate optimal size according to the number of distinct values and false positive probability.
n
: The number of distinct values.
p
: The false positive probability.
returns
optimal number of bits of given n and p.

Definition Classes
BlockSplitBloomFilterImports
def optimalNumberOfBuckets(n: Long, p: Double): Long

Definition Classes
BlockSplitBloomFilterImports
def processCoalesceIfAttributeMissing(expression: Expression, names: Set[String]): Expression

Definition Classes
ProcessDisableIfMissingImports
def processIfAttributeMissing(ruleSuite: RuleSuite, schema: StructType = StructType(Seq())): RuleSuite
Processes a given RuleSuite to replace any coalesceIfMissingAttributes.
Processes a given RuleSuite to replace any coalesceIfMissingAttributes. This may be called before validate / docs but *must* be called *before* adding the expression to a dataframe.
schema
The names to validate against, if empty no attempt to process coalesceIfAttributeMissing will be made

Definition Classes
ProcessDisableIfMissingImports
def readLambdasFromDF(lambdaFunctionDF: DataFrame, lambdaFunctionName: Column, lambdaFunctionExpression: Column, lambdaFunctionId: Column, lambdaFunctionVersion: Column, lambdaFunctionRuleSuiteId: Column, lambdaFunctionRuleSuiteVersion: Column): Map[Id, Seq[LambdaFunction]]
Loads lambda functions
Loads lambda functions

Definition Classes
SerializingImports
def readMetaRuleSetsFromDF(metaRuleSetDF: DataFrame, columnSelectionExpression: Column, lambdaFunctionExpression: Column, metaRuleSetId: Column, metaRuleSetVersion: Column, metaRuleSuiteId: Column, metaRuleSuiteVersion: Column): Map[Id, Seq[MetaRuleSetRow]]

Definition Classes
SerializingImports
def readOutputExpressionsFromDF(outputExpressionDF: DataFrame, outputExpression: Column, outputExpressionId: Column, outputExpressionVersion: Column, outputExpressionRuleSuiteId: Column, outputExpressionRuleSuiteVersion: Column): Map[Id, Seq[OutputExpressionRow]]
Loads output expressions
Loads output expressions

Definition Classes
SerializingImports
def readRulesFromDF(df: DataFrame, ruleSuiteId: Column, ruleSuiteVersion: Column, ruleSetId: Column, ruleSetVersion: Column, ruleId: Column, ruleVersion: Column, ruleExpr: Column, ruleEngineSalience: Column, ruleEngineId: Column, ruleEngineVersion: Column): RuleSuiteMap
Loads a RuleSuite from a dataframe with integers ruleSuiteId, ruleSuiteVersion, ruleSetId, ruleSetVersion, ruleId, ruleVersion and an expression string ruleExpr
Loads a RuleSuite from a dataframe with integers ruleSuiteId, ruleSuiteVersion, ruleSetId, ruleSetVersion, ruleId, ruleVersion and an expression string ruleExpr

Definition Classes
SerializingImports
def readRulesFromDF(df: DataFrame, ruleSuiteId: Column, ruleSuiteVersion: Column, ruleSetId: Column, ruleSetVersion: Column, ruleId: Column, ruleVersion: Column, ruleExpr: Column): RuleSuiteMap
Loads a RuleSuite from a dataframe with integers ruleSuiteId, ruleSuiteVersion, ruleSetId, ruleSetVersion, ruleId, ruleVersion and an expression string ruleExpr
Loads a RuleSuite from a dataframe with integers ruleSuiteId, ruleSuiteVersion, ruleSetId, ruleSetVersion, ruleId, ruleVersion and an expression string ruleExpr

Definition Classes
SerializingImports
def registerBloomMapAndFunction(bloomFilterMap: Broadcast[quality.impl.bloom.BloomExpressionLookup.BloomFilterMap]): Unit
Registers this bloom map and associates the probabilityIn sql expression against it
Registers this bloom map and associates the probabilityIn sql expression against it

Definition Classes
BloomFilterRegistration
def registerLambdaFunctions(functions: Seq[LambdaFunction]): Unit
This function is for use with directly loading lambda functions to use within normal spark apps.
This function is for use with directly loading lambda functions to use within normal spark apps.
Please be aware loading rules via ruleRunner, addOverallResultsAndDetails or addDataQuality will also load any registered rules applicable for that ruleSuite and may override other registered rules. You should not use this directly and RuleSuites for that reason (otherwise your results will not be reproducible).

Definition Classes
LambdaFunctionsImports
def registerMapLookupsAndFunction(mapLookups: MapLookups): Unit

Definition Classes
MapLookupImports
def registerQualityFunctions(parseTypes: (String) ⇒ Option[DataType] = defaultParseTypes, zero: (DataType) ⇒ Option[Any] = defaultZero, add: (DataType) ⇒ Option[(Expression, Expression) ⇒ Expression] = ..., mapCompare: (DataType) ⇒ Option[(Any, Any) ⇒ Int] = ..., writer: (String) ⇒ Unit = println(_), registerFunction: (String, (Seq[Expression]) ⇒ Expression) ⇒ Unit = ...): Unit
Must be called before using any functions like Passed, Failed or Probability(X)
Must be called before using any functions like Passed, Failed or Probability(X)
parseTypes
override type parsing (e.g. DDL, defaults to defaultParseTypes / DataType.fromDDL)
zero
override zero creation for aggExpr (defaults to defaultZero)
add
override the "add" function for aggExpr types (defaults to defaultAdd(dataType))
writer
override the printCode and printExpr print writing function (defaults to println)
registerFunction
function to register the sql extensions

Definition Classes
RuleRunnerFunctionsImport
def ruleEngineRunner(ruleSuite: RuleSuite, resultDataType: DataType, compileEvals: Boolean = true, debugMode: Boolean = false, resolveWith: Option[DataFrame] = None, variablesPerFunc: Int = 40, variableFuncGroup: Int = 20, forceRunnerEval: Boolean = false, forceTriggerEval: Boolean = true): Column
Creates a column that runs the RuleSuite.
Creates a column that runs the RuleSuite. This also forces registering the lambda functions used by that RuleSuite
ruleSuite
The ruleSuite with runOnPassProcessors
resultDataType
The type of the results from runOnPassProcessors - must be the same for all result types
compileEvals
Should the rules be compiled out to interim objects - by default true for eval usage, wholeStageCodeGen will evaluate in place unless forceTriggerEval set to false
debugMode
When debugMode is enabled the resultDataType is wrapped in Array of (salience, result) pairs to ease debugging
resolveWith
This experimental parameter can take the DataFrame these rules will be added to and pre-resolve and optimise the sql expressions, see the documentation for details on when to and not to use this.
variablesPerFunc
Defaulting to 40 allows, in combination with variableFuncGroup allows customisation of handling the 64k jvm method size limitation when performing WholeStageCodeGen
variableFuncGroup
Defaulting to 20
forceRunnerEval
Defaulting to false, passing true forces a simplified partially interpreted evaluation (compileEvals must be false to get fully interpreted)
forceTriggerEval
Defaulting to true, passing true forces each trigger expression to be compiled (compileEvals) and used in place, false instead expands the trigger in-line giving possible performance boosts based on JIT. Most testing has however shown this not to be the case hence the default, ymmv.
returns
A Column representing the QualityRules expression built from this ruleSuite

Definition Classes
RuleEngineRunnerImports
def ruleEngineWithStruct(dataFrame: DataFrame, rules: RuleSuite, outputType: DataType, ruleEngineFieldName: String = "ruleEngine", alias: String = "main", debugMode: Boolean = false): DataFrame
Leverages the ruleRunner to produce an new output structure, the outputType defines the output structure generated.
Leverages the ruleRunner to produce an new output structure, the outputType defines the output structure generated.
This version should only be used when you require select(*, ruleRunner) to be used, it requires you fully specify types.
dataFrame
the input dataframe
outputType
The fields, and types, are used to call the foldRunner. These types must match in the input fields
ruleEngineFieldName
The field name the results will be stored in, by default ruleEngine
alias
sets the alias to use for dataFrame when using subqueries to resolve ambiguities, setting to an empty string (or null) will not assign an alias

Definition Classes
AddDataFunctionsImports
def ruleEngineWithStructF(rules: RuleSuite, outputType: DataType, ruleEngineFieldName: String = "ruleEngine", alias: String = "main", debugMode: Boolean = false): (DataFrame) ⇒ DataFrame
Leverages the ruleRunner to produce an new output structure, the outputType defines the output structure generated.
Leverages the ruleRunner to produce an new output structure, the outputType defines the output structure generated.
This version should only be used when you require select(*, ruleRunner) to be used, it requires you fully specify types.
outputType
The fields, and types, are used to call the foldRunner. These types must match in the input fields

Definition Classes
AddDataFunctionsImports
def ruleFolderRunner(ruleSuite: RuleSuite, startingStruct: Column, compileEvals: Boolean = true, debugMode: Boolean = false, resolveWith: Option[DataFrame] = None, variablesPerFunc: Int = 40, variableFuncGroup: Int = 20, forceRunnerEval: Boolean = false, useType: Option[StructType] = None, forceTriggerEval: Boolean = true): Column
Creates a column that runs the folding RuleSuite.
Creates a column that runs the folding RuleSuite. This also forces registering the lambda functions used by that RuleSuite.
FolderRunner runs all output expressions for matching rules in order of salience, the startingStruct is passed ot the first matching, the result passed to the second etc. In contrast to ruleEngineRunner OutputExpressions should be lambdas with one parameter, that of the structure
ruleSuite
The ruleSuite with runOnPassProcessors
startingStruct
This struct is passed to the first matching rule, ideally you would use the spark dsl struct function to refer to existing columns
compileEvals
Should the rules be compiled out to interim objects - by default true
debugMode
When debugMode is enabled the resultDataType is wrapped in Array of (salience, result) pairs to ease debugging
resolveWith
This experimental parameter can take the DataFrame these rules will be added to and pre-resolve and optimise the sql expressions, see the documentation for details on when to and not to use this.
variablesPerFunc
Defaulting to 40 allows, in combination with variableFuncGroup allows customisation of handling the 64k jvm method size limitation when performing WholeStageCodeGen
variableFuncGroup
Defaulting to 20
forceRunnerEval
Defaulting to false, passing true forces a simplified partially interpreted evaluation (compileEvals must be false to get fully interpreted)
useType
In the case you must use select and can't use withColumn you may provide a type directly to stop the NPE
forceTriggerEval
Defaulting to true, passing true forces each trigger expression to be compiled (compileEvals) and used in place, false instead expands the trigger in-line giving possible performance boosts based on JIT. Most testing has however shown this not to be the case hence the default, ymmv.
returns
A Column representing the QualityRules expression built from this ruleSuite

Definition Classes
RuleFolderRunnerImports
def ruleRunner(ruleSuite: RuleSuite, compileEvals: Boolean = true, resolveWith: Option[DataFrame] = None, variablesPerFunc: Int = 40, variableFuncGroup: Int = 20, forceRunnerEval: Boolean = false): Column
Creates a column that runs the RuleSuite.
Creates a column that runs the RuleSuite. This also forces registering the lambda functions used by that RuleSuite
ruleSuite
The Qualty RuleSuite to evaluate
compileEvals
Should the rules be compiled out to interim objects - by default true for eval usage, wholeStageCodeGen will evaluate in place
resolveWith
This experimental parameter can take the DataFrame these rules will be added to and pre-resolve and optimise the sql expressions, see the documentation for details on when to and not to use this. RuleRunner does not currently do wholestagecodegen when resolveWith is used.
variablesPerFunc
Defaulting to 40, it allows, in combination with variableFuncGroup customisation of handling the 64k jvm method size limitation when performing WholeStageCodeGen. You _shouldn't_ need it but it's there just in case.
variableFuncGroup
Defaulting to 20
forceRunnerEval
Defaulting to false, passing true forces a simplified partially interpreted evaluation (compileEvals must be false to get fully interpreted)
returns
A Column representing the Quality DQ expression built from this ruleSuite

Definition Classes
RuleRunnerImports
def toDS(ruleSuite: RuleSuite): Dataset[RuleRow]
Must have an active sparksession before calling and only works with ExpressionRule's, all other rules are converted to 1=1
Must have an active sparksession before calling and only works with ExpressionRule's, all other rules are converted to 1=1
returns
a Dataset[RowRaw] with the rules flattened out

Definition Classes
SerializingImports
def toLambdaDS(ruleSuite: RuleSuite): Dataset[LambdaFunctionRow]
Must have an active sparksession before calling and only works with ExpressionRule's, all other rules are converted to 1=1
Must have an active sparksession before calling and only works with ExpressionRule's, all other rules are converted to 1=1
returns
a Dataset[RowRaw] with the rules flattened out

Definition Classes
SerializingImports
def toOutputExpressionDS(ruleSuite: RuleSuite): Dataset[OutputExpressionRow]
Must have an active sparksession before calling and only works with ExpressionRule's, all other rules are converted to 1=1
Must have an active sparksession before calling and only works with ExpressionRule's, all other rules are converted to 1=1
returns
a Dataset[RowRaw] with the rules flattened out

Definition Classes
SerializingImports
def toRuleSuiteDF(ruleSuite: RuleSuite): DataFrame
Utility function to easy dealing with simple DQ rules where the rule engine functionality is ignored.
Utility function to easy dealing with simple DQ rules where the rule engine functionality is ignored. This can be paired with the default ruleEngine parameter in readRulesFromDF

Definition Classes
SerializingImports
def toString(): String

Definition Classes
Any
def typedExpressionRunner(ruleSuite: RuleSuite, ddlType: String, name: String = "expressionResults"): Column
Runs the ruleSuite expressions saving results as a tuple of (ruleResult: String, resultDDL: String)
Runs the ruleSuite expressions saving results as a tuple of (ruleResult: String, resultDDL: String)

Definition Classes
ExpressionRunnerImports
def validate(schemaOrFrame: Either[StructType, DataFrame], ruleSuite: RuleSuite, showParams: ShowParams = ShowParams(), runnerFunction: Option[(DataFrame) ⇒ Column] = None, qualityName: String = "Quality", recursiveLambdasSOEIsOk: Boolean = false, transformBeforeShow: (DataFrame) ⇒ DataFrame = identity, viewLookup: (String) ⇒ Boolean = Validation.defaultViewLookup): (Set[RuleError], Set[RuleWarning], String, RuleSuiteDocs, Map[IdTrEither, ExpressionLookup])
For a given dataFrame provide a full set of any validation errors for a given ruleSuite.
For a given dataFrame provide a full set of any validation errors for a given ruleSuite.
schemaOrFrame
when it's a Left( StructType ) the struct will be used to test against and an emptyDataframe of this type created to resolve on the spark level. Using Right(DataFrame) will cause that dataframe to be used which is great for test cases with a runnerFunction
showParams
- configure how the output text is formatted using the same options and formatting as dataFrame.show
runnerFunction
- allows you to create a ruleRunner or ruleEngineRunner with different configurations
qualityName
- the column name to store the runnerFunction results in
recursiveLambdasSOEIsOk
- this signals that finding a recursive lambda SOE should not stop the evaluations - if true it will still try to run any runnerFunction but may not give the correct results
transformBeforeShow
- an optional transformation function to help shape what results are pushed to show
viewLookup
- for any subquery used looks up the view name for being present (quoted and with schema), defaults to the current spark catalogue
returns
A set of errors and the output from the dataframe when a runnerFunction is specified

Definition Classes
ValidationImports
def validate(frame: DataFrame, ruleSuite: RuleSuite, runnerFunction: (DataFrame) ⇒ Column, transformBeforeShow: (DataFrame) ⇒ DataFrame): (Set[RuleError], Set[RuleWarning], String, RuleSuiteDocs, Map[IdTrEither, ExpressionLookup])
For a given dataFrame provide a full set of any validation errors for a given ruleSuite.
For a given dataFrame provide a full set of any validation errors for a given ruleSuite.
frame
when it's a Left( StructType ) the struct will be used to test against and an emptyDataframe of this type created to resolve on the spark level. Using Right(DataFrame) will cause that dataframe to be used which is great for test cases with a runnerFunction
runnerFunction
- allows you to create a ruleRunner or ruleEngineRunner with different configurations
transformBeforeShow
- an optional transformation function to help shape what results are pushed to show
returns
A set of errors and the output from the dataframe when a runnerFunction is specified

Definition Classes
ValidationImports
def validate(frame: DataFrame, ruleSuite: RuleSuite, runnerFunction: (DataFrame) ⇒ Column, transformBeforeShow: (DataFrame) ⇒ DataFrame, viewLookup: (String) ⇒ Boolean): (Set[RuleError], Set[RuleWarning], String, RuleSuiteDocs, Map[IdTrEither, ExpressionLookup])
For a given dataFrame provide a full set of any validation errors for a given ruleSuite.
For a given dataFrame provide a full set of any validation errors for a given ruleSuite.
frame
when it's a Left( StructType ) the struct will be used to test against and an emptyDataframe of this type created to resolve on the spark level. Using Right(DataFrame) will cause that dataframe to be used which is great for test cases with a runnerFunction
runnerFunction
- allows you to create a ruleRunner or ruleEngineRunner with different configurations
transformBeforeShow
- an optional transformation function to help shape what results are pushed to show
returns
A set of errors and the output from the dataframe when a runnerFunction is specified

Definition Classes
ValidationImports
def validate(frame: DataFrame, ruleSuite: RuleSuite, runnerFunction: (DataFrame) ⇒ Column): (Set[RuleError], Set[RuleWarning], String, RuleSuiteDocs, Map[IdTrEither, ExpressionLookup])
For a given dataFrame provide a full set of any validation errors for a given ruleSuite.
For a given dataFrame provide a full set of any validation errors for a given ruleSuite.
frame
when it's a Left( StructType ) the struct will be used to test against and an emptyDataframe of this type created to resolve on the spark level. Using Right(DataFrame) will cause that dataframe to be used which is great for test cases with a runnerFunction
runnerFunction
- allows you to create a ruleRunner or ruleEngineRunner with different configurations
returns
A set of errors and the output from the dataframe when a runnerFunction is specified

Definition Classes
ValidationImports
def validate(frame: DataFrame, ruleSuite: RuleSuite): (Set[RuleError], Set[RuleWarning])
For a given dataFrame provide a full set of any validation errors for a given ruleSuite.
For a given dataFrame provide a full set of any validation errors for a given ruleSuite.
frame
when it's a Left( StructType ) the struct will be used to test against and an emptyDataframe of this type created to resolve on the spark level. Using Right(DataFrame) will cause that dataframe to be used which is great for test cases with a runnerFunction
returns
A set of errors and the output from the dataframe when a runnerFunction is specified

Definition Classes
ValidationImports
def validate(schema: StructType, ruleSuite: RuleSuite): (Set[RuleError], Set[RuleWarning])
For a given dataFrame provide a full set of any validation errors for a given ruleSuite.
For a given dataFrame provide a full set of any validation errors for a given ruleSuite.
schema
which fields should the dataframe have
returns
A set of errors and the output from the dataframe when a runnerFunction is specified

Definition Classes
ValidationImports
def validate_Lookup(frame: DataFrame, ruleSuite: RuleSuite, runnerFunction: (DataFrame) ⇒ Column, viewLookup: (String) ⇒ Boolean): (Set[RuleError], Set[RuleWarning], String, RuleSuiteDocs, Map[IdTrEither, ExpressionLookup])
For a given dataFrame provide a full set of any validation errors for a given ruleSuite.
For a given dataFrame provide a full set of any validation errors for a given ruleSuite.
frame
when it's a Left( StructType ) the struct will be used to test against and an emptyDataframe of this type created to resolve on the spark level. Using Right(DataFrame) will cause that dataframe to be used which is great for test cases with a runnerFunction
runnerFunction
- allows you to create a ruleRunner or ruleEngineRunner with different configurations
returns
A set of errors and the output from the dataframe when a runnerFunction is specified

Definition Classes
ValidationImports
def validate_Lookup(frame: DataFrame, ruleSuite: RuleSuite, viewLookup: (String) ⇒ Boolean = Validation.defaultViewLookup): (Set[RuleError], Set[RuleWarning])
For a given dataFrame provide a full set of any validation errors for a given ruleSuite.
For a given dataFrame provide a full set of any validation errors for a given ruleSuite.
frame
when it's a Left( StructType ) the struct will be used to test against and an emptyDataframe of this type created to resolve on the spark level. Using Right(DataFrame) will cause that dataframe to be used which is great for test cases with a runnerFunction
returns
A set of errors and the output from the dataframe when a runnerFunction is specified

Definition Classes
ValidationImports
def validate_Lookup(schema: StructType, ruleSuite: RuleSuite, viewLookup: (String) ⇒ Boolean): (Set[RuleError], Set[RuleWarning])
For a given dataFrame provide a full set of any validation errors for a given ruleSuite.
For a given dataFrame provide a full set of any validation errors for a given ruleSuite.
schema
which fields should the dataframe have
returns
A set of errors and the output from the dataframe when a runnerFunction is specified

Definition Classes
ValidationImports
object BloomModel extends Serializable
object DisabledRule extends RuleResult with Product with Serializable
This shouldn't evaluate to a fail, allows signalling a rule has been disabled
object Failed extends RuleResult with Product with Serializable
object LambdaFunction
object Passed extends RuleResult with Product with Serializable
object QualityException extends Serializable
object RunOnPassProcessor
object SoftFailed extends RuleResult with Product with Serializable
This shouldn't evaluate to a fail, think of it as Amber / Warn

Packages

quality

package quality

Type Members

Abstract Value Members

Concrete Value Members

Inherited from ExpressionRunnerImports

Inherited from ViewLoading

Inherited from RuleFolderRunnerImports

Inherited from ProcessDisableIfMissingImports

Inherited from ValidationImports

Inherited from RuleEngineRunnerImports

Inherited from LambdaFunctionsImports

Inherited from AddDataFunctionsImports

Inherited from SerializingImports

Inherited from BlockSplitBloomFilterImports

Inherited from BloomFilterLookupImports

Inherited from LookupIdFunctionsImports

Inherited from MapLookupImports

Inherited from Serializable

Inherited from Serializable

Inherited from RuleRunnerImports

Inherited from BloomFilterRegistration

Inherited from RuleRunnerFunctionsImport

Inherited from BucketedCreatorFunctionImports

Inherited from BloomFilterTypes

Inherited from AnyRef

Inherited from Any

Ungrouped

Packages

quality 

package quality

Type Members

Abstract Value Members

Concrete Value Members

Inherited from ExpressionRunnerImports

Inherited from ViewLoading

Inherited from RuleFolderRunnerImports

Inherited from ProcessDisableIfMissingImports

Inherited from ValidationImports

Inherited from RuleEngineRunnerImports

Inherited from LambdaFunctionsImports

Inherited from AddDataFunctionsImports

Inherited from SerializingImports

Inherited from BlockSplitBloomFilterImports

Inherited from BloomFilterLookupImports

Inherited from LookupIdFunctionsImports

Inherited from MapLookupImports

Inherited from Serializable

Inherited from Serializable

Inherited from RuleRunnerImports

Inherited from BloomFilterRegistration

Inherited from RuleRunnerFunctionsImport

Inherited from BucketedCreatorFunctionImports

Inherited from BloomFilterTypes

Inherited from AnyRef

Inherited from Any

Ungrouped

quality