package quality
Provides an easy import point for the library.
- Alphabetic
- By Inheritance
- quality
- CollectRunnerImports
- VariableProcessIfMissing
- VersionSpecificSerializingImports
- ExpressionRunnerImports
- ViewLoading
- RuleFolderRunnerImports
- RuleEngineRunnerImports
- LambdaFunctionsImports
- AddDataFunctionsImports
- SerializingImports
- MapLookupImportsShared
- Serializable
- RuleRunnerImports
- AnyRef
- Any
- Hide All
- Show All
- Public
- Protected
Package Members
- package classicFunctions
Collection of the Quality Spark Expressions for use in select( Column * )
- package functions
Collection of the Quality Spark Expressions for use in select( Column * )
- package impl
- package implicits
Imports the common implicits
- package simpleVersioning
A simple versioning scheme that allows management of versions
- package sparkless
Type Members
- trait BloomLookup extends AnyRef
Simple does it contain function to test a bloom
- case class BloomModel(rootDir: String, fpp: Double, numBuckets: Int) extends Serializable with Product
Represents the shared file location of a bucked bloom filter.
Represents the shared file location of a bucked bloom filter. There should be files with names 0..numBuckets containing the same number of bytes representing each bucket.
- rootDir
The directory which contains each bucket
- fpp
The fpp for this bloom - note it is informational only and will not be used in further processing
- numBuckets
The number of buckets within this bloom
- trait ClassicOnly extends Annotation
- Annotations
- @Retention()
- case class CombinedRuleRow(ruleRow: RuleRow, outputExpressionRow: Option[OutputExpressionRow]) extends Product with Serializable
Raw model for Variable usage
- case class CombinedRuleSuiteRows(ruleSuiteId: Int, ruleSuiteVersion: Int, ruleRows: Seq[CombinedRuleRow], lambdaFunctions: Option[Seq[LambdaFunctionRow]], probablePass: Option[Double], defaultProcessor: Option[OutputExpressionRow]) extends Product with Serializable
Raw model for Variable usage
- trait DataFrameLoader extends AnyRef
Simple interface to load DataFrames used by map/bloom and view loading
- trait DefaultProcessor extends HasRuleText with HasOutputExpression
Configuration of what should be evaluated when no trigger passes for Folder and Collector
- trait ExpressionRule extends Serializable
A trigger rule
- case class GeneralExpressionResult(result: String, resultDDL: String) extends Product with Serializable
Represents the expression results of ExpressionRunner
Represents the expression results of ExpressionRunner
- result
the result casted to string
- resultDDL
the result type in ddl
- Annotations
- @SerialVersionUID()
- case class GeneralExpressionsResult[R](id: VersionedId, ruleSetResults: Map[VersionedId, Map[VersionedId, R]]) extends Serializable with Product
Represents the results of the ExpressionRunner
Represents the results of the ExpressionRunner
- Annotations
- @SerialVersionUID()
- case class GeneralExpressionsResultNoDDL(id: VersionedId, ruleSetResults: Map[VersionedId, Map[VersionedId, String]]) extends Serializable with Product
Represents the results of the ExpressionRunner after calling strip_result_ddl
Represents the results of the ExpressionRunner after calling strip_result_ddl
- Annotations
- @SerialVersionUID()
- trait HasOutputExpression extends Serializable
- trait HasRuleText extends Serializable
- case class Id(id: Int, version: Int) extends VersionedId with Product with Serializable
A versioned rule ID - note the name is never persisted in results, the id and version are sufficient to retrieve the name
A versioned rule ID - note the name is never persisted in results, the id and version are sufficient to retrieve the name
- id
a unique ID to identify this rule
- version
the version of the rule - again tied to ID
- Annotations
- @SerialVersionUID()
- type IdTriple = (Id, Id, Id)
- Definition Classes
- LambdaFunctionsImports
- trait LambdaFunction extends HasRuleText
A user defined SQL function
- case class LambdaFunctionRow(name: String, ruleExpr: String, functionId: Int, functionVersion: Int, ruleSuiteId: Int, ruleSuiteVersion: Int) extends Product with Serializable
- case class LazyRuleEngineResult[T](lazyRuleSuiteResults: LazyRuleSuiteResult, salientRule: Option[SalientRule], result: Option[T]) extends Serializable with Product
Results for all rules run against a DataFrame, the RuleSuiteResult is lazily evaluated.
Results for all rules run against a DataFrame, the RuleSuiteResult is lazily evaluated. Note in debug mode the type of T must be Seq[(Int, ActualType)]
- lazyRuleSuiteResults
Overall results from applying the engine, evealuated lazily
- salientRule
if it's None there is no rule which matched for this row or it's in Debug mode which will return all results.
- result
The result type for this row, if no rule matched this will be None, if a rule matched but the outputexpression returned null this will also be None
- Annotations
- @SerialVersionUID()
- case class LazyRuleFolderResult[T](lazyRuleSuiteResults: LazyRuleSuiteResult, result: Option[T]) extends Serializable with Product
Results for all rules run against a DataFrame, the RuleSuiteResult is lazily evaluated.
Results for all rules run against a DataFrame, the RuleSuiteResult is lazily evaluated. Note in debug mode the type of T must be Seq[(Int, ActualType)]
- lazyRuleSuiteResults
Overall results from applying the engine, evaluated lazily
- result
The result type for this row, if no rule matched this will be None, if a rule matched but the outputexpression returned null this will also be None
- Annotations
- @SerialVersionUID()
- trait LazyRuleSuiteResult extends Serializable
A lazy proxy for RuleSuiteResult
- trait LazyRuleSuiteResultDetails extends Serializable
A lazy proxy for RuleSuiteResultDetails
A lazy proxy for RuleSuiteResultDetails
- Annotations
- @SerialVersionUID()
- type MapCreator = () => (DataFrame, Column, Column)
- Definition Classes
- MapLookupImportsShared
- type MapLookups = String
Used as a param to load the map lookups - note the type of the broadcast is always Map[AnyRef, AnyRef]
Used as a param to load the map lookups - note the type of the broadcast is always Map[AnyRef, AnyRef]
- Definition Classes
- MapLookupImportsShared
- case class MetaRuleSetRow(ruleSuiteId: Int, ruleSuiteVersion: Int, ruleSetId: Int, ruleSetVersion: Int, columnFilter: String, ruleExpr: String) extends Product with Serializable
only one arg is supported without brackets etc.
only one arg is supported without brackets etc.
Law: Each Rule generated must have a stable Id for the same column, the version used is the same as the RuleSet
The caller of withNameAndOrd must enforce this law to have stable and correctly evolving rules.
- trait OutputExpression extends AnyRef
Used as a result of serializing
- case class OutputExpressionRow(ruleExpr: String, functionId: Int, functionVersion: Int, ruleSuiteId: Int, ruleSuiteVersion: Int) extends Product with Serializable
- case class Probability(percentage: Double) extends RuleResult with Product with Serializable
0-1 with 1 being absolutely likely a pass
0-1 with 1 being absolutely likely a pass
- Annotations
- @SerialVersionUID()
- case class QualityException(msg: String, cause: Throwable = null) extends RuntimeException with Product with Serializable
Simple marker instead of sys.error
Simple marker instead of sys.error
- Annotations
- @SerialVersionUID()
- trait ResultStatistics[T <: ResultStatistics[_]] extends Serializable
- sealed trait ResultStatisticsProvider[T, R] extends Serializable
- sealed trait ResultStatisticsProviderImpl[T <: ResultStatistics[_]] extends Serializable
- case class Rule(id: Id, expression: ExpressionRule, runOnPassProcessor: RunOnPassProcessor = NoOpRunOnPassProcessor.noOp) extends Serializable with Product
A rule to run over a row
A rule to run over a row
- Annotations
- @SerialVersionUID()
- case class RuleEngineResult[T](ruleSuiteResults: RuleSuiteResult, salientRule: Option[SalientRule], result: Option[T]) extends Serializable with Product
Results for all rules run against a DataFrame.
Results for all rules run against a DataFrame. Note in debug mode the type of T must be Seq[(Int, ActualType)]
- ruleSuiteResults
Overall results from applying the engine
- salientRule
if it's None there is no rule which matched for this row or it's in Debug mode which will return all results.
- result
The result type for this row, if no rule matched this will be None, if a rule matched but the outputexpression returned null this will also be None
- Annotations
- @SerialVersionUID()
- case class RuleFolderResult[T](ruleSuiteResults: RuleSuiteResult, result: Option[T]) extends Serializable with Product
Results for all rules run against a DataFrame.
Results for all rules run against a DataFrame. Note in debug mode the type of T must be Seq[(Int, ActualType)]
- ruleSuiteResults
Overall results from applying the engine
- result
The result type for this row, if no rule matched this will be None, if a rule matched but the outputexpression returned null this will also be None
- Annotations
- @SerialVersionUID()
- sealed trait RuleResult extends Serializable
- Annotations
- @SerialVersionUID()
- case class RuleResultRow(ruleSuiteId: Int, ruleSuiteVersion: Int, ruleSuiteResult: Int, ruleSetResult: Int, ruleSetId: Int, ruleSetVersion: Int, ruleId: Int, ruleVersion: Int, ruleResult: Int) extends Product with Serializable
Flattened results for aggregation / display / use by the explodeResults expression
- case class RuleResultWithProcessor(ruleResult: RuleResult, runOnPassProcessor: RunOnPassProcessor) extends RuleResult with Product with Serializable
Packs a rule result with a RunOnPassProcessor processor
Packs a rule result with a RunOnPassProcessor processor
- Annotations
- @SerialVersionUID()
- case class RuleRow(ruleSuiteId: Int, ruleSuiteVersion: Int, ruleSetId: Int, ruleSetVersion: Int, ruleId: Int, ruleVersion: Int, ruleExpr: String, ruleEngineSalience: Int = Int.MaxValue, ruleEngineId: Int = 0, ruleEngineVersion: Int = 0) extends Product with Serializable
- case class RuleSet(id: Id, rules: Seq[Rule]) extends Serializable with Product
- Annotations
- @SerialVersionUID()
- case class RuleSetResult(overallResult: RuleResult, ruleResults: Map[VersionedId, RuleResult]) extends Serializable with Product
Result collection for a number of rules
Result collection for a number of rules
- ruleResults
rule id -> ruleresult
- Annotations
- @SerialVersionUID()
- case class RuleSetStatistics(ruleSet: VersionedId, failed: Long = 0, passed: Long = 0, softFailed: Long = 0, disabled: Long = 0, ignored: Long = 0, defaulted: Long = 0, probabilityPassed: Long = 0, probabilityFailed: Long = 0, rules: Map[VersionedId, RuleStatistics] = Map.empty) extends ResultStatistics[RuleSetStatistics] with Product with Serializable
Aggregated RuleStatistics for a RuleSet
Aggregated RuleStatistics for a RuleSet
- Annotations
- @SerialVersionUID()
- case class RuleStatistics(rule: VersionedId, failed: Long = 0, passed: Long = 0, softFailed: Long = 0, disabled: Long = 0, ignored: Long = 0, defaulted: Long = 0, probabilityPassed: Long = 0, probabilityFailed: Long = 0) extends ResultStatistics[RuleStatistics] with Product with Serializable
Rule level results aggregated across a set of rows
Rule level results aggregated across a set of rows
- Annotations
- @SerialVersionUID()
- case class RuleSuite(id: Id, ruleSets: Seq[RuleSet], lambdaFunctions: Seq[LambdaFunction] = Seq.empty, probablePass: Double = defaultProbablePass, defaultProcessor: DefaultProcessor = NoOpDefaultProcessor.noOp) extends Serializable with Product
Represents a versioned collection of RuleSet's
Represents a versioned collection of RuleSet's
- probablePass
override to specify a different percentage for treating probability results as passes - defaults to 80% (0.8)
- defaultProcessor
when using folder or collector this output expression will be used when no rules are triggered
- Annotations
- @SerialVersionUID()
- case class RuleSuiteGroup(ruleSuites: Map[VersionedId, RuleSuite] = Map.empty) extends Product with Serializable
Represents a grouping of RuleSuites
Represents a grouping of RuleSuites
- Annotations
- @SerialVersionUID()
- case class RuleSuiteGroupResults(ruleSuiteResults: Map[VersionedId, RuleSuiteResult] = Map.empty) extends Product with Serializable
Represents a grouping of RuleSuiteResults, from a call to group_results (Spark 4 only).
Represents a grouping of RuleSuiteResults, from a call to group_results (Spark 4 only). Given the results may be combined for audit trail reasons from both engines and dq results no overallResult equivalent is provided. Please use overall_dq and overall_engine functions to process the ruleSuiteResults, filtering the RuleSuite Ids as needed.
- Annotations
- @SerialVersionUID()
- case class RuleSuiteGroupStatistics(ruleSuites: Map[VersionedId, RuleSuiteStatistics] = Map.empty, rowCount: Long = 0) extends Serializable with Product
Convenience as a dataset may contain more than one rule suite
Convenience as a dataset may contain more than one rule suite
- Annotations
- @SerialVersionUID()
- case class RuleSuiteResult(id: VersionedId, overallResult: RuleResult, ruleSetResults: Map[VersionedId, RuleSetResult]) extends Serializable with Product
Results for all rules run against a dataframe
Results for all rules run against a dataframe
- id
- the Id of the suite, all other content is mapped
- Annotations
- @SerialVersionUID()
- case class RuleSuiteResultDetails(id: VersionedId, ruleSetResults: Map[VersionedId, RuleSetResult]) extends Serializable with Product
Results for all rules run against a dataframe without the overallResult.
Results for all rules run against a dataframe without the overallResult. Performance differences for filtering on top level fields are significant over nested structures even under Spark 3, in the region of 30-50% depending on op.
- Annotations
- @SerialVersionUID()
- case class RuleSuiteRow(ruleSuiteId: Int, ruleSuiteVersion: Int, probablePass: Double, ruleEngineId: Int, ruleEngineVersion: Int) extends Product with Serializable
Only needed for probability and defaultProcessor support with the optional integrateRuleSuites
- case class RuleSuiteStatistics(ruleSuite: VersionedId, failed: Long = 0, passed: Long = 0, softFailed: Long = 0, disabled: Long = 0, ignored: Long = 0, defaulted: Long = 0, probabilityPassed: Long = 0, probabilityFailed: Long = 0, rowCount: Long = 0, ruleSets: Map[VersionedId, RuleSetStatistics] = Map.empty) extends ResultStatistics[RuleSuiteStatistics] with Product with Serializable
Aggregated RuleSetStatistics for a complete RuleSuite
Aggregated RuleSetStatistics for a complete RuleSuite
- Annotations
- @SerialVersionUID()
- trait RunOnPassProcessor extends HasRuleText with HasOutputExpression
Configuration of what should be evaluated when a trigger passes and salience matches
- case class SalientRule(ruleSuiteId: VersionedId, ruleSetId: VersionedId, ruleId: VersionedId) extends Product with Serializable
Represents the rule that matched a given RuleEngine result
Represents the rule that matched a given RuleEngine result
- Annotations
- @SerialVersionUID()
- case class SimpleField(name: String, dataType: String, nullable: Boolean) extends Product with Serializable
Used to filter columns for meta RuleSets
- case class SparkBloomFilter(bloom: BloomFilter) extends BloomLookup with Product with Serializable
- sealed trait VersionedId extends Serializable
base for storage of rule or ruleset ids, must be a trait to force frameless to use lookup and stop any accidental auto product treatment
Concrete Value Members
- final def !=(arg0: Any): Boolean
- Definition Classes
- Any
- final def ##: Int
- Definition Classes
- Any
- final def ==(arg0: Any): Boolean
- Definition Classes
- Any
- val DefaultRuleInt: Int
The integer value for RuleSuiteResult.overallResult when no trigger rules have run and the default rule was
The integer value for RuleSuiteResult.overallResult when no trigger rules have run and the default rule was
- Definition Classes
- RuleRunnerImports
- val DefaultRuleSalience: Int
When Folder is configured with a default rule and debug mode is enabled, this salience is returned
When Folder is configured with a default rule and debug mode is enabled, this salience is returned
- Definition Classes
- RuleRunnerImports
- val DisabledRuleInt: Int
The integer value for disabled dq rules
The integer value for disabled dq rules
- Definition Classes
- RuleRunnerImports
- val FailedInt: Int
The integer value for failed dq rules
The integer value for failed dq rules
- Definition Classes
- RuleRunnerImports
- val IgnoredRuleInt: Int
The integer value for ignored dq rules
The integer value for ignored dq rules
- Definition Classes
- RuleRunnerImports
- val PassedInt: Int
The integer value for passed dq rules
The integer value for passed dq rules
- Definition Classes
- RuleRunnerImports
- val SoftFailedInt: Int
The integer value for soft failed dq rules
The integer value for soft failed dq rules
- Definition Classes
- RuleRunnerImports
- def addDataQuality(dataFrame: DataFrame, rules: RuleSuite, name: String = "DataQuality"): DataFrame
Adds a DataQuality field using the RuleSuite and RuleSuiteResult structure
Adds a DataQuality field using the RuleSuite and RuleSuiteResult structure
- Definition Classes
- AddDataFunctionsImports
- def addDataQualityF(rules: RuleSuite, name: String = "DataQuality"): (DataFrame) => DataFrame
Adds a DataQuality field using the RuleSuite and RuleSuiteResult structure for use with dataset.transform functions
Adds a DataQuality field using the RuleSuite and RuleSuiteResult structure for use with dataset.transform functions
- Definition Classes
- AddDataFunctionsImports
- def addExpressionRunner(dataFrame: DataFrame, ruleSuite: RuleSuite, name: String = "expressionResults", renderOptions: Map[String, String] = Map.empty, ddlType: String = "", stripDDL: Boolean = false, postProcess: (DataFrame) => DataFrame = identity): DataFrame
Runs the ruleSuite expressions saving results as a tuple of (ruleResult: yaml, resultType: String) Supplying a ddlType triggers the output type for the expression to be that ddl type, rather than using yaml conversion.
Runs the ruleSuite expressions saving results as a tuple of (ruleResult: yaml, resultType: String) Supplying a ddlType triggers the output type for the expression to be that ddl type, rather than using yaml conversion.
- name
the default column name "expressionResults"
- renderOptions
provides rendering options to the underlying snake yaml implementation
- ddlType
optional DDL string, when present yaml output is disabled and the output expressions must all have the same type
- stripDDL
the default of false leaves the DDL present, true strips ddl from the output using strip_result_ddl
- Definition Classes
- AddDataFunctionsImports
- def addExpressionRunnerF[P[R] >: Dataset[R]](ruleSuite: RuleSuite, name: String = "expressionResults", renderOptions: Map[String, String] = Map.empty, ddlType: String = "", stripDDL: Boolean = false, postProcess: (DataFrame) => DataFrame = identity): (P[Row]) => P[Row]
Runs the ruleSuite expressions saving results as a tuple of (ruleResult: yaml, resultType: String) Supplying a ddlType triggers the output type for the expression to be that ddl type, rather than using yaml conversion.
Runs the ruleSuite expressions saving results as a tuple of (ruleResult: yaml, resultType: String) Supplying a ddlType triggers the output type for the expression to be that ddl type, rather than using yaml conversion.
- name
the default column name "expressionResults"
- renderOptions
provides rendering options to the underlying snake yaml implementation
- ddlType
optional DDL string, when present yaml output is disabled and the output expressions must all have the same type
- stripDDL
the default of false leaves the DDL present, true strips ddl from the output using strip_result_ddl
- Definition Classes
- AddDataFunctionsImports
- def addOverallResultsAndDetails(dataFrame: DataFrame, rules: RuleSuite, overallResult: String = "DQ_overallResult", resultDetails: String = "DQ_Details"): DataFrame
Adds two columns, one for overallResult and the other the details, allowing 30-50% performance gains for simple filters
Adds two columns, one for overallResult and the other the details, allowing 30-50% performance gains for simple filters
- Definition Classes
- AddDataFunctionsImports
- def addOverallResultsAndDetailsF(rules: RuleSuite, overallResult: String = "DQ_overallResult", resultDetails: String = "DQ_Details"): (DataFrame) => DataFrame
Adds two columns, one for overallResult and the other the details, allowing 30-50% performance gains for simple filters, for use in dataset.transform functions
Adds two columns, one for overallResult and the other the details, allowing 30-50% performance gains for simple filters, for use in dataset.transform functions
- Definition Classes
- AddDataFunctionsImports
- final def asInstanceOf[T0]: T0
- Definition Classes
- Any
- def collectRunner(ruleSuite: RuleSuite, resultDataType: Option[DataType] = None, variablesPerFunc: Int = 40, variableFuncGroup: Int = 20, flatten: Boolean = true, includeNulls: Boolean = false, useInPlaceArray: Boolean = true, unrollInPlaceArray: Boolean = false, unrollOutputArraySize: Int = 1): Column
Creates a column that runs the folding RuleSuite.
Creates a column that runs the folding RuleSuite. This also forces registering the lambda functions used by that RuleSuite.
FolderRunner runs all output expressions for matching rules in order of salience, the startingStruct is passed ot the first matching, the result passed to the second etc. In contrast to ruleEngineRunner OutputExpressions should be lambdas with one parameter, that of the structure
- ruleSuite
The ruleSuite with runOnPassProcessors
- resultDataType
the collected result type, specify this if the derived types have nullability or ordering issues. By default, it takes the type of the last output expression
- variablesPerFunc
Defaulting to 40 allows, in combination with variableFuncGroup allows customisation of handling the 64k jvm method size limitation when performing WholeStageCodeGen
- variableFuncGroup
Defaulting to 20
- flatten
when resultType is an ArrayType should the result be flattened
- includeNulls
should nulls returned by the output expressions be included, note when flattening nulls IN the returned arrays are not filtered
- useInPlaceArray
defaulting to true, this replaces the array handling approach for 'array' Output Expressions
- unrollInPlaceArray
defaulting to false, when true, unrolls array copying for InPlaceArray usage, it _could_ be faster in some circumstances but in most cases trust the JVM's JIT
- unrollOutputArraySize
defaulting to 1 and only used when unrollInPlaceArray is true. When higher, decides how many operations should be unrolled for array copies per loop, setting to a higher number than all OutputExpression array sizes would force all array copying to be inlined. It _could_ be faster in some circumstances but in most cases trust the JVM's JIT to unroll the loops.
- returns
A Column representing the QualityRules expression built from this ruleSuite
- Definition Classes
- CollectRunnerImports
- def combine(ruleRows: Dataset[RuleRow], lambdaFunctionRows: Dataset[LambdaFunctionRow], outputExpressionRows: Dataset[OutputExpressionRow], globalLambdaSuites: Dataset[Id], ruleSuites: Dataset[RuleSuiteRow])(implicit encoder: Encoder[CombinedRuleSuiteRows]): Dataset[CombinedRuleSuiteRows]
Combines ruleRows, lambdaFunctionRows and outputExpressionRows into CombinedRuleSuiteRows
Combines ruleRows, lambdaFunctionRows and outputExpressionRows into CombinedRuleSuiteRows
- Definition Classes
- VersionSpecificSerializingImports
- def combine(ruleRows: Dataset[RuleRow], lambdaFunctionRows: Dataset[LambdaFunctionRow], outputExpressionRows: Dataset[OutputExpressionRow], globalLambdaSuites: Dataset[Id], globalOutputExpressionSuites: Dataset[Id], ruleSuites: Dataset[RuleSuiteRow])(implicit encoder: Encoder[CombinedRuleSuiteRows]): Dataset[CombinedRuleSuiteRows]
Combines ruleRows, lambdaFunctionRows and outputExpressionRows into CombinedRuleSuiteRows
Combines ruleRows, lambdaFunctionRows and outputExpressionRows into CombinedRuleSuiteRows
- Definition Classes
- VersionSpecificSerializingImports
- def combine(ruleRows: Dataset[RuleRow], lambdaFunctionRows: Dataset[LambdaFunctionRow], outputExpressionRows: Dataset[OutputExpressionRow], ruleSuites: Dataset[RuleSuiteRow])(implicit encoder: Encoder[CombinedRuleSuiteRows]): Dataset[CombinedRuleSuiteRows]
Combines ruleRows, lambdaFunctionRows and outputExpressionRows into CombinedRuleSuiteRows, using probablePass
Combines ruleRows, lambdaFunctionRows and outputExpressionRows into CombinedRuleSuiteRows, using probablePass
- Definition Classes
- VersionSpecificSerializingImports
- def combine(ruleRows: Dataset[RuleRow])(implicit encoder: Encoder[CombinedRuleSuiteRows]): Dataset[CombinedRuleSuiteRows]
Uses ruleRows to make CombinedRuleSuiteRows
Uses ruleRows to make CombinedRuleSuiteRows
- Definition Classes
- VersionSpecificSerializingImports
- def combine(ruleRows: Dataset[RuleRow], lambdaFunctionRows: Dataset[LambdaFunctionRow])(implicit encoder: Encoder[CombinedRuleSuiteRows]): Dataset[CombinedRuleSuiteRows]
Combines ruleRows and lambdaFunctionRows into CombinedRuleSuiteRows
Combines ruleRows and lambdaFunctionRows into CombinedRuleSuiteRows
- Definition Classes
- VersionSpecificSerializingImports
- def combine(ruleRows: Dataset[RuleRow], lambdaFunctionRows: Dataset[LambdaFunctionRow], outputExpressionRows: Dataset[OutputExpressionRow])(implicit encoder: Encoder[CombinedRuleSuiteRows]): Dataset[CombinedRuleSuiteRows]
Combines ruleRows, lambdaFunctionRows and outputExpressionRows into CombinedRuleSuiteRows
Combines ruleRows, lambdaFunctionRows and outputExpressionRows into CombinedRuleSuiteRows
- Definition Classes
- VersionSpecificSerializingImports
- def combine(ruleRows: Dataset[RuleRow], lambdaFunctionRows: Option[Dataset[LambdaFunctionRow]] = None, outputExpressionRows: Option[Dataset[OutputExpressionRow]] = None, globalLambdaSuites: Option[Dataset[Id]] = None, globalOutputExpressionSuites: Option[Dataset[Id]] = None, ruleSuites: Option[Dataset[RuleSuiteRow]] = None)(implicit encoder: Encoder[CombinedRuleSuiteRows]): Dataset[CombinedRuleSuiteRows]
Combines ruleRows, lambdaFunctionRows and outputExpressionRows into CombinedRuleSuiteRows
Combines ruleRows, lambdaFunctionRows and outputExpressionRows into CombinedRuleSuiteRows
- Definition Classes
- VersionSpecificSerializingImports
- def combined_rows(ruleSuite: RuleSuite): Dataset[CombinedRuleSuiteRows]
- Definition Classes
- VersionSpecificSerializingImports
- def dqRuleRunner(ruleSuite: RuleSuite, variablesPerFunc: Int = 40, variableFuncGroup: Int = 20): Column
Creates a column that runs the RuleSuite suitable for DQ / Validation.
Creates a column that runs the RuleSuite suitable for DQ / Validation. This also forces registering the lambda functions used by that RuleSuite.
- ruleSuite
The Qualty RuleSuite to evaluate
- variablesPerFunc
Defaulting to 40, it allows, in combination with variableFuncGroup customisation of handling the 64k jvm method size limitation when performing WholeStageCodeGen. You _shouldn't_ need it but it's there just in case.
- variableFuncGroup
Defaulting to 20
- returns
A Column representing the Quality DQ expression built from this ruleSuite
- Definition Classes
- RuleRunnerImports
- def equals(arg0: Any): Boolean
- Definition Classes
- Any
- def expressionRunner(ruleSuite: RuleSuite, name: String = "expressionResults", renderOptions: Map[String, String] = Map.empty): Column
- Definition Classes
- ExpressionRunnerImports
- def foldAndReplaceFieldPairs[P[R] >: Dataset[R]](rules: RuleSuite, fields: Seq[(String, Column)], foldFieldName: String = "foldedFields", debugMode: Boolean = false, tempFoldDebugName: String = "tempFOLDDEBUG", maintainOrder: Boolean = true, alias: String = "main"): (P[Row]) => P[Row]
Leverages the foldRunner to replace fields, the input fields pairs are used to create a structure that the rules fold over.
Leverages the foldRunner to replace fields, the input fields pairs are used to create a structure that the rules fold over. Any field references in the providing pairs are then dropped from the original Dataframe and added back from the resulting structure.
NOTE: The field order and types of the original DF will be maintained only when maintainOrder is true. As it requires access to the schema it may incur extra work.
- debugMode
when true the last results are taken for the replaced fields
- maintainOrder
when true the schema is used to replace fields in the correct location, when false they are simply appended
- Definition Classes
- AddDataFunctionsImports
- def foldAndReplaceFieldPairsWithStruct[P[R] >: Dataset[R]](rules: RuleSuite, fields: Seq[(String, Column)], struct: StructType, foldFieldName: String = "foldedFields", debugMode: Boolean = false, tempFoldDebugName: String = "tempFOLDDEBUG", maintainOrder: Boolean = true, alias: String = "main"): (P[Row]) => P[Row]
Leverages the foldRunner to replace fields, the input field pairs are used to create a structure that the rules fold over, the type must match the provided struct.
Leverages the foldRunner to replace fields, the input field pairs are used to create a structure that the rules fold over, the type must match the provided struct. Any field references in the providing pairs are then dropped from the original Dataframe and added back from the resulting structure.
This version should only be used when you require select(*, foldRunner) to be used, it requires you fully specify types.
NOTE: The field order and types of the original DF will be maintained only when maintainOrder is true. As it requires access to the schema it may incur extra work.
- struct
The fields, and types, are used to call the foldRunner. These types must match in the input fields
- debugMode
when true the last results are taken for the replaced fields
- maintainOrder
when true the schema is used to replace fields in the correct location, when false they are simply appended
- Definition Classes
- AddDataFunctionsImports
- def foldAndReplaceFields[P[R] >: Dataset[R]](rules: RuleSuite, fields: Seq[String], foldFieldName: String = "foldedFields", debugMode: Boolean = false, tempFoldDebugName: String = "tempFOLDDEBUG", maintainOrder: Boolean = true, alias: String = "main"): (P[Row]) => P[Row]
Leverages the foldRunner to replace fields, the input fields are used to create a structure that the rules fold over.
Leverages the foldRunner to replace fields, the input fields are used to create a structure that the rules fold over. The fields are then dropped from the original Dataframe and added back from the resulting structure.
NOTE: The field order and types of the original DF will be maintained only when maintainOrder is true. As it requires access to the schema it may incur extra work.
- debugMode
when true the last results are taken for the replaced fields
- maintainOrder
when true the schema is used to replace fields in the correct location, when false they are simply appended
- Definition Classes
- AddDataFunctionsImports
- def foldAndReplaceFieldsWithStruct[P[R] >: Dataset[R]](rules: RuleSuite, struct: StructType, foldFieldName: String = "foldedFields", debugMode: Boolean = false, tempFoldDebugName: String = "tempFOLDDEBUG", maintainOrder: Boolean = true, alias: String = "main"): (P[Row]) => P[Row]
Leverages the foldRunner to replace fields, the input fields are used to create a structure that the rules fold over.
Leverages the foldRunner to replace fields, the input fields are used to create a structure that the rules fold over. The fields are then dropped from the original Dataframe and added back from the resulting structure.
This version should only be used when you require select(*, foldRunner) to be used, it requires you fully specify types.
NOTE: The field order and types of the original DF will be maintained only when maintainOrder is true. As it requires access to the schema it may incur extra work.
- struct
The fields, and types, are used to call the foldRunner. These types must match in the input fields
- debugMode
when true the last results are taken for the replaced fields
- maintainOrder
when true the schema is used to replace fields in the correct location, when false they are simply appended
- Definition Classes
- AddDataFunctionsImports
- def getConfig(name: String, default: String = ""): String
first attempts to get the system env, then system java property then sqlconf
- def hashCode(): Int
- Definition Classes
- Any
- def integrateLambdas(ruleSuiteMap: RuleSuiteMap, lambdas: Map[Id, Seq[LambdaFunction]], globalLibrary: Option[Id] = None): RuleSuiteMap
Add any of the Lambdas loaded for the rule suites
Add any of the Lambdas loaded for the rule suites
- globalLibrary
- all lambdas with this RuleSuite Id will be added to all RuleSuites
- Definition Classes
- SerializingImports
- def integrateMetaRuleSets(dataFrame: DataFrame, ruleSuiteMap: RuleSuiteMap, metaRuleSetMap: Map[Id, Seq[MetaRuleSetRow]], stablePosition: (String) => Int, transform: (DataFrame) => DataFrame = identity): RuleSuiteMap
Integrates meta rulesets into the rulesuites.
Integrates meta rulesets into the rulesuites. Note this only works for a specific dataset, if rulesuites should be filtered for a given dataset then this must take place before calling.
- dataFrame
the dataframe to identify columns for metarules
- ruleSuiteMap
the ruleSuites relevant for this dataframe
- metaRuleSetMap
the meta rulesets
- stablePosition
this function must maintain the law that each column name within a RuleSet always generates the same position
- Definition Classes
- SerializingImports
- def integrateOutputExpressions(ruleSuiteMap: RuleSuiteMap, outputs: Map[Id, Seq[OutputExpressionRow]], globalLibrary: Option[Id] = None): (RuleSuiteMap, Map[Id, Set[Rule]])
Returns an integrated ruleSuiteMap with a set of RuleSuite Id -> Rule mappings where the OutputExpression didn't exist.
Returns an integrated ruleSuiteMap with a set of RuleSuite Id -> Rule mappings where the OutputExpression didn't exist. If defaultProcessor is expected to be used then call integrateRuleSuites *before*.
Users should check if their RuleSuite is in the "error" map. The isAMissingRuleSuiteRule can be used to identify if a Rule is referring to a missing RuleSuite.defaultProcessor
- Definition Classes
- SerializingImports
- def integrateRuleSuites(ruleSuiteMap: RuleSuiteMap, ruleSuites: Map[Id, RuleSuiteRow]): RuleSuiteMap
Returns an integrated ruleSuiteMap with probablePass and defaultOutputExpression applied
Returns an integrated ruleSuiteMap with probablePass and defaultOutputExpression applied
Call this before calling integrateOutputExpressions.
- Definition Classes
- SerializingImports
- def isAMissingRuleSuiteRule(rule: Rule): Boolean
Identify if the missing OutputExpression from a RuleSuite is from a defaultProcessor
Identify if the missing OutputExpression from a RuleSuite is from a defaultProcessor
- Definition Classes
- SerializingImports
- final def isInstanceOf[T0]: Boolean
- Definition Classes
- Any
- def loadMapConfigs(loader: DataFrameLoader, viewDF: DataFrame, name: Column, token: Column, filter: Column, sql: Column, key: Column, value: Column): (Seq[MapConfig], Set[String])
Loads map configurations from a given DataFrame.
Loads map configurations from a given DataFrame. Wherever token is present loader will be called and the filter optionally applied.
- returns
A tuple of MapConfig's and the names of rows which had unexpected content (either token or sql must be present)
- Definition Classes
- MapLookupImportsShared
- def loadMapConfigs(loader: DataFrameLoader, viewDF: DataFrame, ruleSuiteIdColumn: Column, ruleSuiteVersionColumn: Column, ruleSuiteId: Id, name: Column, token: Column, filter: Column, sql: Column, key: Column, value: Column): (Seq[MapConfig], Set[String])
Loads map configurations from a given DataFrame for ruleSuiteId.
Loads map configurations from a given DataFrame for ruleSuiteId. Wherever token is present loader will be called and the filter optionally applied.
- returns
A tuple of MapConfig's and the names of rows which had unexpected content (either token or sql must be present)
- Definition Classes
- MapLookupImportsShared
- def loadMaps(configs: Seq[MapConfig]): MapLookups
- Definition Classes
- MapLookupImportsShared
- def loadViewConfigs(loader: DataFrameLoader, viewDF: DataFrame, name: Column, token: Column, filter: Column, sql: Column): (Seq[ViewConfig], Set[String])
Loads view configurations from a given DataFrame.
Loads view configurations from a given DataFrame. Wherever token is present loader will be called and the filter optionally applied.
- returns
A tuple of ViewConfig's and the names of rows which had unexpected content (either token or sql must be present)
- Definition Classes
- ViewLoading
- def loadViewConfigs(loader: DataFrameLoader, viewDF: DataFrame, ruleSuiteIdColumn: Column, ruleSuiteVersionColumn: Column, ruleSuiteId: Id, name: Column, token: Column, filter: Column, sql: Column): (Seq[ViewConfig], Set[String])
Loads view configurations from a given DataFrame for ruleSuiteId.
Loads view configurations from a given DataFrame for ruleSuiteId. Wherever token is present loader will be called and the filter optionally applied.
- returns
A tuple of ViewConfig's and the names of rows which had unexpected content (either token or sql must be present)
- Definition Classes
- ViewLoading
- def loadViews(viewConfigs: Seq[ViewConfig]): ViewLoadResults
Attempts to load all the views present in the config.
Attempts to load all the views present in the config. If a view is already registered in that name it will be replaced.
- returns
the names of views which have been replaced
- Definition Classes
- ViewLoading
- def mapLookupsFromDFs(creators: Map[String, MapCreator]): MapLookups
Loads maps to broadcast, each individual dataframe may have different associated expressions
Loads maps to broadcast, each individual dataframe may have different associated expressions
- creators
a map of string id to MapCreator
- returns
a map of id to broadcast variables needed for exact lookup and mapping checks
- Definition Classes
- MapLookupImportsShared
- def process_if_attribute_missing(ruleSuite: Column): String
Processes a given RuleSuite to replace any coalesceIfMissingAttributes.
Processes a given RuleSuite to replace any coalesceIfMissingAttributes. This may be called before validate / docs but *must* be called *before* adding the ruleSuite to a dataframe.
- ruleSuite
a Column representing a rule suite from register_rule_suite_variable
- Definition Classes
- VariableProcessIfMissing
- def process_if_attribute_missing(ruleSuite: Column, schema: StructType): String
Processes a given RuleSuite to replace any coalesceIfMissingAttributes.
Processes a given RuleSuite to replace any coalesceIfMissingAttributes. This may be called before validate / docs but *must* be called *before* adding the ruleSuite to a dataframe.
- ruleSuite
a Column representing a rule suite from register_rule_suite_variable
- schema
The names to validate against, if empty no attempt to process coalesceIfAttributeMissing will be made
- Definition Classes
- VariableProcessIfMissing
- def process_if_attribute_missing(ruleSuite: Column, schema: StructType, stableName: String): String
Processes a given RuleSuite to replace any coalesceIfMissingAttributes.
Processes a given RuleSuite to replace any coalesceIfMissingAttributes. This may be called before validate / docs but *must* be called *before* adding the ruleSuite to a dataframe.
- ruleSuite
a Column representing a rule suite from register_rule_suite_variable
- schema
The names to validate against, if empty no attempt to process coalesceIfAttributeMissing will be made
- Definition Classes
- VariableProcessIfMissing
- def process_if_attribute_missing_col(ruleSuite: Column, schema: StructType, stableName: String): Column
Processes a given RuleSuite to replace any coalesceIfMissingAttributes.
Processes a given RuleSuite to replace any coalesceIfMissingAttributes. This may be called before validate / docs but *must* be called *before* adding the ruleSuite to a dataframe.
- ruleSuite
a Column representing a rule suite from register_rule_suite_variable
- schema
The names to validate against, if empty no attempt to process coalesceIfAttributeMissing will be made
- Definition Classes
- VariableProcessIfMissing
- def readLambdaRowsFromDF(lambdaFunctionDF: DataFrame, lambdaFunctionName: Column, lambdaFunctionExpression: Column, lambdaFunctionId: Column, lambdaFunctionVersion: Column, lambdaFunctionRuleSuiteId: Column, lambdaFunctionRuleSuiteVersion: Column): Dataset[LambdaFunctionRow]
Loads lambda functions
Loads lambda functions
- Definition Classes
- SerializingImports
- def readLambdasFromDF(lambdaFunctionDF: DataFrame, lambdaFunctionName: Column, lambdaFunctionExpression: Column, lambdaFunctionId: Column, lambdaFunctionVersion: Column, lambdaFunctionRuleSuiteId: Column, lambdaFunctionRuleSuiteVersion: Column): Map[Id, Seq[LambdaFunction]]
Loads lambda functions
Loads lambda functions
- Definition Classes
- SerializingImports
- def readMetaRuleSetsFromDF(metaRuleSetDF: DataFrame, columnSelectionExpression: Column, lambdaFunctionExpression: Column, metaRuleSetId: Column, metaRuleSetVersion: Column, metaRuleSuiteId: Column, metaRuleSuiteVersion: Column): Map[Id, Seq[MetaRuleSetRow]]
- Definition Classes
- SerializingImports
- def readOutputExpressionRowsFromDF(outputExpressionDF: DataFrame, outputExpression: Column, outputExpressionId: Column, outputExpressionVersion: Column, outputExpressionRuleSuiteId: Column, outputExpressionRuleSuiteVersion: Column): Dataset[OutputExpressionRow]
Loads output expressions
Loads output expressions
- Definition Classes
- SerializingImports
- def readOutputExpressionsFromDF(outputExpressionDF: DataFrame, outputExpression: Column, outputExpressionId: Column, outputExpressionVersion: Column, outputExpressionRuleSuiteId: Column, outputExpressionRuleSuiteVersion: Column): Map[Id, Seq[OutputExpressionRow]]
Loads output expressions
Loads output expressions
- Definition Classes
- SerializingImports
- def readRuleRowsFromDF(df: DataFrame, ruleSuiteId: Column, ruleSuiteVersion: Column, ruleSetId: Column, ruleSetVersion: Column, ruleId: Column, ruleVersion: Column, ruleExpr: Column, outputExpressionSalience: Column, outputExpressionId: Column, outputExpressionVersion: Column): Dataset[RuleRow]
Loads RuleRows from a dataframe with integers ruleSuiteId, ruleSuiteVersion, ruleSetId, ruleSetVersion, ruleId, ruleVersion and an expression string ruleExpr
Loads RuleRows from a dataframe with integers ruleSuiteId, ruleSuiteVersion, ruleSetId, ruleSetVersion, ruleId, ruleVersion and an expression string ruleExpr
- Definition Classes
- SerializingImports
- def readRuleRowsFromDF(df: DataFrame, ruleSuiteId: Column, ruleSuiteVersion: Column, ruleSetId: Column, ruleSetVersion: Column, ruleId: Column, ruleVersion: Column, ruleExpr: Column, ruleEngine: Option[(Column, Column, Column)] = None): Dataset[RuleRow]
Loads RuleRows from a dataframe with integers ruleSuiteId, ruleSuiteVersion, ruleSetId, ruleSetVersion, ruleId, ruleVersion and an expression string ruleExpr
Loads RuleRows from a dataframe with integers ruleSuiteId, ruleSuiteVersion, ruleSetId, ruleSetVersion, ruleId, ruleVersion and an expression string ruleExpr
- Definition Classes
- SerializingImports
- def readRuleSuitesFromDF(ruleSuites: Dataset[RuleSuiteRow]): Map[Id, RuleSuiteRow]
Loads RuleSuite specific attributes to use with integrateRuleSuites
Loads RuleSuite specific attributes to use with integrateRuleSuites
- Definition Classes
- SerializingImports
- def readRulesFromDF(df: DataFrame, ruleSuiteId: Column, ruleSuiteVersion: Column, ruleSetId: Column, ruleSetVersion: Column, ruleId: Column, ruleVersion: Column, ruleExpr: Column, ruleEngineSalience: Column, ruleEngineId: Column, ruleEngineVersion: Column): RuleSuiteMap
Loads a RuleSuite from a dataframe with integers ruleSuiteId, ruleSuiteVersion, ruleSetId, ruleSetVersion, ruleId, ruleVersion and an expression string ruleExpr
Loads a RuleSuite from a dataframe with integers ruleSuiteId, ruleSuiteVersion, ruleSetId, ruleSetVersion, ruleId, ruleVersion and an expression string ruleExpr
- Definition Classes
- SerializingImports
- def readRulesFromDF(df: DataFrame, ruleSuiteId: Column, ruleSuiteVersion: Column, ruleSetId: Column, ruleSetVersion: Column, ruleId: Column, ruleVersion: Column, ruleExpr: Column): RuleSuiteMap
Loads a RuleSuite from a dataframe with integers ruleSuiteId, ruleSuiteVersion, ruleSetId, ruleSetVersion, ruleId, ruleVersion and an expression string ruleExpr
Loads a RuleSuite from a dataframe with integers ruleSuiteId, ruleSuiteVersion, ruleSetId, ruleSetVersion, ruleId, ruleVersion and an expression string ruleExpr
- Definition Classes
- SerializingImports
- def registerLambdaFunctions(functions: Seq[LambdaFunction]): Unit
This function is for use with directly loading lambda functions to use within normal spark apps.
This function is for use with directly loading lambda functions to use within normal spark apps.
Please be aware loading rules via ruleRunner, addOverallResultsAndDetails or addDataQuality will also load any registered rules applicable for that ruleSuite and may override other registered rules. You should not use this directly and RuleSuites for that reason (otherwise your results will not be reproducible).
- Definition Classes
- LambdaFunctionsImports
- def registerQualityFunctions(): Unit
Simplified registerQualityFunctions, use classicFunction import when the other features are needed.
Simplified registerQualityFunctions, use classicFunction import when the other features are needed.
Must be called before using any functions like Passed, Failed or Probability(X) when using classic, a no-op when using connect with the SparkSessionExtension
- def registerTempViewNameFromDS[T](lambdaFunctionRows: Option[Dataset[T]]): String
- Attributes
- protected
- Definition Classes
- VersionSpecificSerializingImports
- def register_rule_suite(ruleSuite: RuleSuite): String
Registers a ruleSuite directly as an Spark Variable (with object stream encoding)
Registers a ruleSuite directly as an Spark Variable (with object stream encoding)
- Definition Classes
- VersionSpecificSerializingImports
- def register_rule_suite(ruleSuite: RuleSuite, stableName: String): String
Registers a ruleSuite directly as a Spark Variable (with object stream encoding).
Registers a ruleSuite directly as a Spark Variable (with object stream encoding). Where possible using the CombinedRuleSuiteRows should be preferred and manage the ruleSuites on the server.
- Definition Classes
- VersionSpecificSerializingImports
- def register_rule_suite_group(group: RuleSuiteGroup, stableName: String): String
Registers the RuleSuiteGroup with a Spark Variable with the provided stable id
Registers the RuleSuiteGroup with a Spark Variable with the provided stable id
- returns
stableName
- Definition Classes
- VersionSpecificSerializingImports
- def register_rule_suite_group(group: RuleSuiteGroup): String
Registers the RuleSuiteGroup with a Spark Variable with the provided stable id
Registers the RuleSuiteGroup with a Spark Variable with the provided stable id
- returns
stableName
- Definition Classes
- VersionSpecificSerializingImports
- def register_rule_suite_group_variable(ds: Dataset[CombinedRuleSuiteRows], stableName: String): String
Registers the RuleSuiteGroup represented by the CombinedRuleSuiteRows with a Spark Variable with the provided stable id.
Registers the RuleSuiteGroup represented by the CombinedRuleSuiteRows with a Spark Variable with the provided stable id.
- returns
stableName
- Definition Classes
- VersionSpecificSerializingImports
- def register_rule_suite_group_variable(ds: Dataset[CombinedRuleSuiteRows]): String
Registers the RuleSuiteGroup represented by the CombinedRuleSuiteRows with a Spark Variable with the provided stable id
Registers the RuleSuiteGroup represented by the CombinedRuleSuiteRows with a Spark Variable with the provided stable id
- returns
stableName
- Definition Classes
- VersionSpecificSerializingImports
- def register_rule_suite_variable(ds: Dataset[CombinedRuleSuiteRows], id: VersionedId, stableName: String): String
Registers the RuleSuite with a Spark Variable with the provided stable id
Registers the RuleSuite with a Spark Variable with the provided stable id
- returns
stableName
- Definition Classes
- VersionSpecificSerializingImports
- def register_rule_suite_variable(ds: Dataset[CombinedRuleSuiteRows], id: VersionedId): String
Registers the RuleSuite with a Spark Variable with a unique_id
Registers the RuleSuite with a Spark Variable with a unique_id
- returns
the variable name
- Definition Classes
- VersionSpecificSerializingImports
- def ruleEngineRunner(ruleSuite: RuleSuite, resultDataType: DataType): Column
Creates a column that runs the RuleSuite.
Creates a column that runs the RuleSuite. This also forces registering the lambda functions used by that RuleSuite
- ruleSuite
The ruleSuite with runOnPassProcessors
- resultDataType
The type of the results from runOnPassProcessors - must be the same for all result types, by default most fields will be nullable and encoding must follow the fields when not specified. *
- returns
A Column representing the QualityRules expression built from this ruleSuite
- Definition Classes
- RuleEngineRunnerImports
- def ruleEngineRunner(ruleSuite: RuleSuite, resultDataType: DataType, debugMode: Boolean): Column
Creates a column that runs the RuleSuite.
Creates a column that runs the RuleSuite. This also forces registering the lambda functions used by that RuleSuite
- ruleSuite
The ruleSuite with runOnPassProcessors
- resultDataType
The type of the results from runOnPassProcessors - must be the same for all result types, by default most fields will be nullable and encoding must follow the fields when not specified. *
- debugMode
When debugMode is enabled the resultDataType is wrapped in Array of (salience, result) pairs to ease debugging
- returns
A Column representing the QualityRules expression built from this ruleSuite
- Definition Classes
- RuleEngineRunnerImports
- def ruleEngineRunner(ruleSuite: RuleSuite, resultDataType: Option[DataType] = None, debugMode: Boolean = false, variablesPerFunc: Int = 40, variableFuncGroup: Int = 20, forceRunnerEval: Boolean = false, forceTriggerEval: Boolean = false): Column
Creates a column that runs the RuleSuite.
Creates a column that runs the RuleSuite. This also forces registering the lambda functions used by that RuleSuite
- ruleSuite
The ruleSuite with runOnPassProcessors
- resultDataType
The type of the results from runOnPassProcessors - must be the same for all result types, by default most fields will be nullable and encoding must follow the fields when not specified.
- debugMode
When debugMode is enabled the resultDataType is wrapped in Array of (salience, result) pairs to ease debugging
- variablesPerFunc
Defaulting to 40 allows, in combination with variableFuncGroup allows customisation of handling the 64k jvm method size limitation when performing WholeStageCodeGen
- variableFuncGroup
Defaulting to 20
- returns
A Column representing the QualityRules expression built from this ruleSuite
- Definition Classes
- RuleEngineRunnerImports
- def ruleEngineWithStruct(dataFrame: DataFrame, rules: RuleSuite, outputType: DataType, ruleEngineFieldName: String = "ruleEngine", alias: String = "main", debugMode: Boolean = false): DataFrame
Leverages the ruleRunner to produce an new output structure, the outputType defines the output structure generated.
Leverages the ruleRunner to produce an new output structure, the outputType defines the output structure generated.
This version should only be used when you require select(*, ruleRunner) to be used, it requires you fully specify types.
- dataFrame
the input dataframe
- outputType
The fields, and types, are used to call the foldRunner. These types must match in the input fields
- ruleEngineFieldName
The field name the results will be stored in, by default ruleEngine
- alias
sets the alias to use for dataFrame when using subqueries to resolve ambiguities, setting to an empty string (or null) will not assign an alias
- Definition Classes
- AddDataFunctionsImports
- def ruleEngineWithStructF[P[R] >: Dataset[R]](rules: RuleSuite, outputType: DataType, ruleEngineFieldName: String = "ruleEngine", alias: String = "main", debugMode: Boolean = false): (P[Row]) => P[Row]
Leverages the ruleRunner to produce an new output structure, the outputType defines the output structure generated.
Leverages the ruleRunner to produce an new output structure, the outputType defines the output structure generated.
This version should only be used when you require select(*, ruleRunner) to be used, it requires you fully specify types.
- outputType
The fields, and types, are used to call the foldRunner. These types must match in the input fields
- Definition Classes
- AddDataFunctionsImports
- def ruleEngineWithStructFOT[P[R] >: Dataset[R]](rules: RuleSuite, outputType: Option[DataType] = None, ruleEngineFieldName: String = "ruleEngine", alias: String = "main", debugMode: Boolean = false): (P[Row]) => P[Row]
Optional outputType, when none attempts will be made by spark to derive the output type based on the first output.
Optional outputType, when none attempts will be made by spark to derive the output type based on the first output. This is often correct, but nullablity and ordering of fields are known problems, use the output type to correct them.
Leverages the ruleRunner to produce an new output structure, the outputType defines the output structure generated.
This version should only be used when you require select(*, ruleRunner) to be used, it requires you fully specify types.
- outputType
The fields, and types, are used to call the foldRunner. These types must match in the input fields
- Definition Classes
- AddDataFunctionsImports
- def ruleEngineWithStructOT(dataFrame: DataFrame, rules: RuleSuite, outputType: Option[DataType] = None, ruleEngineFieldName: String = "ruleEngine", alias: String = "main", debugMode: Boolean = false): DataFrame
Optional outputType, when none attempts will be made by spark to derive the output type based on the first output.
Optional outputType, when none attempts will be made by spark to derive the output type based on the first output. This is often correct, but nullablity and ordering of fields are known problems, use the output type to correct them.
Leverages the ruleRunner to produce an new output structure, the outputType defines the output structure generated.
This version should only be used when you require select(*, ruleRunner) to be used, it requires you fully specify types.
- dataFrame
the input dataframe
- outputType
The fields, and types, are used to call the foldRunner. These types must match in the input fields
- ruleEngineFieldName
The field name the results will be stored in, by default ruleEngine
- alias
sets the alias to use for dataFrame when using subqueries to resolve ambiguities, setting to an empty string (or null) will not assign an alias
- Definition Classes
- AddDataFunctionsImports
- def ruleFolderRunner(ruleSuite: RuleSuite, startingStruct: Column, debugMode: Boolean = false, variablesPerFunc: Int = 40, variableFuncGroup: Int = 20, useType: Option[StructType] = None): Column
Creates a column that runs the folding RuleSuite.
Creates a column that runs the folding RuleSuite. This also forces registering the lambda functions used by that RuleSuite.
FolderRunner runs all output expressions for matching rules in order of salience, the startingStruct is passed ot the first matching, the result passed to the second etc. In contrast to ruleEngineRunner OutputExpressions should be lambdas with one parameter, that of the structure
- ruleSuite
The ruleSuite with runOnPassProcessors
- startingStruct
This struct is passed to the first matching rule, ideally you would use the spark dsl struct function to refer to existing columns
- debugMode
When debugMode is enabled the resultDataType is wrapped in Array of (salience, result) pairs to ease debugging
- variablesPerFunc
Defaulting to 40 allows, in combination with variableFuncGroup allows customisation of handling the 64k jvm method size limitation when performing WholeStageCodeGen
- variableFuncGroup
Defaulting to 20
- useType
In the case you must use select and can't use withColumn you may provide a type directly to stop the NPE
- returns
A Column representing the QualityRules expression built from this ruleSuite
- Definition Classes
- RuleFolderRunnerImports
- def ruleRunner(ruleSuite: RuleSuite, variablesPerFunc: Int = 40, variableFuncGroup: Int = 20): Column
Creates a column that runs the RuleSuite suitable for DQ / Validation.
Creates a column that runs the RuleSuite suitable for DQ / Validation. This also forces registering the lambda functions used by that RuleSuite This forwards to the original ruleRunner via dqRuleRunner
- ruleSuite
The Qualty RuleSuite to evaluate
- variablesPerFunc
Defaulting to 40, it allows, in combination with variableFuncGroup customisation of handling the 64k jvm method size limitation when performing WholeStageCodeGen. You _shouldn't_ need it but it's there just in case.
- variableFuncGroup
Defaulting to 20
- returns
A Column representing the Quality DQ expression built from this ruleSuite
- Definition Classes
- RuleRunnerImports
- def rule_suite(ds: Dataset[CombinedRuleSuiteRows], id: VersionedId): Option[RuleSuite]
Returns a specific RuleSuite from available rule suites
Returns a specific RuleSuite from available rule suites
- Definition Classes
- VersionSpecificSerializingImports
- def rule_suite(row: CombinedRuleSuiteRows): RuleSuite
Returns a specific RuleSuite from available rule suites
Returns a specific RuleSuite from available rule suites
- Definition Classes
- VersionSpecificSerializingImports
- def rule_suite_from(ruleSuiteGroupName: String, ruleSuiteId: Int, ruleSuiteVersion: Int): Column
Retrieves the RuleSuite from the RuleSuiteGroup backed Spark Session variable name against the specific ruleSuiteId and ruleSuiteVersion.
Retrieves the RuleSuite from the RuleSuiteGroup backed Spark Session variable name against the specific ruleSuiteId and ruleSuiteVersion. Null is returned if no matching entry is found
- Definition Classes
- VersionSpecificSerializingImports
- def rule_suite_from(ruleSuiteGroupName: String, ruleSuiteId: Int): Column
Retrieves the RuleSuite from the RuleSuiteGroup backed Spark Session variable name against the highest version with the ruleSuiteId.
Retrieves the RuleSuite from the RuleSuiteGroup backed Spark Session variable name against the highest version with the ruleSuiteId. Null is returned if no matching entry is found
- Definition Classes
- VersionSpecificSerializingImports
- def rule_suite_group(ds: Dataset[CombinedRuleSuiteRows]): RuleSuiteGroup
Converts CombinedRuleSuiteRows into a RuleSuiteGroup
Converts CombinedRuleSuiteRows into a RuleSuiteGroup
- returns
stableName
- Definition Classes
- VersionSpecificSerializingImports
- def rule_suite_group(combined: Seq[CombinedRuleSuiteRows]): RuleSuiteGroup
Converts CombinedRuleSuiteRows into a RuleSuiteGroup
Converts CombinedRuleSuiteRows into a RuleSuiteGroup
- returns
stableName
- Definition Classes
- VersionSpecificSerializingImports
- def toDS(ruleSuite: RuleSuite): Dataset[RuleRow]
Must have an active sparksession before calling and only works with ExpressionRule's, all other rules are converted to 1=1
Must have an active sparksession before calling and only works with ExpressionRule's, all other rules are converted to 1=1
- returns
a Dataset[RowRaw] with the rules flattened out
- Definition Classes
- SerializingImports
- def toLambdaDS(ruleSuite: RuleSuite): Dataset[LambdaFunctionRow]
Must have an active sparksession before calling and only works with ExpressionRule's, all other rules are converted to 1=1
Must have an active sparksession before calling and only works with ExpressionRule's, all other rules are converted to 1=1
- returns
a Dataset[RowRaw] with the rules flattened out
- Definition Classes
- SerializingImports
- def toOutputExpressionDS(ruleSuite: RuleSuite): Dataset[OutputExpressionRow]
Must have an active sparksession before calling and only works with ExpressionRule's, all other rules are converted to 1=1
Must have an active sparksession before calling and only works with ExpressionRule's, all other rules are converted to 1=1
- returns
a Dataset[RowRaw] with the rules flattened out
- Definition Classes
- SerializingImports
- def toRuleSuiteDF(ruleSuite: RuleSuite): DataFrame
Utility function to easy dealing with simple DQ rules where the rule engine functionality is ignored.
Utility function to easy dealing with simple DQ rules where the rule engine functionality is ignored. This can be paired with the default ruleEngine parameter in readRulesFromDF
- Definition Classes
- SerializingImports
- def toRuleSuiteRow(ruleSuite: RuleSuite): (RuleSuiteRow, Option[OutputExpressionRow])
Creates a RuleSuiteRow from a RuleSuite for RuleSuite specific attributes and an optional defaultProcessor
Creates a RuleSuiteRow from a RuleSuite for RuleSuite specific attributes and an optional defaultProcessor
- Definition Classes
- SerializingImports
- def toString(): String
- Definition Classes
- Any
- def typedExpressionRunner(ruleSuite: RuleSuite, ddlType: String, name: String = "expressionResults"): Column
Runs the ruleSuite expressions saving results as a tuple of (ruleResult: String, resultDDL: String)
Runs the ruleSuite expressions saving results as a tuple of (ruleResult: String, resultDDL: String)
- Definition Classes
- ExpressionRunnerImports
- object BloomModel extends Serializable
- object DefaultProcessor extends Serializable
- case object DefaultRule extends RuleResult with Product with Serializable
Returned for RuleSuiteResult.overallResult when no other trigger rule has passed and a defaultProcessor has been configured (otherwise it's Failed)
Returned for RuleSuiteResult.overallResult when no other trigger rule has passed and a defaultProcessor has been configured (otherwise it's Failed)
- Annotations
- @SerialVersionUID()
- case object DisabledRule extends RuleResult with Product with Serializable
This shouldn't evaluate to a fail, allows signalling a rule has been disabled
This shouldn't evaluate to a fail, allows signalling a rule has been disabled
- Annotations
- @SerialVersionUID()
- object ExpressionRule extends Serializable
- case object Failed extends RuleResult with Product with Serializable
- Annotations
- @SerialVersionUID()
- case object IgnoredRule extends RuleResult with Product with Serializable
This shouldn't evaluate to a fail, allows signalling a rule has been ignored
This shouldn't evaluate to a fail, allows signalling a rule has been ignored
- Annotations
- @SerialVersionUID()
- object LambdaFunction extends Serializable
- object NoOpDefaultProcessor
- object NoOpRunOnPassProcessor
- object OutputExpression
- case object Passed extends RuleResult with Product with Serializable
- Annotations
- @SerialVersionUID()
- object QualityException extends Serializable
- object RegistrationFunction
- object ResultStatisticsProvider extends Serializable
- object RuleModel
- object RuleSuite extends Serializable
- object RuleSuiteGroup extends Serializable
- object RuleSuiteGroupResults extends Serializable
- object RuleSuiteResultDetails extends Serializable
- object RunOnPassProcessor extends Serializable
- case object SoftFailed extends RuleResult with Product with Serializable
This shouldn't evaluate to a fail, think of it as Amber / Warn
This shouldn't evaluate to a fail, think of it as Amber / Warn
- Annotations
- @SerialVersionUID()