Packages

  • package root
    Definition Classes
    root
  • package com
    Definition Classes
    root
  • package sparkutils
    Definition Classes
    com
  • package quality

    Provides an easy import point for the library.

    Provides an easy import point for the library.

    Definition Classes
    sparkutils
  • package classicFunctions

    Collection of the Quality Spark Expressions for use in select( Column * )

    Collection of the Quality Spark Expressions for use in select( Column * )

    Definition Classes
    quality
  • EmptyToGenerateScalaDocs
  • package functions

    Collection of the Quality Spark Expressions for use in select( Column * )

    Collection of the Quality Spark Expressions for use in select( Column * )

    Definition Classes
    quality
  • package impl
    Definition Classes
    quality
  • package implicits

    Imports the common implicits

    Imports the common implicits

    Definition Classes
    quality
  • package simpleVersioning

    A simple versioning scheme that allows management of versions

    A simple versioning scheme that allows management of versions

    Definition Classes
    quality
  • package sparkless
    Definition Classes
    quality
p

com.sparkutils.quality

classicFunctions

package classicFunctions

Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. classicFunctions
  2. ProcessDisableIfMissingImports
  3. ValidationImports
  4. ClassicRuleEngineRunnerImports
  5. LambdaFunctionsImports
  6. SerializingImports
  7. BlockSplitBloomFilterImports
  8. BloomFilterLookupImports
  9. LookupIdFunctionsImports
  10. MapLookupImportsShared
  11. Serializable
  12. BloomFilterRegistration
  13. ClassicRuleRunnerFunctionsImport
  14. BucketedCreatorFunctionImports
  15. BloomFilterTypes
  16. MapLookupFunctionImports
  17. ClassicRuleFolderRunnerImports
  18. BloomExpressionFunctions
  19. ClassicRuleRunnerImports
  20. BloomFilterLookupFunctionImport
  21. AnyRef
  22. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. Protected

Type Members

  1. type BloomFilterMap = Map[String, (BloomLookup, Double)]

    Used as a param to load the bloomfilter expr

    Used as a param to load the bloomfilter expr

    Definition Classes
    BloomFilterTypes
  2. final class EmptyToGenerateScalaDocs extends AnyRef

    Forces scaladoc to generate the package - https://github.com/scala/bug/issues/8124

    Forces scaladoc to generate the package - https://github.com/scala/bug/issues/8124

    Attributes
    protected[quality]
  3. type IdTriple = (Id, Id, Id)
    Definition Classes
    LambdaFunctionsImports
  4. type MapCreator = () => (DataFrame, Column, Column)
    Definition Classes
    MapLookupImportsShared
  5. type MapLookups = String

    Used as a param to load the map lookups - note the type of the broadcast is always Map[AnyRef, AnyRef]

    Used as a param to load the map lookups - note the type of the broadcast is always Map[AnyRef, AnyRef]

    Definition Classes
    MapLookupImportsShared

Abstract Value Members

  1. abstract def getClass(): Class[_ <: AnyRef]
    Definition Classes
    Any

Concrete Value Members

  1. final def !=(arg0: Any): Boolean
    Definition Classes
    Any
  2. final def ##: Int
    Definition Classes
    Any
  3. final def ==(arg0: Any): Boolean
    Definition Classes
    Any
  4. final def asInstanceOf[T0]: T0
    Definition Classes
    Any
  5. def big_bloom(bloomOver: Column, expectedNumberOfRows: Long, expectedFPP: Double, id: String, bucketedFilesRoot: BucketedFilesRoot): Column

    Creates a bloom over the bloomOver expression, with an expected number of rows and fpp to build.

    Creates a bloom over the bloomOver expression, with an expected number of rows and fpp to build.

    This bloom is an array of byte arrays, with the underlying arrays stored on file, as such it is only limited by file system size.

    id

    - the id of this bloom used to separate from other big_blooms - defaults to a uuid

    bucketedFilesRoot

    - provide this to override the default file root for the underlying arrays - defaults to a temporary file location or /dbfs/ on Databricks. The config property sparkutils.quality.bloom.root can also be used to set for all blooms in a cluster.

    Definition Classes
    BloomExpressionFunctions
    Annotations
    @ClassicOnly()
  6. def big_bloom(bloomOver: Column, expectedNumberOfRows: Long, expectedFPP: Double, id: String): Column

    Creates a bloom over the bloomOver expression, with an expected number of rows and fpp to build.

    Creates a bloom over the bloomOver expression, with an expected number of rows and fpp to build.

    This bloom is an array of byte arrays, with the underlying arrays stored on file, as such it is only limited by file system size.

    id

    - the id of this bloom used to separate from other big_blooms - defaults to a uuid

    Definition Classes
    BloomExpressionFunctions
    Annotations
    @ClassicOnly()
  7. def big_bloom(bloomOver: Column, expectedNumberOfRows: Long, expectedFPP: Double): Column

    Creates a bloom over the bloomOver expression, with an expected number of rows and fpp to build.

    Creates a bloom over the bloomOver expression, with an expected number of rows and fpp to build.

    This bloom is an array of byte arrays, with the underlying arrays stored on file, as such it is only limited by file system size.

    Definition Classes
    BloomExpressionFunctions
    Annotations
    @ClassicOnly()
  8. def big_bloom(bloomOver: Column, expectedNumberOfRows: Column, expectedFPP: Column, id: Column = lit(java.util.UUID.randomUUID().toString), bucketedFilesRoot: BucketedFilesRoot = BucketedFilesRoot(FileRoot(com.sparkutils.quality.classicFunctions.bloomFileLocation))): Column

    Creates a bloom over the bloomOver expression, with an expected number of rows and fpp to build.

    Creates a bloom over the bloomOver expression, with an expected number of rows and fpp to build.

    This bloom is an array of byte arrays, with the underlying arrays stored on file, as such it is only limited by file system size.

    id

    - the id of this bloom used to separate from other big_blooms - defaults to a uuid, this cannot refer to a row column or expression

    bucketedFilesRoot

    - provide this to override the default file root for the underlying arrays - defaults to a temporary file location or /dbfs/ on Databricks. The config property sparkutils.quality.bloom.root can also be used to set for all blooms in a cluster.

    Definition Classes
    BloomExpressionFunctions
    Annotations
    @ClassicOnly()
  9. val bloomFileLocation: String

    Where the bloom filter files will be stored

    Where the bloom filter files will be stored

    Definition Classes
    BucketedCreatorFunctionImports
  10. def bloomFrom(dataFrame: DataFrame, bloomOn: String, expectedSize: Int, fpp: Double, bloomId: String, partitions: Int): BloomModel

    Generates a bloom filter using the expression passed via bloomOn with optional repartitioning and a default fpp of 0.01.

    Generates a bloom filter using the expression passed via bloomOn with optional repartitioning and a default fpp of 0.01. A unique run will be created so other results can co-exist but this can be overridden by specifying bloomId. The other interim result files will be removed and the resulting BucketedFiles can be used in lookups or stored elsewhere.

    bloomOn

    a compatible expression to generate the bloom hashes against

    Definition Classes
    BucketedCreatorFunctionImports
  11. def bloomFrom(dataFrame: DataFrame, bloomOn: String, expectedSize: Int, fpp: Double, bloomId: String): BloomModel

    Generates a bloom filter using the expression passed via bloomOn with optional repartitioning and a default fpp of 0.01.

    Generates a bloom filter using the expression passed via bloomOn with optional repartitioning and a default fpp of 0.01. A unique run will be created so other results can co-exist but this can be overridden by specifying bloomId. The other interim result files will be removed and the resulting BucketedFiles can be used in lookups or stored elsewhere.

    bloomOn

    a compatible expression to generate the bloom hashes against

    Definition Classes
    BucketedCreatorFunctionImports
  12. def bloomFrom(dataFrame: DataFrame, bloomOn: String, expectedSize: Int, fpp: Double): BloomModel

    Generates a bloom filter using the expression passed via bloomOn with optional repartitioning and a default fpp of 0.01.

    Generates a bloom filter using the expression passed via bloomOn with optional repartitioning and a default fpp of 0.01. A unique run will be created so other results can co-exist but this can be overridden by specifying bloomId. The other interim result files will be removed and the resulting BucketedFiles can be used in lookups or stored elsewhere.

    bloomOn

    a compatible expression to generate the bloom hashes against

    Definition Classes
    BucketedCreatorFunctionImports
  13. def bloomFrom(dataFrame: DataFrame, bloomOn: String, expectedSize: Int): BloomModel

    Generates a bloom filter using the expression passed via bloomOn with optional repartitioning and a default fpp of 0.01.

    Generates a bloom filter using the expression passed via bloomOn with optional repartitioning and a default fpp of 0.01. A unique run will be created so other results can co-exist but this can be overridden by specifying bloomId. The other interim result files will be removed and the resulting BucketedFiles can be used in lookups or stored elsewhere.

    bloomOn

    a compatible expression to generate the bloom hashes against

    Definition Classes
    BucketedCreatorFunctionImports
  14. def bloomFrom(dataFrame: DataFrame, bloomOn: Column, expectedSize: Int, fpp: Double = 0.01, bloomId: String = java.util.UUID.randomUUID().toString, partitions: Int = 0): BloomModel

    Generates a bloom filter using the expression passed via bloomOn with optional repartitioning and a default fpp of 0.01.

    Generates a bloom filter using the expression passed via bloomOn with optional repartitioning and a default fpp of 0.01. A unique run will be created so other results can co-exist but this can be overridden by specifying bloomId. The other interim result files will be removed and the resulting BucketedFiles can be used in lookups or stored elsewhere.

    bloomOn

    a compatible expression to generate the bloom hashes against

    Definition Classes
    BucketedCreatorFunctionImports
  15. def bloomLookup(bucketedFiles: BloomModel): BloomLookup

    Creates a very large bloom filter from multiple buckets of 2gb arrays backed by the default Parquet implementation

  16. def bloomLookup(bytes: Array[Byte]): BloomLookup

    Creates a bloom filter from an array of bytes using the default Parquet bloom filter implementation

  17. def enableFunNRewrites(): Unit

    Enables the FunNRewrite optimisation.

    Enables the FunNRewrite optimisation. Where a user configured LambdaFunction does not have nested HigherOrderFunctions, or declares the /* USED_AS_LAMBDA */ comment, the lambda function will be expanded, replacing all LambdaVariables with the input expressions.

    Definition Classes
    ClassicRuleRunnerFunctionsImport
  18. def enableOptimizations(extraOptimizations: Seq[org.apache.spark.sql.catalyst.rules.Rule[LogicalPlan]]): Unit

    Enables optimisations via experimental.extraOptimizations, first checking if they are already present and ensuring all are added in order, can be used in place of enableFunNRewrites or additionally, for example adding ConstantFolding may improve performance for given rule types

    Enables optimisations via experimental.extraOptimizations, first checking if they are already present and ensuring all are added in order, can be used in place of enableFunNRewrites or additionally, for example adding ConstantFolding may improve performance for given rule types

    extraOptimizations

    rules to be added in order

    Definition Classes
    ClassicRuleRunnerFunctionsImport
  19. def equals(arg0: Any): Boolean
    Definition Classes
    Any
  20. def getBlooms(ruleSuite: RuleSuite): Seq[String]

    Identifies bloom ids before (or after) resolving for a given ruleSuite, use to know which bloom filters need to be loaded

    Identifies bloom ids before (or after) resolving for a given ruleSuite, use to know which bloom filters need to be loaded

    ruleSuite

    a ruleSuite full of expressions to check

    returns

    The bloom id's used, for unresolved expression trees this may contain blooms which are not present in the bloom map

    Definition Classes
    BloomFilterLookupImports
    Annotations
    @ClassicOnly()
  21. def hashCode(): Int
    Definition Classes
    Any
  22. def identifyLookups(ruleSuite: RuleSuite): LookupResults

    Use this function to identify which maps / blooms etc.

    Use this function to identify which maps / blooms etc. are used by a given rulesuite collects all rules that are using lookup functions but without constant expressions and the list of lookups that are constants.

    Definition Classes
    LookupIdFunctionsImports
  23. def integrateLambdas(ruleSuiteMap: RuleSuiteMap, lambdas: Map[Id, Seq[LambdaFunction]], globalLibrary: Option[Id] = None): RuleSuiteMap

    Add any of the Lambdas loaded for the rule suites

    Add any of the Lambdas loaded for the rule suites

    globalLibrary

    - all lambdas with this RuleSuite Id will be added to all RuleSuites

    Definition Classes
    SerializingImports
  24. def integrateMetaRuleSets(dataFrame: DataFrame, ruleSuiteMap: RuleSuiteMap, metaRuleSetMap: Map[Id, Seq[MetaRuleSetRow]], stablePosition: (String) => Int, transform: (DataFrame) => DataFrame = identity): RuleSuiteMap

    Integrates meta rulesets into the rulesuites.

    Integrates meta rulesets into the rulesuites. Note this only works for a specific dataset, if rulesuites should be filtered for a given dataset then this must take place before calling.

    dataFrame

    the dataframe to identify columns for metarules

    ruleSuiteMap

    the ruleSuites relevant for this dataframe

    metaRuleSetMap

    the meta rulesets

    stablePosition

    this function must maintain the law that each column name within a RuleSet always generates the same position

    Definition Classes
    SerializingImports
  25. def integrateOutputExpressions(ruleSuiteMap: RuleSuiteMap, outputs: Map[Id, Seq[OutputExpressionRow]], globalLibrary: Option[Id] = None): (RuleSuiteMap, Map[Id, Set[Rule]])

    Returns an integrated ruleSuiteMap with a set of RuleSuite Id -> Rule mappings where the OutputExpression didn't exist.

    Returns an integrated ruleSuiteMap with a set of RuleSuite Id -> Rule mappings where the OutputExpression didn't exist. If defaultProcessor is expected to be used then call integrateRuleSuites *before*.

    Users should check if their RuleSuite is in the "error" map. The isAMissingRuleSuiteRule can be used to identify if a Rule is referring to a missing RuleSuite.defaultProcessor

    Definition Classes
    SerializingImports
  26. def integrateRuleSuites(ruleSuiteMap: RuleSuiteMap, ruleSuites: Map[Id, RuleSuiteRow]): RuleSuiteMap

    Returns an integrated ruleSuiteMap with probablePass and defaultOutputExpression applied

    Returns an integrated ruleSuiteMap with probablePass and defaultOutputExpression applied

    Call this before calling integrateOutputExpressions.

    Definition Classes
    SerializingImports
  27. def isAMissingRuleSuiteRule(rule: Rule): Boolean

    Identify if the missing OutputExpression from a RuleSuite is from a defaultProcessor

    Identify if the missing OutputExpression from a RuleSuite is from a defaultProcessor

    Definition Classes
    SerializingImports
  28. final def isInstanceOf[T0]: Boolean
    Definition Classes
    Any
  29. def loadBloomConfigs(loader: DataFrameLoader, viewDF: DataFrame, name: Column, token: Column, filter: Column, sql: Column, bigBloom: Column, value: Column, numberOfElements: Column, expectedFPP: Column): (Seq[BloomConfig], Set[String])

    Loads map configurations from a given DataFrame.

    Loads map configurations from a given DataFrame. Wherever token is present loader will be called and the filter optionally applied.

    returns

    A tuple of MapConfig's and the names of rows which had unexpected content (either token or sql must be present)

    Definition Classes
    BloomFilterLookupImports
  30. def loadBloomConfigs(loader: DataFrameLoader, viewDF: DataFrame, ruleSuiteIdColumn: Column, ruleSuiteVersionColumn: Column, ruleSuiteId: Id, name: Column, token: Column, filter: Column, sql: Column, bigBloom: Column, value: Column, numberOfElements: Column, expectedFPP: Column): (Seq[BloomConfig], Set[String])

    Loads map configurations from a given DataFrame for ruleSuiteId.

    Loads map configurations from a given DataFrame for ruleSuiteId. Wherever token is present loader will be called and the filter optionally applied.

    returns

    A tuple of MapConfig's and the names of rows which had unexpected content (either token or sql must be present)

    Definition Classes
    BloomFilterLookupImports
  31. def loadBlooms(configs: Seq[BloomConfig]): impl.bloom.BloomExpressionLookup.BloomFilterMap

    Loads bloom maps ready to register from configuration

    Loads bloom maps ready to register from configuration

    Definition Classes
    BloomFilterLookupImports
  32. def loadMapConfigs(loader: DataFrameLoader, viewDF: DataFrame, name: Column, token: Column, filter: Column, sql: Column, key: Column, value: Column): (Seq[MapConfig], Set[String])

    Loads map configurations from a given DataFrame.

    Loads map configurations from a given DataFrame. Wherever token is present loader will be called and the filter optionally applied.

    returns

    A tuple of MapConfig's and the names of rows which had unexpected content (either token or sql must be present)

    Definition Classes
    MapLookupImportsShared
  33. def loadMapConfigs(loader: DataFrameLoader, viewDF: DataFrame, ruleSuiteIdColumn: Column, ruleSuiteVersionColumn: Column, ruleSuiteId: Id, name: Column, token: Column, filter: Column, sql: Column, key: Column, value: Column): (Seq[MapConfig], Set[String])

    Loads map configurations from a given DataFrame for ruleSuiteId.

    Loads map configurations from a given DataFrame for ruleSuiteId. Wherever token is present loader will be called and the filter optionally applied.

    returns

    A tuple of MapConfig's and the names of rows which had unexpected content (either token or sql must be present)

    Definition Classes
    MapLookupImportsShared
  34. def loadMaps(configs: Seq[MapConfig]): MapLookups
    Definition Classes
    MapLookupImportsShared
  35. def mapLookupsFromDFs(creators: Map[String, MapCreator]): MapLookups

    Loads maps to broadcast, each individual dataframe may have different associated expressions

    Loads maps to broadcast, each individual dataframe may have different associated expressions

    creators

    a map of string id to MapCreator

    returns

    a map of id to broadcast variables needed for exact lookup and mapping checks

    Definition Classes
    MapLookupImportsShared
  36. def map_contains(mapLookupName: String, lookupKey: Column, mapLookups: quality.MapLookups): Column

    Tests if there is a stored value from a map via the name mapLookupName and 'key' lookupKey.

    Tests if there is a stored value from a map via the name mapLookupName and 'key' lookupKey. Implementation is map_lookup.isNotNull

    Definition Classes
    MapLookupFunctionImports
  37. def map_lookup(mapLookupName: String, lookupKey: Column, mapLookups: quality.MapLookups): Column

    Retrieves the stored value from a map via the name mapLookupName and 'key' lookupKey

    Retrieves the stored value from a map via the name mapLookupName and 'key' lookupKey

    Definition Classes
    MapLookupFunctionImports
  38. def namesFromSchema(schema: StructType): Set[String]

    Retrieves the names (with parent fields following the .

    Retrieves the names (with parent fields following the . notation for any nested fields) from the schema

    Definition Classes
    LookupIdFunctionsImports
  39. def optimalNumOfBits(n: Long, p: Double): Int

    Calculate optimal size according to the number of distinct values and false positive probability.

    Calculate optimal size according to the number of distinct values and false positive probability.

    n

    : The number of distinct values.

    p

    : The false positive probability.

    returns

    optimal number of bits of given n and p.

    Definition Classes
    BlockSplitBloomFilterImports
  40. def optimalNumberOfBuckets(n: Long, p: Double): Long
  41. def probability_in(lookupValue: Column, filterName: String, bloomMap: Broadcast[impl.bloom.BloomExpressionLookup.BloomFilterMap]): Column

    Lookup the value against a bloom filter from bloomMap with name filterName.

    Lookup the value against a bloom filter from bloomMap with name filterName.

    Definition Classes
    BloomFilterLookupFunctionImport
    Annotations
    @ClassicOnly()
  42. def processCoalesceIfAttributeMissing(expression: Expression, names: Set[String]): Expression
  43. def processIfAttributeMissing(ruleSuite: RuleSuite, schema: StructType = StructType(Seq())): RuleSuite

    Processes a given RuleSuite to replace any coalesceIfMissingAttributes.

    Processes a given RuleSuite to replace any coalesceIfMissingAttributes. This may be called before validate / docs but *must* be called *before* adding the expression to a dataframe.

    schema

    The names to validate against, if empty no attempt to process coalesceIfAttributeMissing will be made

    Definition Classes
    ProcessDisableIfMissingImports
  44. def readLambdaRowsFromDF(lambdaFunctionDF: DataFrame, lambdaFunctionName: Column, lambdaFunctionExpression: Column, lambdaFunctionId: Column, lambdaFunctionVersion: Column, lambdaFunctionRuleSuiteId: Column, lambdaFunctionRuleSuiteVersion: Column): Dataset[LambdaFunctionRow]

    Loads lambda functions

    Loads lambda functions

    Definition Classes
    SerializingImports
  45. def readLambdasFromDF(lambdaFunctionDF: DataFrame, lambdaFunctionName: Column, lambdaFunctionExpression: Column, lambdaFunctionId: Column, lambdaFunctionVersion: Column, lambdaFunctionRuleSuiteId: Column, lambdaFunctionRuleSuiteVersion: Column): Map[Id, Seq[LambdaFunction]]

    Loads lambda functions

    Loads lambda functions

    Definition Classes
    SerializingImports
  46. def readMetaRuleSetsFromDF(metaRuleSetDF: DataFrame, columnSelectionExpression: Column, lambdaFunctionExpression: Column, metaRuleSetId: Column, metaRuleSetVersion: Column, metaRuleSuiteId: Column, metaRuleSuiteVersion: Column): Map[Id, Seq[MetaRuleSetRow]]
    Definition Classes
    SerializingImports
  47. def readOutputExpressionRowsFromDF(outputExpressionDF: DataFrame, outputExpression: Column, outputExpressionId: Column, outputExpressionVersion: Column, outputExpressionRuleSuiteId: Column, outputExpressionRuleSuiteVersion: Column): Dataset[OutputExpressionRow]

    Loads output expressions

    Loads output expressions

    Definition Classes
    SerializingImports
  48. def readOutputExpressionsFromDF(outputExpressionDF: DataFrame, outputExpression: Column, outputExpressionId: Column, outputExpressionVersion: Column, outputExpressionRuleSuiteId: Column, outputExpressionRuleSuiteVersion: Column): Map[Id, Seq[OutputExpressionRow]]

    Loads output expressions

    Loads output expressions

    Definition Classes
    SerializingImports
  49. def readRuleRowsFromDF(df: DataFrame, ruleSuiteId: Column, ruleSuiteVersion: Column, ruleSetId: Column, ruleSetVersion: Column, ruleId: Column, ruleVersion: Column, ruleExpr: Column, outputExpressionSalience: Column, outputExpressionId: Column, outputExpressionVersion: Column): Dataset[RuleRow]

    Loads RuleRows from a dataframe with integers ruleSuiteId, ruleSuiteVersion, ruleSetId, ruleSetVersion, ruleId, ruleVersion and an expression string ruleExpr

    Loads RuleRows from a dataframe with integers ruleSuiteId, ruleSuiteVersion, ruleSetId, ruleSetVersion, ruleId, ruleVersion and an expression string ruleExpr

    Definition Classes
    SerializingImports
  50. def readRuleRowsFromDF(df: DataFrame, ruleSuiteId: Column, ruleSuiteVersion: Column, ruleSetId: Column, ruleSetVersion: Column, ruleId: Column, ruleVersion: Column, ruleExpr: Column, ruleEngine: Option[(Column, Column, Column)] = None): Dataset[RuleRow]

    Loads RuleRows from a dataframe with integers ruleSuiteId, ruleSuiteVersion, ruleSetId, ruleSetVersion, ruleId, ruleVersion and an expression string ruleExpr

    Loads RuleRows from a dataframe with integers ruleSuiteId, ruleSuiteVersion, ruleSetId, ruleSetVersion, ruleId, ruleVersion and an expression string ruleExpr

    Definition Classes
    SerializingImports
  51. def readRuleSuitesFromDF(ruleSuites: Dataset[RuleSuiteRow]): Map[Id, RuleSuiteRow]

    Loads RuleSuite specific attributes to use with integrateRuleSuites

    Loads RuleSuite specific attributes to use with integrateRuleSuites

    Definition Classes
    SerializingImports
  52. def readRulesFromDF(df: DataFrame, ruleSuiteId: Column, ruleSuiteVersion: Column, ruleSetId: Column, ruleSetVersion: Column, ruleId: Column, ruleVersion: Column, ruleExpr: Column, ruleEngineSalience: Column, ruleEngineId: Column, ruleEngineVersion: Column): RuleSuiteMap

    Loads a RuleSuite from a dataframe with integers ruleSuiteId, ruleSuiteVersion, ruleSetId, ruleSetVersion, ruleId, ruleVersion and an expression string ruleExpr

    Loads a RuleSuite from a dataframe with integers ruleSuiteId, ruleSuiteVersion, ruleSetId, ruleSetVersion, ruleId, ruleVersion and an expression string ruleExpr

    Definition Classes
    SerializingImports
  53. def readRulesFromDF(df: DataFrame, ruleSuiteId: Column, ruleSuiteVersion: Column, ruleSetId: Column, ruleSetVersion: Column, ruleId: Column, ruleVersion: Column, ruleExpr: Column): RuleSuiteMap

    Loads a RuleSuite from a dataframe with integers ruleSuiteId, ruleSuiteVersion, ruleSetId, ruleSetVersion, ruleId, ruleVersion and an expression string ruleExpr

    Loads a RuleSuite from a dataframe with integers ruleSuiteId, ruleSuiteVersion, ruleSetId, ruleSetVersion, ruleId, ruleVersion and an expression string ruleExpr

    Definition Classes
    SerializingImports
  54. def registerBloomMapAndFunction(bloomFilterMap: Broadcast[impl.bloom.BloomExpressionLookup.BloomFilterMap]): Unit

    Registers this bloom map and associates the probabilityIn sql expression against it

    Registers this bloom map and associates the probabilityIn sql expression against it

    Definition Classes
    BloomFilterRegistration
    Annotations
    @ClassicOnly()
  55. def registerLambdaFunctions(functions: Seq[LambdaFunction]): Unit

    This function is for use with directly loading lambda functions to use within normal spark apps.

    This function is for use with directly loading lambda functions to use within normal spark apps.

    Please be aware loading rules via ruleRunner, addOverallResultsAndDetails or addDataQuality will also load any registered rules applicable for that ruleSuite and may override other registered rules. You should not use this directly and RuleSuites for that reason (otherwise your results will not be reproducible).

    Definition Classes
    LambdaFunctionsImports
  56. def registerQualityFunctions(parseTypes: (String) => Option[DataType] = defaultParseTypes, zero: (DataType) => Option[Any] = defaultZero, add: (DataType) => Option[(Expression, Expression) => Expression] = (dataType: DataType) => defaultAdd(dataType), mapCompare: (DataType) => Option[(Any, Any) => Int] = (dataType: DataType) => utils.defaultMapCompare(dataType), writer: (String) => Unit = println, registerFunction: (String, (Seq[Expression]) => Expression) => Unit = (n, f) => ShimUtils.registerFunction(SparkSession.active)(n,f), fromExtension: Boolean = false): Unit

    Must be called before using any functions like Passed, Failed or Probability(X)

    Must be called before using any functions like Passed, Failed or Probability(X)

    parseTypes

    override type parsing (e.g. DDL, defaults to defaultParseTypes / DataType.fromDDL)

    zero

    override zero creation for aggExpr (defaults to defaultZero)

    add

    override the "add" function for aggExpr types (defaults to defaultAdd(dataType))

    writer

    override the printCode and printExpr print writing function (defaults to println)

    registerFunction

    function to register the sql extensions

    fromExtension

    only the SparkExtension should set this to true

    Definition Classes
    ClassicRuleRunnerFunctionsImport
  57. def ruleEngineRunner(ruleSuite: RuleSuite, resultDataType: DataType): Column

    Creates a column that runs the RuleSuite.

    Creates a column that runs the RuleSuite. This also forces registering the lambda functions used by that RuleSuite

    ruleSuite

    The ruleSuite with runOnPassProcessors

    resultDataType

    The type of the results from runOnPassProcessors - must be the same for all result types, by default most fields will be nullable and encoding must follow the fields when not specified. *

    returns

    A Column representing the QualityRules expression built from this ruleSuite

    Definition Classes
    ClassicRuleEngineRunnerImports
  58. def ruleEngineRunner(ruleSuite: RuleSuite, resultDataType: DataType, debugMode: Boolean): Column

    Creates a column that runs the RuleSuite.

    Creates a column that runs the RuleSuite. This also forces registering the lambda functions used by that RuleSuite

    ruleSuite

    The ruleSuite with runOnPassProcessors

    resultDataType

    The type of the results from runOnPassProcessors - must be the same for all result types, by default most fields will be nullable and encoding must follow the fields when not specified. *

    debugMode

    When debugMode is enabled the resultDataType is wrapped in Array of (salience, result) pairs to ease debugging

    returns

    A Column representing the QualityRules expression built from this ruleSuite

    Definition Classes
    ClassicRuleEngineRunnerImports
  59. def ruleEngineRunner(ruleSuite: RuleSuite, resultDataType: Option[DataType] = None, compileEvals: Boolean = false, debugMode: Boolean = false, resolveWith: Option[DataFrame] = None, variablesPerFunc: Int = 40, variableFuncGroup: Int = 20, forceRunnerEval: Boolean = false, forceTriggerEval: Boolean = false): Column

    Creates a column that runs the RuleSuite.

    Creates a column that runs the RuleSuite. This also forces registering the lambda functions used by that RuleSuite.

    NOTE if using this in a Connect session resolveWith, compileEvals, forceRunnerEval and forceTriggerEval are ignored and use defaults

    ruleSuite

    The ruleSuite with runOnPassProcessors

    resultDataType

    The type of the results from runOnPassProcessors - must be the same for all result types, by default most fields will be nullable and encoding must follow the fields when not specified. *

    compileEvals

    Should the rules be compiled out to interim objects - by default false

    debugMode

    When debugMode is enabled the resultDataType is wrapped in Array of (salience, result) pairs to ease debugging

    resolveWith

    This experimental parameter can take the DataFrame these rules will be added to and pre-resolve and optimise the sql expressions, see the documentation for details on when to and not to use this.

    variablesPerFunc

    Defaulting to 40 allows, in combination with variableFuncGroup allows customisation of handling the 64k jvm method size limitation when performing WholeStageCodeGen

    variableFuncGroup

    Defaulting to 20

    forceRunnerEval

    Defaulting to false, passing true forces a simplified partially interpreted evaluation (compileEvals must be false to get fully interpreted)

    forceTriggerEval

    Defaulting to false

    returns

    A Column representing the QualityRules expression built from this ruleSuite

    Definition Classes
    ClassicRuleEngineRunnerImports
  60. def ruleFolderRunner(ruleSuite: RuleSuite, startingStruct: Column, compileEvals: Boolean = false, debugMode: Boolean = false, resolveWith: Option[DataFrame] = None, variablesPerFunc: Int = 40, variableFuncGroup: Int = 20, forceRunnerEval: Boolean = false, useType: Option[StructType] = None, forceTriggerEval: Boolean = false): Column

    Creates a column that runs the folding RuleSuite.

    Creates a column that runs the folding RuleSuite. This also forces registering the lambda functions used by that RuleSuite.

    FolderRunner runs all output expressions for matching rules in order of salience, the startingStruct is passed ot the first matching, the result passed to the second etc. In contrast to ruleEngineRunner OutputExpressions should be lambdas with one parameter, that of the structure

    NOTE if using this in a Connect session resolveWith, compileEvals, forceRunnerEval and forceTriggerEval are ignored and use defaults

    ruleSuite

    The ruleSuite with runOnPassProcessors

    startingStruct

    This struct is passed to the first matching rule, ideally you would use the spark dsl struct function to refer to existing columns

    compileEvals

    Should the rules be compiled out to interim objects - by default false, allowing optimisations

    debugMode

    When debugMode is enabled the resultDataType is wrapped in Array of (salience, result) pairs to ease debugging

    resolveWith

    This experimental parameter can take the DataFrame these rules will be added to and pre-resolve and optimise the sql expressions, see the documentation for details on when to and not to use this.

    variablesPerFunc

    Defaulting to 40 allows, in combination with variableFuncGroup allows customisation of handling the 64k jvm method size limitation when performing WholeStageCodeGen

    variableFuncGroup

    Defaulting to 20

    forceRunnerEval

    Defaulting to false, passing true forces a simplified partially interpreted evaluation (compileEvals must be false to get fully interpreted)

    useType

    In the case you must use select and can't use withColumn you may provide a type directly to stop the NPE

    forceTriggerEval

    Defaulting to false, passing true forces each trigger expression to be compiled (compileEvals) and used in place, false instead expands the trigger in-line giving possible performance boosts based on JIT

    returns

    A Column representing the QualityRules expression built from this ruleSuite

    Definition Classes
    ClassicRuleFolderRunnerImports
  61. def ruleFolderRunnerClassic(ruleSuite: RuleSuite, startingStruct: Column, compileEvals: Boolean = false, debugMode: Boolean = false, resolveWith: Option[DataFrame] = None, variablesPerFunc: Int = 40, variableFuncGroup: Int = 20, forceRunnerEval: Boolean = false, useType: Option[StructType] = None, forceTriggerEval: Boolean = false): Column

    Creates a column that runs the folding RuleSuite.

    Creates a column that runs the folding RuleSuite. This also forces registering the lambda functions used by that RuleSuite.

    FolderRunner runs all output expressions for matching rules in order of salience, the startingStruct is passed ot the first matching, the result passed to the second etc. In contrast to ruleEngineRunner OutputExpressions should be lambdas with one parameter, that of the structure

    ruleSuite

    The ruleSuite with runOnPassProcessors

    startingStruct

    This struct is passed to the first matching rule, ideally you would use the spark dsl struct function to refer to existing columns

    compileEvals

    Should the rules be compiled out to interim objects - by default false, allowing optimisations

    debugMode

    When debugMode is enabled the resultDataType is wrapped in Array of (salience, result) pairs to ease debugging

    resolveWith

    This experimental parameter can take the DataFrame these rules will be added to and pre-resolve and optimise the sql expressions, see the documentation for details on when to and not to use this.

    variablesPerFunc

    Defaulting to 40 allows, in combination with variableFuncGroup allows customisation of handling the 64k jvm method size limitation when performing WholeStageCodeGen

    variableFuncGroup

    Defaulting to 20

    forceRunnerEval

    Defaulting to false, passing true forces a simplified partially interpreted evaluation (compileEvals must be false to get fully interpreted)

    useType

    In the case you must use select and can't use withColumn you may provide a type directly to stop the NPE

    forceTriggerEval

    Defaulting to false, passing true forces each trigger expression to be compiled (compileEvals) and used in place, false instead expands the trigger in-line giving possible performance boosts based on JIT

    returns

    A Column representing the QualityRules expression built from this ruleSuite

    Definition Classes
    ClassicRuleFolderRunnerImports
  62. def ruleRunner(ruleSuite: RuleSuite, compileEvals: Boolean = false, resolveWith: Option[DataFrame] = None, variablesPerFunc: Int = 40, variableFuncGroup: Int = 20, forceRunnerEval: Boolean = false): Column

    Creates a column that runs the RuleSuite.

    Creates a column that runs the RuleSuite. This also forces registering the lambda functions used by that RuleSuite

    NOTE if using this in a Connect session resolveWith, compileEvals and forceRunnerEval are ignored and use defaults

    ruleSuite

    The Qualty RuleSuite to evaluate

    compileEvals

    Should the rules be compiled out to interim objects - by default false

    resolveWith

    This experimental parameter can take the DataFrame these rules will be added to and pre-resolve and optimise the sql expressions, see the documentation for details on when to and not to use this. RuleRunner does not currently do wholestagecodegen when resolveWith is used.

    variablesPerFunc

    Defaulting to 40, it allows, in combination with variableFuncGroup customisation of handling the 64k jvm method size limitation when performing WholeStageCodeGen. You _shouldn't_ need it but it's there just in case.

    variableFuncGroup

    Defaulting to 20

    forceRunnerEval

    Defaulting to false, passing true forces a simplified partially interpreted evaluation (compileEvals must be false to get fully interpreted)

    returns

    A Column representing the Quality DQ expression built from this ruleSuite

    Definition Classes
    ClassicRuleRunnerImports
  63. def small_bloom(bloomOver: Column, expectedNumberOfRows: Long, expectedFPP: Double): Column

    Creates a bloom over the bloomOver expression, with an expected number of rows and fpp to build.

    Creates a bloom over the bloomOver expression, with an expected number of rows and fpp to build.

    This bloom fit's in a single byte array, as such it's limited to the number of elements it can fit it and still maintain the fpp, use big_bloom where the item counts are high and fpp must hold.

    Definition Classes
    BloomExpressionFunctions
    Annotations
    @ClassicOnly()
  64. def small_bloom(bloomOver: Column, expectedNumberOfRows: Column, expectedFPP: Column): Column

    Creates a bloom over the bloomOver expression, with an expected number of rows and fpp to build.

    Creates a bloom over the bloomOver expression, with an expected number of rows and fpp to build.

    This bloom fit's in a single byte array, as such it's limited to the number of elements it can fit it and still maintain the fpp, use big_bloom where the item counts are high and fpp must hold.

    Definition Classes
    BloomExpressionFunctions
    Annotations
    @ClassicOnly()
  65. def toDS(ruleSuite: RuleSuite): Dataset[RuleRow]

    Must have an active sparksession before calling and only works with ExpressionRule's, all other rules are converted to 1=1

    Must have an active sparksession before calling and only works with ExpressionRule's, all other rules are converted to 1=1

    returns

    a Dataset[RowRaw] with the rules flattened out

    Definition Classes
    SerializingImports
  66. def toLambdaDS(ruleSuite: RuleSuite): Dataset[LambdaFunctionRow]

    Must have an active sparksession before calling and only works with ExpressionRule's, all other rules are converted to 1=1

    Must have an active sparksession before calling and only works with ExpressionRule's, all other rules are converted to 1=1

    returns

    a Dataset[RowRaw] with the rules flattened out

    Definition Classes
    SerializingImports
  67. def toOutputExpressionDS(ruleSuite: RuleSuite): Dataset[OutputExpressionRow]

    Must have an active sparksession before calling and only works with ExpressionRule's, all other rules are converted to 1=1

    Must have an active sparksession before calling and only works with ExpressionRule's, all other rules are converted to 1=1

    returns

    a Dataset[RowRaw] with the rules flattened out

    Definition Classes
    SerializingImports
  68. def toRuleSuiteDF(ruleSuite: RuleSuite): DataFrame

    Utility function to easy dealing with simple DQ rules where the rule engine functionality is ignored.

    Utility function to easy dealing with simple DQ rules where the rule engine functionality is ignored. This can be paired with the default ruleEngine parameter in readRulesFromDF

    Definition Classes
    SerializingImports
  69. def toRuleSuiteRow(ruleSuite: RuleSuite): (RuleSuiteRow, Option[OutputExpressionRow])

    Creates a RuleSuiteRow from a RuleSuite for RuleSuite specific attributes and an optional defaultProcessor

    Creates a RuleSuiteRow from a RuleSuite for RuleSuite specific attributes and an optional defaultProcessor

    Definition Classes
    SerializingImports
  70. def toString(): String
    Definition Classes
    Any
  71. def validate(schemaOrFrame: Either[StructType, DataFrame], ruleSuite: RuleSuite, showParams: ShowParams = ShowParams(), runnerFunction: Option[(DataFrame) => Column] = None, qualityName: String = "Quality", recursiveLambdasSOEIsOk: Boolean = false, transformBeforeShow: (DataFrame) => DataFrame = identity, viewLookup: (String) => Boolean = ViewLoader.defaultViewLookup): (Set[RuleError], Set[RuleWarning], String, RuleSuiteDocs, Map[IdTrEither, ExpressionLookup])

    For a given dataFrame provide a full set of any validation errors for a given ruleSuite.

    For a given dataFrame provide a full set of any validation errors for a given ruleSuite.

    schemaOrFrame

    when it's a Left( StructType ) the struct will be used to test against and an emptyDataframe of this type created to resolve on the spark level. Using Right(DataFrame) will cause that dataframe to be used which is great for test cases with a runnerFunction

    showParams

    - configure how the output text is formatted using the same options and formatting as dataFrame.show

    runnerFunction

    - allows you to create a ruleRunner or ruleEngineRunner with different configurations

    qualityName

    - the column name to store the runnerFunction results in

    recursiveLambdasSOEIsOk

    - this signals that finding a recursive lambda SOE should not stop the evaluations - if true it will still try to run any runnerFunction but may not give the correct results

    transformBeforeShow

    - an optional transformation function to help shape what results are pushed to show

    viewLookup

    - for any subquery used looks up the view name for being present (quoted and with schema), defaults to the current spark catalogue

    returns

    A set of errors and the output from the dataframe when a runnerFunction is specified

    Definition Classes
    ValidationImports
  72. def validate(frame: DataFrame, ruleSuite: RuleSuite, runnerFunction: (DataFrame) => Column, transformBeforeShow: (DataFrame) => DataFrame): (Set[RuleError], Set[RuleWarning], String, RuleSuiteDocs, Map[IdTrEither, ExpressionLookup])

    For a given dataFrame provide a full set of any validation errors for a given ruleSuite.

    For a given dataFrame provide a full set of any validation errors for a given ruleSuite.

    frame

    when it's a Left( StructType ) the struct will be used to test against and an emptyDataframe of this type created to resolve on the spark level. Using Right(DataFrame) will cause that dataframe to be used which is great for test cases with a runnerFunction

    runnerFunction

    - allows you to create a ruleRunner or ruleEngineRunner with different configurations

    transformBeforeShow

    - an optional transformation function to help shape what results are pushed to show

    returns

    A set of errors and the output from the dataframe when a runnerFunction is specified

    Definition Classes
    ValidationImports
  73. def validate(frame: DataFrame, ruleSuite: RuleSuite, runnerFunction: (DataFrame) => Column, transformBeforeShow: (DataFrame) => DataFrame, viewLookup: (String) => Boolean): (Set[RuleError], Set[RuleWarning], String, RuleSuiteDocs, Map[IdTrEither, ExpressionLookup])

    For a given dataFrame provide a full set of any validation errors for a given ruleSuite.

    For a given dataFrame provide a full set of any validation errors for a given ruleSuite.

    frame

    when it's a Left( StructType ) the struct will be used to test against and an emptyDataframe of this type created to resolve on the spark level. Using Right(DataFrame) will cause that dataframe to be used which is great for test cases with a runnerFunction

    runnerFunction

    - allows you to create a ruleRunner or ruleEngineRunner with different configurations

    transformBeforeShow

    - an optional transformation function to help shape what results are pushed to show

    returns

    A set of errors and the output from the dataframe when a runnerFunction is specified

    Definition Classes
    ValidationImports
  74. def validate(frame: DataFrame, ruleSuite: RuleSuite, runnerFunction: (DataFrame) => Column): (Set[RuleError], Set[RuleWarning], String, RuleSuiteDocs, Map[IdTrEither, ExpressionLookup])

    For a given dataFrame provide a full set of any validation errors for a given ruleSuite.

    For a given dataFrame provide a full set of any validation errors for a given ruleSuite.

    frame

    when it's a Left( StructType ) the struct will be used to test against and an emptyDataframe of this type created to resolve on the spark level. Using Right(DataFrame) will cause that dataframe to be used which is great for test cases with a runnerFunction

    runnerFunction

    - allows you to create a ruleRunner or ruleEngineRunner with different configurations

    returns

    A set of errors and the output from the dataframe when a runnerFunction is specified

    Definition Classes
    ValidationImports
  75. def validate(frame: DataFrame, ruleSuite: RuleSuite): (Set[RuleError], Set[RuleWarning])

    For a given dataFrame provide a full set of any validation errors for a given ruleSuite.

    For a given dataFrame provide a full set of any validation errors for a given ruleSuite.

    frame

    when it's a Left( StructType ) the struct will be used to test against and an emptyDataframe of this type created to resolve on the spark level. Using Right(DataFrame) will cause that dataframe to be used which is great for test cases with a runnerFunction

    returns

    A set of errors and the output from the dataframe when a runnerFunction is specified

    Definition Classes
    ValidationImports
  76. def validate(schema: StructType, ruleSuite: RuleSuite): (Set[RuleError], Set[RuleWarning])

    For a given dataFrame provide a full set of any validation errors for a given ruleSuite.

    For a given dataFrame provide a full set of any validation errors for a given ruleSuite.

    schema

    which fields should the dataframe have

    returns

    A set of errors and the output from the dataframe when a runnerFunction is specified

    Definition Classes
    ValidationImports
  77. def validate_Lookup(frame: DataFrame, ruleSuite: RuleSuite, runnerFunction: (DataFrame) => Column, viewLookup: (String) => Boolean): (Set[RuleError], Set[RuleWarning], String, RuleSuiteDocs, Map[IdTrEither, ExpressionLookup])

    For a given dataFrame provide a full set of any validation errors for a given ruleSuite.

    For a given dataFrame provide a full set of any validation errors for a given ruleSuite.

    frame

    when it's a Left( StructType ) the struct will be used to test against and an emptyDataframe of this type created to resolve on the spark level. Using Right(DataFrame) will cause that dataframe to be used which is great for test cases with a runnerFunction

    runnerFunction

    - allows you to create a ruleRunner or ruleEngineRunner with different configurations

    returns

    A set of errors and the output from the dataframe when a runnerFunction is specified

    Definition Classes
    ValidationImports
  78. def validate_Lookup(frame: DataFrame, ruleSuite: RuleSuite, viewLookup: (String) => Boolean = ViewLoader.defaultViewLookup): (Set[RuleError], Set[RuleWarning])

    For a given dataFrame provide a full set of any validation errors for a given ruleSuite.

    For a given dataFrame provide a full set of any validation errors for a given ruleSuite.

    frame

    when it's a Left( StructType ) the struct will be used to test against and an emptyDataframe of this type created to resolve on the spark level. Using Right(DataFrame) will cause that dataframe to be used which is great for test cases with a runnerFunction

    returns

    A set of errors and the output from the dataframe when a runnerFunction is specified

    Definition Classes
    ValidationImports
  79. def validate_Lookup(schema: StructType, ruleSuite: RuleSuite, viewLookup: (String) => Boolean): (Set[RuleError], Set[RuleWarning])

    For a given dataFrame provide a full set of any validation errors for a given ruleSuite.

    For a given dataFrame provide a full set of any validation errors for a given ruleSuite.

    schema

    which fields should the dataframe have

    returns

    A set of errors and the output from the dataframe when a runnerFunction is specified

    Definition Classes
    ValidationImports

Deprecated Value Members

  1. def bloomFilterLookup(lookupValue: Column, bloomFilterName: Column, bloomMap: Broadcast[impl.bloom.BloomExpressionLookup.BloomFilterMap]): Column

    Lookup the value against a bloom filter from bloomMap with name bloomFilterName.

    Lookup the value against a bloom filter from bloomMap with name bloomFilterName. In line with the sql functions please migrate to probability_in

    Definition Classes
    BloomFilterLookupFunctionImport
    Annotations
    @deprecated @ClassicOnly()
    Deprecated

    (Since version 0.1.0) Please migrate to bloom_lookup, bloomFilterLookup will be removed in 0.2.0

Inherited from ValidationImports

Inherited from LambdaFunctionsImports

Inherited from SerializingImports

Inherited from MapLookupImportsShared

Inherited from Serializable

Inherited from BloomFilterTypes

Inherited from AnyRef

Inherited from Any

Ungrouped