Skip to content

Quality - 0.1.1ΒΆ

Coverage

Statement 91.11 Branch 93.08

Run complex data quality rules using simple SQL in a batch or streaming Spark application at scale.ΒΆ

Write rules using simple SQL or create re-usable functions via SQL Lambdas.

Your rules are just versioned data, store them wherever convenient, use them by simply defining a column.

  • πŸ†• to_yaml and from_yaml - convert Spark fields into yaml (unlike to_json it allows non-string map keys)
  • πŸ†• Improved update_field and added drop_field - handles nested transformations directly using the 3.4.1 Spark implementation
  • πŸ†• rule_result - directly access rule results from DQ and expressionRunner, simplifying row statistic collection
  • πŸ†• expression runner - new runner type saving expression results directly to yaml, suitable for aggregate statistics
  • πŸ†• view loading - load views for data lookup and transformation rules from a configuration DataFrame
  • πŸ†• map loading - load maps using views or DataFrames from a configuration DataFrame
  • πŸ†• bloom loading - load blooms using views or DataFrames from a configuration DataFrame

Rules are evaluated lazily during Spark actions, such as writing a row, with results saved in a single predictable column.

Enhanced Spark FunctionalityΒΆ

Lookup Functions are distributed across the Spark cluster and held in memory, as such no shuffling is required where the shuffling introduced by joins may be too expensive:

  • Support for massive Bloom Filters while retaining FPP (i.e. several billion items at 0.001 would not fit into a normal 2gb byte array)
  • Map lookup expressions for exact lookups and contains tests, using broadcast variables under the hood they are a great fit for small reference data sets
  • View loading - manage the use of session views in your application through configuration and a pluggable DataFrameLoader

  • Lambda Functions - user provided re-usable sql functions over late bound columns

  • Fast PRNG's exposing RandomSource allowing pluggable and stable generation across the cluster

  • Aggregate functions over Maps expandable with simple SQL Lambdas

  • Row ID expressions including guaranteed unique row IDs (based on MAC address guarantees)

Plus a collection of handy functions to integrate it all.


Last update: July 9, 2023 20:20:54
Created: July 9, 2023 20:20:54