Quality - 0.0.2¶

Coverage

Statement

87.50

Branch

80.60

Run complex data quality rules using simple SQL in a batch or streaming Spark application at scale.¶

Write rules using simple SQL or create re-usable functions via SQL Lambdas

Your rules are just versioned data, store them wherever convenient, use them by simply defining a column.

- comparableMaps - allow unions or sorting with map columns without json serialising and parsing overhead
- set syntax - simplified syntax for updating and defaulting
- Spark Extension - registers common Quality sql functions automatically for Thrift/Hive servers and query optimisations
- Databricks 12.2 support
- New id related functions: id_size, id_base64, id_from_base64, id_raw_type and "as_uuid"

Rules are evaluated lazily during Spark actions, such as writing a row, with results saved in a single predicatable and extensible column.

Enhanced Spark Functionality¶

Lookup Functions are distributed across the Spark cluster and held in memory, as such no shuffling is required where the shuffling introduced by joins may be too expensive:

Support for massive Bloom Filters while retaining FPP (i.e. several billion items at 0.001 would not fit into a normal 2gb byte array)
Map lookup expressions for exact lookups and contains tests, using broadcast variables under the hood they are a great fit for small reference data sets
Lambda Functions - user provided re-usable sql functions over late binded columns
Fast PRNG's exposing RandomSource allowing plugable and stable generation across the cluster
Aggregate functions over Maps expandable with simple SQL Lambdas
Row ID expressions including guaranteed unique row IDs (based on MAC address guarantees)

Plus a collection of handy functions to integrate it all.

Last update: June 2, 2023 18:50:52
Created: June 2, 2023 18:50:52