Quality - 0.2.0-preview1.1¶
Coverage
| Statement | 88.87 | Branch | 82.32 |
Run complex data quality and transformation rules using simple SQL in a batch or streaming Spark application at scale.¶
Write rules using simple SQL or create re-usable functions via SQL Lambdas.
Your rules are just versioned data, store them wherever convenient, use them by simply defining a column.
Spark 4 Connect Support
Folder can use a DefaultProcessor, both Folder and Engine now use the improved collectRunner result processing logic
RuleSuiteGroups, manage a single group of rules by name and use it to access ruleSuites in nested runners and group the results
Rules are evaluated lazily during Spark actions, such as writing a row, with results saved in a single predictable column.
Enhanced Spark Functionality¶
- Lambda Functions - user provided re-usable sql functions over late bound columns
- Map lookup expressions for exact lookups and contains tests, using broadcast variables on Classic and Variables on Connect under the hood they are a great fit for small reference data sets
-
View loading - manage the use of session views in your application through configuration and a pluggable DataFrameLoader
-
Aggregate functions over Maps expandable with simple SQL Lambdas
-
Row ID expressions including guaranteed unique row IDs (based on MAC address guarantees)
-
Fast PRNG's exposing RandomSource allowing pluggable and stable generation across the cluster
-
Support for massive Bloom Filters while retaining FPP (i.e. several billion items at 0.001 would not fit into a normal 2gb byte array) on Spark Classic
Plus a collection of handy functions to integrate it all.
Created: March 8, 2026 14:46:46