Storage Model¶

Nested columns, with nested columns, this lets you use Spark SQL to do filters and have predicate pushdown. Sample filter:

df.select(expr("filter(map_values(DataQuality.ruleSetResults), 
  ruleSet -> size(filter(map_values(ruleSet.ruleResults), 
  result ->  probability(result) > 0.3 )) > 0)").as("filtered"))

actual type:

struct<id: LongType, overallResult: IntegerType, 
  ruleSetResults: map<LongType, 
    struct<overallResult: IntegerType, 
	  ruleResults: map<LongType, IntegerType>>>>

Alternatively when creating with addOverallResultsAndDetails you have the

overallResult: IntegerType

moved to the top level, leaving

details: struct<id: LongType, 
  ruleSetResults: map<LongType, 
    struct<overallResult: IntegerType, 
	  ruleResults: map<LongType, IntegerType>>>>

Where have all the VersionIds and RuleResults gone?¶

In order to optimise storage and marshalling the VersionId parts are packed into a single LongType. RuleResults are similarly encoded into an IntegerType:

Failed => FailedInt // 0
SoftFailed => SoftFailedInt // -1
Disabled => DisabledInt // -2
Passed => PassedInt // 100000
Probability(percentage) => (percentage * PassedInt).toInt

When the developer wishes to retrieve the objects they may use the encoders directly:

// frameless is used to encode
import frameless._
// imports the encoders for RuleSuiteResult
import com.sparkutils.quality.implicits._
// derive an encoder for the pair with a user type and the RuleSuiteResult for a given row
implicit val enc = TypedExpressionEncoder[(TestIdLeft, RuleSuiteResult)]
// select the fields needed for the user type and the DataQuality result (or details with RuleResult, RuleSuiteResultDetails for seperate overall results and details)
val ds = df.selectExpr("named_struct('left_lower', `1`, 'left_higher', `2`)","DataQuality").as[(TestIdLeft, RuleSuiteResult)]

the developer can then interegate the data quality results alongside their relevant data.

Last update: December 3, 2024 17:06:32
Created: December 3, 2024 17:06:32