Skip to content

Running Quality on Databricks

The aim is to have explicit support for LTS', other interim versions may be supported as needed.

Running 3.1 builds on Databricks Runtime 9.1 LTS

Use the 9.1.dbr build / profile, the artefact name will also end with _9.1.dbr. OSS 3.1 do not need to worry about this and should not use this profile.

Databricks has back-ported TreePattern including the final nodePatterns in HigherOrderFunction and 3.2's Conf class. As such very old versions of non-opensource Quality (<=0.5.0) will fail with AbstractMethodError's when lambda's are used are 9.1 as the OSS binary version of HigherOrderFunction does not have nodePattern. Similarly the quality_testshade jar must use the 9.1.dbr version due to Conf changes.

The 9.1.dbr build class files are built on the fake TreePattern and HigherOrderFunction present in the 9.1.dbr-scala source directory, they are however removed in the jar.

ResolveTableValuedFunctions and ResolveCreateNamedStruct are removed from resolveWith as they are binary incompatible with OSS. This does not seem to effect building namedstructs using resolveWith.

Running 3.2.1 builds on Databricks Runtime 10.4

Use the 10.4.dbr build / profile, the artefact name will also end with _10.4.dbr.

DBR 10.4 backports canonicalisation changes which allow Quality and any other code using explode and arrays to functionally run. Performance is still known to be affected. These fixes are not present in the 3.2.1 OSS release, although performance improvements may be back-ported.

ResolveTables, ResolveAlterTableCommands and ResolveHigherOrderFunctions are removed from resolveWith as they are binary incompatible with OSS.

Only 10.4 LTS is supported

10.2 version support was removed in 0.0.1

Running 3.3.0 builds on Databricks Runtime 11.3 LTS

Use the 11.3.dbr build / profile, the artefact name will also end with _11.3.dbr. Due to a backport of SPARK-39316 only 11.3 LTS is supported (although likely 11.2 will also run), this changed the result type of Add causing incorrect aggregation precision via aggExpr (Sum and Average stopped using Add for this reason).

Testing out Quality via Notebooks

You can use the appropriate runtime quality_testshade artefact jar (e.g. DBR 11.3) from maven to upload into your workspace / notebook env (or add via maven). When using Databricks make sure to use the appropriate _Version.dbr builds.

Then using:

import com.sparkutils.quality.tests.TestSuite
import com.sparkutils.qualityTests.SparkTestUtils

SparkTestUtils.setPath("path_where_test_files_should_be_generated")
TestSuite.runTests

in your cell will run through all of the test suite used when building Quality.

In Databricks notebooks you can set the path up via:

val fileLoc = "/dbfs/databricks/quality_test"
SparkTestUtils.setPath(fileLoc)

Ideally at the end of your runs you'll see - after 10 minutes or so and some stdout - for example on DBR 11.3 a run provides:

Time: 682.626

OK (210 tests)

Finished. Result: Failures: 0. Ignored: 0. Tests run: 210. Time: 682626ms.
import com.sparkutils.quality.tests.TestSuite
import com.sparkutils.qualityTests.SparkTestUtils
fileLoc: String = /dbfs/databricks/quality_test

Last update: March 27, 2023 09:08:01
Created: March 27, 2023 09:08:01