qualityFunctions

package qualityFunctions

Content Hierarchy

Ordering

Alphabetic

Visibility

Public
All

Type Members

trait Digest extends AnyRef
Basic digest implementation for Array[Long] based hashes
trait DigestFactory extends Serializable
Factory to get a new or reset digest for each row
class DoCodegenFallbackHandler extends LambdaCompilationHandler
Defaults to calling codeGen, this can either be an original compilation approach or the CodegenFallback depending on implementation.
Defaults to calling codeGen, this can either be an original compilation approach or the CodegenFallback depending on implementation. It will evaluate the entire tree of expr and return all NamedLambdaVariables as they will not be using the same compilation approach.
This is the default for known OSS implementations and should also be used if compilation will not be within the same class
trait FunDoGenCode extends Expression with CodegenFallback
Generate code for any FunX including nested, normal doGenCode defaults to codegenfallback
case class FunForward(children: Seq[Expression]) extends Expression with CodegenFallback with Product with Serializable
Forwards calls to the function arguments via setters.
Forwards calls to the function arguments via setters. This is only evaluated in aggExpr, all other usages are removed during lambda creation.
This removal may also be forced in aggExpr at a later stage
case class FunN(arguments: Seq[Expression], function: Expression, name: Option[String] = None, processed: Boolean = false, attemptCodeGen: Boolean = false) extends Expression with HigherOrderFunction with CodegenFallback with SeqArgs with FunDoGenCode with Product with Serializable
Lambda function with multiple args, typically created with a placeholder AtomicRefExpression args
Lambda function with multiple args, typically created with a placeholder AtomicRefExpression args
arguments
Evaluated to provide input to the function lambda
function
the actual lambda function
name
the lambda name when available
abstract class HashLongsExpression extends Expression with CodegenFallback
A function that calculates hash value for a group of expressions.
A function that calculates hash value for a group of expressions. Note that the seed argument is not exposed to users and should only be set inside spark SQL.
The hash value for an expression depends on its type and seed:
- null: seed
- boolean: turn boolean into int, 1 for true, 0 for false, and then use murmur3 to hash this int with seed.
- byte, short, int: use murmur3 to hash the input as int with seed.
- long: use murmur3 to hash the long input with seed.
- float: turn it into int: java.lang.Float.floatToIntBits(input), and hash it.
- double: turn it into long: java.lang.Double.doubleToLongBits(input), and hash it.
- decimal: if it's a small decimal, i.e. precision <= 18, turn it into long and hash it. Else, turn it into bytes and hash it.
- calendar interval: hash microseconds first, and use the result as seed to hash months.
- interval day to second: it store long value of microseconds, use murmur3 to hash the long input with seed.
- interval year to month: it store int value of months, use murmur3 to hash the int input with seed.
- binary: use murmur3 to hash the bytes with seed.
- string: get the bytes of string and hash it.
- array: The result starts with seed, then use result as seed, recursively calculate hash value for each element, and assign the element hash value to result.
- struct: The result starts with seed, then use result as seed, recursively calculate hash value for each field, and assign the field hash value to result.
Finally we aggregate the hash values for each expression by the same way of struct.
abstract class InterpretedHashLongsFunction extends AnyRef
Base class for interpreted hash functions.
case class MapMerge(children: Seq[Expression], addF: (DataType) ⇒ Option[(Expression, Expression) ⇒ Expression]) extends Expression with CodegenFallback with Product with Serializable
Transforms a map
Transforms a map
children
seq of maps of type x to y, they must all have the same types
addF
function to derive the add expr for monoidal add on values
case class MapTransform(argument: Expression, key: Expression, function: Expression, zeroF: (DataType) ⇒ Option[Any]) extends Expression with HigherOrderFunction with CodegenFallback with SeqArgs with Product with Serializable
Transforms a map
Transforms a map
argument
map of type x to y,
key
expr for key
function
value to value transformation for that key entry
case class NamedLambdaVariableCodeGen(name: String, dataType: DataType, nullable: Boolean, exprId: ExprId, valueRef: String) extends LeafExpression with NamedExpression with Product with Serializable
Replaces NamedLambdaVariables for simple inlined codegen.
case class PlaceHolderExpression(dataType: DataType, nullable: Boolean = true) extends LeafExpression with Unevaluable with Product with Serializable
Only used with Lambda placeholders, defaults to allowing nullable values
trait RefCodeGen extends AnyRef
case class RefExpression(dataType: DataType, nullable: Boolean = true, index: Int = -1) extends LeafExpression with RefCodeGen with Product with Serializable
Getter, trimmed version of NamedLambdaVariable as it should never be resolved
case class RefExpressionLazyType(dataTypeF: () ⇒ DataType, nullable: Boolean) extends LeafExpression with RefCodeGen with Product with Serializable
Getter, trimmed version of NamedLambdaVariable as it should never be resolved
case class RefSetterExpression(children: Seq[Expression]) extends Expression with CodegenFallback with Product with Serializable
Wraps other expressions and stores the result in an RefExpression -
case class RunAllReturnLast(children: Seq[Expression]) extends Expression with CodegenFallback with Product with Serializable
Runs all of the children and returns the last's eval result - allows stitching together lambdas with aggregates
trait SeqArgs extends AnyRef

Value Members

object FunCall
object LambdaCompilationUtils
Functionality related to LambdaCompilation.
Functionality related to LambdaCompilation. Seemingly all HigherOrderFunctions use a lazy val match to extract the NamedLambdaVariable's from the spark LambdaFunction after bind has been called. When doGenCode is called eval _could_ have been called and the lazy val evaluated, as such simply rewriting the tree may not fully work. Additionally the type for NamedLambdaVariable is bound in the lazy val's which means _ANY_ HigherOrderFunction may not tolerate swapping out NamedLambdaVariables for another NamedExpression.
To add to the fun OpenSource Spark HoF's all use CodegenFallback, as does NamedLambdaVariable, so it's possible to swap out some of these implementations if an array_transform is nested in a Fun1 or Fun2. Similarly Fun1's can call Fun2 so the assumptions are for each Fun1/FunN doCodeGen:
1. Use the processLambda function to evaluate the function 2. compilationHandlers uses the quality.lambdaHandlers environment variable to load a comma separated list of fqn=handler pairs 3. each fully qualified class name pair (e.g. org.apache.spark.sql.catalyst.expressions.ZipWith=handler.fqn) handler is loaded 4. processLambda then evaluates the expression tree, for each matching HoF classname it will call the handler 5. handlers are used to perform the custom doGenCode for that expression rather than the default OSS CodegenFallback 6. handlers return the ExprCode AND a list of NamedLambdaVariables who must have .value.set called upon them (e.g. we can't optimise them)
NB The fqn will also be used to check for named' lambdas used through registerLambdaFunctions.
https://github.com/apache/spark/pull/21954 introduced the lambdavariable with AtomicReference, it's inherent performance hit and, due to the difficulty of threading the holder through the expression chain did not have a compilation approach. After it's threaded and bind has been called the variable id is stable as is the AtomicReference, as such it can be swapped out for a simple variable in the same object.
quality.lambdaHandlers will override the default for a given platform on an fqn basis, so you only need to "add" or "replace" the HoFs that cause issue not the entire list of OSS HigherOrderFunctions for example TransformValues. Note that some versions of Databricks provide compilation of their HoF's that may not be compatible in approach.
Disable this approach by using the quality.lambdaHandlers to disable FunN with the default DoCodegenFallbackHandler: quality.lambdaHandlers=org.apache.spark.sql.qualityFunctions.FunN=org.apache.spark.sql.qualityFunctions.DoCodegenFallbackHandler
object LambdaFunctions
object MapTransform extends Serializable
object SafeUTF8
object SeqArgs

Packages

qualityFunctions

package qualityFunctions

Type Members

Value Members

Ungrouped

Packages

qualityFunctions 

package qualityFunctions

Type Members

Value Members

Ungrouped

qualityFunctions