p

org.apache.spark.sql

qualityFunctions

package qualityFunctions

Ordering
  1. Alphabetic
Visibility
  1. Public
  2. All

Type Members

  1. trait Digest extends AnyRef

    Basic digest implementation for Array[Long] based hashes

  2. trait DigestFactory extends Serializable

    Factory to get a new or reset digest for each row

  3. class DoCodegenFallbackHandler extends LambdaCompilationHandler

    Defaults to calling codeGen, this can either be an original compilation approach or the CodegenFallback depending on implementation.

    Defaults to calling codeGen, this can either be an original compilation approach or the CodegenFallback depending on implementation. It will evaluate the entire tree of expr and return all NamedLambdaVariables as they will not be using the same compilation approach.

    This is the default for known OSS implementations and should also be used if compilation will not be within the same class

  4. trait FunDoGenCode extends Expression with CodegenFallback

    Generate code for any FunX including nested, normal doGenCode defaults to codegenfallback

  5. case class FunForward(children: Seq[Expression]) extends Expression with CodegenFallback with Product with Serializable

    Forwards calls to the function arguments via setters.

    Forwards calls to the function arguments via setters. This is only evaluated in aggExpr, all other usages are removed during lambda creation.

    This removal may also be forced in aggExpr at a later stage

  6. case class FunN(arguments: Seq[Expression], function: Expression, name: Option[String] = None, processed: Boolean = false, attemptCodeGen: Boolean = false) extends Expression with HigherOrderFunction with CodegenFallback with SeqArgs with FunDoGenCode with Product with Serializable

    Lambda function with multiple args, typically created with a placeholder AtomicRefExpression args

    Lambda function with multiple args, typically created with a placeholder AtomicRefExpression args

    arguments

    Evaluated to provide input to the function lambda

    function

    the actual lambda function

    name

    the lambda name when available

  7. abstract class HashLongsExpression extends Expression with CodegenFallback

    A function that calculates hash value for a group of expressions.

    A function that calculates hash value for a group of expressions. Note that the seed argument is not exposed to users and should only be set inside spark SQL.

    The hash value for an expression depends on its type and seed:

    • null: seed
    • boolean: turn boolean into int, 1 for true, 0 for false, and then use murmur3 to hash this int with seed.
    • byte, short, int: use murmur3 to hash the input as int with seed.
    • long: use murmur3 to hash the long input with seed.
    • float: turn it into int: java.lang.Float.floatToIntBits(input), and hash it.
    • double: turn it into long: java.lang.Double.doubleToLongBits(input), and hash it.
    • decimal: if it's a small decimal, i.e. precision <= 18, turn it into long and hash it. Else, turn it into bytes and hash it.
    • calendar interval: hash microseconds first, and use the result as seed to hash months.
    • interval day to second: it store long value of microseconds, use murmur3 to hash the long input with seed.
    • interval year to month: it store int value of months, use murmur3 to hash the int input with seed.
    • binary: use murmur3 to hash the bytes with seed.
    • string: get the bytes of string and hash it.
    • array: The result starts with seed, then use result as seed, recursively calculate hash value for each element, and assign the element hash value to result.
    • struct: The result starts with seed, then use result as seed, recursively calculate hash value for each field, and assign the field hash value to result.

    Finally we aggregate the hash values for each expression by the same way of struct.

  8. abstract class InterpretedHashLongsFunction extends AnyRef

    Base class for interpreted hash functions.

  9. case class MapMerge(children: Seq[Expression], addF: (DataType) ⇒ Option[(Expression, Expression) ⇒ Expression]) extends Expression with CodegenFallback with Product with Serializable

    Transforms a map

    Transforms a map

    children

    seq of maps of type x to y, they must all have the same types

    addF

    function to derive the add expr for monoidal add on values

  10. case class MapTransform(argument: Expression, key: Expression, function: Expression, zeroF: (DataType) ⇒ Option[Any]) extends Expression with HigherOrderFunction with CodegenFallback with SeqArgs with Product with Serializable

    Transforms a map

    Transforms a map

    argument

    map of type x to y,

    key

    expr for key

    function

    value to value transformation for that key entry

  11. case class NamedLambdaVariableCodeGen(name: String, dataType: DataType, nullable: Boolean, exprId: ExprId, valueRef: String) extends LeafExpression with NamedExpression with Product with Serializable

    Replaces NamedLambdaVariables for simple inlined codegen.

  12. case class PlaceHolderExpression(dataType: DataType, nullable: Boolean = true) extends LeafExpression with Unevaluable with Product with Serializable

    Only used with Lambda placeholders, defaults to allowing nullable values

  13. trait RefCodeGen extends AnyRef
  14. case class RefExpression(dataType: DataType, nullable: Boolean = true, index: Int = -1) extends LeafExpression with RefCodeGen with Product with Serializable

    Getter, trimmed version of NamedLambdaVariable as it should never be resolved

  15. case class RefExpressionLazyType(dataTypeF: () ⇒ DataType, nullable: Boolean) extends LeafExpression with RefCodeGen with Product with Serializable

    Getter, trimmed version of NamedLambdaVariable as it should never be resolved

  16. case class RefSetterExpression(children: Seq[Expression]) extends Expression with CodegenFallback with Product with Serializable

    Wraps other expressions and stores the result in an RefExpression -

  17. case class RunAllReturnLast(children: Seq[Expression]) extends Expression with CodegenFallback with Product with Serializable

    Runs all of the children and returns the last's eval result - allows stitching together lambdas with aggregates

  18. trait SeqArgs extends AnyRef

Value Members

  1. object FunCall
  2. object LambdaCompilationUtils

    Functionality related to LambdaCompilation.

    Functionality related to LambdaCompilation. Seemingly all HigherOrderFunctions use a lazy val match to extract the NamedLambdaVariable's from the spark LambdaFunction after bind has been called. When doGenCode is called eval _could_ have been called and the lazy val evaluated, as such simply rewriting the tree may not fully work. Additionally the type for NamedLambdaVariable is bound in the lazy val's which means _ANY_ HigherOrderFunction may not tolerate swapping out NamedLambdaVariables for another NamedExpression.

    To add to the fun OpenSource Spark HoF's all use CodegenFallback, as does NamedLambdaVariable, so it's possible to swap out some of these implementations if an array_transform is nested in a Fun1 or Fun2. Similarly Fun1's can call Fun2 so the assumptions are for each Fun1/FunN doCodeGen:

    1. Use the processLambda function to evaluate the function 2. compilationHandlers uses the quality.lambdaHandlers environment variable to load a comma separated list of fqn=handler pairs 3. each fully qualified class name pair (e.g. org.apache.spark.sql.catalyst.expressions.ZipWith=handler.fqn) handler is loaded 4. processLambda then evaluates the expression tree, for each matching HoF classname it will call the handler 5. handlers are used to perform the custom doGenCode for that expression rather than the default OSS CodegenFallback 6. handlers return the ExprCode AND a list of NamedLambdaVariables who must have .value.set called upon them (e.g. we can't optimise them)

    NB The fqn will also be used to check for named' lambdas used through registerLambdaFunctions.

    https://github.com/apache/spark/pull/21954 introduced the lambdavariable with AtomicReference, it's inherent performance hit and, due to the difficulty of threading the holder through the expression chain did not have a compilation approach. After it's threaded and bind has been called the variable id is stable as is the AtomicReference, as such it can be swapped out for a simple variable in the same object.

    quality.lambdaHandlers will override the default for a given platform on an fqn basis, so you only need to "add" or "replace" the HoFs that cause issue not the entire list of OSS HigherOrderFunctions for example TransformValues. Note that some versions of Databricks provide compilation of their HoF's that may not be compatible in approach.

    Disable this approach by using the quality.lambdaHandlers to disable FunN with the default DoCodegenFallbackHandler: quality.lambdaHandlers=org.apache.spark.sql.qualityFunctions.FunN=org.apache.spark.sql.qualityFunctions.DoCodegenFallbackHandler

  3. object LambdaFunctions
  4. object MapTransform extends Serializable
  5. object SafeUTF8
  6. object SeqArgs

Ungrouped