Apache Calcite 文档翻译 - 代数

代数

关系代数是Calcite的核心。每个查询都被表示为一棵关系运算符的树。你可以将一条SQL语句翻译为关系代数，也可以直接建立树状结构。

规则器规则使用保留语义的数学特性来转换表达树。例如，如果一个过滤操作没有引入其他输入的列，那么可以将一个过滤器下推至连接之前。

Calcite通过对关系表达式进行反复应用规划器规则来优化查询。一个成本模型指导了优化的整个过程，规划器生成一个替代的表达式，语义与之前表达式相同，但具有更低的成本。

规划过程是可扩展的。你可以添加自己的关系运算符、规划器规则、成本模型和统计数据。

代数构建

建立关系表达式的最简单方法时使用代数构建器RelBuilder，下面是一个示例：

表扫描

final FrameworkConfig config;
final RelBuilder builder = RelBuilder.create(config);
final RelNode node = builder
  .scan("EMP")
  .build();
System.out.println(RelOptUtil.toString(node));

（你可以从这个文件RelBUilderExample.java中获取更多的例子）

以上代码输出下列内容：

1	LogicalTableScan(table=[[scott, EMP]])

它相当于对EMP表进行了全表扫描，相当于这句SQL：

1 2	SELECT * FROM scott.EMP;

增加项目

现在，让我们添加一个项目，相当于这句SQL：

1 2	SELECT ename, deptno FROM scott.EMP;

我们仅仅只需要在调用build方法之前调用project方法即可：

final RelNode node = builder
  .scan("EMP")
  .project(builder.field("DEPTNO"), builder.field("ENAME"))
  .build();
System.out.println(RelOptUtil.toString(node));

代码输出以下内容：

1 2	LogicalProject(DEPTNO=[$7], ENAME=[$1]) LogicalTableScan(table=[[scott, EMP]])

两次调用builder.field方法创建了简单的表达式，从输入的通过调用scan方法产生的关系表达式（TableScan）中返回字段。

Calcite将字段按照序号进行转换，$7和$1

增加过滤和聚合

一个查询包含了过滤和聚合：

final RelNode node = builder
  .scan("EMP")
  .aggregate(builder.groupKey("DEPTNO"),
      builder.count(false, "C"),
      builder.sum(false, "S", builder.field("SAL")))
  .filter(
      builder.call(SqlStdOperatorTable.GREATER_THAN,
          builder.field("C"),
          builder.literal(10)))
  .build();
System.out.println(RelOptUtil.toString(node));

相等于下列SQL语句：

SELECT deptno, count(*) AS c, sum(sal) AS s
FROM emp
GROUP BY deptno
HAVING count(*) > 10

同时代码输出以下内容：

1
2
3

LogicalFilter(condition=[>($1, 10)])
  LogicalAggregate(group=[{7}], C=[COUNT()], S=[SUM($5)])
    LogicalTableScan(table=[[scott, EMP]])

入栈和出栈

构建器使用栈来存储一个步骤所产生的关系表达式，并将其作为输入传递给下一个步骤，这使得产生关系表达式的方法可以产生一个构造器。

大多数时候，你将使用的唯一栈方法是build()，以获得最后的关系表达式，即树的根部。

有时栈会嵌套的很深，让人感到困惑。为了让事情变得更加简单一些，你可以从栈中移除表达式。例如，在这里我们建立了一个灌木状的连接：

.
               join
             /      \
        join          join
      /      \      /      \
CUSTOMERS ORDERS LINE_ITEMS PRODUCTS

我们分三个步骤来构建它。将中间结果存储在left，right两个变量中，当创建最终的Join时，使用push()方法将它们放回栈中：

final RelNode left = builder
  .scan("CUSTOMERS")
  .scan("ORDERS")
  .join(JoinRelType.INNER, "ORDER_ID")
  .build();

final RelNode right = builder
  .scan("LINE_ITEMS")
  .scan("PRODUCTS")
  .join(JoinRelType.INNER, "PRODUCT_ID")
  .build();

final RelNode result = builder
  .push(left)
  .push(right)
  .join(JoinRelType.INNER, "ORDER_ID")
  .build();

切换公约

默认的Relbuilder创建的是没有公约的逻辑RelNode，但你可以通过adoptConvention()切换到使用一个不同的约定：

final RelNode result = builder
  .push(input)
  .adoptConvention(EnumerableConvention.INSTANCE)
  .sort(toCollation)
  .build();

在这种情况下，我们在输入的RelNode之上创建了一个EnumerableSort。

字段名和序号

你可以通过名称或者序号来引用一个字段。

序号是从0开始。每个运算符都保证其输出字段的出现顺序。例如，Project返回由每个标量表达式产生的字段。

操作符的字段名保证是唯一的，但有时这意味着名字并不完全是你期望的那样。例如，当你将EMP连接到DEPT时，其中一个输出字段被称为DEPTNO，而另一个字段被称为DEPTNO_1。

一些关系表达式方法让你对字段名称有更多的控制：

project让你用alias(expr, fieldName)来包装表达式。它删除了包装器但是保留了建议的名称（只要它是唯一的）
values(String[] fieldNames, Object... values) 接受一个字段名的数组。如果数组中的任何元素为空，构建器将生成一个唯一的名称。

如果一个表达式使用了一个输入字段，或者转换了一个输入字段，那么它将使用该输入字段的名称。

一旦唯一的字段名被分配，这些名字是不可更改的。如果你有一个特定的RelNode实例，你可以相信字段名不会改变。事实上，整个关系表达式不可变的。

但是如果一个关系表达式已经通过了几个重写规则（详见RelOptRule），那么产生的表达式的字段名可能与原来的不太一样。在这一点上，最好使用序号来引用字段。

当你建立一个接收多个输入的关系表达式时，你需要建立考虑到这一点的字段引用。这在构建连接条件的时候常常发生。

假设你正在建立一个关于EMP的连接，EMP有8个字段[EMNO, ENAME, JOB, MGR, HIREDATE, SAL, COMM, DEPTNO]，而DEPT有3个字段[DEPTNO, DNAME, LOC]。在内部，Calcite将这些字段表示为一个有11个字段的组合输入行：左边输入的第一个字段是字段#0（基于0，记住），右边输入的第一个字段是字段#8

但是通过 builder API，你可以指定哪个字段的输入。要引用 “SAL”，内部字段#5，可以写 builder.field(2, 0, “SAL”)，builder.field(2, “EMP”, “SAL”)，或者 builder.field(2, 0, 5)。这意味着 “两个输入中的输入#0的字段#5”。(为什么它需要知道有两个输入？因为它们被存储在堆栈中；输入 #1 在堆栈的顶部，而输入 #0 在它的下面。如果我们不告诉构建器有两个输入，它就不知道要到多深的地方去找0号输入）。

同样，要引用 “DNAME”，内部字段 #9 (8 + 1)，请写 builder.field(2, 1, “DNAME”), builder.field(2, “DEPT”, “DNAME”), 或者 builder.field(2, 1, 1)。

递归查询

警告。目前的API是实验性的，如有变化，恕不另行通知。一个SQL递归查询，例如这个产生序列1, 2, 3, …10的查询：

WITH RECURSIVE aux(i) AS (
  VALUES (1)
  UNION ALL
  SELECT i+1 FROM aux WHERE i < 10
)
SELECT * FROM aux

可以通过对TransientTable和RepeatUnion的扫描来生成：

final RelNode node = builder
  .values(new String[] { "i" }, 1)
  .transientScan("aux")
  .filter(
      builder.call(
          SqlStdOperatorTable.LESS_THAN,
          builder.field(0),
          builder.literal(10)))
  .project(
      builder.call(
          SqlStdOperatorTable.PLUS,
          builder.field(0),
          builder.literal(1)))
  .repeatUnion("aux", true)
  .build();
System.out.println(RelOptUtil.toString(node));

以上代码输出以下内容：

LogicalRepeatUnion(all=[true])
  LogicalTableSpool(readType=[LAZY], writeType=[LAZY], tableName=[aux])
    LogicalValues(tuples=[[{ 1 }]])
  LogicalTableSpool(readType=[LAZY], writeType=[LAZY], tableName=[aux])
    LogicalProject($f0=[+($0, 1)])
      LogicalFilter(condition=[<($0, 10)])
        LogicalTableScan(table=[[aux]])

API汇总

关系操作符

以下方法创建一个关系表达式（RelNode），将其推入栈，并返回RelBuilder。

METHOD	DESCRIPTION
`scan(tableName)`	Creates a TableScan.
`functionScan(operator, n, expr...)` `functionScan(operator, n, exprList)`	Creates a TableFunctionScan of the `n` most recent relational expressions.
`transientScan(tableName [, rowType])`	Creates a TableScan on a TransientTable with the given type (if not specified, the most recent relational expression’s type will be used).
`values(fieldNames, value...)` `values(rowType, tupleList)`	Creates a Values.
`filter([variablesSet, ] exprList)` `filter([variablesSet, ] expr...)`	Creates a Filter over the AND of the given predicates; if `variablesSet` is specified, the predicates may reference those variables.
`project(expr...)` `project(exprList [, fieldNames])`	Creates a Project. To override the default name, wrap expressions using `alias`, or specify the `fieldNames` argument.
`projectPlus(expr...)` `projectPlus(exprList)`	Variant of `project` that keeps original fields and appends the given expressions.
`projectExcept(expr...)` `projectExcept(exprList)`	Variant of `project` that keeps original fields and removes the given expressions.
`permute(mapping)`	Creates a Project that permutes the fields using `mapping`.
`convert(rowType [, rename])`	Creates a Project that converts the fields to the given types, optionally also renaming them.
`aggregate(groupKey, aggCall...)` `aggregate(groupKey, aggCallList)`	Creates an Aggregate.
`distinct()`	Creates an Aggregate that eliminates duplicate records.
`pivot(groupKey, aggCalls, axes, values)`	Adds a pivot operation, implemented by generating an Aggregate with a column for each combination of measures and values
`unpivot(includeNulls, measureNames, axisNames, axisMap)`	Adds an unpivot operation, implemented by generating a Join to a Values that converts each row to several rows
`sort(fieldOrdinal...)` `sort(expr...)` `sort(exprList)`	Creates a Sort. In the first form, field ordinals are 0-based, and a negative ordinal indicates descending; for example, -2 means field 1 descending. In the other forms, you can wrap expressions in `as`, `nullsFirst` or `nullsLast`.
`sortLimit(offset, fetch, expr...)` `sortLimit(offset, fetch, exprList)`	Creates a Sort with offset and limit.
`limit(offset, fetch)`	Creates a Sort that does not sort, only applies with offset and limit.
`exchange(distribution)`	Creates an Exchange.
`sortExchange(distribution, collation)`	Creates a SortExchange.
`correlate(joinType, correlationId, requiredField...)` `correlate(joinType, correlationId, requiredFieldList)`	Creates a Correlate of the two most recent relational expressions, with a variable name and required field expressions for the left relation.
`join(joinType, expr...)` `join(joinType, exprList)` `join(joinType, fieldName...)`	Creates a Join of the two most recent relational expressions. The first form joins on a boolean expression (multiple conditions are combined using AND). The last form joins on named fields; each side must have a field of each name.
`semiJoin(expr)`	Creates a Join with SEMI join type of the two most recent relational expressions.
`antiJoin(expr)`	Creates a Join with ANTI join type of the two most recent relational expressions.
`union(all [, n])`	Creates a Union of the `n` (default two) most recent relational expressions.
`intersect(all [, n])`	Creates an Intersect of the `n` (default two) most recent relational expressions.
`minus(all)`	Creates a Minus of the two most recent relational expressions.
`repeatUnion(tableName, all [, n])`	Creates a RepeatUnion associated to a TransientTable of the two most recent relational expressions, with `n` maximum number of iterations (default -1, i.e. no limit).
`snapshot(period)`	Creates a Snapshot of the given snapshot period.
`match(pattern, strictStart,` `strictEnd, patterns, measures,` `after, subsets, allRows,` `partitionKeys, orderKeys,` `interval)`	Creates a Match.

参数类型详解：

expr、interval，RexNode
expr..., requiredField... Array of RexNode
exprList, measureList, partitionKeys, orderKeys, requiredFieldList Iterable of RexNode
fieldOrdinal Ordinal of a field within its row (starting from 0)
fieldName Name of a field, unique within its row
fieldName... Array of String
fieldNames Iterable of String
rowType RelDataType
groupKey RelBuilder.GroupKey
aggCall... Array of RelBuilder.AggCall
aggCallList Iterable of RelBuilder.AggCall
value... Array of Object
value Object
tupleList Iterable of List of RexLiteral
all, distinct, strictStart, strictEnd, allRows boolean
alias String
correlationId CorrelationId
variablesSet Iterable of CorrelationId
varHolder Holder of RexCorrelVariable
patterns Map whose key is String, value is RexNode
subsets Map whose key is String, value is a sorted set of String
distribution RelDistribution
collation RelCollation
operator SqlOperator
joinType JoinRelType

构造器的方法包含各种优化，包括：

如果被要求按顺序映射所有列，project则返回输入值
filter会扁平化简化筛选操作（所以一个and和or操作可能有多个操作数），例如（将x = 1 AND TRUE转换为x = 1）
如果你先sort再limit，效果和你使用sortLimit一样

有一些注释方法可以为栈上的顶级关系表达式添加注释信息：

METHOD	DESCRIPTION
`as(alias)`	Assigns a table alias to the top relational expression on the stack
`variable(varHolder)`	Creates a correlation variable referencing the top relational expression

栈方法

METHOD	DESCRIPTION
`build()`	Pops the most recently created relational expression off the stack
`push(rel)`	Pushes a relational expression onto the stack. Relational methods such as `scan`, above, call this method, but user code generally does not
`pushAll(collection)`	Pushes a collection of relational expressions onto the stack
`peek()`	Returns the relational expression most recently put onto the stack, but does not remove it

标量表达方法

以下方法返回一个标量表达式(RexNode)。

它们中的许多都使用栈中的内容。例如，field(“DEPTNO”)返回一个对刚刚添加到堆栈中的关系表达式的"DEPTNO"字段的引用。

METHOD	DESCRIPTION
`literal(value)`	Constant
`field(fieldName)`	Reference, by name, to a field of the top-most relational expression
`field(fieldOrdinal)`	Reference, by ordinal, to a field of the top-most relational expression
`field(inputCount, inputOrdinal, fieldName)`	Reference, by name, to a field of the (`inputCount` - `inputOrdinal`)th relational expression
`field(inputCount, inputOrdinal, fieldOrdinal)`	Reference, by ordinal, to a field of the (`inputCount` - `inputOrdinal`)th relational expression
`field(inputCount, alias, fieldName)`	Reference, by table alias and field name, to a field at most `inputCount - 1` elements from the top of the stack
`field(alias, fieldName)`	Reference, by table alias and field name, to a field of the top-most relational expressions
`field(expr, fieldName)`	Reference, by name, to a field of a record-valued expression
`field(expr, fieldOrdinal)`	Reference, by ordinal, to a field of a record-valued expression
`fields(fieldOrdinalList)`	List of expressions referencing input fields by ordinal
`fields(mapping)`	List of expressions referencing input fields by a given mapping
`fields(collation)`	List of expressions, `exprList`, such that `sort(exprList)` would replicate collation
`call(op, expr...)` `call(op, exprList)`	Call to a function or operator
`and(expr...)` `and(exprList)`	Logical AND. Flattens nested ANDs, and optimizes cases involving TRUE and FALSE.
`or(expr...)` `or(exprList)`	Logical OR. Flattens nested ORs, and optimizes cases involving TRUE and FALSE.
`not(expr)`	Logical NOT
`equals(expr, expr)`	Equals
`isNull(expr)`	Checks whether an expression is null
`isNotNull(expr)`	Checks whether an expression is not null
`alias(expr, fieldName)`	Renames an expression (only valid as an argument to `project`)
`cast(expr, typeName)` `cast(expr, typeName, precision)` `cast(expr, typeName, precision, scale)`	Converts an expression to a given type
`desc(expr)`	Changes sort direction to descending (only valid as an argument to `sort` or `sortLimit`)
`nullsFirst(expr)`	Changes sort order to nulls first (only valid as an argument to `sort` or `sortLimit`)
`nullsLast(expr)`	Changes sort order to nulls last (only valid as an argument to `sort` or `sortLimit`)
`cursor(n, input)`	Reference to `input`th (0-based) relational input of a `TableFunctionScan` with `n` inputs (see `functionScan`)

子查询方法

以下方法将子查询转换为标量值（在in、exists、some、all、unique的情况下是BOOLEAN；对于scalarQuery是任何标量类型。对于arrayQuery是ARRAY，对于mapQuery是MAP，对于multisetQuery是MULTISET）

就如下所示，relFn是一个函数，它接收一个RelBuilder参数并返回一个RelNode。你通常把它作为一个lambda表达式来实现；该方法用一个具有正确上下文的RelBuilder来调用你的代码，而你的代码则返回将成为子查询的RelNode。

METHOD	DESCRIPTION
`all(expr, op, relFn)`	Returns whether expr has a particular relation to all of the values of the sub-query
`arrayQuery(relFn)`	Returns the rows of a sub-query as an `ARRAY`
`exists(relFn)`	Tests whether sub-query is non-empty
`in(expr, relFn)` `in(exprList, relFn)`	Tests whether a value occurs in a sub-query
`mapQuery(relFn)`	Returns the rows of a sub-query as a `MAP`
`multisetQuery(relFn)`	Returns the rows of a sub-query as a `MULTISET`
`scalarQuery(relFn)`	Returns the value of the sole column of the sole row of a sub-query
`some(expr, op, relFn)`	Returns whether expr has a particular relation to one or more of the values of the sub-query
`unique(relFn)`	Returns whether the rows of a sub-query are unique

模式匹配方法

以下方法返回用于匹配的模式。

METHOD	DESCRIPTION
`patternConcat(pattern...)`	Concatenates patterns
`patternAlter(pattern...)`	Alternates patterns
`patternQuantify(pattern, min, max)`	Quantifies a pattern
`patternPermute(pattern...)`	Permutes a pattern
`patternExclude(pattern)`	Excludes a pattern

分组方法

以下方法返回RelBuilder.GroupKey。

METHOD	DESCRIPTION
`groupKey(fieldName...)` `groupKey(fieldOrdinal...)` `groupKey(expr...)` `groupKey(exprList)`	Creates a group key of the given expressions
`groupKey(exprList, exprListList)`	Creates a group key of the given expressions with grouping sets
`groupKey(bitSet [, bitSets])`	Creates a group key of the given input columns, with multiple grouping sets if `bitSets` is specified

聚合方法

以下方法返回RelBuilder.AggCall。

METHOD	DESCRIPTION
`aggregateCall(op, expr...)` `aggregateCall(op, exprList)`	Creates a call to a given aggregate function
`count([ distinct, alias, ] expr...)` `count([ distinct, alias, ] exprList)`	Creates a call to the `COUNT` aggregate function
`countStar(alias)`	Creates a call to the `COUNT(*)` aggregate function
`sum([ distinct, alias, ] expr)`	Creates a call to the `SUM` aggregate function
`min([ alias, ] expr)`	Creates a call to the `MIN` aggregate function
`max([ alias, ] expr)`	Creates a call to the `MAX` aggregate function

要进一步修改AggCall，请调用以下方法：

METHOD	DESCRIPTION
`approximate(approximate)`	Allows approximate value for the aggregate of `approximate`
`as(alias)`	Assigns a column alias to this expression (see SQL `AS`)
`distinct()`	Eliminates duplicate values before aggregating (see SQL `DISTINCT`)
`distinct(distinct)`	Eliminates duplicate values before aggregating if `distinct`
`filter(expr)`	Filters rows before aggregating (see SQL `FILTER (WHERE ...)`)
`sort(expr...)` `sort(exprList)`	Sorts rows before aggregating (see SQL `WITHIN GROUP`)
`unique(expr...)` `unique(exprList)`	Makes rows unique before aggregating (see SQL `WITHIN DISTINCT`)
`over()`	Converts this `AggCall` into a windowed aggregate (see `OverCall` below)

窗口聚合方法

创建一个 RelBuilder.OverCall，它代表了对窗口聚合函数的调用，创建一个聚合调用，然后调用其over()方法，例如count().over()

要进一步修改OverCall，调用以下方法：

METHOD	DESCRIPTION
`rangeUnbounded()`	Creates an unbounded range-based window, `RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING`
`rangeFrom(lower)`	Creates a range-based window bounded below, `RANGE BETWEEN lower AND CURRENT ROW`
`rangeTo(upper)`	Creates a range-based window bounded above, `RANGE BETWEEN CURRENT ROW AND upper`
`rangeBetween(lower, upper)`	Creates a range-based window, `RANGE BETWEEN lower AND upper`
`rowsUnbounded()`	Creates an unbounded row-based window, `ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING`
`rowsFrom(lower)`	Creates a row-based window bounded below, `ROWS BETWEEN lower AND CURRENT ROW`
`rowsTo(upper)`	Creates a row-based window bounded above, `ROWS BETWEEN CURRENT ROW AND upper`
`rowsBetween(lower, upper)`	Creates a rows-based window, `ROWS BETWEEN lower AND upper`
`partitionBy(expr...)` `partitionBy(exprList)`	Partitions the window on the given expressions (see SQL `PARTITION BY`)
`orderBy(expr...)` `sort(exprList)`	Sorts the rows in the window (see SQL `ORDER BY`)
`allowPartial(b)`	Sets whether to allow partial width windows; default true
`nullWhenCountZero(b)`	Sets whether whether the aggregate function should evaluate to null if no rows are in the window; default false
`as(alias)`	Assigns a column alias (see SQL `AS`) and converts this `OverCall` to a `RexNode`
`toRex()`	Converts this `OverCall` to a `RexNode`