Apache Calcite 文档翻译 - 代数
代数
关系代数是Calcite的核心。每个查询都被表示为一棵关系运算符的树。你可以将一条SQL语句翻译为关系代数,也可以直接建立树状结构。
规则器规则使用保留语义的数学特性来转换表达树。例如,如果一个过滤操作没有引入其他输入的列,那么可以将一个过滤器下推至连接之前。
Calcite通过对关系表达式进行反复应用规划器规则来优化查询。一个成本模型指导了优化的整个过程,规划器生成一个替代的表达式,语义与之前表达式相同,但具有更低的成本。
规划过程是可扩展的。你可以添加自己的关系运算符、规划器规则、成本模型和统计数据。
代数构建
建立关系表达式的最简单方法时使用代数构建器RelBuilder,下面是一个示例:
表扫描
1 | final FrameworkConfig config; |
(你可以从这个文件RelBUilderExample.java中获取更多的例子)
以上代码输出下列内容:
1 | LogicalTableScan(table=[[scott, EMP]]) |
它相当于对EMP表进行了全表扫描,相当于这句SQL:
1 | SELECT * |
增加项目
现在,让我们添加一个项目,相当于这句SQL:
1 | SELECT ename, deptno |
我们仅仅只需要在调用build方法之前调用project方法即可:
1 | final RelNode node = builder |
代码输出以下内容:
1 | LogicalProject(DEPTNO=[$7], ENAME=[$1]) |
两次调用builder.field方法创建了简单的表达式,从输入的通过调用scan方法产生的关系表达式(TableScan)中返回字段。
Calcite将字段按照序号进行转换,$7和$1
增加过滤和聚合
一个查询包含了过滤和聚合:
1 | final RelNode node = builder |
相等于下列SQL语句:
1 | SELECT deptno, count(*) AS c, sum(sal) AS s |
同时代码输出以下内容:
1 | LogicalFilter(condition=[>($1, 10)]) |
入栈和出栈
构建器使用栈来存储一个步骤所产生的关系表达式,并将其作为输入传递给下一个步骤,这使得产生关系表达式的方法可以产生一个构造器。
大多数时候,你将使用的唯一栈方法是build(),以获得最后的关系表达式,即树的根部。
有时栈会嵌套的很深,让人感到困惑。为了让事情变得更加简单一些,你可以从栈中移除表达式。例如,在这里我们建立了一个灌木状的连接:
1 | . |
我们分三个步骤来构建它。将中间结果存储在left,right两个变量中,当创建最终的Join时,使用push()方法将它们放回栈中:
1 | final RelNode left = builder |
切换公约
默认的Relbuilder创建的是没有公约的逻辑RelNode,但你可以通过adoptConvention()切换到使用一个不同的约定:
1 | final RelNode result = builder |
在这种情况下,我们在输入的RelNode之上创建了一个EnumerableSort。
字段名和序号
你可以通过名称或者序号来引用一个字段。
序号是从0开始。每个运算符都保证其输出字段的出现顺序。例如,Project返回由每个标量表达式产生的字段。
操作符的字段名保证是唯一的,但有时这意味着名字并不完全是你期望的那样。例如,当你将EMP连接到DEPT时,其中一个输出字段被称为DEPTNO,而另一个字段被称为DEPTNO_1。
一些关系表达式方法让你对字段名称有更多的控制:
project让你用alias(expr, fieldName)来包装表达式。它删除了包装器但是保留了建议的名称(只要它是唯一的)values(String[] fieldNames, Object... values)接受一个字段名的数组。如果数组中的任何元素为空,构建器将生成一个唯一的名称。
如果一个表达式使用了一个输入字段,或者转换了一个输入字段,那么它将使用该输入字段的名称。
一旦唯一的字段名被分配,这些名字是不可更改的。如果你有一个特定的RelNode实例,你可以相信字段名不会改变。事实上,整个关系表达式不可变的。
但是如果一个关系表达式已经通过了几个重写规则(详见RelOptRule),那么产生的表达式的字段名可能与原来的不太一样。在这一点上,最好使用序号来引用字段。
当你建立一个接收多个输入的关系表达式时,你需要建立考虑到这一点的字段引用。这在构建连接条件的时候常常发生。
假设你正在建立一个关于EMP的连接,EMP有8个字段[EMNO, ENAME, JOB, MGR, HIREDATE, SAL, COMM, DEPTNO],而DEPT有3个字段[DEPTNO, DNAME, LOC]。在内部,Calcite将这些字段表示为一个有11个字段的组合输入行:左边输入的第一个字段是字段#0(基于0,记住),右边输入的第一个字段是字段#8
但是通过 builder API,你可以指定哪个字段的输入。要引用 “SAL”,内部字段#5,可以写 builder.field(2, 0, “SAL”),builder.field(2, “EMP”, “SAL”),或者 builder.field(2, 0, 5)。这意味着 “两个输入中的输入#0的字段#5”。(为什么它需要知道有两个输入?因为它们被存储在堆栈中;输入 #1 在堆栈的顶部,而输入 #0 在它的下面。如果我们不告诉构建器有两个输入,它就不知道要到多深的地方去找0号输入)。
同样,要引用 “DNAME”,内部字段 #9 (8 + 1),请写 builder.field(2, 1, “DNAME”), builder.field(2, “DEPT”, “DNAME”), 或者 builder.field(2, 1, 1)。
递归查询
警告。目前的API是实验性的,如有变化,恕不另行通知。一个SQL递归查询,例如这个产生序列1, 2, 3, …10的查询:
1 | WITH RECURSIVE aux(i) AS ( |
可以通过对TransientTable和RepeatUnion的扫描来生成:
1 | final RelNode node = builder |
以上代码输出以下内容:
1 | LogicalRepeatUnion(all=[true]) |
API汇总
关系操作符
以下方法创建一个关系表达式(RelNode),将其推入栈,并返回RelBuilder。
| METHOD | DESCRIPTION |
|---|---|
scan(tableName) | Creates a TableScan. |
functionScan(operator, n, expr...) functionScan(operator, n, exprList) | Creates a TableFunctionScan of the n most recent relational expressions. |
transientScan(tableName [, rowType]) | Creates a TableScan on a TransientTable with the given type (if not specified, the most recent relational expression’s type will be used). |
values(fieldNames, value...) values(rowType, tupleList) | Creates a Values. |
filter([variablesSet, ] exprList) filter([variablesSet, ] expr...) | Creates a Filter over the AND of the given predicates; if variablesSet is specified, the predicates may reference those variables. |
project(expr...) project(exprList [, fieldNames]) | Creates a Project. To override the default name, wrap expressions using alias, or specify the fieldNames argument. |
projectPlus(expr...) projectPlus(exprList) | Variant of project that keeps original fields and appends the given expressions. |
projectExcept(expr...) projectExcept(exprList) | Variant of project that keeps original fields and removes the given expressions. |
permute(mapping) | Creates a Project that permutes the fields using mapping. |
convert(rowType [, rename]) | Creates a Project that converts the fields to the given types, optionally also renaming them. |
aggregate(groupKey, aggCall...) aggregate(groupKey, aggCallList) | Creates an Aggregate. |
distinct() | Creates an Aggregate that eliminates duplicate records. |
pivot(groupKey, aggCalls, axes, values) | Adds a pivot operation, implemented by generating an Aggregate with a column for each combination of measures and values |
unpivot(includeNulls, measureNames, axisNames, axisMap) | Adds an unpivot operation, implemented by generating a Join to a Values that converts each row to several rows |
sort(fieldOrdinal...) sort(expr...) sort(exprList) | Creates a Sort. In the first form, field ordinals are 0-based, and a negative ordinal indicates descending; for example, -2 means field 1 descending. In the other forms, you can wrap expressions in as, nullsFirst or nullsLast. |
sortLimit(offset, fetch, expr...) sortLimit(offset, fetch, exprList) | Creates a Sort with offset and limit. |
limit(offset, fetch) | Creates a Sort that does not sort, only applies with offset and limit. |
exchange(distribution) | Creates an Exchange. |
sortExchange(distribution, collation) | Creates a SortExchange. |
correlate(joinType, correlationId, requiredField...) correlate(joinType, correlationId, requiredFieldList) | Creates a Correlate of the two most recent relational expressions, with a variable name and required field expressions for the left relation. |
join(joinType, expr...) join(joinType, exprList) join(joinType, fieldName...) | Creates a Join of the two most recent relational expressions. The first form joins on a boolean expression (multiple conditions are combined using AND). The last form joins on named fields; each side must have a field of each name. |
semiJoin(expr) | Creates a Join with SEMI join type of the two most recent relational expressions. |
antiJoin(expr) | Creates a Join with ANTI join type of the two most recent relational expressions. |
union(all [, n]) | Creates a Union of the n (default two) most recent relational expressions. |
intersect(all [, n]) | Creates an Intersect of the n (default two) most recent relational expressions. |
minus(all) | Creates a Minus of the two most recent relational expressions. |
repeatUnion(tableName, all [, n]) | Creates a RepeatUnion associated to a TransientTable of the two most recent relational expressions, with n maximum number of iterations (default -1, i.e. no limit). |
snapshot(period) | Creates a Snapshot of the given snapshot period. |
match(pattern, strictStart, strictEnd, patterns, measures, after, subsets, allRows, partitionKeys, orderKeys, interval) | Creates a Match. |
参数类型详解:
expr、interval,RexNodeexpr...,requiredField...Array of RexNodeexprList,measureList,partitionKeys,orderKeys,requiredFieldListIterable of RexNodefieldOrdinalOrdinal of a field within its row (starting from 0)fieldNameName of a field, unique within its rowfieldName...Array of StringfieldNamesIterable of StringrowTypeRelDataTypegroupKeyRelBuilder.GroupKeyaggCall...Array of RelBuilder.AggCallaggCallListIterable of RelBuilder.AggCallvalue...Array of ObjectvalueObjecttupleListIterable of List of RexLiteralall,distinct,strictStart,strictEnd,allRowsbooleanaliasStringcorrelationIdCorrelationIdvariablesSetIterable of CorrelationIdvarHolderHolder of RexCorrelVariablepatternsMap whose key is String, value is RexNodesubsetsMap whose key is String, value is a sorted set of StringdistributionRelDistributioncollationRelCollationoperatorSqlOperatorjoinTypeJoinRelType
构造器的方法包含各种优化,包括:
- 如果被要求按顺序映射所有列,project则返回输入值
- filter会扁平化简化筛选操作(所以一个
and和or操作可能有多个操作数),例如(将x = 1 AND TRUE转换为x = 1) - 如果你先
sort再limit,效果和你使用sortLimit一样
有一些注释方法可以为栈上的顶级关系表达式添加注释信息:
| METHOD | DESCRIPTION |
|---|---|
as(alias) | Assigns a table alias to the top relational expression on the stack |
variable(varHolder) | Creates a correlation variable referencing the top relational expression |
栈方法
| METHOD | DESCRIPTION |
|---|---|
build() | Pops the most recently created relational expression off the stack |
push(rel) | Pushes a relational expression onto the stack. Relational methods such as scan, above, call this method, but user code generally does not |
pushAll(collection) | Pushes a collection of relational expressions onto the stack |
peek() | Returns the relational expression most recently put onto the stack, but does not remove it |
标量表达方法
以下方法返回一个标量表达式(RexNode)。
它们中的许多都使用栈中的内容。例如,field(“DEPTNO”)返回一个对刚刚添加到堆栈中的关系表达式的"DEPTNO"字段的引用。
| METHOD | DESCRIPTION |
|---|---|
literal(value) | Constant |
field(fieldName) | Reference, by name, to a field of the top-most relational expression |
field(fieldOrdinal) | Reference, by ordinal, to a field of the top-most relational expression |
field(inputCount, inputOrdinal, fieldName) | Reference, by name, to a field of the (inputCount - inputOrdinal)th relational expression |
field(inputCount, inputOrdinal, fieldOrdinal) | Reference, by ordinal, to a field of the (inputCount - inputOrdinal)th relational expression |
field(inputCount, alias, fieldName) | Reference, by table alias and field name, to a field at most inputCount - 1 elements from the top of the stack |
field(alias, fieldName) | Reference, by table alias and field name, to a field of the top-most relational expressions |
field(expr, fieldName) | Reference, by name, to a field of a record-valued expression |
field(expr, fieldOrdinal) | Reference, by ordinal, to a field of a record-valued expression |
fields(fieldOrdinalList) | List of expressions referencing input fields by ordinal |
fields(mapping) | List of expressions referencing input fields by a given mapping |
fields(collation) | List of expressions, exprList, such that sort(exprList) would replicate collation |
call(op, expr...) call(op, exprList) | Call to a function or operator |
and(expr...) and(exprList) | Logical AND. Flattens nested ANDs, and optimizes cases involving TRUE and FALSE. |
or(expr...) or(exprList) | Logical OR. Flattens nested ORs, and optimizes cases involving TRUE and FALSE. |
not(expr) | Logical NOT |
equals(expr, expr) | Equals |
isNull(expr) | Checks whether an expression is null |
isNotNull(expr) | Checks whether an expression is not null |
alias(expr, fieldName) | Renames an expression (only valid as an argument to project) |
cast(expr, typeName) cast(expr, typeName, precision) cast(expr, typeName, precision, scale) | Converts an expression to a given type |
desc(expr) | Changes sort direction to descending (only valid as an argument to sort or sortLimit) |
nullsFirst(expr) | Changes sort order to nulls first (only valid as an argument to sort or sortLimit) |
nullsLast(expr) | Changes sort order to nulls last (only valid as an argument to sort or sortLimit) |
cursor(n, input) | Reference to inputth (0-based) relational input of a TableFunctionScan with n inputs (see functionScan) |
子查询方法
以下方法将子查询转换为标量值(在in、exists、some、all、unique的情况下是BOOLEAN;对于scalarQuery是任何标量类型。对于arrayQuery是ARRAY,对于mapQuery是MAP,对于multisetQuery是MULTISET)
就如下所示,relFn是一个函数,它接收一个RelBuilder参数并返回一个RelNode。你通常把它作为一个lambda表达式来实现;该方法用一个具有正确上下文的RelBuilder来调用你的代码,而你的代码则返回将成为子查询的RelNode。
| METHOD | DESCRIPTION |
|---|---|
all(expr, op, relFn) | Returns whether expr has a particular relation to all of the values of the sub-query |
arrayQuery(relFn) | Returns the rows of a sub-query as an ARRAY |
exists(relFn) | Tests whether sub-query is non-empty |
in(expr, relFn) in(exprList, relFn) | Tests whether a value occurs in a sub-query |
mapQuery(relFn) | Returns the rows of a sub-query as a MAP |
multisetQuery(relFn) | Returns the rows of a sub-query as a MULTISET |
scalarQuery(relFn) | Returns the value of the sole column of the sole row of a sub-query |
some(expr, op, relFn) | Returns whether expr has a particular relation to one or more of the values of the sub-query |
unique(relFn) | Returns whether the rows of a sub-query are unique |
模式匹配方法
以下方法返回用于匹配的模式。
| METHOD | DESCRIPTION |
|---|---|
patternConcat(pattern...) | Concatenates patterns |
patternAlter(pattern...) | Alternates patterns |
patternQuantify(pattern, min, max) | Quantifies a pattern |
patternPermute(pattern...) | Permutes a pattern |
patternExclude(pattern) | Excludes a pattern |
分组方法
以下方法返回RelBuilder.GroupKey。
| METHOD | DESCRIPTION |
|---|---|
groupKey(fieldName...) groupKey(fieldOrdinal...) groupKey(expr...) groupKey(exprList) | Creates a group key of the given expressions |
groupKey(exprList, exprListList) | Creates a group key of the given expressions with grouping sets |
groupKey(bitSet [, bitSets]) | Creates a group key of the given input columns, with multiple grouping sets if bitSets is specified |
聚合方法
以下方法返回RelBuilder.AggCall。
| METHOD | DESCRIPTION |
|---|---|
aggregateCall(op, expr...) aggregateCall(op, exprList) | Creates a call to a given aggregate function |
count([ distinct, alias, ] expr...) count([ distinct, alias, ] exprList) | Creates a call to the COUNT aggregate function |
countStar(alias) | Creates a call to the COUNT(*) aggregate function |
sum([ distinct, alias, ] expr) | Creates a call to the SUM aggregate function |
min([ alias, ] expr) | Creates a call to the MIN aggregate function |
max([ alias, ] expr) | Creates a call to the MAX aggregate function |
要进一步修改AggCall,请调用以下方法:
| METHOD | DESCRIPTION |
|---|---|
approximate(approximate) | Allows approximate value for the aggregate of approximate |
as(alias) | Assigns a column alias to this expression (see SQL AS) |
distinct() | Eliminates duplicate values before aggregating (see SQL DISTINCT) |
distinct(distinct) | Eliminates duplicate values before aggregating if distinct |
filter(expr) | Filters rows before aggregating (see SQL FILTER (WHERE ...)) |
sort(expr...) sort(exprList) | Sorts rows before aggregating (see SQL WITHIN GROUP) |
unique(expr...) unique(exprList) | Makes rows unique before aggregating (see SQL WITHIN DISTINCT) |
over() | Converts this AggCall into a windowed aggregate (see OverCall below) |
窗口聚合方法
创建一个 RelBuilder.OverCall,它代表了对窗口聚合函数的调用,创建一个聚合调用,然后调用其over()方法,例如count().over()
要进一步修改OverCall,调用以下方法:
| METHOD | DESCRIPTION |
|---|---|
rangeUnbounded() | Creates an unbounded range-based window, RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING |
rangeFrom(lower) | Creates a range-based window bounded below, RANGE BETWEEN lower AND CURRENT ROW |
rangeTo(upper) | Creates a range-based window bounded above, RANGE BETWEEN CURRENT ROW AND upper |
rangeBetween(lower, upper) | Creates a range-based window, RANGE BETWEEN lower AND upper |
rowsUnbounded() | Creates an unbounded row-based window, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING |
rowsFrom(lower) | Creates a row-based window bounded below, ROWS BETWEEN lower AND CURRENT ROW |
rowsTo(upper) | Creates a row-based window bounded above, ROWS BETWEEN CURRENT ROW AND upper |
rowsBetween(lower, upper) | Creates a rows-based window, ROWS BETWEEN lower AND upper |
partitionBy(expr...) partitionBy(exprList) | Partitions the window on the given expressions (see SQL PARTITION BY) |
orderBy(expr...) sort(exprList) | Sorts the rows in the window (see SQL ORDER BY) |
allowPartial(b) | Sets whether to allow partial width windows; default true |
nullWhenCountZero(b) | Sets whether whether the aggregate function should evaluate to null if no rows are in the window; default false |
as(alias) | Assigns a column alias (see SQL AS) and converts this OverCall to a RexNode |
toRex() | Converts this OverCall to a RexNode |










