Apache Calcite 文档翻译 - 代数
代数
关系代数是Calcite的核心。每个查询都被表示为一棵关系运算符的树。你可以将一条SQL语句翻译为关系代数,也可以直接建立树状结构。
规则器规则使用保留语义的数学特性来转换表达树。例如,如果一个过滤操作没有引入其他输入的列,那么可以将一个过滤器下推至连接之前。
Calcite通过对关系表达式进行反复应用规划器规则来优化查询。一个成本模型指导了优化的整个过程,规划器生成一个替代的表达式,语义与之前表达式相同,但具有更低的成本。
规划过程是可扩展的。你可以添加自己的关系运算符、规划器规则、成本模型和统计数据。
代数构建
建立关系表达式的最简单方法时使用代数构建器RelBuilder
,下面是一个示例:
表扫描
1 | final FrameworkConfig config; |
(你可以从这个文件RelBUilderExample.java中获取更多的例子)
以上代码输出下列内容:
1 | LogicalTableScan(table=[[scott, EMP]]) |
它相当于对EMP
表进行了全表扫描,相当于这句SQL:
1 | SELECT * |
增加项目
现在,让我们添加一个项目,相当于这句SQL:
1 | SELECT ename, deptno |
我们仅仅只需要在调用build
方法之前调用project
方法即可:
1 | final RelNode node = builder |
代码输出以下内容:
1 | LogicalProject(DEPTNO=[$7], ENAME=[$1]) |
两次调用builder.field
方法创建了简单的表达式,从输入的通过调用scan
方法产生的关系表达式(TableScan)中返回字段。
Calcite将字段按照序号进行转换,$7
和$1
增加过滤和聚合
一个查询包含了过滤和聚合:
1 | final RelNode node = builder |
相等于下列SQL语句:
1 | SELECT deptno, count(*) AS c, sum(sal) AS s |
同时代码输出以下内容:
1 | LogicalFilter(condition=[>($1, 10)]) |
入栈和出栈
构建器使用栈来存储一个步骤所产生的关系表达式,并将其作为输入传递给下一个步骤,这使得产生关系表达式的方法可以产生一个构造器。
大多数时候,你将使用的唯一栈方法是build()
,以获得最后的关系表达式,即树的根部。
有时栈会嵌套的很深,让人感到困惑。为了让事情变得更加简单一些,你可以从栈中移除表达式。例如,在这里我们建立了一个灌木状的连接:
1 | . |
我们分三个步骤来构建它。将中间结果存储在left
,right
两个变量中,当创建最终的Join
时,使用push()
方法将它们放回栈中:
1 | final RelNode left = builder |
切换公约
默认的Relbuilder
创建的是没有公约的逻辑RelNode
,但你可以通过adoptConvention()
切换到使用一个不同的约定:
1 | final RelNode result = builder |
在这种情况下,我们在输入的RelNode
之上创建了一个EnumerableSort
。
字段名和序号
你可以通过名称或者序号来引用一个字段。
序号是从0开始。每个运算符都保证其输出字段的出现顺序。例如,Project返回由每个标量表达式产生的字段。
操作符的字段名保证是唯一的,但有时这意味着名字并不完全是你期望的那样。例如,当你将EMP连接到DEPT时,其中一个输出字段被称为DEPTNO
,而另一个字段被称为DEPTNO_1
。
一些关系表达式方法让你对字段名称有更多的控制:
project
让你用alias(expr, fieldName)
来包装表达式。它删除了包装器但是保留了建议的名称(只要它是唯一的)values(String[] fieldNames, Object... values)
接受一个字段名的数组。如果数组中的任何元素为空,构建器将生成一个唯一的名称。
如果一个表达式使用了一个输入字段,或者转换了一个输入字段,那么它将使用该输入字段的名称。
一旦唯一的字段名被分配,这些名字是不可更改的。如果你有一个特定的RelNode
实例,你可以相信字段名不会改变。事实上,整个关系表达式不可变的。
但是如果一个关系表达式已经通过了几个重写规则(详见RelOptRule),那么产生的表达式的字段名可能与原来的不太一样。在这一点上,最好使用序号来引用字段。
当你建立一个接收多个输入的关系表达式时,你需要建立考虑到这一点的字段引用。这在构建连接条件的时候常常发生。
假设你正在建立一个关于EMP的连接,EMP有8个字段[EMNO, ENAME, JOB, MGR, HIREDATE, SAL, COMM, DEPTNO],而DEPT有3个字段[DEPTNO, DNAME, LOC]。在内部,Calcite将这些字段表示为一个有11个字段的组合输入行:左边输入的第一个字段是字段#0(基于0,记住),右边输入的第一个字段是字段#8
但是通过 builder API,你可以指定哪个字段的输入。要引用 “SAL”,内部字段#5,可以写 builder.field(2, 0, “SAL”),builder.field(2, “EMP”, “SAL”),或者 builder.field(2, 0, 5)。这意味着 “两个输入中的输入#0的字段#5”。(为什么它需要知道有两个输入?因为它们被存储在堆栈中;输入 #1 在堆栈的顶部,而输入 #0 在它的下面。如果我们不告诉构建器有两个输入,它就不知道要到多深的地方去找0号输入)。
同样,要引用 “DNAME”,内部字段 #9 (8 + 1),请写 builder.field(2, 1, “DNAME”), builder.field(2, “DEPT”, “DNAME”), 或者 builder.field(2, 1, 1)。
递归查询
警告。目前的API是实验性的,如有变化,恕不另行通知。一个SQL递归查询,例如这个产生序列1, 2, 3, …10的查询:
1 | WITH RECURSIVE aux(i) AS ( |
可以通过对TransientTable
和RepeatUnion
的扫描来生成:
1 | final RelNode node = builder |
以上代码输出以下内容:
1 | LogicalRepeatUnion(all=[true]) |
API汇总
关系操作符
以下方法创建一个关系表达式(RelNode),将其推入栈,并返回RelBuilder。
METHOD | DESCRIPTION |
---|---|
scan(tableName) | Creates a TableScan. |
functionScan(operator, n, expr...) functionScan(operator, n, exprList) | Creates a TableFunctionScan of the n most recent relational expressions. |
transientScan(tableName [, rowType]) | Creates a TableScan on a TransientTable with the given type (if not specified, the most recent relational expression’s type will be used). |
values(fieldNames, value...) values(rowType, tupleList) | Creates a Values. |
filter([variablesSet, ] exprList) filter([variablesSet, ] expr...) | Creates a Filter over the AND of the given predicates; if variablesSet is specified, the predicates may reference those variables. |
project(expr...) project(exprList [, fieldNames]) | Creates a Project. To override the default name, wrap expressions using alias , or specify the fieldNames argument. |
projectPlus(expr...) projectPlus(exprList) | Variant of project that keeps original fields and appends the given expressions. |
projectExcept(expr...) projectExcept(exprList) | Variant of project that keeps original fields and removes the given expressions. |
permute(mapping) | Creates a Project that permutes the fields using mapping . |
convert(rowType [, rename]) | Creates a Project that converts the fields to the given types, optionally also renaming them. |
aggregate(groupKey, aggCall...) aggregate(groupKey, aggCallList) | Creates an Aggregate. |
distinct() | Creates an Aggregate that eliminates duplicate records. |
pivot(groupKey, aggCalls, axes, values) | Adds a pivot operation, implemented by generating an Aggregate with a column for each combination of measures and values |
unpivot(includeNulls, measureNames, axisNames, axisMap) | Adds an unpivot operation, implemented by generating a Join to a Values that converts each row to several rows |
sort(fieldOrdinal...) sort(expr...) sort(exprList) | Creates a Sort. In the first form, field ordinals are 0-based, and a negative ordinal indicates descending; for example, -2 means field 1 descending. In the other forms, you can wrap expressions in as , nullsFirst or nullsLast . |
sortLimit(offset, fetch, expr...) sortLimit(offset, fetch, exprList) | Creates a Sort with offset and limit. |
limit(offset, fetch) | Creates a Sort that does not sort, only applies with offset and limit. |
exchange(distribution) | Creates an Exchange. |
sortExchange(distribution, collation) | Creates a SortExchange. |
correlate(joinType, correlationId, requiredField...) correlate(joinType, correlationId, requiredFieldList) | Creates a Correlate of the two most recent relational expressions, with a variable name and required field expressions for the left relation. |
join(joinType, expr...) join(joinType, exprList) join(joinType, fieldName...) | Creates a Join of the two most recent relational expressions. The first form joins on a boolean expression (multiple conditions are combined using AND). The last form joins on named fields; each side must have a field of each name. |
semiJoin(expr) | Creates a Join with SEMI join type of the two most recent relational expressions. |
antiJoin(expr) | Creates a Join with ANTI join type of the two most recent relational expressions. |
union(all [, n]) | Creates a Union of the n (default two) most recent relational expressions. |
intersect(all [, n]) | Creates an Intersect of the n (default two) most recent relational expressions. |
minus(all) | Creates a Minus of the two most recent relational expressions. |
repeatUnion(tableName, all [, n]) | Creates a RepeatUnion associated to a TransientTable of the two most recent relational expressions, with n maximum number of iterations (default -1, i.e. no limit). |
snapshot(period) | Creates a Snapshot of the given snapshot period. |
match(pattern, strictStart, strictEnd, patterns, measures, after, subsets, allRows, partitionKeys, orderKeys, interval) | Creates a Match. |
参数类型详解:
expr
、interval
,RexNodeexpr...
,requiredField...
Array of RexNodeexprList
,measureList
,partitionKeys
,orderKeys
,requiredFieldList
Iterable of RexNodefieldOrdinal
Ordinal of a field within its row (starting from 0)fieldName
Name of a field, unique within its rowfieldName...
Array of StringfieldNames
Iterable of StringrowType
RelDataTypegroupKey
RelBuilder.GroupKeyaggCall...
Array of RelBuilder.AggCallaggCallList
Iterable of RelBuilder.AggCallvalue...
Array of Objectvalue
ObjecttupleList
Iterable of List of RexLiteralall
,distinct
,strictStart
,strictEnd
,allRows
booleanalias
StringcorrelationId
CorrelationIdvariablesSet
Iterable of CorrelationIdvarHolder
Holder of RexCorrelVariablepatterns
Map whose key is String, value is RexNodesubsets
Map whose key is String, value is a sorted set of Stringdistribution
RelDistributioncollation
RelCollationoperator
SqlOperatorjoinType
JoinRelType
构造器的方法包含各种优化,包括:
- 如果被要求按顺序映射所有列,project则返回输入值
- filter会扁平化简化筛选操作(所以一个
and
和or
操作可能有多个操作数),例如(将x = 1 AND TRUE转换为x = 1) - 如果你先
sort
再limit
,效果和你使用sortLimit
一样
有一些注释方法可以为栈上的顶级关系表达式添加注释信息:
METHOD | DESCRIPTION |
---|---|
as(alias) | Assigns a table alias to the top relational expression on the stack |
variable(varHolder) | Creates a correlation variable referencing the top relational expression |
栈方法
METHOD | DESCRIPTION |
---|---|
build() | Pops the most recently created relational expression off the stack |
push(rel) | Pushes a relational expression onto the stack. Relational methods such as scan , above, call this method, but user code generally does not |
pushAll(collection) | Pushes a collection of relational expressions onto the stack |
peek() | Returns the relational expression most recently put onto the stack, but does not remove it |
标量表达方法
以下方法返回一个标量表达式(RexNode)。
它们中的许多都使用栈中的内容。例如,field(“DEPTNO”)返回一个对刚刚添加到堆栈中的关系表达式的"DEPTNO"字段的引用。
METHOD | DESCRIPTION |
---|---|
literal(value) | Constant |
field(fieldName) | Reference, by name, to a field of the top-most relational expression |
field(fieldOrdinal) | Reference, by ordinal, to a field of the top-most relational expression |
field(inputCount, inputOrdinal, fieldName) | Reference, by name, to a field of the (inputCount - inputOrdinal )th relational expression |
field(inputCount, inputOrdinal, fieldOrdinal) | Reference, by ordinal, to a field of the (inputCount - inputOrdinal )th relational expression |
field(inputCount, alias, fieldName) | Reference, by table alias and field name, to a field at most inputCount - 1 elements from the top of the stack |
field(alias, fieldName) | Reference, by table alias and field name, to a field of the top-most relational expressions |
field(expr, fieldName) | Reference, by name, to a field of a record-valued expression |
field(expr, fieldOrdinal) | Reference, by ordinal, to a field of a record-valued expression |
fields(fieldOrdinalList) | List of expressions referencing input fields by ordinal |
fields(mapping) | List of expressions referencing input fields by a given mapping |
fields(collation) | List of expressions, exprList , such that sort(exprList) would replicate collation |
call(op, expr...) call(op, exprList) | Call to a function or operator |
and(expr...) and(exprList) | Logical AND. Flattens nested ANDs, and optimizes cases involving TRUE and FALSE. |
or(expr...) or(exprList) | Logical OR. Flattens nested ORs, and optimizes cases involving TRUE and FALSE. |
not(expr) | Logical NOT |
equals(expr, expr) | Equals |
isNull(expr) | Checks whether an expression is null |
isNotNull(expr) | Checks whether an expression is not null |
alias(expr, fieldName) | Renames an expression (only valid as an argument to project ) |
cast(expr, typeName) cast(expr, typeName, precision) cast(expr, typeName, precision, scale) | Converts an expression to a given type |
desc(expr) | Changes sort direction to descending (only valid as an argument to sort or sortLimit ) |
nullsFirst(expr) | Changes sort order to nulls first (only valid as an argument to sort or sortLimit ) |
nullsLast(expr) | Changes sort order to nulls last (only valid as an argument to sort or sortLimit ) |
cursor(n, input) | Reference to input th (0-based) relational input of a TableFunctionScan with n inputs (see functionScan ) |
子查询方法
以下方法将子查询转换为标量值(在in、exists、some、all、unique的情况下是BOOLEAN;对于scalarQuery是任何标量类型。对于arrayQuery是ARRAY,对于mapQuery是MAP,对于multisetQuery是MULTISET)
就如下所示,relFn
是一个函数,它接收一个RelBuilder
参数并返回一个RelNode
。你通常把它作为一个lambda表达式来实现;该方法用一个具有正确上下文的RelBuilder
来调用你的代码,而你的代码则返回将成为子查询的RelNode
。
METHOD | DESCRIPTION |
---|---|
all(expr, op, relFn) | Returns whether expr has a particular relation to all of the values of the sub-query |
arrayQuery(relFn) | Returns the rows of a sub-query as an ARRAY |
exists(relFn) | Tests whether sub-query is non-empty |
in(expr, relFn) in(exprList, relFn) | Tests whether a value occurs in a sub-query |
mapQuery(relFn) | Returns the rows of a sub-query as a MAP |
multisetQuery(relFn) | Returns the rows of a sub-query as a MULTISET |
scalarQuery(relFn) | Returns the value of the sole column of the sole row of a sub-query |
some(expr, op, relFn) | Returns whether expr has a particular relation to one or more of the values of the sub-query |
unique(relFn) | Returns whether the rows of a sub-query are unique |
模式匹配方法
以下方法返回用于匹配的模式。
METHOD | DESCRIPTION |
---|---|
patternConcat(pattern...) | Concatenates patterns |
patternAlter(pattern...) | Alternates patterns |
patternQuantify(pattern, min, max) | Quantifies a pattern |
patternPermute(pattern...) | Permutes a pattern |
patternExclude(pattern) | Excludes a pattern |
分组方法
以下方法返回RelBuilder.GroupKey。
METHOD | DESCRIPTION |
---|---|
groupKey(fieldName...) groupKey(fieldOrdinal...) groupKey(expr...) groupKey(exprList) | Creates a group key of the given expressions |
groupKey(exprList, exprListList) | Creates a group key of the given expressions with grouping sets |
groupKey(bitSet [, bitSets]) | Creates a group key of the given input columns, with multiple grouping sets if bitSets is specified |
聚合方法
以下方法返回RelBuilder.AggCall。
METHOD | DESCRIPTION |
---|---|
aggregateCall(op, expr...) aggregateCall(op, exprList) | Creates a call to a given aggregate function |
count([ distinct, alias, ] expr...) count([ distinct, alias, ] exprList) | Creates a call to the COUNT aggregate function |
countStar(alias) | Creates a call to the COUNT(*) aggregate function |
sum([ distinct, alias, ] expr) | Creates a call to the SUM aggregate function |
min([ alias, ] expr) | Creates a call to the MIN aggregate function |
max([ alias, ] expr) | Creates a call to the MAX aggregate function |
要进一步修改AggCall
,请调用以下方法:
METHOD | DESCRIPTION |
---|---|
approximate(approximate) | Allows approximate value for the aggregate of approximate |
as(alias) | Assigns a column alias to this expression (see SQL AS ) |
distinct() | Eliminates duplicate values before aggregating (see SQL DISTINCT ) |
distinct(distinct) | Eliminates duplicate values before aggregating if distinct |
filter(expr) | Filters rows before aggregating (see SQL FILTER (WHERE ...) ) |
sort(expr...) sort(exprList) | Sorts rows before aggregating (see SQL WITHIN GROUP ) |
unique(expr...) unique(exprList) | Makes rows unique before aggregating (see SQL WITHIN DISTINCT ) |
over() | Converts this AggCall into a windowed aggregate (see OverCall below) |
窗口聚合方法
创建一个 RelBuilder.OverCall,它代表了对窗口聚合函数的调用,创建一个聚合调用,然后调用其over()
方法,例如count().over()
要进一步修改OverCall
,调用以下方法:
METHOD | DESCRIPTION |
---|---|
rangeUnbounded() | Creates an unbounded range-based window, RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING |
rangeFrom(lower) | Creates a range-based window bounded below, RANGE BETWEEN lower AND CURRENT ROW |
rangeTo(upper) | Creates a range-based window bounded above, RANGE BETWEEN CURRENT ROW AND upper |
rangeBetween(lower, upper) | Creates a range-based window, RANGE BETWEEN lower AND upper |
rowsUnbounded() | Creates an unbounded row-based window, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING |
rowsFrom(lower) | Creates a row-based window bounded below, ROWS BETWEEN lower AND CURRENT ROW |
rowsTo(upper) | Creates a row-based window bounded above, ROWS BETWEEN CURRENT ROW AND upper |
rowsBetween(lower, upper) | Creates a rows-based window, ROWS BETWEEN lower AND upper |
partitionBy(expr...) partitionBy(exprList) | Partitions the window on the given expressions (see SQL PARTITION BY ) |
orderBy(expr...) sort(exprList) | Sorts the rows in the window (see SQL ORDER BY ) |
allowPartial(b) | Sets whether to allow partial width windows; default true |
nullWhenCountZero(b) | Sets whether whether the aggregate function should evaluate to null if no rows are in the window; default false |
as(alias) | Assigns a column alias (see SQL AS ) and converts this OverCall to a RexNode |
toRex() | Converts this OverCall to a RexNode |