代数

关系代数是Calcite的核心。每个查询都被表示为一棵关系运算符的树。你可以将一条SQL语句翻译为关系代数,也可以直接建立树状结构。

规则器规则使用保留语义的数学特性来转换表达树。例如,如果一个过滤操作没有引入其他输入的列,那么可以将一个过滤器下推至连接之前。

Calcite通过对关系表达式进行反复应用规划器规则来优化查询。一个成本模型指导了优化的整个过程,规划器生成一个替代的表达式,语义与之前表达式相同,但具有更低的成本。

规划过程是可扩展的。你可以添加自己的关系运算符、规划器规则、成本模型和统计数据。

代数构建

建立关系表达式的最简单方法时使用代数构建器RelBuilder,下面是一个示例:

表扫描

1
2
3
4
5
6
final FrameworkConfig config;
final RelBuilder builder = RelBuilder.create(config);
final RelNode node = builder
.scan("EMP")
.build();
System.out.println(RelOptUtil.toString(node));

(你可以从这个文件RelBUilderExample.java中获取更多的例子)

以上代码输出下列内容:

1
LogicalTableScan(table=[[scott, EMP]])

它相当于对EMP表进行了全表扫描,相当于这句SQL:

1
2
SELECT *
FROM scott.EMP;

增加项目

现在,让我们添加一个项目,相当于这句SQL:

1
2
SELECT ename, deptno
FROM scott.EMP;

我们仅仅只需要在调用build方法之前调用project方法即可:

1
2
3
4
5
final RelNode node = builder
.scan("EMP")
.project(builder.field("DEPTNO"), builder.field("ENAME"))
.build();
System.out.println(RelOptUtil.toString(node));

代码输出以下内容:

1
2
LogicalProject(DEPTNO=[$7], ENAME=[$1])
LogicalTableScan(table=[[scott, EMP]])

两次调用builder.field方法创建了简单的表达式,从输入的通过调用scan方法产生的关系表达式(TableScan)中返回字段。

Calcite将字段按照序号进行转换,$7$1

增加过滤和聚合

一个查询包含了过滤和聚合:

1
2
3
4
5
6
7
8
9
10
11
final RelNode node = builder
.scan("EMP")
.aggregate(builder.groupKey("DEPTNO"),
builder.count(false, "C"),
builder.sum(false, "S", builder.field("SAL")))
.filter(
builder.call(SqlStdOperatorTable.GREATER_THAN,
builder.field("C"),
builder.literal(10)))
.build();
System.out.println(RelOptUtil.toString(node));

相等于下列SQL语句:

1
2
3
4
SELECT deptno, count(*) AS c, sum(sal) AS s
FROM emp
GROUP BY deptno
HAVING count(*) > 10

同时代码输出以下内容:

1
2
3
LogicalFilter(condition=[>($1, 10)])
LogicalAggregate(group=[{7}], C=[COUNT()], S=[SUM($5)])
LogicalTableScan(table=[[scott, EMP]])

入栈和出栈

构建器使用栈来存储一个步骤所产生的关系表达式,并将其作为输入传递给下一个步骤,这使得产生关系表达式的方法可以产生一个构造器。

大多数时候,你将使用的唯一栈方法是build(),以获得最后的关系表达式,即树的根部。

有时栈会嵌套的很深,让人感到困惑。为了让事情变得更加简单一些,你可以从栈中移除表达式。例如,在这里我们建立了一个灌木状的连接:

1
2
3
4
5
6
.
join
/ \
join join
/ \ / \
CUSTOMERS ORDERS LINE_ITEMS PRODUCTS

我们分三个步骤来构建它。将中间结果存储在leftright两个变量中,当创建最终的Join时,使用push()方法将它们放回栈中:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
final RelNode left = builder
.scan("CUSTOMERS")
.scan("ORDERS")
.join(JoinRelType.INNER, "ORDER_ID")
.build();

final RelNode right = builder
.scan("LINE_ITEMS")
.scan("PRODUCTS")
.join(JoinRelType.INNER, "PRODUCT_ID")
.build();

final RelNode result = builder
.push(left)
.push(right)
.join(JoinRelType.INNER, "ORDER_ID")
.build();

切换公约

默认的Relbuilder创建的是没有公约的逻辑RelNode,但你可以通过adoptConvention()切换到使用一个不同的约定:

1
2
3
4
5
final RelNode result = builder
.push(input)
.adoptConvention(EnumerableConvention.INSTANCE)
.sort(toCollation)
.build();

在这种情况下,我们在输入的RelNode之上创建了一个EnumerableSort

字段名和序号

你可以通过名称或者序号来引用一个字段。

序号是从0开始。每个运算符都保证其输出字段的出现顺序。例如,Project返回由每个标量表达式产生的字段。

操作符的字段名保证是唯一的,但有时这意味着名字并不完全是你期望的那样。例如,当你将EMP连接到DEPT时,其中一个输出字段被称为DEPTNO,而另一个字段被称为DEPTNO_1

一些关系表达式方法让你对字段名称有更多的控制:

  • project让你用alias(expr, fieldName)来包装表达式。它删除了包装器但是保留了建议的名称(只要它是唯一的)
  • values(String[] fieldNames, Object... values) 接受一个字段名的数组。如果数组中的任何元素为空,构建器将生成一个唯一的名称。

如果一个表达式使用了一个输入字段,或者转换了一个输入字段,那么它将使用该输入字段的名称。

一旦唯一的字段名被分配,这些名字是不可更改的。如果你有一个特定的RelNode实例,你可以相信字段名不会改变。事实上,整个关系表达式不可变的。

但是如果一个关系表达式已经通过了几个重写规则(详见RelOptRule),那么产生的表达式的字段名可能与原来的不太一样。在这一点上,最好使用序号来引用字段。

当你建立一个接收多个输入的关系表达式时,你需要建立考虑到这一点的字段引用。这在构建连接条件的时候常常发生。

假设你正在建立一个关于EMP的连接,EMP有8个字段[EMNO, ENAME, JOB, MGR, HIREDATE, SAL, COMM, DEPTNO],而DEPT有3个字段[DEPTNO, DNAME, LOC]。在内部,Calcite将这些字段表示为一个有11个字段的组合输入行:左边输入的第一个字段是字段#0(基于0,记住),右边输入的第一个字段是字段#8

但是通过 builder API,你可以指定哪个字段的输入。要引用 “SAL”,内部字段#5,可以写 builder.field(2, 0, “SAL”),builder.field(2, “EMP”, “SAL”),或者 builder.field(2, 0, 5)。这意味着 “两个输入中的输入#0的字段#5”。(为什么它需要知道有两个输入?因为它们被存储在堆栈中;输入 #1 在堆栈的顶部,而输入 #0 在它的下面。如果我们不告诉构建器有两个输入,它就不知道要到多深的地方去找0号输入)。

同样,要引用 “DNAME”,内部字段 #9 (8 + 1),请写 builder.field(2, 1, “DNAME”), builder.field(2, “DEPT”, “DNAME”), 或者 builder.field(2, 1, 1)。

递归查询

警告。目前的API是实验性的,如有变化,恕不另行通知。一个SQL递归查询,例如这个产生序列1, 2, 3, …10的查询:

1
2
3
4
5
6
WITH RECURSIVE aux(i) AS (
VALUES (1)
UNION ALL
SELECT i+1 FROM aux WHERE i < 10
)
SELECT * FROM aux

可以通过对TransientTableRepeatUnion的扫描来生成:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
final RelNode node = builder
.values(new String[] { "i" }, 1)
.transientScan("aux")
.filter(
builder.call(
SqlStdOperatorTable.LESS_THAN,
builder.field(0),
builder.literal(10)))
.project(
builder.call(
SqlStdOperatorTable.PLUS,
builder.field(0),
builder.literal(1)))
.repeatUnion("aux", true)
.build();
System.out.println(RelOptUtil.toString(node));

以上代码输出以下内容:

1
2
3
4
5
6
7
LogicalRepeatUnion(all=[true])
LogicalTableSpool(readType=[LAZY], writeType=[LAZY], tableName=[aux])
LogicalValues(tuples=[[{ 1 }]])
LogicalTableSpool(readType=[LAZY], writeType=[LAZY], tableName=[aux])
LogicalProject($f0=[+($0, 1)])
LogicalFilter(condition=[<($0, 10)])
LogicalTableScan(table=[[aux]])

API汇总

关系操作符

以下方法创建一个关系表达式(RelNode),将其推入栈,并返回RelBuilder。

METHOD DESCRIPTION
scan(tableName) Creates a TableScan.
functionScan(operator, n, expr...) functionScan(operator, n, exprList) Creates a TableFunctionScan of the n most recent relational expressions.
transientScan(tableName [, rowType]) Creates a TableScan on a TransientTable with the given type (if not specified, the most recent relational expression’s type will be used).
values(fieldNames, value...) values(rowType, tupleList) Creates a Values.
filter([variablesSet, ] exprList) filter([variablesSet, ] expr...) Creates a Filter over the AND of the given predicates; if variablesSet is specified, the predicates may reference those variables.
project(expr...) project(exprList [, fieldNames]) Creates a Project. To override the default name, wrap expressions using alias, or specify the fieldNames argument.
projectPlus(expr...) projectPlus(exprList) Variant of project that keeps original fields and appends the given expressions.
projectExcept(expr...) projectExcept(exprList) Variant of project that keeps original fields and removes the given expressions.
permute(mapping) Creates a Project that permutes the fields using mapping.
convert(rowType [, rename]) Creates a Project that converts the fields to the given types, optionally also renaming them.
aggregate(groupKey, aggCall...) aggregate(groupKey, aggCallList) Creates an Aggregate.
distinct() Creates an Aggregate that eliminates duplicate records.
pivot(groupKey, aggCalls, axes, values) Adds a pivot operation, implemented by generating an Aggregate with a column for each combination of measures and values
unpivot(includeNulls, measureNames, axisNames, axisMap) Adds an unpivot operation, implemented by generating a Join to a Values that converts each row to several rows
sort(fieldOrdinal...) sort(expr...) sort(exprList) Creates a Sort. In the first form, field ordinals are 0-based, and a negative ordinal indicates descending; for example, -2 means field 1 descending. In the other forms, you can wrap expressions in as, nullsFirst or nullsLast.
sortLimit(offset, fetch, expr...) sortLimit(offset, fetch, exprList) Creates a Sort with offset and limit.
limit(offset, fetch) Creates a Sort that does not sort, only applies with offset and limit.
exchange(distribution) Creates an Exchange.
sortExchange(distribution, collation) Creates a SortExchange.
correlate(joinType, correlationId, requiredField...) correlate(joinType, correlationId, requiredFieldList) Creates a Correlate of the two most recent relational expressions, with a variable name and required field expressions for the left relation.
join(joinType, expr...) join(joinType, exprList) join(joinType, fieldName...) Creates a Join of the two most recent relational expressions. The first form joins on a boolean expression (multiple conditions are combined using AND). The last form joins on named fields; each side must have a field of each name.
semiJoin(expr) Creates a Join with SEMI join type of the two most recent relational expressions.
antiJoin(expr) Creates a Join with ANTI join type of the two most recent relational expressions.
union(all [, n]) Creates a Union of the n (default two) most recent relational expressions.
intersect(all [, n]) Creates an Intersect of the n (default two) most recent relational expressions.
minus(all) Creates a Minus of the two most recent relational expressions.
repeatUnion(tableName, all [, n]) Creates a RepeatUnion associated to a TransientTable of the two most recent relational expressions, with n maximum number of iterations (default -1, i.e. no limit).
snapshot(period) Creates a Snapshot of the given snapshot period.
match(pattern, strictStart, strictEnd, patterns, measures, after, subsets, allRows, partitionKeys, orderKeys, interval) Creates a Match.

参数类型详解:

构造器的方法包含各种优化,包括:

  • 如果被要求按顺序映射所有列,project则返回输入值
  • filter会扁平化简化筛选操作(所以一个andor操作可能有多个操作数),例如(将x = 1 AND TRUE转换为x = 1)
  • 如果你先sortlimit,效果和你使用sortLimit一样

有一些注释方法可以为栈上的顶级关系表达式添加注释信息:

METHOD DESCRIPTION
as(alias) Assigns a table alias to the top relational expression on the stack
variable(varHolder) Creates a correlation variable referencing the top relational expression
栈方法
METHOD DESCRIPTION
build() Pops the most recently created relational expression off the stack
push(rel) Pushes a relational expression onto the stack. Relational methods such as scan, above, call this method, but user code generally does not
pushAll(collection) Pushes a collection of relational expressions onto the stack
peek() Returns the relational expression most recently put onto the stack, but does not remove it
标量表达方法

以下方法返回一个标量表达式(RexNode)。

它们中的许多都使用栈中的内容。例如,field(“DEPTNO”)返回一个对刚刚添加到堆栈中的关系表达式的"DEPTNO"字段的引用。

METHOD DESCRIPTION
literal(value) Constant
field(fieldName) Reference, by name, to a field of the top-most relational expression
field(fieldOrdinal) Reference, by ordinal, to a field of the top-most relational expression
field(inputCount, inputOrdinal, fieldName) Reference, by name, to a field of the (inputCount - inputOrdinal)th relational expression
field(inputCount, inputOrdinal, fieldOrdinal) Reference, by ordinal, to a field of the (inputCount - inputOrdinal)th relational expression
field(inputCount, alias, fieldName) Reference, by table alias and field name, to a field at most inputCount - 1 elements from the top of the stack
field(alias, fieldName) Reference, by table alias and field name, to a field of the top-most relational expressions
field(expr, fieldName) Reference, by name, to a field of a record-valued expression
field(expr, fieldOrdinal) Reference, by ordinal, to a field of a record-valued expression
fields(fieldOrdinalList) List of expressions referencing input fields by ordinal
fields(mapping) List of expressions referencing input fields by a given mapping
fields(collation) List of expressions, exprList, such that sort(exprList) would replicate collation
call(op, expr...) call(op, exprList) Call to a function or operator
and(expr...) and(exprList) Logical AND. Flattens nested ANDs, and optimizes cases involving TRUE and FALSE.
or(expr...) or(exprList) Logical OR. Flattens nested ORs, and optimizes cases involving TRUE and FALSE.
not(expr) Logical NOT
equals(expr, expr) Equals
isNull(expr) Checks whether an expression is null
isNotNull(expr) Checks whether an expression is not null
alias(expr, fieldName) Renames an expression (only valid as an argument to project)
cast(expr, typeName) cast(expr, typeName, precision) cast(expr, typeName, precision, scale) Converts an expression to a given type
desc(expr) Changes sort direction to descending (only valid as an argument to sort or sortLimit)
nullsFirst(expr) Changes sort order to nulls first (only valid as an argument to sort or sortLimit)
nullsLast(expr) Changes sort order to nulls last (only valid as an argument to sort or sortLimit)
cursor(n, input) Reference to inputth (0-based) relational input of a TableFunctionScan with n inputs (see functionScan)
子查询方法

以下方法将子查询转换为标量值(在in、exists、some、all、unique的情况下是BOOLEAN;对于scalarQuery是任何标量类型。对于arrayQuery是ARRAY,对于mapQuery是MAP,对于multisetQuery是MULTISET)

就如下所示,relFn是一个函数,它接收一个RelBuilder参数并返回一个RelNode。你通常把它作为一个lambda表达式来实现;该方法用一个具有正确上下文的RelBuilder来调用你的代码,而你的代码则返回将成为子查询的RelNode

METHOD DESCRIPTION
all(expr, op, relFn) Returns whether expr has a particular relation to all of the values of the sub-query
arrayQuery(relFn) Returns the rows of a sub-query as an ARRAY
exists(relFn) Tests whether sub-query is non-empty
in(expr, relFn) in(exprList, relFn) Tests whether a value occurs in a sub-query
mapQuery(relFn) Returns the rows of a sub-query as a MAP
multisetQuery(relFn) Returns the rows of a sub-query as a MULTISET
scalarQuery(relFn) Returns the value of the sole column of the sole row of a sub-query
some(expr, op, relFn) Returns whether expr has a particular relation to one or more of the values of the sub-query
unique(relFn) Returns whether the rows of a sub-query are unique
模式匹配方法

以下方法返回用于匹配的模式。

METHOD DESCRIPTION
patternConcat(pattern...) Concatenates patterns
patternAlter(pattern...) Alternates patterns
patternQuantify(pattern, min, max) Quantifies a pattern
patternPermute(pattern...) Permutes a pattern
patternExclude(pattern) Excludes a pattern
分组方法

以下方法返回RelBuilder.GroupKey

METHOD DESCRIPTION
groupKey(fieldName...) groupKey(fieldOrdinal...) groupKey(expr...) groupKey(exprList) Creates a group key of the given expressions
groupKey(exprList, exprListList) Creates a group key of the given expressions with grouping sets
groupKey(bitSet [, bitSets]) Creates a group key of the given input columns, with multiple grouping sets if bitSets is specified
聚合方法

以下方法返回RelBuilder.AggCall

METHOD DESCRIPTION
aggregateCall(op, expr...) aggregateCall(op, exprList) Creates a call to a given aggregate function
count([ distinct, alias, ] expr...) count([ distinct, alias, ] exprList) Creates a call to the COUNT aggregate function
countStar(alias) Creates a call to the COUNT(*) aggregate function
sum([ distinct, alias, ] expr) Creates a call to the SUM aggregate function
min([ alias, ] expr) Creates a call to the MIN aggregate function
max([ alias, ] expr) Creates a call to the MAX aggregate function

要进一步修改AggCall,请调用以下方法:

METHOD DESCRIPTION
approximate(approximate) Allows approximate value for the aggregate of approximate
as(alias) Assigns a column alias to this expression (see SQL AS)
distinct() Eliminates duplicate values before aggregating (see SQL DISTINCT)
distinct(distinct) Eliminates duplicate values before aggregating if distinct
filter(expr) Filters rows before aggregating (see SQL FILTER (WHERE ...))
sort(expr...) sort(exprList) Sorts rows before aggregating (see SQL WITHIN GROUP)
unique(expr...) unique(exprList) Makes rows unique before aggregating (see SQL WITHIN DISTINCT)
over() Converts this AggCall into a windowed aggregate (see OverCall below)
窗口聚合方法

创建一个 RelBuilder.OverCall,它代表了对窗口聚合函数的调用,创建一个聚合调用,然后调用其over()方法,例如count().over()

要进一步修改OverCall,调用以下方法:

METHOD DESCRIPTION
rangeUnbounded() Creates an unbounded range-based window, RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
rangeFrom(lower) Creates a range-based window bounded below, RANGE BETWEEN lower AND CURRENT ROW
rangeTo(upper) Creates a range-based window bounded above, RANGE BETWEEN CURRENT ROW AND upper
rangeBetween(lower, upper) Creates a range-based window, RANGE BETWEEN lower AND upper
rowsUnbounded() Creates an unbounded row-based window, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
rowsFrom(lower) Creates a row-based window bounded below, ROWS BETWEEN lower AND CURRENT ROW
rowsTo(upper) Creates a row-based window bounded above, ROWS BETWEEN CURRENT ROW AND upper
rowsBetween(lower, upper) Creates a rows-based window, ROWS BETWEEN lower AND upper
partitionBy(expr...) partitionBy(exprList) Partitions the window on the given expressions (see SQL PARTITION BY)
orderBy(expr...) sort(exprList) Sorts the rows in the window (see SQL ORDER BY)
allowPartial(b) Sets whether to allow partial width windows; default true
nullWhenCountZero(b) Sets whether whether the aggregate function should evaluate to null if no rows are in the window; default false
as(alias) Assigns a column alias (see SQL AS) and converts this OverCall to a RexNode
toRex() Converts this OverCall to a RexNode