【已解决】Doris里面的STREAMING_AGGREGATION_OPERATOR是什么意思?

Viewed 80

我在测试TPCH Q1,然后在QueryProfile里面看到有一个算子是STREAMING_AGGREGATION_OPERATOR:

STREAMING_AGGREGATION_OPERATOR  (id=1):
    -  BlocksProduced:  sum  48,  avg  1,  max  1,  min  1
    -  CloseTime:  avg  1.206us,  max  8.20us,  min  568ns
    -  ExecTime:  avg  10s203ms,  max  10s838ms,  min  9s518ms
    -  MemoryUsage:  sum  ,  avg  ,  max  ,  min  
        -  HashTable:  sum  7.88  KB,  avg  168.00  B,  max  168.00  B,  min  168.00  B
        -  PeakMemoryUsage:  sum  72.76  MB,  avg  1.52  MB,  max  1.52  MB,  min  1.52  MB
        -  SerializeKeyArena:  sum  72.75  MB,  avg  1.52  MB,  max  1.52  MB,  min  1.52  MB
    -  OpenTime:  avg  42.458us,  max  199.827us,  min  28.44us
    -  ProjectionTime:  avg  0ns,  max  0ns,  min  0ns
    -  RowsProduced:  sum  192,  avg  4,  max  4,  min  4

Velox里面对于Streaming Aggregation的定义是不包含聚合函数的Aggregation,也就是只有分组列的aggregation,说白了就是一个去重操作(See [1]), 对于TPCH Q1应该是不符合的:

select
    l_returnflag,
    l_linestatus,
    sum(l_quantity) as sum_qty,
    sum(l_extendedprice) as sum_base_price,
    sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
    sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
    avg(l_quantity) as avg_qty,
    avg(l_extendedprice) as avg_price,
    avg(l_discount) as avg_disc,
    count(*) as count_order
from
    lineitem
where
        l_shipdate <= date '1998-12-01' - interval '120' day
group by
    l_returnflag,
    l_linestatus
order by
    l_returnflag,
    l_linestatus;

不过没有找到Doris明确的文档描述Streaming Aggregation,所以请教一下。

[1]. https://facebookincubator.github.io/velox/develop/aggregations.html

1 Answers

这里我解释一下。流式预聚合是这样的,会生成一个小的hash表,先进行聚合,如果聚合效果好的话。就一直基于这个hash表聚合。
如果聚合率很低,比如进入10000w条,一条一样的都没有,打满了hash表,那么后续来的数据就直接发给第二阶段聚合了,不浪费cpu在无效的计算上了。