左表单分区且left join on条件是左表分桶字段,但是没有使用bucket_shuffle,必须带上[shuffle]才会使用

Viewed 74

版本:2.0.2

左表按照文档上设置成了单分区,而且连表条件得date(date)、hour(int)、pchannel(varchar50)是左表得分桶列,但是看查询计划使用得是shuffle,而不是bucket_shuffle

explain SELECT date AS date, round(IFNULL(SUM( cost),0),2) AS cost FROM ad_hour_indicator_new left JOIN adserving_bi_conversion_indicator_new_hi ON date = dt AND ad_hour_indicator_new.hour = adserving_bi_conversion_indicator_new_hi.hour AND ad_hour_indicator_new.pchannel = adserving_bi_conversion_indicator_new_hi.pchannel WHERE date >= '2024-09-10' AND date <= '2024-10-1' GROUP BY date ORDER BY date;

image.png

而加上HINT '[shuffle]'才能使用bucket_shuffle,比较奇怪还有什么因素可以影响到这个,左表总数大概100w+,右表几十万

explain SELECT date AS date, round(IFNULL(SUM( cost),0),2) AS cost FROM ad_hour_indicator_new left JOIN[shuffle] adserving_bi_conversion_indicator_new_hi ON date = dt AND ad_hour_indicator_new.hour = adserving_bi_conversion_indicator_new_hi.hour AND ad_hour_indicator_new.pchannel = adserving_bi_conversion_indicator_new_hi.pchannel WHERE date >= '2024-09-10' AND date <= '2024-10-1' GROUP BY date ORDER BY date;

image.png

具体执行计划链接

2 Answers

如果能提供 explain memo plan select...的结果,我们可以进一步分析
可能的原因:

  1. 这两次执行时统计信息有没有变化,比如两张表的行数有没有变化(统计信息里有没有变化,不是指实际数据。如果新建的表,有可能某些tablet的行数,be还没来得及上报给fe)
  2. 这里有过滤条件,这个条件是左表还是右表的?建议使用新版本,在新版本中,可能这个outer join已经被消除了,转化为inner join了

ad_hour_indicator_new 1.6M 行,无统计信息推导的过滤后行数 0.4M 行
adserving_bi_conversion_indicator_new_hi 0.7M 行, 无过滤条件

cbo 估算 的代价:
ad_hour_indicator_new left outer join adserving_bi_conversion_indicator_new_hi
高于
adserving_bi_conversion_indicator_new_hi right outer join ad_hour_indicator_new