【已解决】Doris 2.1.0 执行查询导致BE重启

Viewed 100

问题背景,今日将集群从2.0升级到2.1,升级完成后发现be节点经常自动重启。经排查可判断某些复杂的view嵌套view查询语句将造成BE触发SIGSEGV导致节点关闭。

测试语句:

SELECT
        DISTINCT 'ec' AS `ec`,
        `a`.`sku` AS `seller_sku`,
        'US' AS `marketplace_id`,
        `a`.`product_title` AS `fnsku英文标题`
FROM
        `default_cluster:hyy`.`view_walmart_all_order` a
    INNER JOIN (
        SELECT
            `sku` AS `sku`,
            max(`workdate`) AS `workdate`
        FROM
            `default_cluster:hyy`.`view_walmart_all_order`
        GROUP BY
            `sku`) a1 ON
        `a`.`sku` = `a1`.`sku`
        AND `a`.`workdate` = `a1`.`workdate`

被杀死的节点BE.out:

*** Query id: 1345f2f0c91e472a-a2cc9c491f993037 ***
*** tablet id: 0 ***
*** Aborted at 1711352493 (unix time) try "date -d @1711352493" if you are using GNU date ***
*** Current BE git commitID: 91efb6a43d ***
*** SIGSEGV unknown detail explain (@0x0) received by PID 891071 (TID 891325 OR 0x7f79d19f7640) from PID 0; stack trace: ***
 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_release/doris/be/src/common/signal_handler.h:417
 1# 0x00007F7B0800042F in /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so
 2# JVM_handle_linux_signal in /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so
 3# 0x00007F7B07FF90FC in /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so
 4# 0x00007F7B0C5DB520 in /lib/x86_64-linux-gnu/libc.so.6
 5# doris::vectorized::VExprContext::execute(doris::vectorized::Block*, int*) at /home/zcp/repo_center/doris_release/doris/be/src/vec/exprs/vexpr_context.cpp:50
 6# doris::pipeline::JoinProbeLocalState<doris::pipeline::HashJoinSharedState, doris::pipeline::HashJoinProbeLocalState>::_build_output_block(doris::vectorized::Block*, doris::vectorized::Block*, bool) at /home/zcp/repo_center/doris_release/doris/be/src/pipeline/exec/join_probe_operator.cpp:127
 7# doris::pipeline::HashJoinProbeLocalState::filter_data_and_build_output(doris::RuntimeState*, doris::vectorized::Block*, bool*, doris::vectorized::Block*, bool) at /home/zcp/repo_center/doris_release/doris/be/src/pipeline/exec/hashjoin_probe_operator.cpp:433
 8# doris::pipeline::HashJoinProbeOperatorX::pull(doris::RuntimeState*, doris::vectorized::Block*, bool*) const at /home/zcp/repo_center/doris_release/doris/be/src/pipeline/exec/hashjoin_probe_operator.cpp:364
 9# doris::pipeline::StatefulOperatorX<doris::pipeline::HashJoinProbeLocalState>::get_block(doris::RuntimeState*, doris::vectorized::Block*, bool*) at /home/zcp/repo_center/doris_release/doris/be/src/pipeline/pipeline_x/operator.cpp:459
10# doris::pipeline::OperatorXBase::get_block_after_projects(doris::RuntimeState*, doris::vectorized::Block*, bool*) at /home/zcp/repo_center/doris_release/doris/be/src/pipeline/pipeline_x/operator.cpp:210
11# doris::pipeline::StatefulOperatorX<doris::pipeline::DistinctStreamingAggLocalState>::get_block(doris::RuntimeState*, doris::vectorized::Block*, bool*) at /home/zcp/repo_center/doris_release/doris/be/src/pipeline/pipeline_x/operator.cpp:444
12# doris::pipeline::OperatorXBase::get_block_after_projects(doris::RuntimeState*, doris::vectorized::Block*, bool*) at /home/zcp/repo_center/doris_release/doris/be/src/pipeline/pipeline_x/operator.cpp:210
13# doris::pipeline::PipelineXTask::execute(bool*) at /home/zcp/repo_center/doris_release/doris/be/src/pipeline/pipeline_x/pipeline_x_task.cpp:274
14# doris::pipeline::TaskScheduler::_do_work(unsigned long) at /home/zcp/repo_center/doris_release/doris/be/src/pipeline/task_scheduler.cpp:334
15# doris::ThreadPool::dispatch_thread() in /opt/apache-doris/be/lib/doris_be
16# doris::Thread::supervise_thread(void*) at /home/zcp/repo_center/doris_release/doris/be/src/util/thread.cpp:499
17# 0x00007F7B0C62DAC3 in /lib/x86_64-linux-gnu/libc.so.6
18# 0x00007F7B0C6BF850 in /lib/x86_64-linux-gnu/libc.so.6

相关问题在2.0版本未有出现,请教应如何排查修复这个问题?谢谢!

2 Answers

有一个可以迅速绕开问题的方法:对于mysql的连接串,加上useSSL=false,禁用ssl连接即可避免故障。

jdbc:mysql://127.0.0.1:3306/test?useSSL=false

可以尝试设置 set global fragment_transmission_compression_codec="none";
然后再refresh catalog 就可以暂时避免这个问题
社区的同学会在之后的版本改进这个问题哈