求助BE Crash原因分析

Viewed 35

Doris版本:2.1.0

be.out日志

image.png

start time: Fri Feb 14 16:39:19 CST 2025
INFO: java_cmd /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.352.b08-2.el7_9.x86_64/jre/bin/java
INFO: jdk_version 8
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/app/apache-doris-2.1.0-bin-x64/be/lib/java_extensions/preload-extensions/preload-extensions-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/app/apache-doris-2.1.0-bin-x64/be/lib/java_extensions/java-udf/java-udf-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/app/apache-doris-2.1.0-bin-x64/be/lib/hadoop_hdfs/common/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Reload4jLoggerFactory]
[WARNING!] /sys/kernel/mm/transparent_hugepage/enabled: [always] madvise never, Doris not recommend turning on THP, which may cause the BE process to use more memory and cannot be freed in time. Turn off THP: `echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled`
*** Query id: 675d9f4538e746f1-84fe6c62ad1655c8 ***
*** tablet id: 0 ***
*** Aborted at 1744402629 (unix time) try "date -d @1744402629" if you are using GNU date ***
*** Current BE git commitID: 91efb6a43d ***
*** SIGSEGV address not mapped to object (@0x8) received by PID 63890 (TID 66218 OR 0x7ed8399e5700) from PID 8; stack trace: ***
 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_release/doris/be/src/common/signal_handler.h:417
 1# os::Linux::chained_handler(int, siginfo_t*, void*) in /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.352.b08-2.el7_9.x86_64/jre/lib/amd64/server/libjvm.so
 2# JVM_handle_linux_signal in /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.352.b08-2.el7_9.x86_64/jre/lib/amd64/server/libjvm.so
 3# signalHandler(int, siginfo_t*, void*) in /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.352.b08-2.el7_9.x86_64/jre/lib/amd64/server/libjvm.so
 4# 0x00007F11B0B40400 in /lib64/libc.so.6
 5# doris::FunctionContext::~FunctionContext() at /home/zcp/repo_center/doris_release/doris/be/src/udf/udf.h:145
 6# doris::vectorized::VExprContext::~VExprContext() at /home/zcp/repo_center/doris_release/doris/be/src/vec/exprs/vexpr_context.cpp:46
 7# doris::vectorized::VScanner::~VScanner() at /home/zcp/repo_center/doris_release/doris/be/src/vec/exec/scan/vscanner.h:64
 8# doris::vectorized::ScannerDelegate::~ScannerDelegate() at /home/zcp/repo_center/doris_release/doris/be/src/vec/exec/scan/vscan_node.h:101
 9# doris::vectorized::VScanNode::release_resource(doris::RuntimeState*) at /home/zcp/repo_center/doris_release/doris/be/src/vec/exec/scan/vscan_node.cpp:337
10# doris::pipeline::StreamingOperator<doris::vectorized::VScanNode>::close(doris::RuntimeState*) at /home/zcp/repo_center/doris_release/doris/be/src/pipeline/exec/operator.h:340
11# doris::pipeline::PipelineTask::close(doris::Status) at /home/zcp/repo_center/doris_release/doris/be/src/pipeline/pipeline_task.cpp:335
12# doris::pipeline::_close_task(doris::pipeline::PipelineTask*, doris::pipeline::PipelineTaskState, doris::Status) at /home/zcp/repo_center/doris_release/doris/be/src/pipeline/task_scheduler.cpp:238
13# doris::pipeline::TaskScheduler::_do_work(unsigned long) at /home/zcp/repo_center/doris_release/doris/be/src/pipeline/task_scheduler.cpp:279
14# doris::ThreadPool::dispatch_thread() in /app/apache-doris-2.1.0-bin-x64/be/lib/doris_be
15# doris::Thread::supervise_thread(void*) at /home/zcp/repo_center/doris_release/doris/be/src/util/thread.cpp:499
16# start_thread in /lib64/libpthread.so.0
17# clone in /lib64/libc.so.6

start time: Sat Apr 12 04:21:03 CST 2025
INFO: java_cmd /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.352.b08-2.el7_9.x86_64/jre/bin/java
INFO: jdk_version 8
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/app/apache-doris-2.1.0-bin-x64/be/lib/java_extensions/preload-extensions/preload-extensions-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/app/apache-doris-2.1.0-bin-x64/be/lib/java_extensions/java-udf/java-udf-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/app/apache-doris-2.1.0-bin-x64/be/lib/hadoop_hdfs/common/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Reload4jLoggerFactory]

根据以上be.out中的query id,在fe.audit日志获取到业务sql为一条普通的insert into .. select语句,显示detailMessage错误原因为There exists unhealthy backend,因此推测并不是这条SQL导致的be crash:

46184:2025-04-12 04:15:11,635 [query] |Client=10.205.112.180:52804|User=cdp|Ctl=internal|Db=cdp|State=ERR|ErrorCode=1105|ErrorMessage=errCode = 2, detailMessage = There exists unhealthy backend. backend 11555 is down|Time(ms)=64732|ScanBytes=0|ScanRows=0|ReturnRows=0|StmtId=55548015|QueryId=675d9f4538e746f1-84fe6c62ad1655c8|IsQuery=false|isNereids=false|feIp=10.205.112.136|Stmt=XXXXXXXXXXXXXXXXX业务SQLXXXXXXXXXXX|CpuTimeMS=0|ShuffleSendBytes=-1|ShuffleSendRows=-1|SqlHash=363798a3c7d4e0828788d13ad258c3bf|peakMemoryBytes=0|SqlDigest=|TraceId=|WorkloadGroup=normal|FuzzyVariables=

be.info日志如下

image.png

截取fe.audit错误日志04:15:11前后的be.info日志:
https://derekday-1252212125.cos.ap-shanghai.myqcloud.com/be.INFO.log.20250412-030154-short

截取be重启前后的be.info日志如下:
https://derekday-1252212125.cos.ap-shanghai.myqcloud.com/extracted_log_2908129-2909029.log

日志中没有看到命令oom的信息。

使用gdb打开core dump文件

gdb /app/apache-doris-2.1.0-bin-x64/be/lib/doris_be /core.63890

得到信息

GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-120.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /app/apache-doris-2.1.0-bin-x64/be/lib/doris_be...Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /app/apache-doris-2.1.0-bin-x64/be/lib/doris_be]
(no debugging symbols found)...done.
BFD: Warning: /core.63890 is truncated: expected core file size >= 660496875520, found: 510600187904.
[New LWP 63890]
[New LWP 63939]
[New LWP 63990]
.....

[New LWP 64711]
[New LWP 64552]
[New LWP 64756]
[New LWP 64754]
[New LWP 64752]
Cannot access memory at address 0x7f11b151d128
Cannot access memory at address 0x7f11b151d120
Failed to read a valid object file image from memory.
Core was generated by `/app/apache-doris-2.1.0-bin-x64/be/lib/doris_be'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007f11b0bcf85d in ?? ()
(gdb) bt
Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x7ffc45c540e0: 
(gdb) bt full
#0  0x00007f11b0bcf85d in ?? ()
No symbol table info available.
Cannot access memory at address 0x7ffc45c540e0

求问:
这种情况可能是什么原因导致的be crash?

1 Answers

2.1.0 的问题,先直接升级到2.1.9 吧,这期间fix 了很多bug