Doris版本:2.1.0
be.out日志
start time: Fri Feb 14 16:39:19 CST 2025
INFO: java_cmd /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.352.b08-2.el7_9.x86_64/jre/bin/java
INFO: jdk_version 8
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/app/apache-doris-2.1.0-bin-x64/be/lib/java_extensions/preload-extensions/preload-extensions-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/app/apache-doris-2.1.0-bin-x64/be/lib/java_extensions/java-udf/java-udf-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/app/apache-doris-2.1.0-bin-x64/be/lib/hadoop_hdfs/common/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Reload4jLoggerFactory]
[WARNING!] /sys/kernel/mm/transparent_hugepage/enabled: [always] madvise never, Doris not recommend turning on THP, which may cause the BE process to use more memory and cannot be freed in time. Turn off THP: `echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled`
*** Query id: 675d9f4538e746f1-84fe6c62ad1655c8 ***
*** tablet id: 0 ***
*** Aborted at 1744402629 (unix time) try "date -d @1744402629" if you are using GNU date ***
*** Current BE git commitID: 91efb6a43d ***
*** SIGSEGV address not mapped to object (@0x8) received by PID 63890 (TID 66218 OR 0x7ed8399e5700) from PID 8; stack trace: ***
0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_release/doris/be/src/common/signal_handler.h:417
1# os::Linux::chained_handler(int, siginfo_t*, void*) in /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.352.b08-2.el7_9.x86_64/jre/lib/amd64/server/libjvm.so
2# JVM_handle_linux_signal in /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.352.b08-2.el7_9.x86_64/jre/lib/amd64/server/libjvm.so
3# signalHandler(int, siginfo_t*, void*) in /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.352.b08-2.el7_9.x86_64/jre/lib/amd64/server/libjvm.so
4# 0x00007F11B0B40400 in /lib64/libc.so.6
5# doris::FunctionContext::~FunctionContext() at /home/zcp/repo_center/doris_release/doris/be/src/udf/udf.h:145
6# doris::vectorized::VExprContext::~VExprContext() at /home/zcp/repo_center/doris_release/doris/be/src/vec/exprs/vexpr_context.cpp:46
7# doris::vectorized::VScanner::~VScanner() at /home/zcp/repo_center/doris_release/doris/be/src/vec/exec/scan/vscanner.h:64
8# doris::vectorized::ScannerDelegate::~ScannerDelegate() at /home/zcp/repo_center/doris_release/doris/be/src/vec/exec/scan/vscan_node.h:101
9# doris::vectorized::VScanNode::release_resource(doris::RuntimeState*) at /home/zcp/repo_center/doris_release/doris/be/src/vec/exec/scan/vscan_node.cpp:337
10# doris::pipeline::StreamingOperator<doris::vectorized::VScanNode>::close(doris::RuntimeState*) at /home/zcp/repo_center/doris_release/doris/be/src/pipeline/exec/operator.h:340
11# doris::pipeline::PipelineTask::close(doris::Status) at /home/zcp/repo_center/doris_release/doris/be/src/pipeline/pipeline_task.cpp:335
12# doris::pipeline::_close_task(doris::pipeline::PipelineTask*, doris::pipeline::PipelineTaskState, doris::Status) at /home/zcp/repo_center/doris_release/doris/be/src/pipeline/task_scheduler.cpp:238
13# doris::pipeline::TaskScheduler::_do_work(unsigned long) at /home/zcp/repo_center/doris_release/doris/be/src/pipeline/task_scheduler.cpp:279
14# doris::ThreadPool::dispatch_thread() in /app/apache-doris-2.1.0-bin-x64/be/lib/doris_be
15# doris::Thread::supervise_thread(void*) at /home/zcp/repo_center/doris_release/doris/be/src/util/thread.cpp:499
16# start_thread in /lib64/libpthread.so.0
17# clone in /lib64/libc.so.6
start time: Sat Apr 12 04:21:03 CST 2025
INFO: java_cmd /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.352.b08-2.el7_9.x86_64/jre/bin/java
INFO: jdk_version 8
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/app/apache-doris-2.1.0-bin-x64/be/lib/java_extensions/preload-extensions/preload-extensions-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/app/apache-doris-2.1.0-bin-x64/be/lib/java_extensions/java-udf/java-udf-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/app/apache-doris-2.1.0-bin-x64/be/lib/hadoop_hdfs/common/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Reload4jLoggerFactory]
根据以上be.out中的query id,在fe.audit日志获取到业务sql为一条普通的insert into .. select语句,显示detailMessage错误原因为There exists unhealthy backend,因此推测并不是这条SQL导致的be crash:
46184:2025-04-12 04:15:11,635 [query] |Client=10.205.112.180:52804|User=cdp|Ctl=internal|Db=cdp|State=ERR|ErrorCode=1105|ErrorMessage=errCode = 2, detailMessage = There exists unhealthy backend. backend 11555 is down|Time(ms)=64732|ScanBytes=0|ScanRows=0|ReturnRows=0|StmtId=55548015|QueryId=675d9f4538e746f1-84fe6c62ad1655c8|IsQuery=false|isNereids=false|feIp=10.205.112.136|Stmt=XXXXXXXXXXXXXXXXX业务SQLXXXXXXXXXXX|CpuTimeMS=0|ShuffleSendBytes=-1|ShuffleSendRows=-1|SqlHash=363798a3c7d4e0828788d13ad258c3bf|peakMemoryBytes=0|SqlDigest=|TraceId=|WorkloadGroup=normal|FuzzyVariables=
be.info日志如下
截取fe.audit错误日志04:15:11前后的be.info日志:
https://derekday-1252212125.cos.ap-shanghai.myqcloud.com/be.INFO.log.20250412-030154-short
截取be重启前后的be.info日志如下:
https://derekday-1252212125.cos.ap-shanghai.myqcloud.com/extracted_log_2908129-2909029.log
日志中没有看到命令oom的信息。
使用gdb打开core dump文件
gdb /app/apache-doris-2.1.0-bin-x64/be/lib/doris_be /core.63890
得到信息
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-120.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /app/apache-doris-2.1.0-bin-x64/be/lib/doris_be...Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /app/apache-doris-2.1.0-bin-x64/be/lib/doris_be]
(no debugging symbols found)...done.
BFD: Warning: /core.63890 is truncated: expected core file size >= 660496875520, found: 510600187904.
[New LWP 63890]
[New LWP 63939]
[New LWP 63990]
.....
[New LWP 64711]
[New LWP 64552]
[New LWP 64756]
[New LWP 64754]
[New LWP 64752]
Cannot access memory at address 0x7f11b151d128
Cannot access memory at address 0x7f11b151d120
Failed to read a valid object file image from memory.
Core was generated by `/app/apache-doris-2.1.0-bin-x64/be/lib/doris_be'.
Program terminated with signal 11, Segmentation fault.
#0 0x00007f11b0bcf85d in ?? ()
(gdb) bt
Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x7ffc45c540e0:
(gdb) bt full
#0 0x00007f11b0bcf85d in ?? ()
No symbol table info available.
Cannot access memory at address 0x7ffc45c540e0
求问:
这种情况可能是什么原因导致的be crash?