Doris 3个BE节点全部挂了

Viewed 19

version:2.1.6

日志:
*** Query id: d74c474ba05d45f7-8413475ed9d74898 ***
*** is nereids: 0 ***
*** tablet id: 0 ***
*** Aborted at 1729389667 (unix time) try "date -d @1729389667" if you are using GNU date ***
*** Current BE git commitID: 653e315ba5 ***
*** SIGSEGV address not mapped to object (@0x2c8) received by PID 19310 (TID 20673 OR 0x7f94c46b6700) from PID 712; stack trace: ***
0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_release/doris/be/src/common
/signal_handler.h:421
1# os::Linux::chained_handler(int, siginfo_t*, void*) in /opt/apache/doris/java8/jre/lib/amd64/server/libjvm.so
2# JVM_handle_linux_signal in /opt/apache/doris/java8/jre/lib/amd64/server/libjvm.so
3# signalHandler(int, siginfo_t*, void*) in /opt/apache/doris/java8/jre/lib/amd64/server/libjvm.so
4# 0x00007F9770A34400 in /lib64/libc.so.6
5# doris::pipeline::PriorityTaskQueue::push(doris::pipeline::PipelineTask*) at /home/zcp/repo_center/doris_release/doris/be/src/pipeline/task_qu
eue.cpp:115
6# doris::pipeline::MultiCoreTaskQueue::push_back(doris::pipeline::PipelineTask*, int) at /home/zcp/repo_center/doris_release/doris/be/src/pipel
ine/task_queue.cpp:217
7# doris::pipeline::MultiCoreTaskQueue::push_back(doris::pipeline::PipelineTask*) at /home/zcp/repo_center/doris_release/doris/be/src/pipeline/t
ask_queue.cpp:209
8# doris::pipeline::TaskScheduler::schedule_task(doris::pipeline::PipelineTask*) at /home/zcp/repo_center/doris_release/doris/be/src/pipeline/ta
sk_scheduler.cpp:224
9# doris::pipeline::PipelineXFragmentContext::submit() at /home/zcp/repo_center/doris_release/doris/be/src/pipeline/pipeline_x/pipeline_x_fragme
nt_context.cpp:1418
10# doris::FragmentMgr::exec_plan_fragment(doris::TPipelineFragmentParams const&, doris::QuerySource, std::function<void (doris::RuntimeState*, d
oris::Status*)> const&) in /opt/apache/doris/be/lib/doris_be
11# doris::FragmentMgr::exec_plan_fragment(doris::TPipelineFragmentParams const&, doris::QuerySource) at /home/zcp/repo_center/doris_release/dori
s/be/src/runtime/fragment_mgr.cpp:685
12# doris::PInternalServiceImpl::_exec_plan_fragment_impl(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&,
doris::PFragmentRequestVersion, bool, std::function<void (doris::RuntimeState*, doris::Status*)> const&) in /opt/apache/doris/be/lib/doris_be
13# doris::PInternalServiceImpl::_exec_plan_fragment_in_pthread(google::protobuf::RpcController*, doris::PExecPlanFragmentRequest const*, doris::
PExecPlanFragmentResult*, google::protobuf::Closure*) at /home/zcp/repo_center/doris_release/doris/be/src/service/internal_service.cpp:328
14# doris::WorkThreadPool::work_thread(int) at /home/zcp/repo_center/doris_release/doris/be/src/util/work_thread_pool.hpp:159
15# execute_native_thread_routine at ../../../../../libstdc++-v3/src/c++11/thread.cc:84
16# start_thread in /lib64/libpthread.so.0
17# clone in /lib64/libc.so.6

StdoutLogger 2024-10-20 10:01:27,532 Start time: Sun Oct 20 10:01:27 CST 2024
INFO: java_cmd /opt/apache/doris/java8/bin/java
INFO: jdk_version 8
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/apache/doris/be/lib/java_extensions/preload-extensions/preload-extensions-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/apache/doris/be/lib/java_extensions/java-udf/java-udf-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/apache/doris/be/lib/hadoop_hdfs/common/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Reload4jLoggerFactory]

2 Answers

老师确定几个事情:

  1. 去所有的FE下都找过了吗?根据 query_id
  2. 咱们是否有外表查询场景

可以的话方便加下我主页微信,我们一起看下的

grep "d74c474ba05d45f7-8413475ed9d74898" ./fe.audit.log.20241020-1
2024-10-20 10:01:13,095 [query] Query d74c474ba05d45f7-8413475ed9d74898 1 times with new query id: db57e4fb90214ee0-8c5382fbb0fe34ca

grep "db57e4fb90214ee0-8c5382fbb0fe34ca" ./fe.audit.log.20241020-1
2024-10-20 10:01:13,095 [query] Query d74c474ba05d45f7-8413475ed9d74898 1 times with new query id: db57e4fb90214ee0-8c5382fbb0fe34ca
2024-10-20 10:01:19,977 [query] |Client=10.82.196.102:26934|User=root|Ctl=internal|Db=task_diagnosis|State=ERR|ErrorCode=1105|ErrorMessage=errCode = 2, detailMessage = There is no scanNode Backend available.[10005: in black list(send fragments failed. io.grpc.StatusRuntimeException: UNAVAILABLE: Network closed for unknown reason), 10006: in black list(send fragments failed. io.grpc.StatusRuntimeException: UNAVAILABLE: Network closed for unknown reason)]|Time(ms)=5705|ScanBytes=0|ScanRows=0|ReturnRows=0|StmtId=22275356|QueryId=db57e4fb90214ee0-8c5382fbb0fe34ca|IsQuery=true|isNereids=true|feIp=10.82.195.127|Stmt=with logTable as ( select * from azkaban_execution_logs where 1 = 1 and app_id in ( 'application_1721006386533_7823729' ) ) SELECT jbi.id AS jobInstanceId, exeJob.attempt AS attemptId, ael.app_id AS appId, ael.query_id AS queryId, ael.exec_id AS execId, bfi.schedule_time AS scheduleTime, exejob.status as state, exejob.start_time AS startTime, exejob.end_time AS endTime, mbl.id AS businessLineId, mbl.code AS businessLineCode, mbl.name AS businessLineName, bp.id AS projectId, bp.code AS projectCode, bp.name AS projectName, bfi.flow_id AS flowId, exeflow.flow_id AS flowName, bfi.id AS flowInstanceId, job.id AS jobId, job.name AS jobName, jbi.start_time AS jobInstanceStartTime, jbi.end_time AS jobInstanceEndTime, ael.exec_id AS azkExecutionId, job.owner_id AS ownerId, job.owner AS owner, job.creation_date AS creationDate, job.created_by AS createdBy, job.updated_by AS updatedBy, job.updation_date AS updationDate, jbi.azk_cluster_id AS azkClusterId, jbi.job_version AS jobVersion, job.node_type AS nodetype FROM logTable ael LEFT JOIN azkaban_execution_jobs exejob ON exejob.exec_id = ael.exec_id AND exejob.job_id = if(instr(ael.name,':')>0, SPLIT_PART(ael.name, ':', -1), ael.name) AND exeJob.attempt = ael.attempt AND exejob.azk_cluster_id = ael.azk_cluster_id LEFT JOIN azkaban_execution_flows exeflow ON exejob.exec_id = exeflow.exec_id AND exeflow.flow_id = if(instr(exejob.flow_id,',')>0, SPLIT_PART(exejob.flow_id, ',', 1), exejob.flow_id) AND exejob.azk_cluster_id = exeflow.azk_cluster_id LEFT JOIN bdp_flow_instance bfi ON bfi.execution_id = exeflow.exec_id LEFT JOIN bdp_job_instance jbi ON jbi.flow_instance_id = bfi.id AND jbi.job_name = if(instr(ael.name,':')>0, SPLIT_PART(ael.name, ':', -1), ael.name) AND jbi.azk_cluster_id = exejob.azk_cluster_id LEFT JOIN bdp_job job ON jbi.job_id = job.id AND job.name = exejob.job_id LEFT JOIN bdp_project bp ON bp.id = job.project_id LEFT JOIN meta_business_line mbl ON mbl.id = bp.business_line_id order by ael.upload_time desc|CpuTimeMS=0|ShuffleSendBytes=-1|ShuffleSendRows=-1|SqlHash=6afc0b2e05b48e345cea2a3237d1193b|peakMemoryBytes=0|SqlDigest=|TraceId=|WorkloadGroup=normal|FuzzyVariables=
2024-10-20 10:01:19,977 [slow_query] |Client=10.82.196.102:26934|User=root|Ctl=internal|Db=task_diagnosis|State=ERR|ErrorCode=1105|ErrorMessage=errCode = 2, detailMessage = There is no scanNode Backend available.[10005: in black list(send fragments failed. io.grpc.StatusRuntimeException: UNAVAILABLE: Network closed for unknown reason), 10006: in black list(send fragments failed. io.grpc.StatusRuntimeException: UNAVAILABLE: Network closed for unknown reason)]|Time(ms)=5705|ScanBytes=0|ScanRows=0|ReturnRows=0|StmtId=22275356|QueryId=db57e4fb90214ee0-8c5382fbb0fe34ca|IsQuery=true|isNereids=true|feIp=10.82.195.127|Stmt=with logTable as ( select * from azkaban_execution_logs where 1 = 1 and app_id in ( 'application_1721006386533_7823729' ) ) SELECT jbi.id AS jobInstanceId, exeJob.attempt AS attemptId, ael.app_id AS appId, ael.query_id AS queryId, ael.exec_id AS execId, bfi.schedule_time AS scheduleTime, exejob.status as state, exejob.start_time AS startTime, exejob.end_time AS endTime, mbl.id AS businessLineId, mbl.code AS businessLineCode, mbl.name AS businessLineName, bp.id AS projectId, bp.code AS projectCode, bp.name AS projectName, bfi.flow_id AS flowId, exeflow.flow_id AS flowName, bfi.id AS flowInstanceId, job.id AS jobId, job.name AS jobName, jbi.start_time AS jobInstanceStartTime, jbi.end_time AS jobInstanceEndTime, ael.exec_id AS azkExecutionId, job.owner_id AS ownerId, job.owner AS owner, job.creation_date AS creationDate, job.created_by AS createdBy, job.updated_by AS updatedBy, job.updation_date AS updationDate, jbi.azk_cluster_id AS azkClusterId, jbi.job_version AS jobVersion, job.node_type AS nodetype FROM logTable ael LEFT JOIN azkaban_execution_jobs exejob ON exejob.exec_id = ael.exec_id AND exejob.job_id = if(instr(ael.name,':')>0, SPLIT_PART(ael.name, ':', -1), ael.name) AND exeJob.attempt = ael.attempt AND exejob.azk_cluster_id = ael.azk_cluster_id LEFT JOIN azkaban_execution_flows exeflow ON exejob.exec_id = exeflow.exec_id AND exeflow.flow_id = if(instr(exejob.flow_id,',')>0, SPLIT_PART(exejob.flow_id, ',', 1), exejob.flow_id) AND exejob.azk_cluster_id = exeflow.azk_cluster_id LEFT JOIN bdp_flow_instance bfi ON bfi.execution_id = exeflow.exec_id LEFT JOIN bdp_job_instance jbi ON jbi.flow_instance_id = bfi.id AND jbi.job_name = if(instr(ael.name,':')>0, SPLIT_PART(ael.name, ':', -1), ael.name) AND jbi.azk_cluster_id = exejob.azk_cluster_id LEFT JOIN bdp_job job ON jbi.job_id = job.id AND job.name = exejob.job_id LEFT JOIN bdp_project bp ON bp.id = job.project_id LEFT JOIN meta_business_line mbl ON mbl.id = bp.business_line_id order by ael.upload_time desc|CpuTimeMS=0|ShuffleSendBytes=-1|ShuffleSendRows=-1|SqlHash=6afc0b2e05b48e345cea2a3237d1193b|peakMemoryBytes=0|SqlDigest=|TraceId=|WorkloadGroup=normal|FuzzyVariables=