Doris版本:doris-2.1.4-rc03-e93678fd1e
部署结构:3FE+4BE
配置:部署了6台服务器,每台配置: 16C 64GB 100GB+1TB
操作系统:Rocky Linux release 8.5 (Green Obsidian)
现象:superset查询Doris的catalogs中hive数据库时,频繁出现查询失败的情况,superset提示:
there is no scanNode Backend available.[193000: in list(send fragments failed. io.grpc.StatusRuntimeException:UNAVAILABLE:io exception)],
或者:
the upstream server is timing out
无法进行数据查询,等过一会就能正常
通过查询FE和BE的在线情况是正常的,没有出现离线
BE有错误日志和超时,经过curl测试,超时的是可以通的,日志如下:
W20241027 10:35:20.961900 1516180 status.h:412] meet error status: [INTERNAL_ERROR]Could not find local receiver for node 1 with instance 56eb3fc8ba154337-9eab3dfc619b2c3e
0# doris::vectorized::VDataStreamMgr::find_recvr(doris::TUniqueId const&, int, std::shared_ptr<doris::vectorized::VDataStreamRecvr>*, bool) at /var/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/basic_string.h:187
1# doris::vectorized::VDataStreamMgr::transmit_block(doris::PTransmitDataParams const*, google::protobuf::Closure**) at /var/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:360
2# doris::PInternalServiceImpl::_transmit_block(google::protobuf::RpcController*, doris::PTransmitDataParams const*, doris::PTransmitDataResult*, google::protobuf::Closure*, doris::Status const&) at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:377
3# doris::PInternalServiceImpl::transmit_block(google::protobuf::RpcController*, doris::PTransmitDataParams const*, doris::PTransmitDataResult*, google::protobuf::Closure*) at /home/zcp/repo_center/doris_release/doris/be/src/service/internal_service.cpp:1523
4# brpc::policy::ProcessRpcRequest(brpc::InputMessageBase*)
5# brpc::ProcessInputMessage(void*)
6# brpc::InputMessenger::InputMessageClosure::~InputMessageClosure()
7# brpc::InputMessenger::OnNewMessages(brpc::Socket*)
8# brpc::Socket::ProcessEvent(void*)
9# bthread::TaskGroup::task_runner(long)
10# bthread_make_fcontext
W20241027 10:35:20.970499 1516174 status.h:412] meet error status: [INTERNAL_ERROR]Could not find local receiver for node 1 with instance 56eb3fc8ba154337-9eab3dfc619b2c3e
0# doris::vectorized::VDataStreamMgr::find_recvr(doris::TUniqueId const&, int, std::shared_ptr<doris::vectorized::VDataStreamRecvr>*, bool) at /var/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/basic_string.h:187
1# doris::vectorized::VDataStreamMgr::transmit_block(doris::PTransmitDataParams const*, google::protobuf::Closure**) at /var/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:360
2# doris::PInternalServiceImpl::_transmit_block(google::protobuf::RpcController*, doris::PTransmitDataParams const*, doris::PTransmitDataResult*, google::protobuf::Closure*, doris::Status const&) at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:377
3# doris::PInternalServiceImpl::transmit_block(google::protobuf::RpcController*, doris::PTransmitDataParams const*, doris::PTransmitDataResult*, google::protobuf::Closure*) at /home/zcp/repo_center/doris_release/doris/be/src/service/internal_service.cpp:1523
4# brpc::policy::ProcessRpcRequest(brpc::InputMessageBase*)
5# brpc::ProcessInputMessage(void*)
6# brpc::InputMessenger::InputMessageClosure::~InputMessageClosure()
7# brpc::InputMessenger::OnNewMessages(brpc::Socket*)
8# brpc::Socket::ProcessEvent(void*)
9# bthread::TaskGroup::task_runner(long)
10# bthread_make_fcontext
W20241027 10:35:36.963789 1515983 fragment_mgr.cpp:1124] Could not find the query id:cbacbca63cb542bc-8cc2055aaf95c5bd fragment id:0 to cancel
W20241027 10:35:36.963892 1516044 pipeline_x_fragment_context.cpp:154] PipelineXFragmentContext cancel instance: e61a0089abae43b1-9de53c40c6e18412
W20241027 10:35:36.963927 1516044 pipeline_x_fragment_context.cpp:154] PipelineXFragmentContext cancel instance: e61a0089abae43b1-9de53c40c6e1840f
W20241027 10:35:36.963905 1516031 pipeline_x_fragment_context.cpp:154] PipelineXFragmentContext cancel instance: e61a0089abae43b1-9de53c40c6e18412
W20241027 10:35:36.963972 1514914 fragment_mgr.cpp:432] report error status: to coordinator: TNetworkAddress(hostname=10.154.85.5, port=9020), query id: e61a0089abae43b1-9de53c40c6e1840d, instance id: 0-0
W20241027 10:35:36.964021 1514932 fragment_mgr.cpp:432] report error status: to coordinator: TNetworkAddress(hostname=10.154.85.5, port=9020), query id: e61a0089abae43b1-9de53c40c6e1840d, instance id: 0-0
W20241027 10:35:36.964596 1516002 pipeline_x_fragment_context.cpp:154] PipelineXFragmentContext cancel instance: ce667ebfedef45df-a23b0a71b30207a0
W20241027 10:35:36.964699 1514932 fragment_mgr.cpp:432] report error status: to coordinator: TNetworkAddress(hostname=10.154.85.5, port=9020), query id: ce667ebfedef45df-a23b0a71b302079e, instance id: 0-0
W20241027 10:35:37.063400 1516019 fragment_mgr.cpp:1124] Could not find the query id:40f9e077bf4d4a44-995472e3d449cec9 fragment id:0 to cancel
W20241027 10:35:37.079308 1515984 pipeline_x_fragment_context.cpp:154] PipelineXFragmentContext cancel instance: 1411599fb24b4c9a-8d4fbcf079983bc3
W20241027 10:35:37.079419 1514932 fragment_mgr.cpp:432] report error status: to coordinator: TNetworkAddress(hostname=10.154.85.5, port=9020), query id: 1411599fb24b4c9a-8d4fbcf079983bc1, instance id: 0-0
W20241027 10:35:37.235734 1516059 pipeline_x_fragment_context.cpp:154] PipelineXFragmentContext cancel instance: a2029d15732b4fe5-a2afdaf5a58b948b
W20241027 10:35:37.235814 1514933 fragment_mgr.cpp:432] report error status: to coordinator: TNetworkAddress(hostname=10.154.85.5, port=9020), query id: a2029d15732b4fe5-a2afdaf5a58b9489, instance id: 0-0
W20241027 10:47:40.422154 1516072 fragment_mgr.cpp:1124] Could not find the query id:58d922634ce047b4-9572ea4bbbdd3780 fragment id:1 to cancel
W20241027 10:47:40.422281 1516029 fragment_mgr.cpp:1124] Could not find the query id:e77e536bb612432f-b0d24eddf7417f2b fragment id:0 to cancel
W20241027 10:47:40.422331 1516027 fragment_mgr.cpp:1124] Could not find the query id:58d922634ce047b4-9572ea4bbbdd3780 fragment id:0 to cancel
W20241027 10:48:35.805197 1516180 input_messenger.cpp:362] Fail to read from Socket{id=7459 fd=889 addr=10.154.213.10:36624:8060} (0x7f272e119100): Connection reset by peer [104]
W20241027 10:48:35.814724 1516174 input_messenger.cpp:362] Fail to read from Socket{id=7235 fd=892 addr=10.154.213.10:36640:8060} (0x7f2739e07980): Connection reset by peer [104]
W20241027 11:35:09.455665 1516111 input_messenger.cpp:362] Fail to read from Socket{id=8023 fd=913 addr=10.154.213.10:36922:8060} (0x7f27384d3680): Connection reset by peer [104]
W20241027 11:41:19.047667 1514826 fragment_mgr.cpp:1156] Query e5bf4208586d43ca-a426606bcd3daf4c is timeout
W20241027 11:41:19.891362 1516110 pipeline_x_fragment_context.cpp:154] PipelineXFragmentContext cancel instance: e5bf4208586d43ca-a426606bcd3daf55
W20241027 11:41:19.891403 1516110 pipeline_x_fragment_context.cpp:154] PipelineXFragmentContext cancel instance: e5bf4208586d43ca-a426606bcd3daf56
W20241027 11:41:19.891418 1516110 pipeline_x_fragment_context.cpp:154] PipelineXFragmentContext cancel instance: e5bf4208586d43ca-a426606bcd3daf57
W20241027 11:41:19.891428 1516110 pipeline_x_fragment_context.cpp:154] PipelineXFragmentContext cancel instance: e5bf4208586d43ca-a426606bcd3daf58
W20241027 11:41:19.891443 1516110 pipeline_x_fragment_context.cpp:154] PipelineXFragmentContext cancel instance: e5bf4208586d43ca-a426606bcd3daf59
W20241027 11:41:19.891451 1516110 pipeline_x_fragment_context.cpp:154] PipelineXFragmentContext cancel instance: e5bf4208586d43ca-a426606bcd3daf5a
W20241027 11:41:19.891461 1516110 pipeline_x_fragment_context.cpp:154] PipelineXFragmentContext cancel instance: e5bf4208586d43ca-a426606bcd3daf5b
W20241027 11:41:19.891470 1516110 pipeline_x_fragment_context.cpp:154] PipelineXFragmentContext cancel instance: e5bf4208586d43ca-a426606bcd3daf5c
W20241027 11:41:19.891496 1516110 ref_count_closure.h:115] RPC meet failed: [E1008]Reached timeout=900000ms @10.154.213.7:8060
W20241027 11:41:19.907758 1516338 ref_count_closure.h:115] RPC meet failed: [E1008]Reached timeout=900000ms @10.154.213.7:8060
W20241027 11:41:19.919452 1516219 ref_count_closure.h:115] RPC meet failed: [E1008]Reached timeout=900000ms @10.154.213.7:8060
W20241027 11:41:20.049091 1516355 ref_count_closure.h:115] RPC meet failed: [E1008]Reached timeout=900000ms @10.154.213.7:8060
W20241027 11:41:20.284808 1516219 ref_count_closure.h:115] RPC meet failed: [E1008]Reached timeout=900000ms @10.154.213.7:8060
W20241027 11:41:20.291518 1516207 ref_count_closure.h:115] RPC meet failed: [E1008]Reached timeout=900000ms @10.154.213.7:8060
W20241027 11:41:20.346159 1516338 ref_count_closure.h:115] RPC meet failed: [E1008]Reached timeout=900000ms @10.154.213.7:8060
W20241027 11:41:20.483352 1516338 ref_count_closure.h:115] RPC meet failed: [E1008]Reached timeout=900000ms @10.154.213.7:8060
W20241027 11:41:20.488008 1514929 fragment_mgr.cpp:432] report error status: failed to send brpc when exchange, error=RPC call is timed out, error_text=[E1008]Reached timeout=900000ms @10.154.213.7:8060, client: 10.154.213.6, latency = 900000114 to coordinator: TNetworkAddress(hostname=10.154.85.10, port=9020), query id: e5bf4208586d43ca-a426606bcd3daf4c, instance id: 0-0
W20241027 11:42:10.920855 1516347 input_messenger.cpp:362] Fail to read from Socket{id=1469 fd=3106 addr=10.154.213.7:8060:52454} (0x7f27f6a07600): Connection timed out [110]
W20241027 11:42:10.920944 1516247 socket.cpp:1707] Fail to keep-write into Socket{id=1469 fd=3106 addr=10.154.213.7:8060:52454} (0x7f27f6a07600): Broken pipe [32]
W20241027 11:42:10.921293 1516247 pipeline_x_fragment_context.cpp:154] PipelineXFragmentContext cancel instance: 9fa25f471ab5493e-83b7e381f6527c29
W20241027 11:42:10.921320 1516247 ref_count_closure.h:115] RPC meet failed: [E110]Fail to read from Socket{id=1469 fd=3106 addr=10.154.213.7:8060:52454} (0x0x7f27f6a07600): Connection timed out [R1][E112]Not connected to 10.154.213.7:8060 yet, server_id=1469 [R2][E112]Not connected to 10.154.213.7:8060 yet, server_id=1469 [R3][E112]Not connected to 10.154.213.7:8060 yet, server_id=1469 [R4][E112]Not connected to 10.154.213.7:8060 yet, server_id=1469 [R5][E112]Not connected to 10.154.213.7:8060 yet, server_id=1469 [R6][E112]Not connected to 10.154.213.7:8060 yet, server_id=1469 [R7][E112]Not connected to 10.154.213.7:8060 yet, server_id=1469 [R8][E112]Not connected to 10.154.213.7:8060 yet, server_id=1469 [R9][E112]Not connected to 10.154.213.7:8060 yet, server_id=1469 [R10][E112]Not connected to 10.154.213.7:8060 yet, server_id=1469
W20241027 11:42:10.922276 1514893 fragment_mgr.cpp:432] report error status: failed to send brpc when exchange, error=Host is down, error_text=[E110]Fail to read from Socket{id=1469 fd=3106 addr=10.154.213.7:8060:52454} (0x0x7f27f6a07600): Connection timed out [R1][E112]Not connected to 10.154.213.7:8060 yet, server_id=1469 [R2][E112]Not connected to 10.154.213.7:8060 yet, server_id=1469 [R3][E112]Not connected to 10.154.213.7:8060 yet, server_id=1469 [R4][E112]Not connected to 10.154.213.7:8060 yet, server_id=1469 [R5][E112]Not connected to 10.154.213.7:8060 yet, server_id=1469 [R6][E112]Not connected to 10.154.213.7:8060 yet, server_id=1469 [R7][E112]Not connected to 10.154.213.7:8060 yet, server_id=1469 [R8][E112]Not connected to 10.154.213.7:8060 yet, server_id=1469 [R9][E112]Not connected to 10.154.213.7:8060 yet, server_id=1469 [R10][E112]Not connected to 10.154.213.7:8060 yet, server_id=1469, client: 10.154.213.6, latency = 405749787 to coordinator: TNetworkAddress(hostname=10.154.85.5, port=9020), query id: 9fa25f471ab5493e-83b7e381f6527c27, instance id: 0-0
W20241027 11:42:57.820153 1516338 pipeline_x_fragment_context.cpp:154] PipelineXFragmentContext cancel instance: d021206562414556-9d99fdec2dab6d96
W20241027 11:42:57.820219 1516338 ref_count_closure.h:115] RPC meet failed: [E1008]Reached timeout=900000ms @10.154.213.9:8060
W20241027 11:42:57.821683 1514930 fragment_mgr.cpp:432] report error status: failed to send brpc when exchange, error=RPC call is timed out, error_text=[E1008]Reached timeout=900000ms @10.154.213.9:8060, client: 10.154.213.6, latency = 900000201 to coordinator: TNetworkAddress(hostname=10.154.85.10, port=9020), query id: d021206562414556-9d99fdec2dab6d94, instance id: 0-0
W20241027 11:43:22.636842 1516207 pipeline_x_fragment_context.cpp:154] PipelineXFragmentContext cancel instance: 109ba02afb7a458b-8e56b470d85283ff
W20241027 11:43:22.636914 1516207 ref_count_closure.h:115] RPC meet failed: [E1008]Reached timeout=900000ms @10.154.213.9:8060
W20241027 11:43:22.637897 1514914 fragment_mgr.cpp:432] report error status: failed to send brpc when exchange, error=RPC call is timed out, error_text=[E1008]Reached timeout=900000ms @10.154.213.9:8060, client: 10.154.213.6, latency = 900000152 to coordinator: TNetworkAddress(hostname=10.154.85.5, port=9020), query id: 109ba02afb7a458b-8e56b470d85283fd, instance id: 0-0
W20241027 11:43:49.224789 1516111 input_messenger.cpp:362] Fail to read from Socket{id=1017 fd=3082 addr=10.154.213.9:8060:51908} (0x7f292f41a600): Connection timed out [110]
W20241027 11:50:06.630889 1516338 pipeline_x_fragment_context.cpp:154] PipelineXFragmentContext cancel instance: 9991cc84b2954229-ad2f4af62c29d60c
W20241027 11:50:06.630955 1516338 ref_count_closure.h:115] RPC meet failed: [E1008]Reached timeout=900000ms @10.154.213.8:8060
W20241027 11:50:06.632335 1514929 fragment_mgr.cpp:432] report error status: failed to send brpc when exchange, error=RPC call is timed out, error_text=[E1008]Reached timeout=900000ms @10.154.213.8:8060, client: 10.154.213.6, latency = 900000154 to coordinator: TNetworkAddress(hostname=10.154.85.5, port=9020), query id: 9991cc84b2954229-ad2f4af62c29d60a, instance id: 0-0
W20241027 11:50:10.594475 1516219 pipeline_x_fragment_context.cpp:154] PipelineXFragmentContext cancel instance: 5a3fafef57cf4a37-8488916128d4ff3f
W20241027 11:50:10.594518 1516219 pipeline_x_fragment_context.cpp:154] PipelineXFragmentContext cancel instance: 5a3fafef57cf4a37-8488916128d4ff40
W20241027 11:50:10.594532 1516219 pipeline_x_fragment_context.cpp:154] PipelineXFragmentContext cancel instance: 5a3fafef57cf4a37-8488916128d4ff41
W20241027 11:50:10.594542 1516219 pipeline_x_fragment_context.cpp:154] PipelineXFragmentContext cancel instance: 5a3fafef57cf4a37-8488916128d4ff42
W20241027 11:50:10.594552 1516219 pipeline_x_fragment_context.cpp:154] PipelineXFragmentContext cancel instance: 5a3fafef57cf4a37-8488916128d4ff43
W20241027 11:50:10.594574 1516219 pipeline_x_fragment_context.cpp:154] PipelineXFragmentContext cancel instance: 5a3fafef57cf4a37-8488916128d4ff44
W20241027 11:50:10.594589 1516219 pipeline_x_fragment_context.cpp:154] PipelineXFragmentContext cancel instance: 5a3fafef57cf4a37-8488916128d4ff45
W20241027 11:50:10.594599 1516219 pipeline_x_fragment_context.cpp:154] PipelineXFragmentContext cancel instance: 5a3fafef57cf4a37-8488916128d4ff46
W20241027 11:50:10.594626 1516219 ref_count_closure.h:115] RPC meet failed: [E1008]Reached timeout=900000ms @10.154.213.8:8060
W20241027 11:50:10.749457 1516355 ref_count_closure.h:115] RPC meet failed: [E1008]Reached timeout=900000ms @10.154.213.8:8060
W20241027 11:50:10.761426 1516338 ref_count_closure.h:115] RPC meet failed: [E1008]Reached timeout=900000ms @10.154.213.8:8060