环境:1FE+6BE 版本2.1.5
背景:jdbc连接,streamload 方式将hive数据导入到doris中,同时并行66个任务,任务数据量都不大,66个任务分三批计算,每个都在20分钟左右,同时并行,导致三台be异常,重启三台be服务后,集群依然无法工作,必须全部重启才可解决,具体信息如下:
W20241105 09:28:40.961076 122476 internal_service.cpp:1936] failed to response result of slave replica to master replica, error=RPC call is timed out, error_text=[E1008]Reached timeout=60000ms @10.105.159.111:8060, master host: 10.105.159.111, tablet_id=2749138, txn_id=321354
W20241105 09:29:12.824303 122217 status.h:412] meet error status: [E-242]move file to trash failed. file=/data/apache-doris-2.1.5-bin-x64/be/storage/data/183/2496468/433733619, target=/data/apache-doris-2.1.5-bin-x64/be/storage/trash/20241105092912.29895/2496468/433733619, err=unknown errno
W20241105 09:29:47.748860 122231 status.h:412] meet error status: [HTTP_ERROR]Operation timed out after 15000 milliseconds with 0 bytes received
W20241105 15:01:53.177908 249123 http_client.cpp:201] fail to execute HTTP client, errmsg=Operation timed out after 15000 milliseconds with 0 bytes received, trace=
0# doris::HttpClient::execute(std::function<bool (void const*, unsigned long)> const&) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/basic_string.h:187
W20241105 10:16:07.592752 123899 ref_count_closure.h:115] RPC meet failed: [E1014]Got EOF of Socket{id=795 fd=532 addr=10.105.159.113:8060:57614} (0x0x7f616a61d580) [R1][E112]Not connected to 10.105.159.113:8060 yet, server_id=795 [R2][E112]Not connected to 10.105.159.113:8060 yet, server_id=795 [R3][E112]Not connected to 10.105.159.113:8060 yet, server_id=795 [R4][E112]Not connected to 10.105.159.113:8060 yet, server_id=795 [R5][E112]Not connected to 10.103.139.223:8060 yet, server_id=795 [R6][E112]Not connected to 10.105.159.113:8060 yet, server_id=795 [R7][E112]Not connected to 10.105.159.113:8060 yet, server_id=795 [R8][E112]Not connected to 10.105.159.113:8060 yet, server_id=795 [R9][E112]Not connected to 10.105.159.113:8060 yet, server_id=795 [R10][E112]Not connected to 10.105.159.113:8060 yet, server_id=795
W20241105 10:16:07.590818 123836 vtablet_writer.h:176] failed to send brpc batch, error=主机关闭, error_text=[E1014]Got EOF of Socket{id=795 fd=532 addr=10.105.159.113:8060:57614} (0x0x7f616a61d580) [R1][E112]Not connected to 10.105.159.113:8060 yet, server_id=795 [R2][E112]Not connected to 10.105.159.113:8060 yet, server_id=795 [R3][E112]Not connected to 10.105.159.113:8060 yet, server_id=795 [R4][E112]Not connected to 10.105.159.113:8060 yet, server_id=795 [R5][E112]Not connected to 10.105.159.113:8060 yet, server_id=795 [R6][E112]Not connected to 10.105.159.113:8060 yet, server_id=795 [R7][E112]Not connected to 10.105.159.113:8060 yet, server_id=795 [R8][E112]Not connected to 10.105.159.113:8060 yet, server_id=795 [R9][E112]Not connected to 10.105.159.113:8060 yet, server_id=795 [R10][E112]Not connected to 10.105.159.113:8060 yet, server_id=795
异常时 机器资源:
节点状态
各节点IO
qps:
内存
cpu
fe jvm