数据并发导致集群宕机,单启节点无效,必须重启整个集群才生效

Viewed 67

环境:1FE+6BE 版本2.1.5

背景:jdbc连接,streamload 方式将hive数据导入到doris中,同时并行66个任务,任务数据量都不大,66个任务分三批计算,每个都在20分钟左右,同时并行,导致三台be异常,重启三台be服务后,集群依然无法工作,必须全部重启才可解决,具体信息如下:
W20241105 09:28:40.961076 122476 internal_service.cpp:1936] failed to response result of slave replica to master replica, error=RPC call is timed out, error_text=[E1008]Reached timeout=60000ms @10.105.159.111:8060, master host: 10.105.159.111, tablet_id=2749138, txn_id=321354

W20241105 09:29:12.824303 122217 status.h:412] meet error status: [E-242]move file to trash failed. file=/data/apache-doris-2.1.5-bin-x64/be/storage/data/183/2496468/433733619, target=/data/apache-doris-2.1.5-bin-x64/be/storage/trash/20241105092912.29895/2496468/433733619, err=unknown errno

W20241105 09:29:47.748860 122231 status.h:412] meet error status: [HTTP_ERROR]Operation timed out after 15000 milliseconds with 0 bytes received

image.png

image.png

W20241105 15:01:53.177908 249123 http_client.cpp:201] fail to execute HTTP client, errmsg=Operation timed out after 15000 milliseconds with 0 bytes received, trace=
0# doris::HttpClient::execute(std::function<bool (void const*, unsigned long)> const&) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/basic_string.h:187

W20241105 10:16:07.592752 123899 ref_count_closure.h:115] RPC meet failed: [E1014]Got EOF of Socket{id=795 fd=532 addr=10.105.159.113:8060:57614} (0x0x7f616a61d580) [R1][E112]Not connected to 10.105.159.113:8060 yet, server_id=795 [R2][E112]Not connected to 10.105.159.113:8060 yet, server_id=795 [R3][E112]Not connected to 10.105.159.113:8060 yet, server_id=795 [R4][E112]Not connected to 10.105.159.113:8060 yet, server_id=795 [R5][E112]Not connected to 10.103.139.223:8060 yet, server_id=795 [R6][E112]Not connected to 10.105.159.113:8060 yet, server_id=795 [R7][E112]Not connected to 10.105.159.113:8060 yet, server_id=795 [R8][E112]Not connected to 10.105.159.113:8060 yet, server_id=795 [R9][E112]Not connected to 10.105.159.113:8060 yet, server_id=795 [R10][E112]Not connected to 10.105.159.113:8060 yet, server_id=795

W20241105 10:16:07.590818 123836 vtablet_writer.h:176] failed to send brpc batch, error=主机关闭, error_text=[E1014]Got EOF of Socket{id=795 fd=532 addr=10.105.159.113:8060:57614} (0x0x7f616a61d580) [R1][E112]Not connected to 10.105.159.113:8060 yet, server_id=795 [R2][E112]Not connected to 10.105.159.113:8060 yet, server_id=795 [R3][E112]Not connected to 10.105.159.113:8060 yet, server_id=795 [R4][E112]Not connected to 10.105.159.113:8060 yet, server_id=795 [R5][E112]Not connected to 10.105.159.113:8060 yet, server_id=795 [R6][E112]Not connected to 10.105.159.113:8060 yet, server_id=795 [R7][E112]Not connected to 10.105.159.113:8060 yet, server_id=795 [R8][E112]Not connected to 10.105.159.113:8060 yet, server_id=795 [R9][E112]Not connected to 10.105.159.113:8060 yet, server_id=795 [R10][E112]Not connected to 10.105.159.113:8060 yet, server_id=795

异常时 机器资源:
节点状态
image.png

各节点IO
image.png

qps:
image.png

内存
image.png

cpu

image.png

fe jvm
image.png

1 Answers

看日志报错是网络瓶颈RPC time out,强烈建议导入开启攒批模式,按批次导入;其次,Doris 单并发写入控制在10W/s内,同时在根据自己网络、磁盘IO做进一步调整这个数值。
Doris写入性能测试参考:https://zhuanlan.zhihu.com/p/678885098