doris版本: selectdb-doris-2.1.3-rc09-zyxfjr01-734c50b0ef
症状: 从8月12日11点50开始
1. SHOW BACKENDS; SHOW FRONTENDS; 等查看元数据的SQL是可以正常运行的
2. fe;be节点一切正常;
3. 内存使用没有明显飚高;
4. 查询数据的SQL会一直阻塞, 没有报错;
5. 99th Latency飚高到10多分钟, 之后的查询都被阻塞;
日志: fe.out 有以下报错, 看了日志, 是从8月09开始有的, 以前没有这个报错
2024-08-12 13:34:56,007 WARN (mysql-nio-pool-56|495) [Coordinator.cancel():1473] Query 34b12e4742d94db9-97e2df2ac81eb933 already in abnormal status Status [errorCode=INTERNAL_ERROR, errorMsg=send fragments failed. io.grpc.StatusRuntimeException: UNAVAILABLE: io exception], but received cancel again,so that send cancel to BE again java.lang.Exception: null at org.apache.doris.qe.Coordinator.cancel(Coordinator.java:1475) ~[doris-fe.jar:1.2-SNAPSHOT] at org.apache.doris.qe.StmtExecutor.sendResult(StmtExecutor.java:1802) ~[doris-fe.jar:1.2-SNAPSHOT]
be.WARNING 最多的打印信息:
W20240812 11:50:55.486326 215792 fragment_mgr.cpp:1075] Could not find the query id:c93bf76d74d94430-8fbff59b31581e92 fragment id:95 to cancel W20240812 11:52:44.722160 137099 fragment_mgr.cpp:409] report error status: to coordinator: TNetworkAddress(hostname=xx.xx.xx.xx, port=9020), query id: b2c0bac0ede143b2-9e1c04ca05702783, instance id: 0-0 W20240812 11:50:55.485359 215766 pipeline_x_fragment_context.cpp:153] PipelineXFragmentContext cancel instance: ee4951b599a64c78-b3936b7c445bd17d W20240812 12:06:55.203903 215821 runtime_state.cpp:547] registe global ins:Fragment 6d384cb0b41843c1-a28354a3e7dd338e ,mgr: 0x7f4e05fb8a00 ,filter id:1 W20240812 12:23:51.962389 113517 vtablet_writer.cpp:583] cancel node channel VNodeChannel[6231826-3213511], load_id=5a4d6b9e3e8acc17-f20f3298b35d41af, txn_id=577866193, node=XX.X.XX.XX:8060, error message: [CANCELLED]cur path: . cancelled: sender is gone
解决: 最后将be全部重启之后又恢复正常了, 不确定后面是不是会出现这个问题, 不知道有大佬遇到过这个问题没