BE 节点 8060 端口异常,routine load 不消费 kafka 消息

Viewed 101

2024-04-04 12:18 收到 kafka 消费积压告警,routine load 状态正常,但是不消费 kafka 消息
image.png

异常 be 节点机器监控
image.png

fe.log 有很多 failed to get latest offsets 异常

2024-04-04 12:18:01,655 WARN (Routine load task scheduler|48) [KafkaUtil.getLatestOffsets():212] failed to get latest offsets.

be.out

start time: Thu Apr  4 21:24:01 CST 2024
INFO: java_cmd /usr/lib/jvm/java/bin/java
INFO: jdk_version 8
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/doris/be/lib/java_extensions/preload-extensions/preload-extensions-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/doris/be/lib/java_extensions/java-udf/java-udf-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/doris/be/lib/hadoop_hdfs/common/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Reload4jLoggerFactory]
*** Query id: 0-0 ***
*** tablet id: 0 ***
*** Aborted at 1712241970 (unix time) try "date -d @1712241970" if you are using GNU date ***
*** Current BE git commitID: 91efb6a43d ***
*** SIGSEGV unknown detail explain (@0x0) received by PID 1450 (TID 3459 OR 0x7fd463adc700) from PID 0; stack trace: ***
 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_release/doris/be/src/common/signal_handler.h:417
 1# os::Linux::chained_handler(int, siginfo_t*, void*) in /usr/lib/jvm/java/jre/lib/amd64/server/libjvm.so
 2# JVM_handle_linux_signal in /usr/lib/jvm/java/jre/lib/amd64/server/libjvm.so
 3# signalHandler(int, siginfo_t*, void*) in /usr/lib/jvm/java/jre/lib/amd64/server/libjvm.so
 4# 0x00007FE0C001B400 in /lib64/libc.so.6
 5# __GI___pthread_mutex_lock in /lib64/libpthread.so.0
 6# std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::shared_ptr<doris::PBackendService_Stub> >* phmap::priv::parallel_hash_set<8ul, phmap::priv::raw_hash_set, std::mutex, phmap::priv::FlatHashMapPolicy<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::shared_ptr<doris::PBackendService_Stub> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::shared_ptr<doris::PBackendService_Stub> > > >::find_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, phmap::LockableBaseImpl<std::mutex>::WriteLock>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long, phmap::LockableBaseImpl<std::mutex>::WriteLock&) at /home/zcp/repo_center/doris_release/doris/thirdparty/installed/include/parallel_hashmap/phmap.h:3736
 7# bool phmap::priv::parallel_hash_set<8ul, phmap::priv::raw_hash_set, std::mutex, phmap::priv::FlatHashMapPolicy<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::shared_ptr<doris::PBackendService_Stub> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::shared_ptr<doris::PBackendService_Stub> > > >::modify_if_impl<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, doris::BrpcClientCache<doris::PBackendService_Stub>::get_client(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::{lambda(auto:1 const&)#1}&, phmap::LockableBaseImpl<std::mutex>::WriteLock>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, doris::BrpcClientCache<doris::PBackendService_Stub>::get_client(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::{lambda(auto:1 const&)#1}&) in /usr/local/doris/be/lib/doris_be
 8# doris::BrpcClientCache<doris::PBackendService_Stub>::get_client(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) at /home/zcp/repo_center/doris_release/doris/be/src/util/brpc_client_cache.h:95
 9# doris::BrpcClientCache<doris::PBackendService_Stub>::get_client(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int) at /home/zcp/repo_center/doris_release/doris/be/src/util/brpc_client_cache.h:90
10# doris::CheckRPCChannelAction::handle(doris::HttpRequest*) at /home/zcp/repo_center/doris_release/doris/be/src/http/action/check_rpc_channel_action.cpp:85
11# 0x000055F83C432E37 in /usr/local/doris/be/lib/doris_be
12# bufferevent_run_readcb_ in /usr/local/doris/be/lib/doris_be
13# 0x000055F83C435053 in /usr/local/doris/be/lib/doris_be
14# 0x000055F83C41BFB9 in /usr/local/doris/be/lib/doris_be
15# 0x000055F83C41C637 in /usr/local/doris/be/lib/doris_be
16# 0x000055F83C41EC68 in /usr/local/doris/be/lib/doris_be
17# std::_Function_handler<void (), doris::EvHttpServer::start()::$_0>::_M_invoke(std::_Any_data const&) at /var/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_function.h:291
18# doris::ThreadPool::dispatch_thread() in /usr/local/doris/be/lib/doris_be
19# doris::Thread::supervise_thread(void*) at /home/zcp/repo_center/doris_release/doris/be/src/util/thread.cpp:499
20# start_thread in /lib64/libpthread.so.0
21# clone in /lib64/libc.so.6

2024-04-04 21:23:16 业务方访问 Doris 集群 rpc 异常

程序异常 2024-04-04 21:23:16 [-][-][-][error][application] ..........查询异常:PDOStatement::execute(): SQLSTATE[HY000]: General error: 1105 RpcException, msg: send fragments failed. io.grpc.StatusRuntimeException: UNAVAILABLE: io exception, host: .....

程序异常 2024-04-04 22:59:02 [-][-][-][error][application] .......异常:PDOStatement::execute(): SQLSTATE[HY000]: General error: 1105 errCode = 2, detailMessage = tablet 1100807 has no queryable replicas. err: replica 1100809's backend 10146 does not exist or not alive, replica 1100808's backend 10092 does not exist or not alive

程序异常 2024-04-04 23:09:44 [-][-][-][error][application] .....查询异常:PDOStatement::execute(): SQLSTATE[HY000]: General error: 1105 errCode = 2, detailMessage = (.....)[CANCELLED]failed to send brpc when exchange, error=Host is down, error_text=[E112]Not connected to ....:8060 yet, server_id=908 [R1][E112]Not connected to ....:8060 yet, server_id=908 [R2][E112]Not connected to ....:8060 yet, server_id=908 [R3][E112]Not connected to ....:8060 yet, server_id=908 [R4][E112]Not connected to ....:8060 yet, server_id=908 [R5][E1
2 Answers

看看监控异常时间段的日志:12:00-12:30和14:00-14:30

  1. fe.log 有很多 “failed to get latest offsets” 异常,routine load 的状态正常,但不消费 kafka 消息,导致消息积压,重启 fe 恢复。
    image.png

  2. 好像破案了,写 bdb 时,耗时较大
    image.png
    image.png

  3. 修改 be 配置 sync_tablet_meta = false 后,观察 fe.log 没有 “failed to get latest offsets” 异常,但还是有一些 SocketException
    image.png

  4. 观察一段时间后,发现 fe.warn.log 还是有“failed to get latest offsets” 异常,不消费问题仍会复现

  5. 在 master fe 节点 netstat -anpt 发现发送队列有拥塞
    image.png
    调整阿里云服务器 ECS 的 tcp 参数,再观察
    net.core.somaxconn = 2048
    net.core.rmem_default = 262144
    net.core.rmem_max = 16777216
    net.core.wmem_default = 262144
    net.core.wmem_max = 16777216
    net.ipv4.tcp_rmem = 4096 262144 16777216
    net.ipv4.tcp_wmem = 4096 262144 16777216