doris稳定版2.1.7查询超时及无缘无故bug

Viewed 93

bug1:频繁报错超时,但是有找不到具体错误原因。
bug2:be节点经常重启,一重启就打印gc回收日志,be.out频繁报错如下:
bug2补充如下:

**Process 79150 (doris_be) of user 1000 killed by SIGABRT - dumping core
**Executable '/data/doris/be/lib/doris_be' doesn't belong to any package and ProcessUnpackaged is set to 'no'****

到底是啥原因导致的,给排查下吧,谢谢。

** bug1 报错日志如下:**

W20241231 22:07:57.708006 25363 status.h:413] meet error status: [TIMEOUT]Query tiemout

        0#  doris::ResultBufferMgr::cancel_thread() at /home/zcp/repo_center/doris_release/doris/be/src/runtime/result_buffer_mgr.cpp:210
        1#  doris::Thread::supervise_thread(void*) at /var/local/ldb-toolchain/bin/../usr/include/pthread.h:562
        2#  start_thread
        3#  __clone
W20241231 22:07:57.708715 25363 status.h:413] meet error status: [TIMEOUT]Query tiemout

        0#  doris::ResultBufferMgr::cancel_thread() at /home/zcp/repo_center/doris_release/doris/be/src/runtime/result_buffer_mgr.cpp:210
        1#  doris::Thread::supervise_thread(void*) at /var/local/ldb-toolchain/bin/../usr/include/pthread.h:562
        2#  start_thread
        3#  __clone
W20241231 22:07:57.708735 25363 status.h:413] meet error status: [TIMEOUT]Query tiemout

        0#  doris::ResultBufferMgr::cancel_thread() at /home/zcp/repo_center/doris_release/doris/be/src/runtime/result_buffer_mgr.cpp:210
        1#  doris::Thread::supervise_thread(void*) at /var/local/ldb-toolchain/bin/../usr/include/pthread.h:562
        2#  start_thread
        3#  __clone
W20241231 22:07:57.708747 25363 status.h:413] meet error status: [TIMEOUT]Query tiemout

        0#  doris::ResultBufferMgr::cancel_thread() at /home/zcp/repo_center/doris_release/doris/be/src/runtime/result_buffer_mgr.cpp:210
        1#  doris::Thread::supervise_thread(void*) at /var/local/ldb-toolchain/bin/../usr/include/pthread.h:562
        2#  start_thread
        3#  __clone
W20241231 22:07:57.708760 25363 status.h:413] meet error status: [TIMEOUT]Query tiemout

        0#  doris::ResultBufferMgr::cancel_thread() at /home/zcp/repo_center/doris_release/doris/be/src/runtime/result_buffer_mgr.cpp:210
        1#  doris::Thread::supervise_thread(void*) at /var/local/ldb-toolchain/bin/../usr/include/pthread.h:562
        2#  start_thread
        3#  __clone
```** bug1 报错日志如下:**


**bug2错误如下:**

be.out报错如下:
terminate called after throwing an instance of 'std::system_error'
what(): Resource temporarily unavailable
*** Query id: 0-0 ***
*** is nereids: 0 ***
*** tablet id: 0 ***
*** Aborted at 1735748165 (unix time) try "date -d @1735748165" if you are using GNU date ***
*** Current BE git commitID: 443e87e203 ***
*** SIGABRT unknown detail explain (@0x3e80000b553) received by PID 46419 (TID 47956 OR 0x7f54bc285700) from PID 46419; stack trace: ***
0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_release/doris/be/src/common/signal_handler.h:421
1# 0x00007F58398E7400 in /lib64/libc.so.6
2# __GI_raise in /lib64/libc.so.6
3# abort in /lib64/libc.so.6
4# __gnu_cxx::__verbose_terminate_handler() [clone .cold] at ../../../../libstdc++-v3/libsupc++/vterminate.cc:75
5# __cxxabiv1::__terminate(void ()()) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48
6# 0x000055FC5BD799C1 in /data/doris/be/lib/doris_be
7# 0x000055FC5BD79B14 in /data/doris/be/lib/doris_be
8# std::__throw_system_error(int) at ../../../../../libstdc++-v3/src/c++11/system_error.cc:338
9# 0x000055FC5BE4509D in /data/doris/be/lib/doris_be
10# std::thread::thread<void (
)(std::shared_ptr), std::shared_ptr&, void>(void (*&&)(std::sh
ared_ptr), std::shared_ptr&) in /data/doris/be/lib/doris_be
11# apache::thrift::concurrency::Thread::start() in /data/doris/be/lib/doris_be
12# apache::thrift::server::TThreadedServer::onClientConnected(std::shared_ptr const&) in /data/doris/be/lib/doris_be
13# apache::thrift::server::TServerFramework::newlyConnectedClient(std::shared_ptr const&) in /data/doris/be/lib/doris_be
14# apache::thrift::server::TServerFramework::serve() in /data/doris/be/lib/doris_be
15# apache::thrift::server::TThreadedServer::serve() in /data/doris/be/lib/doris_be

be.gc如下:

Java HotSpot(TM) 64-Bit Server VM (25.261-b12) for linux-amd64 JRE (1.8.0_261-b12), built on Jun 17 2020 23:41:40 by "java_re" with gcc 7.3.0
Memory: 4k page, physical 527752888k(195506092k free), swap 0k(0k free)
CommandLine flags: -XX:-CriticalJNINatives -XX:InitialHeapSize=1073741824 -XX:MaxHeapSize=1073741824 -XX:+PrintGC -XX:+PrintGCTimeStamps -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseParallelGC
1.909: [GC (Metadata GC Threshold) 157304K->17327K(1005056K), 0.0401801 secs]
1.949: [Full GC (Metadata GC Threshold) 17327K->15242K(721408K), 0.0453656 secs]
2.685: [GC (Allocation Failure) 277386K->18877K(721408K), 0.0070888 secs]
2.930: [GC (Allocation Failure) 281021K->16338K(721408K), 0.0036951 secs]
3.146: [GC (Allocation Failure) 278482K->15858K(721408K), 0.0037470 secs]

be.out报错如下:

doris_be: rdkafka_broker.c:5756: rd_kafka_broker_add_logical: Assertion `rkb && "failed to create broker thread"' failed.
*** Query id: 0-0 ***
*** is nereids: 0 ***
*** tablet id: 0 ***
*** Aborted at 1735748235 (unix time) try "date -d @1735748235" if you are using GNU date ***
*** Current BE git commitID: 443e87e203 ***
*** SIGABRT unknown detail explain (@0x3e80001352e) received by PID 79150 (TID 80703 OR 0x7fca1f3d3700) from PID 79150; stack trace: ***
0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t
, void*) at /home/zcp/repo_center/doris_release/doris/be/src/common/signal_handler.h:421
1# 0x00007FCDB027B400 in /lib64/libc.so.6
2# __GI_raise in /lib64/libc.so.6
3# abort in /lib64/libc.so.6
4# __assert_fail_base in /lib64/libc.so.6
5# 0x00007FCDB0274252 in /lib64/libc.so.6
6# 0x0000558137791FDE in /data/doris/be/lib/doris_be
7# rd_kafka_cgrp_new in /data/doris/be/lib/doris_be
8# rd_kafka_new in /data/doris/be/lib/doris_be
9# RdKafka::KafkaConsumer::create(RdKafka::Conf const*, std::__cxx11::basic_string<char, std::char_traits, std::allocator >&) in /data/doris/be/lib/doris_be
10# doris::KafkaDataConsumer::init(std::shared_ptr) at /home/zcp/repo_center/doris_release/doris/be/src/runtime/routine_load/data_consumer.cpp:143
11# doris::DataConsumerPool::get_consumer(std::shared_ptr, std::shared_ptr) at /home/zcp/repo_center/doris_release/doris/be/src/runtime/routine_load/data_consumer_pool.cpp:71
12# doris::RoutineLoadTaskExecutor::get_kafka_latest_offsets_for_partitions(doris::PKafkaMetaProxyRequest const&, std::vector<doris::PIntegerPair, std::allocator >
, int) at /home/zcp/repo_center/doris_release/doris/be/src/runtime/routine_load/routine_load_task_executor.cpp:169
13# std::_Function_handler<void (), doris::PInternalServiceImpl::get_info(google::protobuf::RpcController*, doris::PProxyRequest const*, doris::PProxyResult*, google::protobuf::Closure*)::$_0>::_M_invoke(std::_Any_data const&) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_function.h:291

be.gc如下:

Java HotSpot(TM) 64-Bit Server VM (25.261-b12) for linux-amd64 JRE (1.8.0_261-b12), built on Jun 17 2020 23:41:40 by "java_re" with gcc 7.3.0
Memory: 4k page, physical 527752888k(263590796k free), swap 0k(0k free)
CommandLine flags: -XX:-CriticalJNINatives -XX:InitialHeapSize=1073741824 -XX:MaxHeapSize=1073741824 -XX:+PrintGC -XX:+PrintGCTimeStamps -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseParallelGC
1.252: [GC (Metadata GC Threshold) 157307K->16956K(1005056K), 0.0464714 secs]
1.299: [Full GC (Metadata GC Threshold) 16956K->15241K(701952K), 0.0394015 secs]
2.004: [GC (Allocation Failure) 277385K->18609K(701952K), 0.0069175 secs]
2.253: [GC (Allocation Failure) 280753K->16377K(701952K), 0.0114080 secs]
2.474: [GC (Allocation Failure) 278521K->16185K(701952K), 0.0039811 secs]




3 Answers

报错1原因:be.INFO日志打印 Query tiemout 但是没有 queryid,这个报错是查询结束5min后,cancel BufferControlBlock 时打印的,所以每个查询结束都会打印这个报错, 这个 pr 在 2.1.8间接 修了,timeout error不再打印staktrace,这里实际不是错误,是 ResultBufferMgr 在做垃圾清理。

报错2:cat /proc/{be_pid}/limits看下实际的limit是多少

Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        4294967296           4294967296           bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             4096                 2060133              processes 
Max open files            65536                65536                files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       2060133              2060133              signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us   
W20250106 08:07:34.030717 32936 threadpool.cpp:457] Thread pool TabletCalcDeleteBitmapThreadPool failed to create thread: [RUNTIME_ERROR]Could not create thread. (erro
r 11) Resource temporarily unavailable

        0#  doris::Thread::start_thread(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::c
har_traits<char>, std::allocator<char> > const&, std::function<void ()> const&, unsigned long, scoped_refptr<doris::Thread>*) at /home/zcp/repo_center/doris_release/do
ris/be/src/util/thread.cpp:445
        1#  doris::ThreadPool::create_thread() at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_function.h:244
        2#  doris::ThreadPool::do_submit(std::shared_ptr<doris::Runnable>, doris::ThreadPoolToken*) at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h
:491
        3#  doris::ThreadPoolToken::submit_func(std::function<void ()>) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/
shared_ptr_base.h:701
        4#  doris::CalcDeleteBitmapToken::submit(std::shared_ptr<doris::Tablet>, std::shared_ptr<doris::Rowset>, std::shared_ptr<doris::segment_v2::Segment> const&, st
d::vector<std::shared_ptr<doris::Rowset>, std::allocator<std::shared_ptr<doris::Rowset> > > const&, long, std::shared_ptr<doris::DeleteBitmap>, doris::RowsetWriter*) a
t /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_function.h:244
        5#  doris::Tablet::calc_delete_bitmap(std::shared_ptr<doris::Rowset>, std::vector<std::shared_ptr<doris::segment_v2::Segment>, std::allocator<std::shared_ptr<d
oris::segment_v2::Segment> > > const&, std::vector<std::shared_ptr<doris::Rowset>, std::allocator<std::shared_ptr<doris::Rowset> > > const&, std::shared_ptr<doris::Del
eteBitmap>, long, doris::CalcDeleteBitmapToken*, doris::RowsetWriter*) at /home/zcp/repo_center/doris_release/doris/be/src/olap/tablet.cpp:3200
        6#  doris::Tablet::commit_phase_update_delete_bitmap(std::shared_ptr<doris::Rowset> const&, std::unordered_set<doris::RowsetId, std::hash<doris::RowsetId>, std
::equal_to<doris::RowsetId>, std::allocator<doris::RowsetId> >&, std::shared_ptr<doris::DeleteBitmap>, std::vector<std::shared_ptr<doris::segment_v2::Segment>, std::al
locator<std::shared_ptr<doris::segment_v2::Segment> > > const&, long, doris::CalcDeleteBitmapToken*, doris::RowsetWriter*) at /home/zcp/repo_center/doris_release/doris
/be/src/olap/tablet.cpp:3533
        7#  doris::RowsetBuilder::submit_calc_delete_bitmap_task() at /home/zcp/repo_center/doris_release/doris/be/src/olap/rowset_builder.cpp:28