【已解决】2.1.3版本内部数据查询压测后内存高居不下,后续查询一直报错

Viewed 171

2.1.3版本,导入了几张百万级别的表(Be节点16核64G,3be节点),编写了几个压测语句,多表join,全部字段明细查询,使用jmeter压测后,Be内存使用高居不下,内存占用到90%后,后续查询全部报错,be war日志:
W20240604 16:12:59.013475 31847 status.h:412] meet error status: [INTERNAL_ERROR]Failed to get query fragments context. Query may be timeout or be cancelled. host: 10.22.22.27

0#  doris::Status doris::FragmentMgr::_get_query_ctx<doris::TPipelineFragmentParams>(doris::TPipelineFragmentParams const&, doris::TUniqueId, bool, std::shared_ptr<doris::QueryContext>&) at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:0
1#  doris::FragmentMgr::exec_plan_fragment(doris::TPipelineFragmentParams const&, std::function<void (doris::RuntimeState*, doris::Status*)> const&) at /home/zcp/repo_center/doris_release/doris/be/src/runtime/fragment_mgr.cpp:0
2#  doris::FragmentMgr::exec_plan_fragment(doris::TPipelineFragmentParams const&) at /var/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_function.h:244
3#  doris::PInternalServiceImpl::_exec_plan_fragment_impl(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, doris::PFragmentRequestVersion, bool, std::function<void (doris::RuntimeState*, doris::Status*)> const&) at /home/zcp/repo_center/doris_release/doris/be/src/service/internal_service.cpp:0
4#  doris::PInternalServiceImpl::_exec_plan_fragment_in_pthread(google::protobuf::RpcController*, doris::PExecPlanFragmentRequest const*, doris::PExecPlanFragmentResult*, google::protobuf::Closure*) at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:377
5#  doris::WorkThreadPool<false>::work_thread(int) at /var/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/atomic_base.h:646
6#  execute_native_thread_routine at /data/gcc-11.1.0/build/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/unique_ptr.h:85
7#  start_thread
8#  clone

W20240604 16:12:59.013717 31766 fragment_mgr.cpp:1075] Could not find the query id:a85bea5442224224-b06f7a2dc6117492 fragment id:3 to cancel

3 Answers

【问题状态】已处理
【问题处理】可能有SQL在跑着,刚好内存满了,导致集群整体负载cancel响应不及时;负载的时候 不像正常场景,可能有阻塞的情况导致不会立马回收;建议结合 Workload Group 进行压测。

刚好在测试2.1.0版本,也遇到这个问题了。
情况和楼主描述的一样。

2.1.0版本默认是使用的workload Group,所以对问题的解答好像不太满足使用。

因为在查询负载下来之后,单独执行一个语句也会OOM,通过BE的内存追踪器可以看出Query占用到很多的,一直释放不掉呢?