stream load入库,报错replica num 1 < load required replica num 2,miss previous version

Viewed 124

这个报错有时在重试三次的情况下可以解决。

问题

  1. 副本少于规定值,两台be都是正常的,什么时候会导致缺少副本?
  2. 为什么会出现miss previous version?

报错信息如下:

2024-06-01 10:24:00 WARN {"TxnId":2083495,"Label":"9cff8b39-d4a4-4a1e-a455-78722d34a61c","Comment":"","TwoPhaseCommit":"false","Status":"Fail","Message":"[ANALYSIS_ERROR]TStatus: errCode = 2, detailMessage = Failed to commit txn 2083495, cause tablet 557616 succ replica num 1 < load required replica num 2. table 487075, partition: [ id=557594, commit version 48355, visible version 48318 ], this tablet detail: 1 replicas final succ: { [replicaId=557618, backendId=10005, backendAlive=true, version=48355, state=NORMAL] }; 1 replicas write data succ but miss previous version: { [replicaId=557617, backendId=10006, backendAlive=true, version=48306, lastFailedVersion=48317, lastSuccessVersion=48306, lastFailedTimestamp=1717208640183, state=NORMAL] }.\n\n\t0# doris::Status doris::Status::create(doris::TStatus const&) at /var/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/basic_string.h:187\n\t1# doris::StreamLoadExecutor::commit_txn(doris::StreamLoadContext*) at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:449\n\t2# doris::StreamLoadAction::handle(std::shared_ptr) at /home/zcp/repo_center/doris_release/doris/be/src/http/action/stream_load.cpp:0\n\t3# doris::StreamLoadAction::handle(doris::HttpRequest*) at /var/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr_base.h:1291\n\t4# ?\n\t5# bufferevent_run_readcb\n\t6# ?\n\t7# ?\n\t8# ?\n\t9# ?\n\t10# std::_Function_handler<void (), doris::EvHttpServer::start()::$_0>::_M_invoke(std::_Any_data const&) at /var/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/atomicity.h:98\n\t11# doris::ThreadPool::dispatch_thread() at /home/zcp/repo_center/doris_release/doris/be/src/util/threadpool.cpp:0\n\t12# doris::Thread::supervise_thread(void*) at /var/local/ldb_toolchain/bin/../usr/include/pthread.h:562\n\t13# start_thread\n\t14# __clone\n","NumberTotalRows":5000,"NumberLoadedRows":5000,"NumberFilteredRows":0,"NumberUnselectedRows":0,"LoadBytes":4199999,"LoadTimeMs":630,"BeginTxnTimeMs":0,"StreamLoadPutTimeMs":16,"ReadDataTimeMs":15,"WriteDataTimeMs":607,"CommitAndPublishTimeMs":0}

-------------------------添加日志 0605---------------------------------------
be端日志:10006这台,10:18还是正常的,但是10:23报的版本,10:18--:10:23之间没有其他关于557616tablet的信息,确实缺少了48306-48310之间的版本发布
I20240601 10:18:35.661039 43762 engine_publish_version_task.cpp:395] publish version successfully on tablet, table_id=487075, tablet=55
7616, transaction_id=2083262, version=48306, num_rows=0, res=[OK], cost: 537055(us) [Publish Statistics: schedule time(us): 430552, loc
k wait time(us): 4, save meta time(us): 105569, calc delete bitmap time(us): 0, partial update write segment time(us): 0, add inc rowse
t time(us): 908]
。。。。。
。。。。。
W20240601 10:23:15.055142 44711 status.h:399] meet error status: [ANALYSIS_ERROR]TStatus: errCode = 2, detailMessage = Failed to commit txn 2083367, cause tablet 557616 succ replica num 1 < load required replica num 2. table 487075, partition: [ id=557594, commit version 48355, visible version 48310 ], this tablet detail: 1 replicas final succ: { [replicaId=557618, backendId=10005, backendAlive=true, version=48355, state=NORMAL] }; 1 replicas write data succ but miss previous version: { [replicaId=557617, backendId=10006, backendAlive=true, version=48306, lastFailedVersion=48310, lastSuccessVersion=48306, lastFailedTimestamp=1717208577486, state=NORMAL] }.

2 Answers

BE当时导入数据, 但在publish 阶段,该BE 挂了(或者publish 慢了),从而触发publish one succ。主要是原因是源端导入频率太高了,导致be挂了,后面采用攒批或者换成routine load的方式了,现在正常

可以在be日志搜下这个557616这个tablet看看副本为啥缺版本了