大量的insert语句触发doris事务处理bug

Viewed 304

由于同事对doris数据导入机制不熟悉,使用大量的insert value语句进行数据导入,导入到一半时导入失败并终止,集群出现问题,无法正常工作,查询数据报错如下

ERROR 1105 (HY000): errCode = 2, detailMessage = (10.76.119.35)[CANCELLED]missed_versions is empty, spec_version 554, max_version 579, tablet_id 2291806

写入数据报错如下

Caused by: org.apache.doris.flink.exception.StreamLoadException: [ANALYSIS_ERROR]TStatus: errCode = 2, detailMessage = current running txns on db 11234 is 3000, larger than limit 3000

使用show proc查看,确实有3000个commited状态的事务,我们停掉了所有的数据写入作业,观察了一段时间,发现commited的事务还是3000个,一个也没少,由于doris没有提供手动取消事务的功能,我们重启了fe和be的所有节点,依旧没有任何作用,我们一直在想是什么原因导致事务的状态一直不变呢,后来在fe的报错日志中发现了一个可疑的报错,报错如下:

2024-07-03 13:56:46,336 ERROR (PUBLISH_VERSION|32) [PublishVersionDaemon.runAfterCatalogReady():66] errors while publish version to all backends
java.lang.NullPointerException: null
	at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) ~[?:1.8.0_411]
	at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) ~[?:1.8.0_411]
	at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) ~[?:1.8.0_411]
	at java.util.Iterator.forEachRemaining(Iterator.java:116) ~[?:1.8.0_411]
	at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) ~[?:1.8.0_411]
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) ~[?:1.8.0_411]
	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) ~[?:1.8.0_411]
	at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) ~[?:1.8.0_411]
	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:1.8.0_411]
	at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) ~[?:1.8.0_411]
	at org.apache.doris.transaction.PublishVersionDaemon.getBaseTabletIdsForEachBe(PublishVersionDaemon.java:229) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.transaction.PublishVersionDaemon.publishVersion(PublishVersionDaemon.java:103) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.transaction.PublishVersionDaemon.runAfterCatalogReady(PublishVersionDaemon.java:64) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.common.util.MasterDaemon.runOneCycle(MasterDaemon.java:58) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.common.util.Daemon.run(Daemon.java:116) ~[doris-fe.jar:1.2-SNAPSHOT]

从这个publishVersion方法名中,我们判断是事务的处理出现的问题,通过阅读publishVersion的逻辑,我们发现,版本发布的逻辑是获取所有已经就绪的事务,循环进行事务状态处理,但是在循环的过程中可能第一次循环就发生了NullPointerException的异常,因为这个异常没有被捕获处理,所以循环跳出去了,导致事务卡着一直不动。我们想着最简单的办法就是将这个NullPointerException捕获并模仿MetaNotFoundException的处理逻辑直接跳过。
修改前的代码

for (TableCommitInfo tableCommitInfo : transactionState.getIdToTableCommitInfos().values()) {
                partitionCommitInfos.addAll(tableCommitInfo.getIdToPartitionCommitInfo().values());

                try {
                    beIdToBaseTabletIds.putAll(getBaseTabletIdsForEachBe(transactionState, tableCommitInfo));
                } catch (MetaNotFoundException e) {
                    LOG.warn("exception occur when trying to get rollup tablets info", e);
                }
            }

修改后的代码

for (TableCommitInfo tableCommitInfo : transactionState.getIdToTableCommitInfos().values()) {
                partitionCommitInfos.addAll(tableCommitInfo.getIdToPartitionCommitInfo().values());

                try {
                    beIdToBaseTabletIds.putAll(getBaseTabletIdsForEachBe(transactionState, tableCommitInfo));
                } catch (MetaNotFoundException e) {
                    LOG.warn("exception occur when trying to get rollup tablets info", e);
                } catch (NullPointerException e) {
                    LOG.warn("exception occur when trying to get rollup tablets info NullPointerException", e);
                }
            }

打包上线后,commited的事务瞬间变为了0,整个集群的读写也正常了,虽然问题解决了,但是没有找到触发这个问题根本原因,希望大神能给予解答,并且在后面版本发布的时候修复这个问题。

1 Answers

这个问题fix了,PR:https://github.com/apache/doris/pull/35475

手动abort事物 abort demo
image.png