【已解决】routine load 从kafka订阅日志,因transaction abort异常导致消费阻塞

Viewed 83

drois版本为 doris-2.1.1-rc05-9f2520537f
截取了一个报错的事务

ec2-user@drois_sg4 log]$ grep c56c95191388474c-92bbfc39bbe7f4cb fe.log.20240417-3
2024-04-17 11:32:15,105 INFO (Routine load task scheduler|40) [DatabaseTransactionMgr.beginTransaction():404] begin transaction: txn id 7071528 with label c56c95191388474c-92bbfc39bbe7f4cb from coordinator FE: 172.31.19.73, listener id: 76027
2024-04-17 11:32:19,036 WARN (thrift-server-pool-69|293295) [QeProcessorImpl.reportExecStatus():226] ReportExecStatus() runtime error, query c56c95191388474c-92bbfc39bbe7f4cb with type LOAD does not exist
2024-04-17 11:32:19,038 INFO (thrift-server-pool-71|294087) [DatabaseTransactionMgr.abortTransaction():1552] abort transaction: TransactionState. transaction id: 7071528, label: c56c95191388474c-92bbfc39bbe7f4cb, db id: 10358, table id list: 59038, callback id: 76027, coordinator: FE: 172.31.19.73, transaction status: ABORTED, error replicas num: 0, replica ids: , prepare time: 1713353535105, commit time: -1, finish time: 1713353539036, reason: [CANCELLED][E-235]PStatus: (172.31.22.135)[E-235]failed to init rowset builder. version count: 56053, exceed limit: 2000, tablet: 59060 attactment: RLTaskTxnCommitAttachment [filteredRows=0, loadedRows=0, unselectedRows=0, receivedBytes=209715998, taskExecutionTimeMs=0, taskId=TUniqueId(hi:-4220834815818578100, lo:-7873422198631303989), jobId=76027, progress=KafkaProgress [partitionIdToOffset=1_21440496272|3_20665444312|5_19663004962|7_21434311959|9_18042560184|11_17908140484|13_17934827313|15_17973267464]] successfully
2024-04-17 11:59:05,356 INFO (leaderCheckpointer|94) [DatabaseTransactionMgr.replayUpsertTransactionState():2158] replay a ABORTED transaction TransactionState. transaction id: 7071528, label: c56c95191388474c-92bbfc39bbe7f4cb, db id: 10358, table id list: 59038, callback id: 76027, coordinator: FE: 172.31.19.73, transaction status: ABORTED, error replicas num: 0, replica ids: , prepare time: 1713353535105, commit time: -1, finish time: 1713353539036, reason: [CANCELLED][E-235]PStatus: (172.31.22.135)[E-235]failed to init rowset builder. version count: 56053, exceed limit: 2000, tablet: 59060 attactment: RLTaskTxnCommitAttachment [filteredRows=0, loadedRows=0, unselectedRows=0, receivedBytes=209715998, taskExecutionTimeMs=0, taskId=null, jobId=0, progress=KafkaProgress [partitionIdToOffset=1_21440496272|3_20665444312|5_19663004962|7_21434311959|9_18042560184|11_17908140484|13_17934827313|15_17973267464]]

这个表是冷热存储分离,10天前的数据会同步到s3. 但是目前该表创建都还没有10天。
具体的建表语句如下:

CREATE TABLE `country_ip_agg` (
  `country` VARCHAR(20) NOT NULL COMMENT '国家',
  `last_ip` VARCHAR(60) NULL COMMENT '用户的IP',
  `os` VARCHAR(100) REPLACE_IF_NOT_NULL NULL COMMENT '操作系统-方便对数据',
  `ios_usr` INT MAX NULL COMMENT '是否在iOS系统出现过',
  `region` VARCHAR(100) REPLACE_IF_NOT_NULL NULL COMMENT 'region',
  `city` VARCHAR(100) REPLACE_IF_NOT_NULL NULL COMMENT 'city',
  `bid_date` DATE REPLACE NULL COMMENT '日期',
  `bid_date_time` DATETIME REPLACE NULL COMMENT '最后活跃时间',
  `active_cnt` INT SUM NULL DEFAULT "0" COMMENT '活跃数',
  `ip_hll` HLL HLL_UNION NOT NULL COMMENT 'ifa排重辅助'
) ENGINE=OLAP
AGGREGATE KEY(`country`, `last_ip`)
COMMENT 'OLAP'
PARTITION BY LIST(`country`)
(PARTITION p_BAD VALUES IN ("BR1","IT1")PROPERTIES ("STORAGE POLICY" = "s3_cold_down_10d"),
PARTITION p_KP VALUES IN ("KP")PROPERTIES ("STORAGE POLICY" = "s3_cold_down_10d"),
PARTITION p_OTHER VALUES IN ("RU","ID","TH","IN","PH","BR","MX","VN","KR","SA","GB","AE","MY","NG","SG","TK","NU","TV","ER","FK","MS","PN","BL","KI","VA","IO","WF","NR","PM","CK","AS","VU","SM","TC","LI","AX","MC","TO","SB","ST","FO","MH","GL","YT","GI","FM","VG","WS","PW","GG","CF","AD","AI","LS","VC","LC","AW","IM","JE","PG","NC","GW","GD","KM","KN","DJ","SZ","CV","BI","DM","KY","MO","FJ","TM","BW","GP","CW","UM","MQ","BS","IS","GQ","SS","AG","BT","BF","BM","MP","MW","PF","ME","GY","LR","NE","NA","BB","TL","SR","BQ","ZW","SL","SI","RE","CU","MG","GF","TD","MZ","BN","GM","CY","MT","HR","TT","CG","SK","BA","RW","VI","BJ","TG","GA","CR","SV","ML","GN","MR","HT","NI","MK","MF","ZM","TJ","PA","EE","AO","GU","TZ","UG","LU","KG","HN","BZ","LT","HU","CD","SD","AM","GR","JM","UZ","LV","KE","AT","PT","MV","SC","AZ","CM","SN","MN","MD","NO","BY","ET","SO","BG","AL","BH","NZ","UY","TN","IE","JO","DZ","MM","LA","SX","GH","BE","YE","CI","LK","CZ","PS","BO","CH","AF","RO","PY","VE","NF","DO","ES","OM","FI","PR","DK","SY","MU","LB","CL","IL","KH","SE","AU","KW","IR","LY","EC","MA","TR","NP","GE","QA","CO","AR","KZ","GT","HK","CN","EG","PE","BD","JP","PL","IT","UA","IQ","ZA","CA","RS","PK","FR","NL","TW","DE")PROPERTIES ("STORAGE POLICY" = "s3_cold_down_10d"),
PARTITION p_RU VALUES IN ("US")PROPERTIES ("STORAGE POLICY" = "s3_cold_down_10d"))
DISTRIBUTED BY HASH(`last_ip`) BUCKETS 6
PROPERTIES (
"replication_allocation" = "tag.location.default: 3",
"min_load_replica_num" = "-1",
"is_being_synced" = "false",
"storage_medium" = "hdd",
"storage_format" = "V2",
"inverted_index_storage_format" = "V1",
"compression" = "ZSTD",
"light_schema_change" = "true",
"storage_policy" = "s3_cold_down_10d",
"disable_auto_compaction" = "false",
"enable_single_replica_compaction" = "false",
"group_commit_interval_ms" = "10000",
"group_commit_data_bytes" = "134217728"
);

这个 transaction Abort异常会导致 routine load 的 abortedTaskNum 指标一直增加。

我应该怎么解决这个问题呢,因为业务对数据实时性要求比较高。要求数据不要堆积

2 Answers

看报错事由于事务过多导致的,修改be参数max_tablet_version_num为5000试试