flink web ui显示的数据,和实际插入表里的数据不一致

Viewed 10

1.使用flink doris connector,批次是10s;使用stream load csv格式实现upsert
2.目前程序运行1h,flink发送的数据有76w条,实际入库74w,有2w的一个延迟
3.程序日志没有看到报错的情况
请问下大佬们,这种情况应该怎么去定位,少的数据去哪里了?

2 Answers

可以参考如下两点进行定位:
1、可以开启严格模式,看是否因为数据质量的原因导致被过滤了:常见问题10
2、表模式是否为聚合模型,也可能是由于数据被去重聚合了

老师,开启严格模式后,并没有明确的报错,请问怎么排查下呢?
另外,通过服务器上streamload(curl命令)直接读取csv,是可以导入的。
就是代码报下列错误,日志如下:
2024-09-29 14:45:04,578 INFO org.apache.doris.flink.sink.batch.DorisBatchStreamLoad - load Result {
"TxnId": 8622,
"Label": "unique-label-stream-load-update_xxx_91db5eaa-5f4c-4b0b-92f7-324069715660",
"Comment": "",
"TwoPhaseCommit": "false",
"Status": "Fail",
"Message": "[DATA_QUALITY_ERROR]too many filtered rows",
"NumberTotalRows": 10,
"NumberLoadedRows": 0,
"NumberFilteredRows": 10,
"NumberUnselectedRows": 0,
"LoadBytes": 4224,
"LoadTimeMs": 71,
"BeginTxnTimeMs": 2,
"StreamLoadPutTimeMs": 3,
"ReadDataTimeMs": 0,
"WriteDataTimeMs": 64,
"CommitAndPublishTimeMs": 0
}

2024-09-29 14:45:04,581 ERROR org.apache.doris.flink.sink.batch.DorisBatchStreamLoad - stream load error with xxx:8030, to retry, cause by
org.apache.doris.flink.exception.DorisBatchLoadException: stream load error: [DATA_QUALITY_ERROR]too many filtered rows, see more in null
at org.apache.doris.flink.sink.batch.DorisBatchStreamLoad$LoadAsyncExecutor.load(DorisBatchStreamLoad.java:494) [flink-doris-connector-1.16-24.0.0.jar:24.0.0]
at org.apache.doris.flink.sink.batch.DorisBatchStreamLoad$LoadAsyncExecutor.run(DorisBatchStreamLoad.java:407) [flink-doris-connector-1.16-24.0.0.jar:24.0.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_301]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_301]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_301]