【修复中】streamload导入json异常

Question

背景：使用streamload导入带有json字段的文件，报错文件中某些行的列数和目的表的列数不一致，但把有问题的行拿出来单独使用streamload导入，是没有问题的。文件是通过PG外表导出的数据文件（csv格式分隔符逗号，转义符~,包裹符双引号）。由于涉及数据为公司内部数据，不方便直接贴出来，

我这里复现了一个bad case
Doris版本：2.1.0
Doris目的表建表语句:

--show create table load_test_2
CREATE TABLE `load_test_2` (
  `stat_dt` DATE NULL,
  `session_hash` BIGINT NULL,
  `name` VARCHAR(512) NULL,
  `apply_province` VARCHAR(128) NULL,
  `apply_city` VARCHAR(128) NULL,
  `data_json` JSON NULL
) ENGINE=OLAP
DUPLICATE KEY(`stat_dt`)
COMMENT 'OLAP'
DISTRIBUTED BY HASH(`session_hash`) BUCKETS 5
PROPERTIES (
"replication_allocation" = "tag.location.default: 3",
"min_load_replica_num" = "-1",
"is_being_synced" = "false",
"storage_medium" = "hdd",
"storage_format" = "V2",
"light_schema_change" = "true",
"disable_auto_compaction" = "false",
"enable_single_replica_compaction" = "false",
"group_commit_interval_ms" = "10000",
"group_commit_data_bytes" = "134217728"
);

PG外表DDL：

--csv格式 分隔符逗号，转义符~,包裹符双引号
CREATE WRITABLE EXTERNAL TABLE xue_test.ext_w_t_load_test_3_test (LIKE xue_test.load_test_3)
        LOCATION (
                'gpfdist://10.72.xxx.xxx:8081/ext_test/data_file_0','gpfdist://10.72.xxx.xxx:8082/ext_test/data_file_1','gpfdist://10.72.xxx.xxx:8083/ext_test/data_file_2'
        )
        FORMAT 'csv' ( DELIMITER ',' ESCAPE '~' QUOTE '"' NULL '')
        DISTRIBUTED RANDOMLY;

streamload 导入curl命令:

curl --location-trusted -u root: -T data_file_2  -H "format:csv" -H "column_separator:," -H "format:csv" -H 'enclose:"' -H "trim_double_quotes:true" -H 'escape:~' -H "exec_mem_limit:68719476736" http://10.72.xxx.xxx:8030/api/gp_test/load_test_2/_stream_load

我数据文件中所有行都是基于以下两行数据复制多次而来

2024-01-01,23412341,test,Beijing,Beijing,
,,,,,"{~"title~": ~"非常长的JSON数据示例~", ~"users~": [{~"id~": 1, ~"name~": ~"用户1~", ~"email~": ~"user1@example.com~", ~"orders~": [{~"orderId~": ~"ORD1234~", ~"products~": [{~"name~": ~"产品1~", ~"price~": 99.99, ~"productId~": ~"PRD0001~"}, {~"name~": ~"产品2~", ~"price~": 199.99, ~"productId~": ~"PRD0002~"}]}, {~"orderId~": ~"ORD5678~", ~"products~": [{~"name~": ~"产品3~", ~"price~": 49.99, ~"productId~": ~"PRD0003~"}]}], ~"hobbies~": [~"阅读~", ~"旅行~", ~"摄影~"], ~"profile~": {~"age~": 25, ~"gender~": ~"男~", ~"location~": ~"北京~"}}, {~"id~": 2, ~"name~": ~"用户2~", ~"email~": ~"user2@example.com~", ~"orders~": [{~"orderId~": ~"ORD9012~", ~"products~": [{~"name~": ~"产品4~", ~"price~": 249.99, ~"productId~": ~"PRD0004~"}]}], ~"hobbies~": [~"健身~", ~"音乐~", ~"烹饪~"], ~"profile~": {~"age~": 30, ~"gender~": ~"女~", ~"location~": ~"上海~"}}, {~"key1~": ~"这是一段非常长的文本，用于填充JSON数 据，使其达到所需的长度。你可以继续添加更多文本，以满足你的需求。~", ~"key2~": ~"这也是一段长文本，用于扩展JSON的长度。你可以根据需要修改和增加内容。~", ~"key3~": ~"还有更多内容可以添加，以满足长度要求。你可以复制粘贴这段文 本多次，或者添加其他类型的数据。~", ~"key4~": {~"subkey1~": ~"子键值1的内容~", ~"subkey2~": ~"子键值2的内容~", ~"subkey3~": {~"subsubkey1~": ~"子子键值1的内容~", ~"subsubkey2~": ~"子子键值2的内容~"}}, ~"key5~": [~"数组元素1~", ~"数组元素2~", ~"数组元素3~"]}], ~"description~": ~"这是一个用于测试目的的长JSON数据，包含了各种类型和结构的字段。~"}"

原始json

{"title":"非常长的JSON数据示例","description":"这是一个用于测试目的的长JSON数据，包含了各种类型和结构的字段。","users":[{"id":1,"name":"用户1","email":"user1@example.com","profile":{"age":25,"gender":"男","location":"北京"},"hobbies":["阅读","旅行","摄影"],"orders":[{"orderId":"ORD1234","products":[{"productId":"PRD0001","name":"产品1","price":99.99},{"productId":"PRD0002","name":"产品2","price":199.99}]},{"orderId":"ORD5678","products":[{"productId":"PRD0003","name":"产品3","price":49.99}]}]},{"id":2,"name":"用户2","email":"user2@example.com","profile":{"age":30,"gender":"女","location":"上海"},"hobbies":["健身","音乐","烹饪"],"orders":[{"orderId":"ORD9012","products":[{"productId":"PRD0004","name":"产品4","price":249.99}]}]},{"key1":"这是一段非常长的文本，用于填充JSON数据，使其达到所需的长度。你可以继续添加更多文本，以满足你的需求。","key2":"这也是一段长文本，用于扩展JSON的长度。你可以根据需要修改和增加内容。","key3":"还有更多内容可以添加，以满足长度要求。你可以复制粘贴这段文本多次，或者添加其他类型的数据。","key4":{"subkey1":"子键值1的内容","subkey2":"子键值2的内容","subkey3":{"subsubkey1":"子子键值1的内容","subsubkey2":"子子键值2的内容"}},"key5":["数组元素1","数组元素2","数组元素3"]}]}

遇到的问题：
文件数据量少的时候，比如1w行，可以正常导入

{
    "TxnId": 189602,
    "Label": "bc79e4f8-1c50-4233-9fb4-7f1f9f03c9cf",
    "Comment": "",
    "TwoPhaseCommit": "false",
    "Status": "Success",
    "Message": "OK",
    "NumberTotalRows": 100000,
    "NumberLoadedRows": 100000,
    "NumberFilteredRows": 0,
    "NumberUnselectedRows": 0,
    "LoadBytes": 87347289,
    "LoadTimeMs": 1167,
    "BeginTxnTimeMs": 0,
    "StreamLoadPutTimeMs": 1,
    "ReadDataTimeMs": 724,
    "WriteDataTimeMs": 1144,
    "CommitAndPublishTimeMs": 19
}

文件数据量大的时候，会导入失败，失败报错：

{
    "TxnId": 189600,
    "Label": "e6485240-3d5c-4e18-8de6-d6a98ada993d",
    "Comment": "",
    "TwoPhaseCommit": "false",
    "Status": "Fail",
    "Message": "[DATA_QUALITY_ERROR]too many filtered rows",
    "NumberTotalRows": 700000,
    "NumberLoadedRows": 699997,
    "NumberFilteredRows": 3,
    "NumberUnselectedRows": 0,
    "LoadBytes": 611431023,
    "LoadTimeMs": 5858,
    "BeginTxnTimeMs": 1,
    "StreamLoadPutTimeMs": 6,
    "ReadDataTimeMs": 5142,
    "WriteDataTimeMs": 5849,
    "CommitAndPublishTimeMs": 0,
    "ErrorURL": "http://xxx.xxx.xxx.xxx:8040/api/_load_error_log?file=__shard_7/error_log_insert_stmt_734108f57e5b4877-6a94362e1d370cba_734108f57e5b4877_6a94362e1d370cba"
}

两份数据，内容完全一致，只是行数第一个是1w行，第二个是7w行，就会导致streamload导入失败

徐振超@SelectDB · Accepted Answer

【问题状态】已处理
【问题处理】已经定位成功，发现问题，fix pr已经提交
https://github.com/apache/doris/pull/34364

松鼠找不到松果 · Answer

Broker load也存在同样的问题

【修复中】streamload导入json异常

2 Answers