背景:使用streamload导入带有json字段的文件,报错文件中某些行的列数和目的表的列数不一致,但把有问题的行拿出来单独使用streamload导入,是没有问题的。文件是通过PG外表导出的数据文件(csv格式 分隔符逗号,转义符~,包裹符双引号)。由于涉及数据为公司内部数据,不方便直接贴出来,
我这里复现了一个bad case
Doris版本:2.1.0
Doris目的表建表语句:
--show create table load_test_2
CREATE TABLE `load_test_2` (
`stat_dt` DATE NULL,
`session_hash` BIGINT NULL,
`name` VARCHAR(512) NULL,
`apply_province` VARCHAR(128) NULL,
`apply_city` VARCHAR(128) NULL,
`data_json` JSON NULL
) ENGINE=OLAP
DUPLICATE KEY(`stat_dt`)
COMMENT 'OLAP'
DISTRIBUTED BY HASH(`session_hash`) BUCKETS 5
PROPERTIES (
"replication_allocation" = "tag.location.default: 3",
"min_load_replica_num" = "-1",
"is_being_synced" = "false",
"storage_medium" = "hdd",
"storage_format" = "V2",
"light_schema_change" = "true",
"disable_auto_compaction" = "false",
"enable_single_replica_compaction" = "false",
"group_commit_interval_ms" = "10000",
"group_commit_data_bytes" = "134217728"
);
PG外表DDL:
--csv格式 分隔符逗号,转义符~,包裹符双引号
CREATE WRITABLE EXTERNAL TABLE xue_test.ext_w_t_load_test_3_test (LIKE xue_test.load_test_3)
LOCATION (
'gpfdist://10.72.xxx.xxx:8081/ext_test/data_file_0','gpfdist://10.72.xxx.xxx:8082/ext_test/data_file_1','gpfdist://10.72.xxx.xxx:8083/ext_test/data_file_2'
)
FORMAT 'csv' ( DELIMITER ',' ESCAPE '~' QUOTE '"' NULL '')
DISTRIBUTED RANDOMLY;
streamload 导入curl命令:
curl --location-trusted -u root: -T data_file_2 -H "format:csv" -H "column_separator:," -H "format:csv" -H 'enclose:"' -H "trim_double_quotes:true" -H 'escape:~' -H "exec_mem_limit:68719476736" http://10.72.xxx.xxx:8030/api/gp_test/load_test_2/_stream_load
我数据文件中所有行都是基于以下两行数据复制多次而来
2024-01-01,23412341,test,Beijing,Beijing,
,,,,,"{~"title~": ~"非常长的JSON数据示例~", ~"users~": [{~"id~": 1, ~"name~": ~"用户1~", ~"email~": ~"user1@example.com~", ~"orders~": [{~"orderId~": ~"ORD1234~", ~"products~": [{~"name~": ~"产品1~", ~"price~": 99.99, ~"productId~": ~"PRD0001~"}, {~"name~": ~"产品2~", ~"price~": 199.99, ~"productId~": ~"PRD0002~"}]}, {~"orderId~": ~"ORD5678~", ~"products~": [{~"name~": ~"产品3~", ~"price~": 49.99, ~"productId~": ~"PRD0003~"}]}], ~"hobbies~": [~"阅读~", ~"旅行~", ~"摄影~"], ~"profile~": {~"age~": 25, ~"gender~": ~"男~", ~"location~": ~"北京~"}}, {~"id~": 2, ~"name~": ~"用户2~", ~"email~": ~"user2@example.com~", ~"orders~": [{~"orderId~": ~"ORD9012~", ~"products~": [{~"name~": ~"产品4~", ~"price~": 249.99, ~"productId~": ~"PRD0004~"}]}], ~"hobbies~": [~"健身~", ~"音乐~", ~"烹饪~"], ~"profile~": {~"age~": 30, ~"gender~": ~"女~", ~"location~": ~"上海~"}}, {~"key1~": ~"这是一段非常长的文本,用于填充JSON数 据,使其达到所需的长度。你可以继续添加更多文本,以满足你的需求。~", ~"key2~": ~"这也是一段长文本,用于扩展JSON的长度。你可以根据需要修改和增加内容。~", ~"key3~": ~"还有更多内容可以添加,以满足长度要求。你可以复制粘贴这段文 本多次,或者添加其他类型的数据。~", ~"key4~": {~"subkey1~": ~"子键值1的内容~", ~"subkey2~": ~"子键值2的内容~", ~"subkey3~": {~"subsubkey1~": ~"子子键值1的内容~", ~"subsubkey2~": ~"子子键值2的内容~"}}, ~"key5~": [~"数组元素1~", ~"数组元素2~", ~"数组元素3~"]}], ~"description~": ~"这是一个用于测试目的的长JSON数据,包含了各种类型和结构的字段。~"}"
原始json
{"title":"非常长的JSON数据示例","description":"这是一个用于测试目的的长JSON数据,包含了各种类型和结构的字段。","users":[{"id":1,"name":"用户1","email":"user1@example.com","profile":{"age":25,"gender":"男","location":"北京"},"hobbies":["阅读","旅行","摄影"],"orders":[{"orderId":"ORD1234","products":[{"productId":"PRD0001","name":"产品1","price":99.99},{"productId":"PRD0002","name":"产品2","price":199.99}]},{"orderId":"ORD5678","products":[{"productId":"PRD0003","name":"产品3","price":49.99}]}]},{"id":2,"name":"用户2","email":"user2@example.com","profile":{"age":30,"gender":"女","location":"上海"},"hobbies":["健身","音乐","烹饪"],"orders":[{"orderId":"ORD9012","products":[{"productId":"PRD0004","name":"产品4","price":249.99}]}]},{"key1":"这是一段非常长的文本,用于填充JSON数据,使其达到所需的长度。你可以继续添加更多文本,以满足你的需求。","key2":"这也是一段长文本,用于扩展JSON的长度。你可以根据需要修改和增加内容。","key3":"还有更多内容可以添加,以满足长度要求。你可以复制粘贴这段文本多次,或者添加其他类型的数据。","key4":{"subkey1":"子键值1的内容","subkey2":"子键值2的内容","subkey3":{"subsubkey1":"子子键值1的内容","subsubkey2":"子子键值2的内容"}},"key5":["数组元素1","数组元素2","数组元素3"]}]}
遇到的问题:
文件数据量少的时候,比如1w行,可以正常导入
{
"TxnId": 189602,
"Label": "bc79e4f8-1c50-4233-9fb4-7f1f9f03c9cf",
"Comment": "",
"TwoPhaseCommit": "false",
"Status": "Success",
"Message": "OK",
"NumberTotalRows": 100000,
"NumberLoadedRows": 100000,
"NumberFilteredRows": 0,
"NumberUnselectedRows": 0,
"LoadBytes": 87347289,
"LoadTimeMs": 1167,
"BeginTxnTimeMs": 0,
"StreamLoadPutTimeMs": 1,
"ReadDataTimeMs": 724,
"WriteDataTimeMs": 1144,
"CommitAndPublishTimeMs": 19
}
文件数据量大的时候,会导入失败,失败报错:
{
"TxnId": 189600,
"Label": "e6485240-3d5c-4e18-8de6-d6a98ada993d",
"Comment": "",
"TwoPhaseCommit": "false",
"Status": "Fail",
"Message": "[DATA_QUALITY_ERROR]too many filtered rows",
"NumberTotalRows": 700000,
"NumberLoadedRows": 699997,
"NumberFilteredRows": 3,
"NumberUnselectedRows": 0,
"LoadBytes": 611431023,
"LoadTimeMs": 5858,
"BeginTxnTimeMs": 1,
"StreamLoadPutTimeMs": 6,
"ReadDataTimeMs": 5142,
"WriteDataTimeMs": 5849,
"CommitAndPublishTimeMs": 0,
"ErrorURL": "http://xxx.xxx.xxx.xxx:8040/api/_load_error_log?file=__shard_7/error_log_insert_stmt_734108f57e5b4877-6a94362e1d370cba_734108f57e5b4877_6a94362e1d370cba"
}
两份数据,内容完全一致,只是行数第一个是1w行,第二个是7w行,就会导致streamload导入失败