【已解决】2.1 版本基于HDFS冷热分层异常报错

Viewed 266

【版本】Apache Doris2.1.0
【背景】测试集群验证Doris冷热分层至HDFS,结果fe,be一直报错,表数据是可以查询的,show data 显示数据也是上传到HDFS的。
【创建语句】

create resource "remote_hdfs" properties(
	"type" = "hdfs",
	"fs.defaultFS"="masters", -- 取自hadoop的core-site文件
	"hadoop.username"="hadoop",
	"hadoop.password"="",
	"dfs.nameservices"="masters",
	"dfs.ha.namenodes.masters"="h1,h2",
	"dfs.namenode.rpc-address.masters.h1"="h1_ip:9000",
	"dfs.namenode.rpc-address.masters.h2"="h2_ip:9000",
	"dfs.client.failover.proxy.provider.masters"="org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
);

create storage policy test_policy
properties(
	"storage_resource" = "remote_hdfs",
	"cooldown_ttl" = "300"
);

CREATE TABLE test.ods_storage_policy_test1(
`DATA_TIME` datetime not null ,
`ID` varchar(30),
`VALUE` DOUBLE,
`TIME2` datetime,
`TIME3` datetime,
`SOURCESTR` varchar(50)
) 
duplicate key (DATA_TIME,ID)
auto partition by range date_trunc(DATA_TIME,"day") ()
DISTRIBUTED BY HASH(DATA_TIME,ID) BUCKETS 5
properties(
	"storage_policy"="test_policy"
);

【详述】
1.创建一个resource后,创建一个5分钟冷却的storage_policy ,关联表,导入1万条测试数据
2.5分钟后 观察show data的数据同步到HDFS,但是Size与remoteSize数值不对等
3.观察be.WARN日志,发现一直在报

W20240329 10:51:07.714134 32569 file_reader.cpp:34] [INTERNAL_ERROR]cancelled: sender is gone
W20240329 10:51:07.714203 32569 scanner_scheduler.cpp:276] Scan thread read VScanner failed: [INTERNAL_ERROR]cancelled: sender is gone
W20240329 10:51:07.714301 32490 task_scheduler.cpp:348] Pipeline task failed. query_id: ec470c99f75916ea-7841c4481a7934c0|ec470c99f75916ea-7841c4481a7934bf reason: [INTERNAL_ERROR]cancelled: sender is gone
W20240329 10:51:07.714361 32490 pipeline_fragment_context.cpp:177] PipelineFragmentContext ec470c99f75916ea-7841c4481a7934c0|ec470c99f75916ea-7841c4481a7934bf is canceled, cancel message: cancelled: sender is gone
W20240329 10:51:07.714432 32588 vtablet_writer.cpp:598] cancel node channel VNodeChannel[10190-10063], load_id=ec470c99f75916ea-7841c4481a7934bf, txn_id=214, node=10.163.26.125:8060, error message: [CANCELLED]cancelled: sender is gone
W20240329 10:51:07.714550 32588 vtablet_writer.cpp:598] cancel node channel VNodeChannel[10190-10062], load_id=ec470c99f75916ea-7841c4481a7934bf, txn_id=214, node=BE_IP:8060, error message: [CANCELLED]cancelled: sender is gone
W20240329 10:51:07.714603 32588 vtablet_writer.cpp:598] cancel node channel VNodeChannel[10190-10066], load_id=ec470c99f75916ea-7841c4481a7934bf, txn_id=214, node=10.163.26.128:8060, error message: [CANCELLED]cancelled: sender is gone
W20240329 10:51:07.714661 32588 vtablet_writer.cpp:598] cancel node channel VNodeChannel[10190-10065], load_id=ec470c99f75916ea-7841c4481a7934bf, txn_id=214, node=10.163.26.127:8060, error message: [CANCELLED]cancelled: sender is gone
W20240329 10:51:07.714699 32588 vtablet_writer.cpp:598] cancel node channel VNodeChannel[10190-10064], load_id=ec470c99f75916ea-7841c4481a7934bf, txn_id=214, node=10.163.26.126:8060, error message: [CANCELLED]cancelled: sender is gone
W20240329 10:51:07.714753 29595 fragment_mgr.cpp:391] report error status: cancelled: sender is gone to coordinator: TNetworkAddress(hostname=BE_IP, port=9020), query id: ec470c99f75916ea-7841c4481a7934bf, instance id: ec470c99f75916ea-7841c4481a7934c0
W20240329 10:51:07.715348 29595 stream_load_executor.cpp:113] fragment execute failed, query_id=0000000000000000-0000000000000000, err_msg=[CANCELLED]cancelled: sender is gone, id=ec470c99f75916ea-7841c4481a7934bf, job_id=-1, txn_id=214, label=_1_35, elapse(s)=0
W20240329 10:59:44.768100 30337 status.h:380] meet error status: [IO_ERROR]failed to open data/10661/10662.0.meta: (2), 没有那个文件或目录), reason: RemoteException: File does not exist: /user/hadoop/data/10661/10662.0.meta
W20240329 10:59:44.768211 30337 file_system.cpp:35] [IO_ERROR]failed to open data/10661/10662.0.meta: (2), 没有那个文件或目录), reason: RemoteException: File does not exist: /user/hadoop/data/10661/10662.0.meta
W20240329 10:59:44.768249 30337 olap_server.cpp:1121] failed to cooldown, tablet: 10661 err: [INTERNAL_ERROR]cannot read cooldown meta
W20240329 10:59:44.768481 30338 status.h:380] meet error status: [IO_ERROR]failed to open data/10741/10744.0.meta: (2), 没有那个文件或目录), reason: RemoteException: File does not exist: /user/hadoop/data/10741/10744.0.meta
W20240329 10:59:44.768509 30340 status.h:380] meet error status: [IO_ERROR]failed to open data/10447/10450.0.meta: (2), 没有那个文件或目录), reason: RemoteException: File does not exist: /user/hadoop/data/10447/10450.0.meta
W20240329 10:59:44.768548 30341 status.h:380] meet error status: [IO_ERROR]failed to open data/10686/10689.0.meta: (2), 没有那个文件或目录), reason: RemoteException: File does not exist: /user/hadoop/data/10686/10689.0.meta
W20240329 10:59:44.768575 30339 status.h:380] meet error status: [IO_ERROR]failed to open data/10694/10696.0.meta: (2), 没有那个文件或目录), reason: RemoteException: File does not exist: /user/hadoop/data/10694/10696.0.meta
W20240329 10:59:44.768584 30340 file_system.cpp:35] [IO_ERROR]failed to open data/10447/10450.0.meta: (2), 没有那个文件或目录), reason: RemoteException: File does not exist: /user/hadoop/data/10447/10450.0.meta
W20240329 10:59:44.768533 30338 file_system.cpp:35] [IO_ERROR]failed to open data/10741/10744.0.meta: (2), 没有那个文件或目录), reason: RemoteException: File does not exist: /user/hadoop/data/10741/10744.0.meta
W20240329 10:59:44.768661 30341 file_system.cpp:35] [IO_ERROR]failed to open data/10686/10689.0.meta: (2), 没有那个文件或目录), reason: RemoteException: File does not exist: /user/hadoop/data/10686/10689.0.meta
W20240329 10:59:44.768700 30339 file_system.cpp:35] [IO_ERROR]failed to open data/10694/10696.0.meta: (2), 没有那个文件或目录), reason: RemoteException: File does not exist: /user/hadoop/data/10694/10696.0.meta
W20240329 10:59:44.768878 30340 olap_server.cpp:1121] failed to cooldown, tablet: 10447 err: [INTERNAL_ERROR]cannot read cooldown meta
W20240329 10:59:44.768893 30339 olap_server.cpp:1121] failed to cooldown, tablet: 10694 err: [INTERNAL_ERROR]cannot read cooldown meta
W20240329 10:59:44.768893 30341 olap_server.cpp:1121] failed to cooldown, tablet: 10686 err: [INTERNAL_ERROR]cannot read cooldown meta
W20240329 10:59:44.768885 30338 olap_server.cpp:1121] failed to cooldown, tablet: 10741 err: [INTERNAL_ERROR]cannot read cooldown meta
W20240329 10:59:44.774300 30341 status.h:380] meet error status: [IO_ERROR]failed to open data/10753/10754.0.meta: (2), 没有那个文件或目录), reason: RemoteException: File does not exist: /user/hadoop/data/10753/10754.0.meta
W20240329 10:59:44.774545 30341 file_system.cpp:35] [IO_ERROR]failed to open data/10753/10754.0.meta: (2), 没有那个文件或目录), reason: RemoteException: File does not exist: /user/hadoop/data/10753/10754.0.meta
W20240329 10:59:44.774588 30341 olap_server.cpp:1121] failed to cooldown, tablet: 10753 err: [INTERNAL_ERROR]cannot read cooldown meta
W20240329 10:59:44.778764 30339 status.h:380] meet error status: [IO_ERROR]failed to open data/10669/10671.0.meta: (2), 没有那个文件或目录), reason: RemoteException: File does not exist: /user/hadoop/data/10669/10671.0.meta
W20240329 10:59:44.778861 30339 file_system.cpp:35] [IO_ERROR]failed to open data/10669/10671.0.meta: (2), 没有那个文件或目录), reason: RemoteException: File does not exist: /user/hadoop/data/10669/10671.0.meta
W20240329 10:59:44.778900 30339 olap_server.cpp:1121] failed to cooldown, tablet: 10669 err: [INTERNAL_ERROR]cannot read cooldown meta
W20240329 10:59:44.786492 30341 status.h:380] meet error status: [IO_ERROR]failed to open data/10547/10549.0.meta: (2), 没有那个文件或目录), reason: RemoteException: File does not exist: /user/hadoop/data/10547/10549.0.meta
W20240329 10:59:44.786592 30341 file_system.cpp:35] [IO_ERROR]failed to open data/10547/10549.0.meta: (2), 没有那个文件或目录), reason: RemoteException: File does not exist: /user/hadoop/data/10547/10549.0.meta
W20240329 10:59:44.786652 30341 olap_server.cpp:1121] failed to cooldown, tablet: 10547 err: [INTERNAL_ERROR]cannot read cooldown meta
W20240329 10:59:44.788599 30337 status.h:380] meet error status: [IO_ERROR]failed to open data/10552/10553.0.meta: (2), 没有那个文件或目录), reason: RemoteException: File does not exist: /user/hadoop/data/10552/10553.0.meta
W20240329 10:59:44.788707 30337 file_system.cpp:35] [IO_ERROR]failed to open data/10552/10553.0.meta: (2), 没有那个文件或目录), reason: RemoteException: File does not exist: /user/hadoop/data/10552/10553.0.meta
W20240329 10:59:44.788756 30337 olap_server.cpp:1121] failed to cooldown, tablet: 10552 err: [INTERNAL_ERROR]cannot read cooldown meta
W20240329 10:59:44.788976 30339 status.h:380] meet error status: [IO_ERROR]failed to open data/10543/10546.0.meta: (2), 没有那个文件或目录), reason: RemoteException: File does not exist: /user/hadoop/data/10543/10546.0.meta
W20240329 10:59:44.789033 30339 file_system.cpp:35] [IO_ERROR]failed to open data/10543/10546.0.meta: (2), 没有那个文件或目录), reason: RemoteException: File does not exist: /user/hadoop/data/10543/10546.0.meta
W20240329 10:59:44.789098 30339 olap_server.cpp:1121] failed to cooldown, tablet: 10543 err: [INTERNAL_ERROR]cannot read cooldown meta
W20240329 10:59:44.796890 30341 status.h:380] meet error status: [IO_ERROR]failed to open data/10535/10536.0.meta: (2), 没有那个文件或目录), reason: RemoteException: File does not exist: /user/hadoop/data/10535/10536.0.meta
W20240329 10:59:44.796991 30341 file_system.cpp:35] [IO_ERROR]failed to open data/10535/10536.0.meta: (2), 没有那个文件或目录), reason: RemoteException: File does not exist: /user/hadoop/data/10535/10536.0.meta
W20240329 10:59:44.797056 30341 olap_server.cpp:1121] failed to cooldown, tablet: 10535 err: [INTERNAL_ERROR]cannot read cooldown meta
W20240329 10:59:44.799446 30339 status.h:380] meet error status: [IO_ERROR]failed to open data/10514/10516.0.meta: (2), 没有那个文件或目录), reason: RemoteException: File does not exist: /user/hadoop/data/10514/10516.0.meta
W20240329 10:59:44.799530 30339 file_system.cpp:35] [IO_ERROR]failed to open data/10514/10516.0.meta: (2), 没有那个文件或目录), reason: RemoteException: File does not exist: /user/hadoop/data/10514/10516.0.meta
W20240329 10:59:44.799574 30339 olap_server.cpp:1121] failed to cooldown, tablet: 10514 err: [INTERNAL_ERROR]cannot read cooldown meta


W20240329 11:15:05.637538 30339 file_reader.cpp:34] [INTERNAL_ERROR]Read hdfs file failed. (BE: BE_IP) namenode:HDFS_NAMENODE_IP:9000, path:data/10673/10674.0.meta, err: (255), 未知的错误 255), reason: IOException: Blocklist for /user/hadoop/data/10673/10674.0.meta has changed!

        0#  doris::io::HdfsFileReader::read_at_impl(unsigned long, doris::Slice, unsigned long*, doris::io::IOContext const*) at /var/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/basic_string.h:187
        1#  doris::io::FileReader::read_at(unsigned long, doris::Slice, unsigned long*, doris::io::IOContext const*) at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:449
        2#  doris::Tablet::_read_cooldown_meta(std::shared_ptr<doris::io::RemoteFileSystem> const&, doris::TabletMetaPB*) at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:449
        3#  doris::Tablet::_follow_cooldowned_data() at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:449
        4#  doris::Tablet::cooldown(std::shared_ptr<doris::Rowset>) at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:449
        5#  std::_Function_handler<void (), doris::StorageEngine::_cooldown_tasks_producer_callback()::$_1>::_M_invoke(std::_Any_data const&) at /var/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr_base.h:701
        6#  doris::WorkThreadPool<true>::work_thread(int) at /var/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/atomic_base.h:646
        7#  execute_native_thread_routine at /data/gcc-11.1.0/build/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/unique_ptr.h:85
        8#  start_thread
        9#  clone
W20240329 11:15:05.637604 30339 olap_server.cpp:1121] failed to cooldown, tablet: 10673 err: [INTERNAL_ERROR]cannot read cooldown meta
W20240329 11:15:06.383301 30337 status.h:380] meet error status: [INTERNAL_ERROR]Read hdfs file failed. (BE: BE_IP) namenode:HDFS_NAMENODE_IP:9000, path:data/10451/10452.0.meta, err: (255), 未知的错误 255), reason: IOException: Blocklist for /user/hadoop/data/10451/10452.0.meta has changed!

        0#  doris::io::HdfsFileReader::read_at_impl(unsigned long, doris::Slice, unsigned long*, doris::io::IOContext const*) at /var/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/basic_string.h:187
        1#  doris::io::FileReader::read_at(unsigned long, doris::Slice, unsigned long*, doris::io::IOContext const*) at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:449
        2#  doris::Tablet::_read_cooldown_meta(std::shared_ptr<doris::io::RemoteFileSystem> const&, doris::TabletMetaPB*) at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:449
        3#  doris::Tablet::_follow_cooldowned_data() at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:449
        4#  doris::Tablet::cooldown(std::shared_ptr<doris::Rowset>) at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:449
        5#  std::_Function_handler<void (), doris::StorageEngine::_cooldown_tasks_producer_callback()::$_1>::_M_invoke(std::_Any_data const&) at /var/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr_base.h:701
        6#  doris::WorkThreadPool<true>::work_thread(int) at /var/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/atomic_base.h:646
        7#  execute_native_thread_routine at /data/gcc-11.1.0/build/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/unique_ptr.h:85
        8#  start_thread
        9#  clone
***********************************************************************fe报错***********************************************************************

2024-03-29 10:51:07,712 WARN (thrift-server-pool-23|452) [QeProcessorImpl.reportExecStatus():225] ReportExecStatus() runtime error, query f842235f87bb8711-22bc92904341c691 with type LOAD does not exist
2024-03-29 10:51:07,713 WARN (thrift-server-pool-15|227) [QeProcessorImpl.reportExecStatus():225] ReportExecStatus() runtime error, query 6548bcf4e0641151-c4c012c3809e83af with type LOAD does not exist
2024-03-29 10:51:07,715 WARN (thrift-server-pool-24|492) [QeProcessorImpl.reportExecStatus():225] ReportExecStatus() runtime error, query ec470c99f75916ea-7841c4481a7934bf with type LOAD does not exist
2024-03-29 10:51:07,716 WARN (thrift-server-pool-15|227) [QeProcessorImpl.reportExecStatus():225] ReportExecStatus() runtime error, query 7e4a880010080fcf-74ce0b275f20c194 with type LOAD does not exist
2024-03-29 10:51:07,716 WARN (thrift-server-pool-18|269) [QeProcessorImpl.reportExecStatus():225] ReportExecStatus() runtime error, query 564f662b92759d4b-36a029bb65c94483 with type LOAD does not exist
2024-03-29 10:51:07,718 WARN (thrift-server-pool-25|629) [QeProcessorImpl.reportExecStatus():225] ReportExecStatus() runtime error, query 2b479e8c3df137b3-b21dee2f04d73bf with type LOAD does not exist
2024-03-29 10:54:47,544 WARN (Thread-65|123) [TabletInvertedIndex.handleCooldownConf():438] failed to get tablet. tabletId=10770
2024-03-29 10:54:47,544 WARN (Thread-65|123) [TabletInvertedIndex.handleCooldownConf():438] failed to get tablet. tabletId=10774
2024-03-29 10:54:47,544 WARN (Thread-65|123) [TabletInvertedIndex.handleCooldownConf():438] failed to get tablet. tabletId=10787
2024-03-29 10:54:47,544 WARN (Thread-65|123) [TabletInvertedIndex.handleCooldownConf():438] failed to get tablet. tabletId=10795
2 Answers

【问题状态】已处理
【问题处理】如评论PR所示

为更具体地了解问题(处理后会更新回帖),可以➕我一下W:yz-jayhua

报这个错,我还能查到数据的原因是因为 在冷数据上传至hdfs时,会在本地cache一份吗,等到本地cache被清除掉,那么我的表是不是就无法查询了,之前遇到过这个报错,没在意,结果过了两天在查 发现表无法被查询,报错:failed to initialize storage reader.tablet=34202,res=[E-206]get fs failed
be一直报 cannot get storage policy write tablet tabletid cooldown meta failed because :[not found] could not find storage_policy,storage_policy_id=xxx