【已解决】doris 2.0.5 cooldown warning日志如何解决?

Viewed 25

第一种告警:

W0611 08:28:23.461499  2013 olap_server.cpp:1080] failed to cooldown, tablet: 239732 err: [INTERNAL_ERROR]cannot read cooldown meta
W0611 08:28:23.464896  2011 olap_server.cpp:1080] failed to cooldown, tablet: 239876 err: [INTERNAL_ERROR]cooldowned version is not aligned
W0611 08:28:23.464947  2012 olap_server.cpp:1080] failed to cooldown, tablet: 239900 err: [INTERNAL_ERROR]cooldowned version is not aligned
W0611 08:28:23.465143  2010 olap_server.cpp:1080] failed to cooldown, tablet: 239804 err: [INTERNAL_ERROR]cooldowned version is not aligned
W0611 08:28:23.465533  2009 olap_server.cpp:1080] failed to cooldown, tablet: 239692 err: [INTERNAL_ERROR]cooldowned version is not aligned

第二种告警:

W0611 08:28:23.461438  2013 status.h:395] meet error status: [IO_ERROR]failed to get file size xxxx/data/239732/239735.0.meta, (endpoint: xxxxxx, bucket: cool, key:xxx/data/239732/239735.0.meta, ), No response body., error code 404

	0#  doris::io::S3FileSystem::file_size_impl(std::filesystem::__cxx11::path const&, long*) const at /var/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/basic_string.h:187
	1#  doris::io::S3FileSystem::open_file_internal(doris::io::FileDescription const&, std::filesystem::__cxx11::path const&, std::shared_ptr<doris::io::FileReader>*) at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:445
	2#  doris::io::RemoteFileSystem::open_file_impl(doris::io::FileDescription const&, std::filesystem::__cxx11::path const&, doris::io::FileReaderOptions const&, std::shared_ptr<doris::io::FileReader>*) at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:445
	3#  doris::io::FileSystem::open_file(doris::io::FileDescription const&, doris::io::FileReaderOptions const&, std::shared_ptr<doris::io::FileReader>*) at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:356
	4#  doris::Tablet::_read_cooldown_meta(std::shared_ptr<doris::io::RemoteFileSystem> const&, doris::TabletMetaPB*) at /var/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/basic_string.h:187
	5#  doris::Tablet::_follow_cooldowned_data() at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:445
	6#  doris::Tablet::cooldown() at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:445
	7#  std::_Function_handler<void (), doris::StorageEngine::_cooldown_tasks_producer_callback()::$_1>::_M_invoke(std::_Any_data const&) at /home/zcp/repo_center/doris_release/doris/be/src/olap/olap_server.cpp:1076
	8#  doris::WorkThreadPool<true>::work_thread(int) at /var/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/atomic_base.h:646
	9#  execute_native_thread_routine at /data/gcc-11.1.0/build/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/unique_ptr.h:85
	10# ?
	11# clone
1 Answers

冷热分层功能在多副本的时候会选择一个cooldown replica作为基准,对象存储上的数据可以理解成是cooldown replica的数据和版本为准。下面我们举一个例子。

假如我们有3个副本A,B,C。此时版本分布情况是:
A [0,2], [3,5], [6,10]
B [0,7], [8,10]
C [0,3], [4, 5], [6,7], [8, 10]

如果此时cooldown replica是A,并且冷却到了版本5。

告警分析一

对于cannot read cooldown meta这个报错的含义是找不到对象存储上的cooldown replica的Meta信息。

这里报错的场景是B,C在试图从对象存储上读取cooldown replica的信息时发现cooldown replica A还没有上传cooldown meta信息导致这一次follow cooldown失败了。可以检查下A的日志查看A的cooldown是否失败了。

对于failed to get file size,这也是在对象存储上没有读到meta文件,可以按照上面的方式查看日志。

告警分析二

对于cooldowned version is not aligned这个报错的直观含义是冷却版本不能对齐。

这里我们用副本 A和B来举例子,A已经冷却到了版本5,那么对象存储上的meta信息里记录的也是冷却到了版本5. B从对象存储读取到了meta信息后与自己本地的版本信息进行比较,发现自己的版本5已经被compaction掉了,所以 B此时是不能直接使用meta信息里的版本信息进行冷却的,可以等待A的冷却版本信息进一步推进后重新进行冷却(这一步是BE周期性自动尝试的)。