failed to fdb_future_get_error err=Operation aborted because the transaction timed out key=*****

Viewed 15

在k8s中采用存算分离的方式部署。在使用一段时间后,执行查询语句出现timeout的报错。在pod ‘test-disaggregated-cluster-ms-*’中会出现如下的日志报错信息:

RuntimeLogger W20250115 02:17:44.783914   363 txn_kv.cpp:389] virtual TxnErrorCode doris::cloud::fdb::Transaction::get(std::string_view, std::string *, bool) failed to fdb_future_get_error err=Operation aborted because the transaction timed out key=011076657273696f6e00011031383134383031373133000110706172746974696f6e000112000000000000271212000000000000355e120000000000003565
RuntimeLogger I20250115 02:17:44.796782   325 meta_service_helper.h:81] begin get_obj_store_info from 192.164.11.22:39088 request=cloud_unique_id: "1:1814801713:uXZDfzBa"
RuntimeLogger W20250115 02:17:46.505126   360 txn_kv.cpp:389] virtual TxnErrorCode doris::cloud::fdb::Transaction::get(std::string_view, std::string *, bool) failed to fdb_future_get_error err=Operation aborted because the transaction timed out key=01106d657461000110313831343830313731330001107461626c65745f696e64657800011200000000000561aa
RuntimeLogger W20250115 02:17:46.752558   330 txn_kv.cpp:431] Operation aborted because the transaction timed out
RuntimeLogger W20250115 02:17:46.820328   363 txn_kv.cpp:389] virtual TxnErrorCode doris::cloud::fdb::Transaction::get(std::string_view, std::string *, bool) failed to fdb_future_get_error err=Operation aborted because the transaction timed out key=0110696e7374616e6365000110313831343830313731330001
RuntimeLogger I20250115 02:17:46.820472   363 meta_service_resource.cpp:238] get instance_key=0110696e7374616e6365000110313831343830313731330001
RuntimeLogger I20250115 02:17:46.820581   363 meta_service_helper.h:147] finish get_obj_store_info from 192.164.248.254:49466 response=status {
  code: KV_TXN_GET_ERR
  msg: "failed to get instance, instance_id=1814801713 err=Timeout"
}
RuntimeLogger I20250115 02:17:47.380452   355 meta_service_helper.h:81] begin get_obj_store_info from 192.164.248.254:49466 request=cloud_unique_id: "1:1814801713:FxU0_gTN"
RuntimeLogger W20250115 02:17:47.413942   360 txn_kv.cpp:389] virtual TxnErrorCode doris::cloud::fdb::Transaction::get(std::string_view, std::string *, bool) failed to fdb_future_get_error err=Operation aborted because the transaction timed out key=0110696e7374616e6365000110313831343830313731330001
RuntimeLogger I20250115 02:17:47.414059   360 meta_service_resource.cpp:238] get instance_key=0110696e7374616e6365000110313831343830313731330001
RuntimeLogger I20250115 02:17:47.414126   360 meta_service_helper.h:147] finish get_obj_store_info from 192.164.248.234:53870 response=status {
  code: KV_TXN_GET_ERR
  msg: "failed to get instance, instance_id=1814801713 err=Timeout"
}
RuntimeLogger W20250115 02:17:47.562582   156 txn_kv.cpp:389] virtual TxnErrorCode doris::cloud::fdb::Transaction::get(std::string_view, std::string *, bool) failed to fdb_future_get_error err=Operation aborted because the transaction timed out key=021073797374656d0001106d6574612d7365727669636500011072656769737472790001
RuntimeLogger W20250115 02:17:47.972550   330 txn_kv.cpp:389] virtual TxnErrorCode doris::cloud::fdb::Transaction::get(std::string_view, std::string *, bool) failed to fdb_future_get_error err=Operation aborted because the transaction timed out key=011076657273696f6e00011031383134383031373133000110706172746974696f6e000112000000000000271212000000000000355e120000000000003565
RuntimeLogger I20250115 02:17:48.119001   355 meta_service_helper.h:81] begin get_obj_store_info from 192.164.248.234:53870 request=cloud_unique_id: "1:1814801713:ZHVqa89N"
RuntimeLogger I20250115 02:17:48.208321   323 main.cpp:296] Periodically log for recycler
RuntimeLogger W20250115 02:17:48.482789   363 txn_kv.cpp:389] virtual TxnErrorCode doris::cloud::fdb::Transaction::get(std::string_view, std::string *, bool) failed to fdb_future_get_error err=Operation aborted because the transaction timed out key=0110696e7374616e6365000110313831343830313731330001
RuntimeLogger I20250115 02:17:48.482911   363 meta_service_resource.cpp:238] get instance_key=0110696e7374616e6365000110313831343830313731330001
RuntimeLogger I20250115 02:17:48.482997   363 meta_service_helper.h:147] finish get_obj_store_info from 192.164.11.24:34020 response=status {
  code: KV_TXN_GET_ERR
  msg: "failed to get instance, instance_id=1814801713 err=Timeout"
}

此时重启foundationdb的组件,没有任何效果。
如果重启pod ‘test-disaggregated-cluster-ms-*’,pod会启动失败,出现报错日志:

LIBHDFS3_CONF=
starts doris_cloud with args: 
Wed Jan 15 02:17:12 UTC 2025
process working directory: "/opt/apache-doris/ms"
pid=149 written to file=./bin/doris_cloud.pid
RuntimeLogger I20250115 02:17:12.232849   149 main.cpp:214] try to start doris_cloud
RuntimeLogger I20250115 02:17:12.233088   149 main.cpp:215] version:{doris-3.0.3-rc04-release} code_version:{commit=62a58bff4c2f640f1afcba8c754058d5f77d420f time=2024-12-08 05:42:14 +0800} build_info:{initiator=root@vm-70 build_at=2024-12-08 05:42:14 +0800 build_on=PRETTY_NAME="Ubuntu 22.04.4 LTS" NAME="Ubuntu" }
version:{doris-3.0.3-rc04-release} code_version:{commit=62a58bff4c2f640f1afcba8c754058d5f77d420f time=2024-12-08 05:42:14 +0800} build_info:{initiator=root@vm-70 build_at=2024-12-08 05:42:14 +0800 build_on=PRETTY_NAME="Ubuntu 22.04.4 LTS" NAME="Ubuntu" }

RuntimeLogger I20250115 02:17:12.233106   149 main.cpp:221] meta_service and recycler are both not specified, run doris_cloud as meta_service and recycler by default
run doris_cloud as meta_service and recycler by default
RuntimeLogger I20250115 02:17:12.233132   149 main.cpp:243] begin to init txn kv
RuntimeLogger I20250115 02:17:12.235663   149 main.cpp:251] successfully init txn kv, elapsed milliseconds: 2
RuntimeLogger W20250115 02:17:22.239355   149 txn_kv.cpp:389] virtual TxnErrorCode doris::cloud::fdb::Transaction::get(std::string_view, std::string *, bool) failed to fdb_future_get_error err=Operation aborted because the transaction timed out key=021073797374656d0001106d6574612d73657276696365000110656e6372797074696f6e5f6b65795f696e666f0001
RuntimeLogger W20250115 02:17:22.239964   149 encryption_util.cpp:560] failed to get key of encryption_key_info err=Timeout
RuntimeLogger W20250115 02:17:22.240048   149 encryption_util.cpp:708] failed to generate random root key
RuntimeLogger W20250115 02:17:22.240077   149 main.cpp:255] failed to init global encryption key map
RuntimeLogger W20250115 02:17:22.240223   149 txn_kv.cpp:253] fdb_stop_network
RuntimeLogger W20250115 02:17:22.240303   153 txn_kv.cpp:248] exit fdb_run_network

doris的镜像版本是3.0.3,foundationdb的镜像版本是7.1.65.
fe,be,ms,fdb的配置基本采用的默认配置。

请问如何解决这个问题?

1 Answers

问题解决了。
在fdb-kubernetes-operator的安装文件cluster.yaml中,添加配置。
useDNSInClusterFile: true

具体位置参考一下:

routing:
    defineDNSLocalityFields: true
    useDNSInClusterFile: true
  sidecarContainer:
    imageConfigs:
      - baseImage: harbor.yoocar.com.cn/middleware/foundationdb/foundationdb-kubernetes-sidecar
        tag: 7.1.65-1
    enableLivenessProbe: true
    enableReadinessProbe: false
  useExplicitListenAddress: true
  version: 7.1.65

这样,fdb的coordination servers会使用dns来联通,即使所有的coordination servers的pod都被kill掉,fdb-kubernetes-operator也会调度生成同名的pod。这个过程中fdb服务不可用,但是一段时间后fdb是可用的。

具体信息,参考:
https://github.com/FoundationDB/fdb-kubernetes-operator/blob/v1.52.0/docs/manual/customization.md
其中的 “Using DNS“章节