【已解决】K8S BE 节点不明原因重启

Viewed 42

我在 EKS 集群上部署了 Doris,目前出现 BE 节点不明原因重启。重启前

  • BE Pod 打印标准输出如下:
I0428 15:42:42.264484  1582 olap_meta.cpp:68] [Rocksdb] EVENT_LOG_v1 {"time_micros": 1714318962264473, "cf_name": "meta", "job": 8042, "event": "table_file_creation", "file_number": 39374, "file_size": 5145236, "table_properties": {"data_size": 5113986, "index_size": 130133, "filter_size": 0, "raw_key_size": 334885, "raw_average_key_size": 93, "raw_value_size": 38483542, "raw_average_value_size": 10698, "num_data_blocks": 1817, "num_entries": 3597, "filter_policy_name": "", "kDeletedKeys": "0", "kMergeOperands": "0"}}
I0428 15:42:42.266206  1582 olap_meta.cpp:68] [Rocksdb] [db/compaction_job.cc:1287] [meta] [JOB 8042] Compacted 4@0 + 1@1 files to L1 => 5145236 bytes
I0428 15:42:42.268757  1582 olap_meta.cpp:68] [Rocksdb] (Original Log Time 2024/04/28-15:42:42.268716) [db/compaction_job.cc:685] [meta] compacted to: base level 1 max bytes base 268435456 files[0 1 0 0 0 0 0] max score 0.02, MB/sec: 160.3 rd, 54.0 wr, level 1, files in(4, 1) out(1) MB in(12.2, 2.3) out(4.9), read-write-amplify(1.6) write-amplify(0.4) OK, records in: 3760, records dropped: 163 output_compression: Snappy
I0428 15:42:42.268774  1582 olap_meta.cpp:68] [Rocksdb] (Original Log Time 2024/04/28-15:42:42.268729) EVENT_LOG_v1 {"time_micros": 1714318962268723, "job": 8042, "event": "compaction_finished", "compaction_time_micros": 95211, "output_level": 1, "num_output_files": 1, "total_output_size": 5145236, "num_input_records": 3760, "num_output_records": 3597, "num_subcompactions": 1, "output_compression": "Snappy", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [0, 1, 0, 0, 0, 0, 0]}
I0428 15:42:42.269580  1582 olap_meta.cpp:68] [Rocksdb] EVENT_LOG_v1 {"time_micros": 1714318962269576, "job": 8042, "event": "table_file_deletion", "file_number": 39373}
I0428 15:42:42.270237  1582 olap_meta.cpp:68] [Rocksdb] EVENT_LOG_v1 {"time_micros": 1714318962270236, "job": 8042, "event": "table_file_deletion", "file_number": 39371}
I0428 15:42:42.270915  1582 olap_meta.cpp:68] [Rocksdb] EVENT_LOG_v1 {"time_micros": 1714318962270912, "job": 8042, "event": "table_file_deletion", "file_number": 39369}
I0428 15:42:42.271484  1582 olap_meta.cpp:68] [Rocksdb] EVENT_LOG_v1 {"time_micros": 1714318962271483, "job": 8042, "event": "table_file_deletion", "file_number": 39367}
I0428 15:42:42.272046  1582 olap_meta.cpp:68] [Rocksdb] EVENT_LOG_v1 {"time_micros": 1714318962272044, "job": 8042, "event": "table_file_deletion", "file_number": 39365}
/opt/apache-doris/be/bin/start_be.sh: line 360:   588 Killed                  ${LIMIT:+${LIMIT}} "${DORIS_HOME}/lib/doris_be" "$@" 2>&1 < /dev/null
  • be.out 未打印堆栈信息

  • be.WARNING 中在重启前有很多如下日志,频率 1 小时一次

BE.warning中有很多这个日志:
W0428 19:44:01.763976   912 status.h:393] meet error status: [IO_ERROR]failed to list /opt/apache-doris/be/storage/mini_download: (2), No such file or directory

        0#  doris::io::LocalFileSystem::list_impl(std::filesystem::__cxx11::path const&, bool, std::vector<doris::io::FileInfo, std::allocator<doris::io::FileInfo> >*, bool*) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/basic_string.h:187
        1#  doris::io::FileSystem::list(std::filesystem::__cxx11::path const&, bool, std::vector<doris::io::FileInfo, std::allocator<doris::io::FileInfo> >*, bool*) at /root/src/doris-2.0/be/src/common/status.h:354
        2#  doris::LoadPathMgr::clean_one_path(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:360
        3#  std::_Function_handler<void (), doris::LoadPathMgr::init()::$_0>::_M_invoke(std::_Any_data const&) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/stl_iterator.h:1034
        4#  doris::Thread::supervise_thread(void*) at /var/local/ldb-toolchain/bin/../usr/include/pthread.h:562
        5#  ?
        6#  ?
  • be 镜像:selectdb/doris.be-ubuntu:2.0.3
  • 部署 helm chart values 文件
dorisCluster:
  name: msgcenter
  password: "*******"

feSpec:
  service:
    type: LoadBalancer
    annotations:
      service.beta.kubernetes.io/aws-load-balancer-type: nlb
      service.beta.kubernetes.io/aws-load-balancer-name: msgcenter-doris-fe
      service.beta.kubernetes.io/aws-load-balancer-scheme: internal
  resource:
    requests:
      cpu: 1
      memory: 4Gi
    limits:
      cpu: 1
      memory: 4Gi
  nodeSelector:
    node-group: doris
  tolerations:
  - key: "node-group"
    operator: "Equal"
    value: "doris"
    effect: "NoSchedule"
  persistentVolumeClaim:
    metaPersistentVolume:
      storageClassName: "gp3"
      storage: "200Gi"
    logsPersistentVolume:
      storageClassName: "gp3"
      storage: "20Gi"

beSpec:
  replicas: 3
  configMap:
    be.conf: |
      # Licensed to the Apache Software Foundation (ASF) under one
      # or more contributor license agreements.  See the NOTICE file
      # distributed with this work for additional information
      # regarding copyright ownership.  The ASF licenses this file
      # to you under the Apache License, Version 2.0 (the
      # "License"); you may not use this file except in compliance
      # with the License.  You may obtain a copy of the License at
      #
      #   http://www.apache.org/licenses/LICENSE-2.0
      #
      # Unless required by applicable law or agreed to in writing,
      # software distributed under the License is distributed on an
      # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
      # KIND, either express or implied.  See the License for the
      # specific language governing permissions and limitations
      # under the License.

      CUR_DATE=`date +%Y%m%d-%H%M%S`

      PPROF_TMPDIR="$DORIS_HOME/log/"

      JAVA_OPTS="-Xmx1024m -DlogPath=$DORIS_HOME/log/jni.log -Xloggc:$DORIS_HOME/log/be.gc.log.$CUR_DATE -Djavax.security.auth.useSubjectCredsOnly=false -Dsun.java.command=DorisBE -XX:-CriticalJNINatives -DJDBC_MIN_POOL=1 -DJDBC_MAX_POOL=100 -DJDBC_MAX_IDLE_TIME=300000 -DJDBC_MAX_WAIT_TIME=5000"

      # For jdk 9+, this JAVA_OPTS will be used as default JVM options
      JAVA_OPTS_FOR_JDK_9="-Xmx1024m -DlogPath=$DORIS_HOME/log/jni.log -Xlog:gc:$DORIS_HOME/log/be.gc.log.$CUR_DATE -Djavax.security.auth.useSubjectCredsOnly=false -Dsun.java.command=DorisBE -XX:-CriticalJNINatives -DJDBC_MIN_POOL=1 -DJDBC_MAX_POOL=100 -DJDBC_MAX_IDLE_TIME=300000 -DJDBC_MAX_WAIT_TIME=5000"

      # since 1.2, the JAVA_HOME need to be set to run BE process.
      # JAVA_HOME=/path/to/jdk/

      # https://github.com/apache/doris/blob/master/docs/zh-CN/community/developer-guide/debug-tool.md#jemalloc-heap-profile
      # https://jemalloc.net/jemalloc.3.html
      JEMALLOC_CONF="percpu_arena:percpu,background_thread:true,metadata_thp:auto,muzzy_decay_ms:15000,dirty_decay_ms:15000,oversize_threshold:0,lg_tcache_max:20,prof:false,lg_prof_interval:32,lg_prof_sample:19,prof_gdump:false,prof_accum:false,prof_leak:false,prof_final:false"
      JEMALLOC_PROF_PRFIX=""

      # INFO, WARNING, ERROR, FATAL
      sys_log_level = INFO

      # ports for admin, web, heartbeat service
      be_port = 9060
      webserver_port = 8040
      heartbeat_service_port = 9050
      brpc_port = 8060

      # HTTPS configures
      enable_https = false
      # path of certificate in PEM format.
      ssl_certificate_path = "$DORIS_HOME/conf/cert.pem"
      # path of private key in PEM format.
      ssl_private_key_path = "$DORIS_HOME/conf/key.pem"

      # enable auth check
      enable_auth = false

      # Choose one if there are more than one ip except loopback address.
      # Note that there should at most one ip match this list.
      # If no ip match this rule, will choose one randomly.
      # use CIDR format, e.g. 10.10.10.0/24 or IP format, e.g. 10.10.10.1
      # Default value is empty.
      # priority_networks = 10.10.10.0/24;192.168.0.0/16

      # data root path, separate by ';'
      # you can specify the storage medium of each root path, HDD or SSD
      # you can add capacity limit at the end of each root path, separate by ','
      # eg:
      # storage_root_path = /home/disk1/doris.HDD,50;/home/disk2/doris.SSD,1;/home/disk2/doris
      # /home/disk1/doris.HDD, capacity limit is 50GB, HDD;
      # /home/disk2/doris.SSD, capacity limit is 1GB, SSD;
      # /home/disk2/doris, capacity limit is disk capacity, HDD(default)
      #
      # you also can specify the properties by setting '<property>:<value>', separate by ','
      # property 'medium' has a higher priority than the extension of path
      #
      # Default value is ${DORIS_HOME}/storage, you should create it by hand.
      # storage_root_path = ${DORIS_HOME}/storage
      storage_root_path = ${DORIS_HOME}/storage,medium:SSD

      # Default dirs to put jdbc drivers,default value is ${DORIS_HOME}/jdbc_drivers
      # jdbc_drivers_dir = ${DORIS_HOME}/jdbc_drivers

      # Advanced configurations
      # sys_log_dir = ${DORIS_HOME}/log
      # sys_log_roll_mode = SIZE-MB-1024
      # sys_log_roll_num = 10
      # sys_log_verbose_modules = *
      # log_buffer_level = -1
      # palo_cgroups
  service:
    type: LoadBalancer
    annotations:
      service.beta.kubernetes.io/aws-load-balancer-type: nlb
      service.beta.kubernetes.io/aws-load-balancer-name: msgcenter-doris-be
      service.beta.kubernetes.io/aws-load-balancer-scheme: internal
  env:
  - name: PASSWD
    value: "*******"
  resource:
    requests:
      cpu: 2700m
      memory: 10Gi
    limits:
      cpu: 2700m
      memory: 10Gi
  nodeSelector:
    node-group: doris
  tolerations:
  - key: "node-group"
    operator: "Equal"
    value: "doris"
    effect: "NoSchedule"
  persistentVolumeClaim:
1 Answers
${LIMIT:+${LIMIT}} "${DORIS_HOME}/lib/doris_be" "$@" 2>&1 < /dev/null

可能原因:

  1. be包和机器cpu架构不匹配
  2. max_map_count等必调参数被改动

然后看看java_home是否正常配置。