我在 EKS 集群上部署了 Doris,目前出现 BE 节点不明原因重启。重启前
- BE Pod 打印标准输出如下:
I0428 15:42:42.264484 1582 olap_meta.cpp:68] [Rocksdb] EVENT_LOG_v1 {"time_micros": 1714318962264473, "cf_name": "meta", "job": 8042, "event": "table_file_creation", "file_number": 39374, "file_size": 5145236, "table_properties": {"data_size": 5113986, "index_size": 130133, "filter_size": 0, "raw_key_size": 334885, "raw_average_key_size": 93, "raw_value_size": 38483542, "raw_average_value_size": 10698, "num_data_blocks": 1817, "num_entries": 3597, "filter_policy_name": "", "kDeletedKeys": "0", "kMergeOperands": "0"}}
I0428 15:42:42.266206 1582 olap_meta.cpp:68] [Rocksdb] [db/compaction_job.cc:1287] [meta] [JOB 8042] Compacted 4@0 + 1@1 files to L1 => 5145236 bytes
I0428 15:42:42.268757 1582 olap_meta.cpp:68] [Rocksdb] (Original Log Time 2024/04/28-15:42:42.268716) [db/compaction_job.cc:685] [meta] compacted to: base level 1 max bytes base 268435456 files[0 1 0 0 0 0 0] max score 0.02, MB/sec: 160.3 rd, 54.0 wr, level 1, files in(4, 1) out(1) MB in(12.2, 2.3) out(4.9), read-write-amplify(1.6) write-amplify(0.4) OK, records in: 3760, records dropped: 163 output_compression: Snappy
I0428 15:42:42.268774 1582 olap_meta.cpp:68] [Rocksdb] (Original Log Time 2024/04/28-15:42:42.268729) EVENT_LOG_v1 {"time_micros": 1714318962268723, "job": 8042, "event": "compaction_finished", "compaction_time_micros": 95211, "output_level": 1, "num_output_files": 1, "total_output_size": 5145236, "num_input_records": 3760, "num_output_records": 3597, "num_subcompactions": 1, "output_compression": "Snappy", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [0, 1, 0, 0, 0, 0, 0]}
I0428 15:42:42.269580 1582 olap_meta.cpp:68] [Rocksdb] EVENT_LOG_v1 {"time_micros": 1714318962269576, "job": 8042, "event": "table_file_deletion", "file_number": 39373}
I0428 15:42:42.270237 1582 olap_meta.cpp:68] [Rocksdb] EVENT_LOG_v1 {"time_micros": 1714318962270236, "job": 8042, "event": "table_file_deletion", "file_number": 39371}
I0428 15:42:42.270915 1582 olap_meta.cpp:68] [Rocksdb] EVENT_LOG_v1 {"time_micros": 1714318962270912, "job": 8042, "event": "table_file_deletion", "file_number": 39369}
I0428 15:42:42.271484 1582 olap_meta.cpp:68] [Rocksdb] EVENT_LOG_v1 {"time_micros": 1714318962271483, "job": 8042, "event": "table_file_deletion", "file_number": 39367}
I0428 15:42:42.272046 1582 olap_meta.cpp:68] [Rocksdb] EVENT_LOG_v1 {"time_micros": 1714318962272044, "job": 8042, "event": "table_file_deletion", "file_number": 39365}
/opt/apache-doris/be/bin/start_be.sh: line 360: 588 Killed ${LIMIT:+${LIMIT}} "${DORIS_HOME}/lib/doris_be" "$@" 2>&1 < /dev/null
-
be.out 未打印堆栈信息
-
be.WARNING 中在重启前有很多如下日志,频率 1 小时一次
BE.warning中有很多这个日志:
W0428 19:44:01.763976 912 status.h:393] meet error status: [IO_ERROR]failed to list /opt/apache-doris/be/storage/mini_download: (2), No such file or directory
0# doris::io::LocalFileSystem::list_impl(std::filesystem::__cxx11::path const&, bool, std::vector<doris::io::FileInfo, std::allocator<doris::io::FileInfo> >*, bool*) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/basic_string.h:187
1# doris::io::FileSystem::list(std::filesystem::__cxx11::path const&, bool, std::vector<doris::io::FileInfo, std::allocator<doris::io::FileInfo> >*, bool*) at /root/src/doris-2.0/be/src/common/status.h:354
2# doris::LoadPathMgr::clean_one_path(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:360
3# std::_Function_handler<void (), doris::LoadPathMgr::init()::$_0>::_M_invoke(std::_Any_data const&) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/stl_iterator.h:1034
4# doris::Thread::supervise_thread(void*) at /var/local/ldb-toolchain/bin/../usr/include/pthread.h:562
5# ?
6# ?
- be 镜像:selectdb/doris.be-ubuntu:2.0.3
- 部署 helm chart values 文件
dorisCluster:
name: msgcenter
password: "*******"
feSpec:
service:
type: LoadBalancer
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: nlb
service.beta.kubernetes.io/aws-load-balancer-name: msgcenter-doris-fe
service.beta.kubernetes.io/aws-load-balancer-scheme: internal
resource:
requests:
cpu: 1
memory: 4Gi
limits:
cpu: 1
memory: 4Gi
nodeSelector:
node-group: doris
tolerations:
- key: "node-group"
operator: "Equal"
value: "doris"
effect: "NoSchedule"
persistentVolumeClaim:
metaPersistentVolume:
storageClassName: "gp3"
storage: "200Gi"
logsPersistentVolume:
storageClassName: "gp3"
storage: "20Gi"
beSpec:
replicas: 3
configMap:
be.conf: |
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
CUR_DATE=`date +%Y%m%d-%H%M%S`
PPROF_TMPDIR="$DORIS_HOME/log/"
JAVA_OPTS="-Xmx1024m -DlogPath=$DORIS_HOME/log/jni.log -Xloggc:$DORIS_HOME/log/be.gc.log.$CUR_DATE -Djavax.security.auth.useSubjectCredsOnly=false -Dsun.java.command=DorisBE -XX:-CriticalJNINatives -DJDBC_MIN_POOL=1 -DJDBC_MAX_POOL=100 -DJDBC_MAX_IDLE_TIME=300000 -DJDBC_MAX_WAIT_TIME=5000"
# For jdk 9+, this JAVA_OPTS will be used as default JVM options
JAVA_OPTS_FOR_JDK_9="-Xmx1024m -DlogPath=$DORIS_HOME/log/jni.log -Xlog:gc:$DORIS_HOME/log/be.gc.log.$CUR_DATE -Djavax.security.auth.useSubjectCredsOnly=false -Dsun.java.command=DorisBE -XX:-CriticalJNINatives -DJDBC_MIN_POOL=1 -DJDBC_MAX_POOL=100 -DJDBC_MAX_IDLE_TIME=300000 -DJDBC_MAX_WAIT_TIME=5000"
# since 1.2, the JAVA_HOME need to be set to run BE process.
# JAVA_HOME=/path/to/jdk/
# https://github.com/apache/doris/blob/master/docs/zh-CN/community/developer-guide/debug-tool.md#jemalloc-heap-profile
# https://jemalloc.net/jemalloc.3.html
JEMALLOC_CONF="percpu_arena:percpu,background_thread:true,metadata_thp:auto,muzzy_decay_ms:15000,dirty_decay_ms:15000,oversize_threshold:0,lg_tcache_max:20,prof:false,lg_prof_interval:32,lg_prof_sample:19,prof_gdump:false,prof_accum:false,prof_leak:false,prof_final:false"
JEMALLOC_PROF_PRFIX=""
# INFO, WARNING, ERROR, FATAL
sys_log_level = INFO
# ports for admin, web, heartbeat service
be_port = 9060
webserver_port = 8040
heartbeat_service_port = 9050
brpc_port = 8060
# HTTPS configures
enable_https = false
# path of certificate in PEM format.
ssl_certificate_path = "$DORIS_HOME/conf/cert.pem"
# path of private key in PEM format.
ssl_private_key_path = "$DORIS_HOME/conf/key.pem"
# enable auth check
enable_auth = false
# Choose one if there are more than one ip except loopback address.
# Note that there should at most one ip match this list.
# If no ip match this rule, will choose one randomly.
# use CIDR format, e.g. 10.10.10.0/24 or IP format, e.g. 10.10.10.1
# Default value is empty.
# priority_networks = 10.10.10.0/24;192.168.0.0/16
# data root path, separate by ';'
# you can specify the storage medium of each root path, HDD or SSD
# you can add capacity limit at the end of each root path, separate by ','
# eg:
# storage_root_path = /home/disk1/doris.HDD,50;/home/disk2/doris.SSD,1;/home/disk2/doris
# /home/disk1/doris.HDD, capacity limit is 50GB, HDD;
# /home/disk2/doris.SSD, capacity limit is 1GB, SSD;
# /home/disk2/doris, capacity limit is disk capacity, HDD(default)
#
# you also can specify the properties by setting '<property>:<value>', separate by ','
# property 'medium' has a higher priority than the extension of path
#
# Default value is ${DORIS_HOME}/storage, you should create it by hand.
# storage_root_path = ${DORIS_HOME}/storage
storage_root_path = ${DORIS_HOME}/storage,medium:SSD
# Default dirs to put jdbc drivers,default value is ${DORIS_HOME}/jdbc_drivers
# jdbc_drivers_dir = ${DORIS_HOME}/jdbc_drivers
# Advanced configurations
# sys_log_dir = ${DORIS_HOME}/log
# sys_log_roll_mode = SIZE-MB-1024
# sys_log_roll_num = 10
# sys_log_verbose_modules = *
# log_buffer_level = -1
# palo_cgroups
service:
type: LoadBalancer
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: nlb
service.beta.kubernetes.io/aws-load-balancer-name: msgcenter-doris-be
service.beta.kubernetes.io/aws-load-balancer-scheme: internal
env:
- name: PASSWD
value: "*******"
resource:
requests:
cpu: 2700m
memory: 10Gi
limits:
cpu: 2700m
memory: 10Gi
nodeSelector:
node-group: doris
tolerations:
- key: "node-group"
operator: "Equal"
value: "doris"
effect: "NoSchedule"
persistentVolumeClaim: