【已解决】2.1.2版本,be节点间的负载严重不平衡

Viewed 152

版本:doris 2.1.2

be数量:

+-----------+---------------+---------------+--------+----------+----------+--------------------+---------------------+---------------------+-------+----------------------+-----------+------------------+--------------------+---------------+---------------+---------+----------------+--------------------+--------------------------+--------+-----------------------------+-------------------------------------------------------------------------------------------------------------------------------+-------------------------+----------+
| BackendId | Host          | HeartbeatPort | BePort | HttpPort | BrpcPort | ArrowFlightSqlPort | LastStartTime       | LastHeartbeat       | Alive | SystemDecommissioned | TabletNum | DataUsedCapacity | TrashUsedCapcacity | AvailCapacity | TotalCapacity | UsedPct | MaxDiskUsedPct | RemoteUsedCapacity | Tag                      | ErrMsg | Version                     | Status                                                                                                                        | HeartbeatFailureCounter | NodeRole |
+-----------+---------------+---------------+--------+----------+----------+--------------------+---------------------+---------------------+-------+----------------------+-----------+------------------+--------------------+---------------+---------------+---------+----------------+--------------------+--------------------------+--------+-----------------------------+-------------------------------------------------------------------------------------------------------------------------------+-------------------------+----------+
| 10236     | 172.31.21.21  | 9050          | 9060   | 8040     | 8060     | -1                 | 2024-05-01 18:55:28 | 2024-05-06 03:00:06 | true  | false                | 7311      | 47.891 GB        | 303.316 GB         | 1.508 TB      | 2.001 TB      | 24.67 % | 29.80 %        | 1.348 KB           | {"location" : "default"} |        | doris-2.1.2-rc04-b130df2488 | {"lastSuccessReportTabletsTime":"2024-05-06 02:59:38","lastStreamLoadTime":-1,"isQueryDisabled":false,"isLoadDisabled":false} | 0                       | mix      |
| 10312     | 172.31.21.229 | 9050          | 9060   | 8040     | 8060     | -1                 | 2024-05-03 19:14:57 | 2024-05-06 03:00:06 | true  | true                 | 544       | 0.000            | 386.919 GB         | 553.132 GB    | 1.025 TB      | 47.32 % | 47.32 %        | 0.000              | {"location" : "default"} |        | doris-2.1.2-rc04-b130df2488 | {"lastSuccessReportTabletsTime":"2024-05-06 02:59:18","lastStreamLoadTime":-1,"isQueryDisabled":false,"isLoadDisabled":false} | 0                       | mix      |
| 10313     | 172.31.22.135 | 9050          | 9060   | 8040     | 8060     | -1                 | 2024-05-03 13:26:37 | 2024-05-06 03:00:06 | true  | false                | 8939      | 70.386 GB        | 25.741 GB          | 828.069 GB    | 1.025 TB      | 21.13 % | 21.13 %        | 0.000              | {"location" : "default"} |        | doris-2.1.2-rc04-b130df2488 | {"lastSuccessReportTabletsTime":"2024-05-06 02:59:26","lastStreamLoadTime":-1,"isQueryDisabled":false,"isLoadDisabled":false} | 0                       | mix      |
| 67726     | 172.31.24.252 | 9050          | 9060   | 8040     | 8060     | -1                 | 2024-04-28 05:15:07 | 2024-05-06 03:00:06 | true  | false                | 11196     | 80.226 GB        | 76.163 GB          | 729.414 GB    | 1.025 TB      | 30.51 % | 36.36 %        | 1.348 KB           | {"location" : "default"} |        | doris-2.1.2-rc04-b130df2488 | {"lastSuccessReportTabletsTime":"2024-05-06 02:59:48","lastStreamLoadTime":-1,"isQueryDisabled":false,"isLoadDisabled":false} | 0                       | mix      |
| 93095     | 172.31.20.199 | 9050          | 9060   | 8040     | 8060     | -1                 | 2024-04-30 03:20:50 | 2024-05-06 03:00:06 | true  | false                | 13367     | 144.911 GB       | 35.441 GB          | 661.337 GB    | 1.025 TB      | 37.00 % | 40.57 %        | 1.348 KB           | {"location" : "default"} |        | doris-2.1.2-rc04-b130df2488 | {"lastSuccessReportTabletsTime":"2024-05-06 02:59:51","lastStreamLoadTime":-1,"isQueryDisabled":false,"isLoadDisabled":false} | 0                       | mix      |
| 276075    | 172.31.19.176 | 9050          | 9060   | 8040     | 8060     | -1                 | 2024-04-26 17:45:02 | 2024-05-06 03:00:06 | true  | false                | 19181     | 149.240 GB       | 136.598 GB         | 1.112 TB      | 1.513 TB      | 26.49 % | 32.57 %        | 0.000              | {"location" : "default"} |        | doris-2.1.2-rc04-b130df2488 | {"lastSuccessReportTabletsTime":"2024-05-06 02:59:52","lastStreamLoadTime":-1,"isQueryDisabled":false,"isLoadDisabled":false} | 0                       | mix      |
+-----------+---------------+---------------+--------+----------+----------+--------------------+---------------------+---------------------+-------+----------------------+-----------+------------------+--------------------+---------------+---------------+---------+----------------+--------------------+--------------------------+--------+-----------------------------+-------------------------------------------------------------------------------------------------------------------------------+-------------------------+----------+
6 rows in set (0.00 sec)

172.31.21.229 这个打算释放掉,其他5个节点间的 tablet 数量和系统负载都存在不平衡现象,172.31.20.199 这个节点会偶发的跑到40+的负载。

其他接节点的负载和磁盘数量如下:
每个节点的物理配置均为 8核心64G内存


172.31.22.135 
    磁盘:一块1T的SSD,总空间:1T 
    负载: load average: 5.91, 6.37, 5.58
172.31.21.21  
    磁盘:2块1T的SSD,总空间:2T
    负载:load average: 4.37, 4.85, 4.86
172.31.24.252
    磁盘:2块500G的SSD,总空间:1T
    负载:load average: 6.33, 8.69, 7.59
172.31.20.199
    磁盘:2块500G的SSD,总空间:1T
    负载:load average: 11.16, 14.90, 16.88
172.31.19.176
    磁盘:3块500G的SSD,总空间:1.5T
    负载:load average: 7.25, 9.50, 9.46

请问下我应该采取什么动作让集群的负载更加均衡呢?

3 Answers

如果没有冷热分层的需求,将所有的be.conf中显示指定介质类型去掉,再重启be,be会自动均衡

image.png

SHOW PROC '/cluster_balance/cluster_load_stat/location_default/SSD';
+--------+-----------+--------------+---------------+---------+-------------+------------+--------------------+--------------------+--------------------+-------+
| BeId   | Available | UsedCapacity | Capacity      | MaxDisk | UsedPercent | ReplicaNum | CapCoeff           | ReplCoeff          | Score              | Class |
+--------+-----------+--------------+---------------+---------+-------------+------------+--------------------+--------------------+--------------------+-------+
| 10313  | true      | 278328541184 | 1127337742336 | MID     | 24.689      | 12498      | 0.7031344590428831 | 0.2968655409571169 | 0.9294483900123862 | MID   |
| 10236  | true      | 585190752256 | 2200555278336 | MID     | 26.593      | 11137      | 0.7031344590428831 | 0.2968655409571169 | 0.9408153643734463 | MID   |
| 67726  | true      | 315596603392 | 1127079792640 | MID     | 28.001      | 11244      | 0.7031344590428831 | 0.2968655409571169 | 0.9781173333725808 | MID   |
| 276075 | true      | 447697612800 | 1663688560640 | MID     | 26.910      | 17064      | 0.7031344590428831 | 0.2968655409571169 | 1.1023389135806465 | MID   |
| 93095  | true      | 451429429248 | 1127079792640 | MID     | 40.053      | 5277       | 0.7031344590428831 | 0.2968655409571169 | 1.1187729398148103 | HIGH  |
+--------+-----------+--------------+---------------+---------+-------------+------------+--------------------+--------------------+--------------------+-------+
5 rows in set (0.01 sec)

HDD 没有记录

我也遇到了类似情况,三台主机配置一致,只运行了BE,SSD同型号同大小,但是节点空间占用并不均衡。后来将占用小的那个be释放掉并加入一个新的be还是一样的情况。
image.png
image.png