Doris2.1.0版本
Doris FE节点有三个,均配置了Prometheus来采集Doris的Metrics数据来进行集群监控。
目前发现Doris的FE Master节点在工作一段时间后,便无法采集Mertics信息,这时Doris的WEB UI也无法访问,但FE进程仍然正常运行,JDBC读写正常。
其他的两个非Master节点的FE可以正常采集Metrics。
目前的信息:
- FE日志中没有发现有报错信息
- 三个FE节点做了负载均衡,请求数量差距并不大
- 通过监控发现Doris FE Master节点的线程数量一直在增加,直到达到10000个左右时开始出现无法采集Metrics的情况。怀疑是这个情况导致的异常。
查看线程状态基本都处于WAITING状态:
"Thread-52171" #115023 daemon prio=5 os_prio=0 tid=0x00007f541499e000 nid=0x102485 waiting on condition [0x00007f5058fd0000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007f66c4007878> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at com.lmax.disruptor.BlockingWaitStrategy.waitFor(BlockingWaitStrategy.java:47)
at com.lmax.disruptor.ProcessingSequenceBarrier.waitFor(ProcessingSequenceBarrier.java:56)
at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:148)
at java.lang.Thread.run(Thread.java:750)
"Thread-52170" #115022 daemon prio=5 os_prio=0 tid=0x00007f541499c000 nid=0x102484 waiting on condition [0x00007f50597d1000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007f66c4007878> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at com.lmax.disruptor.BlockingWaitStrategy.waitFor(BlockingWaitStrategy.java:47)
at com.lmax.disruptor.ProcessingSequenceBarrier.waitFor(ProcessingSequenceBarrier.java:56)
at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:148)
at java.lang.Thread.run(Thread.java:750)
"Thread-52169" #115021 daemon prio=5 os_prio=0 tid=0x00007f541480b800 nid=0x102483 waiting on condition [0x00007f5059fd2000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007f66c4007878> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at com.lmax.disruptor.BlockingWaitStrategy.waitFor(BlockingWaitStrategy.java:47)
at com.lmax.disruptor.ProcessingSequenceBarrier.waitFor(ProcessingSequenceBarrier.java:56)
at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:148)
at java.lang.Thread.run(Thread.java:750)
并且其他两个FE节点并没有出现这种线程数量猛增的情况。
求大佬看下是什么问题,有没有解决的办法。