Doris2.1.0 FE Master节点请求Metrics超时

Viewed 34

Doris2.1.0版本

Doris FE节点有三个,均配置了Prometheus来采集Doris的Metrics数据来进行集群监控。

目前发现Doris的FE Master节点在工作一段时间后,便无法采集Mertics信息,这时Doris的WEB UI也无法访问,但FE进程仍然正常运行,JDBC读写正常。

其他的两个非Master节点的FE可以正常采集Metrics。

目前的信息:

  1. FE日志中没有发现有报错信息
  2. 三个FE节点做了负载均衡,请求数量差距并不大
  3. 通过监控发现Doris FE Master节点的线程数量一直在增加,直到达到10000个左右时开始出现无法采集Metrics的情况。怀疑是这个情况导致的异常
    image

查看线程状态基本都处于WAITING状态:

"Thread-52171" #115023 daemon prio=5 os_prio=0 tid=0x00007f541499e000 nid=0x102485 waiting on condition [0x00007f5058fd0000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00007f66c4007878> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
        at com.lmax.disruptor.BlockingWaitStrategy.waitFor(BlockingWaitStrategy.java:47)
        at com.lmax.disruptor.ProcessingSequenceBarrier.waitFor(ProcessingSequenceBarrier.java:56)
        at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:148)
        at java.lang.Thread.run(Thread.java:750)

"Thread-52170" #115022 daemon prio=5 os_prio=0 tid=0x00007f541499c000 nid=0x102484 waiting on condition [0x00007f50597d1000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00007f66c4007878> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
        at com.lmax.disruptor.BlockingWaitStrategy.waitFor(BlockingWaitStrategy.java:47)
        at com.lmax.disruptor.ProcessingSequenceBarrier.waitFor(ProcessingSequenceBarrier.java:56)
        at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:148)
        at java.lang.Thread.run(Thread.java:750)

"Thread-52169" #115021 daemon prio=5 os_prio=0 tid=0x00007f541480b800 nid=0x102483 waiting on condition [0x00007f5059fd2000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00007f66c4007878> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
        at com.lmax.disruptor.BlockingWaitStrategy.waitFor(BlockingWaitStrategy.java:47)
        at com.lmax.disruptor.ProcessingSequenceBarrier.waitFor(ProcessingSequenceBarrier.java:56)
        at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:148)
        at java.lang.Thread.run(Thread.java:750)

并且其他两个FE节点并没有出现这种线程数量猛增的情况。

求大佬看下是什么问题,有没有解决的办法。

2 Answers

目前这个现象是什么?无法采集 Metrics 信息,导致监控信息显示不全吗?

现象就是FE Master的HTTP接口在启动后过一段时间后便不可用,包括采集Metrics信息或WEB UI的访问(等待超时)。

问题原因应该是有大量的 INSERT INTO ... VALUES(...) 这样的的插入请求,导致了这种情况。

观察指标统计发现的是,FE Master节点会频繁地生成镜像文件,且镜像文件生成的频率和线程数量的增加曲线是相同的。

image.png

image.png