Doris告警 Image Write failed

Viewed 121

1,检查监控发现Image Write failed数量增长很快:1720149497243.jpg
2,我们有10个Routine load数据导入任务,最近一周半数都有OtherMsg信息:
2024-06-29 08:57:17:errCode = 2, detailMessage = failed to send task: Socket is closed by peer.
2024-07-04 18:00:08:errCode = 2, detailMessage = failed to send task: java.net.SocketException: Broken pipe (Write failed)
想得到帮助:
1,Image Write failed 快500了,什么原因?有什么影响,怎么修复?
2,routine load 作业的OtherMsg信息,是什么原因,怎么修复?

3 Answers

这几个问题:

  1. Image Write failed 这个结合FE JVM内存监控看下的,日志:"the memory used percent 72 exceed the checkpoint memory threshold: 70" ,内存使用超过 jvm heap 70% 不做checkpoint,导致生成image失败了。所以需要看看FE 内存是否有泄漏问题。如果导入任务比较多的话,可以尝试这样调整下:
    1. 观察profile是否是开启的,如果开启的话全局关闭

    2. 导入任务多的话,可能是label堆积,可以修改label的保留时间
      fe.conf
      label_keep_max_second = 14400;
      streaming_label_keep_max_second = 14400;

    3. 将FE JVM GC算法修改为G1
      CMS算法修改为G1 算法

JAVA_OPTS="-Djavax.security.auth.useSubjectCredsOnly=false -Xss4m -Xmx8192m -XX:+UnlockExperimentalVMOptions -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:+PrintGCDateStamps -XX:+PrintGCDetails -Xloggc:$DORIS_HOME/log/fe.gc.log.$CUR_DATE -Dlog4j2.formatMsgNoLookups=true"

-Xmx改成一致的,注意下CUR_DATE这个环境变量,在老版本可能叫DATE

2.routineload 导入任务报错问题,这个问题一般就是没连上或者是有网络抖动问题,这个应该是会重试的。

image.png

有没有那个时间点FE的日志?得看看有没有更详细的错误信息

周末Image Write failed数量又增加了,fe.log警告级别日志大概如下

2024-07-06 02:57:15,879 WARN (Routine load task scheduler|51) [RoutineLoadTaskScheduler.scheduleOneTask():223] failed to submit routine load task f9c9be259fb94810-a66e924c09639c8c to BE: 10072, error: errCode = 2, detailMessage = failed to send task: java.net.SocketException: Broken p
ipe (Write failed)

上面警告出现次数不少

2024-07-06 05:52:27,716 WARN (leaderCheckpointer|327) [Env.replayJournal():2591] replay journal cost too much time: 3693 replayedJournalId: 19941604

上面警告看到1次

2024-07-06 09:55:29,522 WARN (leaderCheckpointer|327) [Checkpoint.checkMemoryEnoughToDoCheckpoint():327] the memory used percent 72 exceed the checkpoint memory threshold: 70

Image Write failed增加的时间,总有这警告,内存超过70的限制了。YoungGC Old GC 都比较多

2024-07-06 16:30:09,418 WARN (mysql-nio-pool-26706|1148759) [ConnectProcessor.processOnce():843] Null packet received from network. remote: 10.188.100.31:38094
2024-07-06 16:30:09,418 WARN (mysql-nio-pool-26706|1148759) [ReadListener.lambda$handleEvent$0():60] Exception happened in one session(org.apache.doris.qe.ConnectContext@599ce481).
java.io.IOException: Error happened when receiving packet.
        at org.apache.doris.qe.ConnectProcessor.processOnce(ConnectProcessor.java:844) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.mysql.ReadListener.lambda$handleEvent$0(ReadListener.java:52) ~[doris-fe.jar:1.2-SNAPSHOT]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_351]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_351]
        at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_351]

还有这个警告