Doris 2.1.4版本Arrow flight SQL超时连接无法Kill,导致FE日志大量报错

Viewed 82

通过Arrow Flight SQL的查询如下,并且有执行时间超过10天的查询
image.png
image.png

arrow_flight_token_cache_size参数已手动修改为50
arrow_flight_token_alive_time还是默认3天没有修改
image.png

复现代码如下:

import adbc_driver_manager
import adbc_driver_flightsql.dbapi as flight_sql

if __name__ == '__main__':
    conn = flight_sql.connect(uri='grpc://localhost:9090', db_kwargs={
        adbc_driver_manager.DatabaseOptions.USERNAME.value: "xxx",
        adbc_driver_manager.DatabaseOptions.PASSWORD.value: "xxx"
    })
    cursor = conn.cursor()
    cursor.execute("SELECT 1;")
    print(cursor.fetchallarrow().to_pandas())
    cursor.close()
    conn.close()

FE日志出现大量报错

2024-10-10 13:21:09,756 WARN (connect-scheduler-check-timer-0|186) [Coordinator.cancel():1496] Query cd6353f35d20406a-a19aa1ded2683b12 already in abnormal status Status [errorCode=CANCELLED, errorMsg=cancelled], but received cancel again,so that send cancel to BE again
java.lang.Exception: cancel failed
        at org.apache.doris.qe.Coordinator.cancel(Coordinator.java:1498) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.qe.StmtExecutor.cancel(StmtExecutor.java:1422) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.qe.ConnectContext.killByTimeout(ConnectContext.java:918) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.qe.ConnectContext.checkTimeout(ConnectContext.java:962) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.qe.ConnectScheduler$TimeoutChecker.run(ConnectScheduler.java:74) ~[doris-fe.jar:1.2-SNAPSHOT]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_281]
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) ~[?:1.8.0_281]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) ~[?:1.8.0_281]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) ~[?:1.8.0_281]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_281]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_281]
        at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_281]
2024-10-10 13:21:09,756 WARN (connect-scheduler-check-timer-0|186) [Coordinator.cancel():1503] Cancel execution of query cd6353f35d20406a-a19aa1ded2683b12, this is a outside invoke, cancelReason TIMEOUT
2024-10-10 13:21:09,756 WARN (connect-scheduler-check-timer-0|186) [ResultReceiver.updateCancelReason():203] Query cd6353f35d20406a-a19aa1ded2683b12 already has cancel reason: TIMEOUT, new reason TIMEOUT will be ignored
2024-10-10 13:21:09,756 WARN (connect-scheduler-check-timer-0|186) [Coordinator$BackendExecState.cancelFragmentInstance():3194] cancelRemoteFragments initiated=true done=false backend: 64996, fragment instance id=cd6353f35d20406a-a19aa1ded2683b1d, reason: TIMEOUT
2024-10-10 13:21:09,756 WARN (connect-scheduler-check-timer-0|186) [Coordinator$BackendExecState.cancelFragmentInstance():3194] cancelRemoteFragments initiated=true done=false backend: 14066, fragment instance id=cd6353f35d20406a-a19aa1ded2683b13, reason: TIMEOUT
2024-10-10 13:21:09,756 WARN (connect-scheduler-check-timer-0|186) [Coordinator$BackendExecState.cancelFragmentInstance():3194] cancelRemoteFragments initiated=true done=false backend: 14066, fragment instance id=cd6353f35d20406a-a19aa1ded2683b14, reason: TIMEOUT
2024-10-10 13:21:09,756 WARN (connect-scheduler-check-timer-0|186) [Coordinator$BackendExecState.cancelFragmentInstance():3194] cancelRemoteFragments initiated=true done=false backend: 14066, fragment instance id=cd6353f35d20406a-a19aa1ded2683b15, reason: TIMEOUT
2024-10-10 13:21:09,756 WARN (connect-scheduler-check-timer-0|186) [Coordinator$BackendExecState.cancelFragmentInstance():3194] cancelRemoteFragments initiated=true done=false backend: 14066, fragment instance id=cd6353f35d20406a-a19aa1ded2683b16, reason: TIMEOUT
2024-10-10 13:21:09,756 WARN (connect-scheduler-check-timer-0|186) [Coordinator$BackendExecState.cancelFragmentInstance():3194] cancelRemoteFragments initiated=true done=false backend: 14047, fragment instance id=cd6353f35d20406a-a19aa1ded2683b17, reason: TIMEOUT
2024-10-10 13:21:09,757 WARN (connect-scheduler-check-timer-0|186) [Coordinator$BackendExecState.cancelFragmentInstance():3194] cancelRemoteFragments initiated=true done=false backend: 14047, fragment instance id=cd6353f35d20406a-a19aa1ded2683b18, reason: TIMEOUT
2024-10-10 13:21:09,757 WARN (connect-scheduler-check-timer-0|186) [Coordinator$BackendExecState.cancelFragmentInstance():3194] cancelRemoteFragments initiated=true done=false backend: 14047, fragment instance id=cd6353f35d20406a-a19aa1ded2683b19, reason: TIMEOUT
2024-10-10 13:21:09,757 WARN (connect-scheduler-check-timer-0|186) [Coordinator$BackendExecState.cancelFragmentInstance():3194] cancelRemoteFragments initiated=true done=false backend: 64996, fragment instance id=cd6353f35d20406a-a19aa1ded2683b1a, reason: TIMEOUT
2024-10-10 13:21:09,757 WARN (connect-scheduler-check-timer-0|186) [Coordinator$BackendExecState.cancelFragmentInstance():3194] cancelRemoteFragments initiated=true done=false backend: 64996, fragment instance id=cd6353f35d20406a-a19aa1ded2683b1b, reason: TIMEOUT
2024-10-10 13:21:09,757 WARN (connect-scheduler-check-timer-0|186) [Coordinator$BackendExecState.cancelFragmentInstance():3194] cancelRemoteFragments initiated=true done=false backend: 64996, fragment instance id=cd6353f35d20406a-a19aa1ded2683b1c, reason: TIMEOUT
2024-10-10 13:21:09,757 WARN (connect-scheduler-check-timer-0|186) [ConnectContext.checkTimeout():940] kill wait timeout connection, remote: 0.0.0.0:0, wait timeout: 60
2024-10-10 13:21:09,757 WARN (connect-scheduler-check-timer-0|186) [ConnectContext.killByTimeout():906] kill query from 0.0.0.0:0, kill mysql connection: true reason time out
2024-10-10 13:21:09,757 WARN (connect-scheduler-check-timer-0|186) [Coordinator.cancel():1496] Query 365e6365d24146ac-a9370f09a8e207da already in abnormal status Status [errorCode=CANCELLED, errorMsg=cancelled], but received cancel again,so that send cancel to BE again
java.lang.Exception: cancel failed
        at org.apache.doris.qe.Coordinator.cancel(Coordinator.java:1498) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.qe.StmtExecutor.cancel(StmtExecutor.java:1422) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.qe.ConnectContext.killByTimeout(ConnectContext.java:918) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.qe.ConnectContext.checkTimeout(ConnectContext.java:962) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.qe.ConnectScheduler$TimeoutChecker.run(ConnectScheduler.java:74) ~[doris-fe.jar:1.2-SNAPSHOT]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_281]
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) ~[?:1.8.0_281]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) ~[?:1.8.0_281]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) ~[?:1.8.0_281]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_281]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_281]
        at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_281]
2024-10-10 13:21:09,757 WARN (connect-scheduler-check-timer-0|186) [Coordinator.cancel():1503] Cancel execution of query 365e6365d24146ac-a9370f09a8e207da, this is a outside invoke, cancelReason TIMEOUT
2024-10-10 13:21:09,757 WARN (connect-scheduler-check-timer-0|186) [ResultReceiver.updateCancelReason():203] Query 365e6365d24146ac-a9370f09a8e207da already has cancel reason: TIMEOUT, new reason TIMEOUT will be ignored
2024-10-10 13:21:09,757 WARN (connect-scheduler-check-timer-0|186) [Coordinator$BackendExecState.cancelFragmentInstance():3194] cancelRemoteFragments initiated=true done=false backend: 64996, fragment instance id=365e6365d24146ac-a9370f09a8e207e5, reason: TIMEOUT
2024-10-10 13:21:09,757 WARN (connect-scheduler-check-timer-0|186) [Coordinator$BackendExecState.cancelFragmentInstance():3194] cancelRemoteFragments initiated=true done=false backend: 14066, fragment instance id=365e6365d24146ac-a9370f09a8e207db, reason: TIMEOUT
2024-10-10 13:21:09,757 WARN (connect-scheduler-check-timer-0|186) [Coordinator$BackendExecState.cancelFragmentInstance():3194] cancelRemoteFragments initiated=true done=false backend: 14066, fragment instance id=365e6365d24146ac-a9370f09a8e207dc, reason: TIMEOUT
2024-10-10 13:21:09,757 WARN (connect-scheduler-check-timer-0|186) [Coordinator$BackendExecState.cancelFragmentInstance():3194] cancelRemoteFragments initiated=true done=false backend: 14066, fragment instance id=365e6365d24146ac-a9370f09a8e207dd, reason: TIMEOUT
2024-10-10 13:21:09,757 WARN (connect-scheduler-check-timer-0|186) [Coordinator$BackendExecState.cancelFragmentInstance():3194] cancelRemoteFragments initiated=true done=false backend: 14047, fragment instance id=365e6365d24146ac-a9370f09a8e207de, reason: TIMEOUT
2024-10-10 13:21:09,757 WARN (connect-scheduler-check-timer-0|186) [Coordinator$BackendExecState.cancelFragmentInstance():3194] cancelRemoteFragments initiated=true done=false backend: 14047, fragment instance id=365e6365d24146ac-a9370f09a8e207df, reason: TIMEOUT
2024-10-10 13:21:09,757 WARN (connect-scheduler-check-timer-0|186) [Coordinator$BackendExecState.cancelFragmentInstance():3194] cancelRemoteFragments initiated=true done=false backend: 14047, fragment instance id=365e6365d24146ac-a9370f09a8e207e0, reason: TIMEOUT
2024-10-10 13:21:09,757 WARN (connect-scheduler-check-timer-0|186) [Coordinator$BackendExecState.cancelFragmentInstance():3194] cancelRemoteFragments initiated=true done=false backend: 64996, fragment instance id=365e6365d24146ac-a9370f09a8e207e1, reason: TIMEOUT
2024-10-10 13:21:09,757 WARN (connect-scheduler-check-timer-0|186) [Coordinator$BackendExecState.cancelFragmentInstance():3194] cancelRemoteFragments initiated=true done=false backend: 64996, fragment instance id=365e6365d24146ac-a9370f09a8e207e2, reason: TIMEOUT
2024-10-10 13:21:09,757 WARN (connect-scheduler-check-timer-0|186) [Coordinator$BackendExecState.cancelFragmentInstance():3194] cancelRemoteFragments initiated=true done=false backend: 64996, fragment instance id=365e6365d24146ac-a9370f09a8e207e3, reason: TIMEOUT
2024-10-10 13:21:09,757 WARN (connect-scheduler-check-timer-0|186) [Coordinator$BackendExecState.cancelFragmentInstance():3194] cancelRemoteFragments initiated=true done=false backend: 64996, fragment instance id=365e6365d24146ac-a9370f09a8e207e4, reason: TIMEOUT
2024-10-10 13:21:09,757 WARN (connect-scheduler-check-timer-0|186) [ConnectContext.checkTimeout():940] kill wait timeout connection, remote: 0.0.0.0:0, wait timeout: 60
2024-10-10 13:21:09,757 WARN (connect-scheduler-check-timer-0|186) [ConnectContext.killByTimeout():906] kill query from 0.0.0.0:0, kill mysql connection: true reason time out
2024-10-10 13:21:09,757 WARN (connect-scheduler-check-timer-0|186) [Coordinator.cancel():1496] Query bb6e0d4d1d1045bd-b3106f6406995c6b already in abnormal status Status [errorCode=CANCELLED, errorMsg=cancelled], but received cancel again,so that send cancel to BE again
java.lang.Exception: cancel failed
        at org.apache.doris.qe.Coordinator.cancel(Coordinator.java:1498) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.qe.StmtExecutor.cancel(StmtExecutor.java:1422) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.qe.ConnectContext.killByTimeout(ConnectContext.java:918) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.qe.ConnectContext.checkTimeout(ConnectContext.java:962) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.qe.ConnectScheduler$TimeoutChecker.run(ConnectScheduler.java:74) ~[doris-fe.jar:1.2-SNAPSHOT]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_281]
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) ~[?:1.8.0_281]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) ~[?:1.8.0_281]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) ~[?:1.8.0_281]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_281]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_281]
        at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_281]

尝试KILL QUERY connectionId; 无法成功
image.png

2 Answers

image.png
image.png
KILL (CONNECTION) 而不是 KILL QUERY,可以正确 cancel 掉 connection,错误日志是合理的,因为此时这个链接没用在跑query,所以 kill query 失败

image.png
https://doris.apache.org/zh-CN/docs/dev/db-connect/arrow-flight-sql-connect/
目前 arrow flight 的连接,只会在连接数超限或者超时后被删除,也可以用上面的方法手动删除。

链接能否正常超时后被删除,我这会再确认下

这个问题之前修复过,建议升级到2.1.6看看