Springboot queries Doris with an error
ERROR [http-nio-10020-exec-12] [http-nio-10020-exec-12raceId] [] [5] @@GlobalExceptionAdvice@@ | server error
org.springframework.dao.RecoverableDataAccessException:
### Error querying database. Cause: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure
The last packet successfully received from the server was 426 milliseconds ago. The last packet sent successfully to the server was 0 milliseconds ago.
; Communications link failure
The last packet successfully received from the server was 426 milliseconds ago. The last packet sent successfully to the server was 0 milliseconds ago.; nested exception is com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure
The last packet successfully received from the server was 426 milliseconds ago. The last packet sent successfully to the server was 0 milliseconds ago.
An error is reported in the insert into select task scheduled by Doris
ERROR 2013 (HY000) at line 7: Lost connection to MySQL server during query
analysis
It may be that slow queries cause huge pressure on the cluster.
several slow queries reach 120s-400s, which is unbearable for the Doris cluster because of the global query_ The timeout parameter is 60. It is assumed that the task session variable of someone is set to 600s or higher
Let the development offline slow query task and the tuning SQL
slow query task for more than 100 seconds work normally after offline
But after a while, the springboot service alarms. There are mistakes again
Doris parameter
interactive_timeout=3880000
wait_timeout=3880000
Doris Fe service node alarm log
2021-06-03 16:00:08,398 WARN (Connect-Scheduler-Check-Timer-0|79) [ConnectContext.checkTimeout():365] kill wait timeout connection, remote: 1.1.1.1:57399, wait timeout: 3880000
2021-06-03 16:00:08,398 WARN (Connect-Scheduler-Check-Timer-0|79) [ConnectContext.kill():339] kill timeout query, 1.1.1.1.1:57399, kill connection: true
Doris monitoring
It can be seen that the number of connections at 15:44 drops sharply
#Elk log
you can also see that the alarm and error messages of Doris queried by springboot service also start at 15:44
so what operation variables affect the cluster at 15:44?
See waite according to the error report
_ The time is 3880000s, which is 44 days, but the default in the source code is 28800s
interactive_timeout=3880000
wait_timeout=3880000
No one went online, no one cut, and the Cluster Administrator was in my hands. I didn’t change the parameters, but I’m still not sure why the parameters will change. Go to the fe.audit audit audit log to check the operation records. Sure enough,
someone ( insider mark>) was using the 2020.2.3 version of DataGrid. At 15:44, the set global parameters were modified
interactive_timeout=3880000
wait_timeout=3880000
call back the two parameters to 28800s mark>, and the connections of the cluster are restored immediately
it should be noted here that in the discussion with the community, there is only wait in Doris_ Timeout
works, and the other is interactive_ Timeout
in order to be compatible with MySQL, it doesn’t work
Question: why wait in Doris_ When the timeout parameter is too large, it will cause a connection error communications link failure Code>
on the contrary, it can return to normal after being reduced. You need to sort out the code and look at the logic
Please check the
connection Doris error communications link failure