The bot seems to be restarting or throwing errors a lot without reason.
We've had 4 unexplained restarts/network erros in 7 days.
Our normal uptime can be weeks unless we reboot it.
We should see if there's anything we can do.
The bot seems to be restarting or throwing errors a lot without reason.
We've had 4 unexplained restarts/network erros in 7 days.
Our normal uptime can be weeks unless we reboot it.
We should see if there's anything we can do.
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | RhinosF1 | T248960 Sopel is not recording last event correctly causing server timeout | |||
Declined | None | T250861 Raise timeout value for IRC Bots | |||
Declined | None | T255261 Extended Test Window: ZppixBot-Test |
We have:
{P10830}
2020-03-26 17:01:55 ⇐ +ZppixBot quit (~ZppixBot@miraheze/bot/Zppix) Ping timeout: 256 seconds 17:02:33 → ZppixBot joined (~ZppixBot@miraheze/bot/Zppix)
2020-03-30 22:34:57 ⇐ +ZppixBot quit (~ZppixBot@miraheze/bot/Zppix) Remote host closed the connection 22:35:33 → ZppixBot joined (~ZppixBot@miraheze/bot/Zppix) 22:35:33 +ZppixBot was voiced (+v) by @ChanServ 23:03:18 ⇐ +ZppixBot-test quit (~ZppixBot@miraheze/bot/Zppix) Remote host closed the connection 23:10:14 → ZppixBot-test joined (~ZppixBot@miraheze/bot/Zppix) 23:10:14 +ZppixBot-test was voiced (+v) by @ChanServ
2020-04-01 06:53:31 ZppixBot-test has quit: Ping timeout: 265 seconds
2020-04-01 06:57:09 RhinosF1 joined channel
2020-04-01 07:00:00 ** ZppixBot-test joined channel
{P10852}
strace-ing:
root@tools-k8s-worker-54:/# strace -qq -fp 23 -s 1024 -t -e execve,read,write,send,recv,socket,connect,network [pid 23] 10:07:15 recvfrom(3, ":zhuyifei1999_!sid97661@wikimedia/zhuyifei1999 PART #ZppixBot\r\n", 65536, 0, NULL, NULL) = 63 [pid 23] 10:07:15 write(4, "2020-04-02 10:07:15,467 <<\t:zhuyifei1999_!sid97661@wikimedia/zhuyifei1999 PART #ZppixBot\n", 89) = 89 [pid 23] 10:07:20 recvfrom(3, ":zhuyifei1999_!sid97661@wikimedia/zhuyifei1999 JOIN #ZppixBot zhuyifei1999 :My name is hidden, dude\r\n", 65536, 0, NULL, NULL) = 101 [pid 23] 10:07:20 write(4, "2020-04-02 10:07:20,110 <<\t:zhuyifei1999_!sid97661@wikimedia/zhuyifei1999 JOIN #ZppixBot zhuyifei1999 :My name is hidden, dude\n", 127) = 127 [pid 23] 10:08:08 recvfrom(3, "PING :weber.freenode.net\r\n", 65536, 0, NULL, NULL) = 26 [pid 23] 10:08:08 write(4, "2020-04-02 10:08:08,988 <<\tPING :weber.freenode.net\n", 52) = 52 [pid 23] 10:08:08 sendto(3, "PONG weber.freenode.net\r\n", 25, 0, NULL, 0) = 25 [pid 23] 10:08:08 write(4, "2020-04-02 10:08:08,989 >>\tPONG weber.freenode.net\n", 51) = 51
Please ping me when it fails so I can look at the strace logs. This is to be 100% sure that the issue is indeed outside of the tool itself (kernel / network).
Mentioned in SAL (#wikimedia-cloud) [2020-04-02T10:11:35Z] <RhinosF1> approved zhuyifei1999_ to debug network issues - T248960
Mentioned in SAL (#wikimedia-cloud) [2020-04-02T10:17:56Z] <RhinosF1> strace requested on zppixbot tool as well - T248960
20:20:13 ⇐ +ZppixBot quit (~ZppixBot@miraheze/bot/Zppix) Ping timeout: 250 seconds 20:20:36 → ZppixBot joined (~ZppixBot@miraheze/bot/Zppix) 20:20:36 +ZppixBot was voiced (+v) by @ChanServ 20:21:24 ⇐ +ZppixBot-test quit (~ZppixBot@miraheze/bot/Zppix) Ping timeout: 265 seconds 20:22:03 → ZppixBot-test joined (~ZppixBot@miraheze/bot/Zppix) 20:22:03 +ZppixBot-test was voiced (+v) by @ChanServ 20:58:01 ⇐ +Voidwalker quit (~Voidwalke@wikipedia/The-Voidwalker) Quit: back later 22:15:32 ⇐ +ZppixBot quit (~ZppixBot@miraheze/bot/Zppix) Ping timeout: 256 seconds 22:15:55 ⇐ +ZppixBot-test quit (~ZppixBot@miraheze/bot/Zppix) Ping timeout: 250 seconds 22:15:56 → ZppixBot joined (~ZppixBot@miraheze/bot/Zppix) 22:15:56 +ZppixBot was voiced (+v) by @ChanServ 22:16:54 → ZppixBot-test joined (~ZppixBot@miraheze/bot/Zppix) 22:16:54 +ZppixBot-test was voiced (+v) by @ChanServ
All based on the raw logs show server timeouts so that's us not getting incoming traffic from Freenode and then assuming the connection dead (so freenode then kills us as we stop attempting to talk to it.
^ above may have been the openstack upgrade that seemed to have affected the network based on conversation with andrew
Happened again today, our timeout seems to be ~120s and might be worth upping to 200-300s to see if it mitigates the impact as we seem to be too suspectible to network interuptions.
Still happening and the timeout value just make it weirder when it crashes to auto-recover.
Caught a trace of zppixbot-test failing: P11192. Keeping it private because there are way too many lines for me to look through to see what need redacting
The test instance seems to be having a memory issue, I think I've fixed that by rebooting it but if not, I'll do a full reset.
Zppixbot's main instance seems to be more stable.
If the -test instance is stable for a few days from now, I will close this. If not, I'm nervous as there is no documentation on fixing this.
This is likely a race condition but the whole start/stop sequence is broken.
I've filled https://github.com/sopel-irc/sopel/issues/1868 and https://github.com/sopel-irc/sopel/issues/1865 to get it fixed.
There are issues with the logging & jobs/memory backend but they don't cause this and are syptomns of it (why we get bot.memory logspam)
The real cause is that _check_timeout isn't seeing the update last_event_at under some conditions / randomly.
It might be improved if _check_timeout waited for a bit after _send_ping finished but we aren't yet sure of the full cause.