Page MenuHomePhabricator

Sopel is not recording last event correctly causing server timeout
Closed, ResolvedPublic

Description

The bot seems to be restarting or throwing errors a lot without reason.

We've had 4 unexplained restarts/network erros in 7 days.

Our normal uptime can be weeks unless we reboot it.

We should see if there's anything we can do.

Event Timeline

RhinosF1 updated the task description. (Show Details)

We have:
{P10830}

2020-03-26
17:01:55 ⇐ +ZppixBot quit (~ZppixBot@miraheze/bot/Zppix) Ping timeout: 256 seconds
17:02:33 → ZppixBot joined (~ZppixBot@miraheze/bot/Zppix)
2020-03-30
22:34:57 ⇐ +ZppixBot quit (~ZppixBot@miraheze/bot/Zppix) Remote host closed the connection
22:35:33 → ZppixBot joined (~ZppixBot@miraheze/bot/Zppix)
22:35:33 +ZppixBot was voiced (+v) by @ChanServ
23:03:18 ⇐ +ZppixBot-test quit (~ZppixBot@miraheze/bot/Zppix) Remote host closed the connection
23:10:14 → ZppixBot-test joined (~ZppixBot@miraheze/bot/Zppix)
23:10:14 +ZppixBot-test was voiced (+v) by @ChanServ

2020-04-01 06:53:31 ZppixBot-test has quit: Ping timeout: 265 seconds
2020-04-01 06:57:09
RhinosF1 joined channel
2020-04-01 07:00:00 ** ZppixBot-test joined channel

{P10852}

strace-ing:

root@tools-k8s-worker-54:/# strace -qq -fp 23 -s 1024 -t -e execve,read,write,send,recv,socket,connect,network
[pid    23] 10:07:15 recvfrom(3, ":zhuyifei1999_!sid97661@wikimedia/zhuyifei1999 PART #ZppixBot\r\n", 65536, 0, NULL, NULL) = 63
[pid    23] 10:07:15 write(4, "2020-04-02 10:07:15,467 <<\t:zhuyifei1999_!sid97661@wikimedia/zhuyifei1999 PART #ZppixBot\n", 89) = 89
[pid    23] 10:07:20 recvfrom(3, ":zhuyifei1999_!sid97661@wikimedia/zhuyifei1999 JOIN #ZppixBot zhuyifei1999 :My name is hidden, dude\r\n", 65536, 0, NULL, NULL) = 101
[pid    23] 10:07:20 write(4, "2020-04-02 10:07:20,110 <<\t:zhuyifei1999_!sid97661@wikimedia/zhuyifei1999 JOIN #ZppixBot zhuyifei1999 :My name is hidden, dude\n", 127) = 127
[pid    23] 10:08:08 recvfrom(3, "PING :weber.freenode.net\r\n", 65536, 0, NULL, NULL) = 26
[pid    23] 10:08:08 write(4, "2020-04-02 10:08:08,988 <<\tPING :weber.freenode.net\n", 52) = 52
[pid    23] 10:08:08 sendto(3, "PONG weber.freenode.net\r\n", 25, 0, NULL, 0) = 25
[pid    23] 10:08:08 write(4, "2020-04-02 10:08:08,989 >>\tPONG weber.freenode.net\n", 51) = 51

Please ping me when it fails so I can look at the strace logs. This is to be 100% sure that the issue is indeed outside of the tool itself (kernel / network).

Mentioned in SAL (#wikimedia-cloud) [2020-04-02T10:11:35Z] <RhinosF1> approved zhuyifei1999_ to debug network issues - T248960

Mentioned in SAL (#wikimedia-cloud) [2020-04-02T10:17:56Z] <RhinosF1> strace requested on zppixbot tool as well - T248960

Trace stopped, alive for 10 and 11 days respectively

We apparently had on 14th March as well. Which is now the first. No new though.

20:20:13 ⇐ +ZppixBot quit (~ZppixBot@miraheze/bot/Zppix) Ping timeout: 250 seconds
20:20:36 → ZppixBot joined (~ZppixBot@miraheze/bot/Zppix)
20:20:36 +ZppixBot was voiced (+v) by @ChanServ
20:21:24 ⇐ +ZppixBot-test quit (~ZppixBot@miraheze/bot/Zppix) Ping timeout: 265 seconds
20:22:03 → ZppixBot-test joined (~ZppixBot@miraheze/bot/Zppix)
20:22:03 +ZppixBot-test was voiced (+v) by @ChanServ
20:58:01 ⇐ +Voidwalker quit (~Voidwalke@wikipedia/The-Voidwalker) Quit: back later
22:15:32 ⇐ +ZppixBot quit (~ZppixBot@miraheze/bot/Zppix) Ping timeout: 256 seconds
22:15:55 ⇐ +ZppixBot-test quit (~ZppixBot@miraheze/bot/Zppix) Ping timeout: 250 seconds
22:15:56 → ZppixBot joined (~ZppixBot@miraheze/bot/Zppix)
22:15:56 +ZppixBot was voiced (+v) by @ChanServ
22:16:54 → ZppixBot-test joined (~ZppixBot@miraheze/bot/Zppix)
22:16:54 +ZppixBot-test was voiced (+v) by @ChanServ

All based on the raw logs show server timeouts so that's us not getting incoming traffic from Freenode and then assuming the connection dead (so freenode then kills us as we stop attempting to talk to it.

^ above may have been the openstack upgrade that seemed to have affected the network based on conversation with andrew

RhinosF1 lowered the priority of this task from High to Medium.Apr 21 2020, 8:08 PM

Happened again today, our timeout seems to be ~120s and might be worth upping to 200-300s to see if it mitigates the impact as we seem to be too suspectible to network interuptions.

RhinosF1 changed the task status from Open to Stalled.Apr 21 2020, 8:53 PM

Stalling on deploying mitigations to network interuption as described in subtask.

RhinosF1 changed the task status from Stalled to Open.May 11 2020, 5:30 PM

Still happening and the timeout value just make it weirder when it crashes to auto-recover.

Caught a trace of zppixbot-test failing: P11192. Keeping it private because there are way too many lines for me to look through to see what need redacting

The test instance seems to be having a memory issue, I think I've fixed that by rebooting it but if not, I'll do a full reset.

Zppixbot's main instance seems to be more stable.

If the -test instance is stable for a few days from now, I will close this. If not, I'm nervous as there is no documentation on fixing this.

RhinosF1 renamed this task from Debug random ZppixBot(-test) restarts to ZppixBot-test restarting due to memory not saving ping/pong - not loaded?.May 20 2020, 10:23 AM

This is likely a race condition but the whole start/stop sequence is broken.

I've filled https://github.com/sopel-irc/sopel/issues/1868 and https://github.com/sopel-irc/sopel/issues/1865 to get it fixed.

RhinosF1 changed the task status from Open to Stalled.May 21 2020, 8:43 AM
RhinosF1 lowered the priority of this task from Medium to Lowest.
RhinosF1 added a project: Upstream.
RhinosF1 renamed this task from ZppixBot-test restarting due to memory not saving ping/pong - not loaded? to Multiple race conditions causing ZppixBot(-test) to disconnect and poorly restart.May 21 2020, 8:49 AM
RhinosF1 renamed this task from Multiple race conditions causing ZppixBot(-test) to disconnect and poorly restart to Sopel does not recording last event correctly causing server timeout.Jun 12 2020, 1:30 PM

There are issues with the logging & jobs/memory backend but they don't cause this and are syptomns of it (why we get bot.memory logspam)

The real cause is that _check_timeout isn't seeing the update last_event_at under some conditions / randomly.

It might be improved if _check_timeout waited for a bit after _send_ping finished but we aren't yet sure of the full cause.

RhinosF1 renamed this task from Sopel does not recording last event correctly causing server timeout to Sopel is not recording last event correctly causing server timeout.Jun 12 2020, 1:35 PM

I may have found the cause of this and fixed it. Monitoring for the next week.

RhinosF1 raised the priority of this task from Lowest to Medium.