Page MenuHomePhabricator

wikibugs having a hard time staying connected to libera.chat IRC network
Closed, ResolvedPublicBUG REPORT

Description

First noticed by me around 2024-02-08, the redis2irc component of wikibugs has been having difficulty staying connected to the libera.chat IRC network.

Excerpts from the tool's $HOME/redis2irc.out log file related to this issue look like:

Tue 06 Feb 2024 06:43:24 AM UTC -- Starting redis2irc
Tue 06 Feb 2024 12:06:18 PM UTC -- Starting redis2irc
Tue 06 Feb 2024 01:01:00 PM UTC -- Starting redis2irc
Wed 07 Feb 2024 04:23:57 PM UTC -- Starting redis2irc
Wed 07 Feb 2024 06:02:28 PM UTC -- Starting redis2irc
Thu 08 Feb 2024 10:29:28 PM UTC -- Starting redis2irc
Thu 08 Feb 2024 10:37:06 PM UTC -- Starting redis2irc
Thu 08 Feb 2024 10:37:12 PM UTC -- Starting redis2irc
Fri 09 Feb 2024 10:18:26 AM UTC -- Starting redis2irc
Sat 10 Feb 2024 05:14:37 PM UTC -- Starting redis2irc
Sun 11 Feb 2024 10:29:42 AM UTC -- Starting redis2irc
Sun 11 Feb 2024 09:16:13 PM UTC -- Starting redis2irc
Mon 12 Feb 2024 11:05:06 PM UTC -- Starting redis2irc
Mon 12 Feb 2024 11:24:05 PM UTC -- Starting redis2irc
Wed 14 Feb 2024 02:11:13 PM UTC -- Starting redis2irc
Thu 15 Feb 2024 10:49:10 AM UTC -- Starting redis2irc
Thu 15 Feb 2024 11:28:55 AM UTC -- Starting redis2irc

In the $HOME/redis2irc.log for 2024-02-15, we can see events like:

2024-02-15 10:49:11,272 - wikibugs.wb2-irc - INFO - Started
...snip...
2024-02-15 10:49:11,434 - irc3.wikibugs - DEBUG - Starting wikibugs...
2024-02-15 10:49:11,992 - irc3.wikibugs - DEBUG - Connected
2024-02-15 10:49:11,993 - irc3.wikibugs - DEBUG - CONNECT ping-pong ()
2024-02-15 10:49:15,279 - irc3.wikibugs - DEBUG - Trying to regain nickname in 30s...
2024-02-15 10:49:52,295 - irc3.wikibugs - CRITICAL - connection lost (140684783865568): None
2024-02-15 10:49:52,295 - irc3.wikibugs - CRITICAL - closing old transport (140684783865568)
2024-02-15 10:49:54,298 - irc3.wikibugs - DEBUG - Starting wikibugs...
2024-02-15 10:49:54,899 - irc3.wikibugs - DEBUG - Connected
2024-02-15 10:49:54,899 - irc3.wikibugs - DEBUG - CONNECT ping-pong ()
2024-02-15 10:50:25,884 - irc3.wikibugs - CRITICAL - connection lost (140684783862592): None
2024-02-15 10:50:25,884 - irc3.wikibugs - CRITICAL - closing old transport (140684783862592)
2024-02-15 10:50:27,892 - irc3.wikibugs - DEBUG - Starting wikibugs...
2024-02-15 10:50:28,415 - irc3.wikibugs - DEBUG - Connected
2024-02-15 10:50:28,416 - irc3.wikibugs - DEBUG - CONNECT ping-pong ()
2024-02-15 10:51:16,311 - irc3.wikibugs - CRITICAL - connection lost (140684783579680): None
2024-02-15 10:51:16,311 - irc3.wikibugs - CRITICAL - closing old transport (140684783579680)
2024-02-15 10:51:18,314 - irc3.wikibugs - DEBUG - Starting wikibugs...
2024-02-15 10:51:18,904 - irc3.wikibugs - DEBUG - Connected
2024-02-15 10:51:18,905 - irc3.wikibugs - DEBUG - CONNECT ping-pong ()
2024-02-15 10:51:57,964 - irc3.wikibugs - CRITICAL - connection lost (140684783579872): None
2024-02-15 10:51:57,964 - irc3.wikibugs - CRITICAL - closing old transport (140684783579872)
$ for f in redis2irc.log*; do echo -n $f " "; grep 'connection lost' $f | wc -l; done
redis2irc.log  51
redis2irc.log.2024-02-08  12
redis2irc.log.2024-02-09  0
redis2irc.log.2024-02-10  512
redis2irc.log.2024-02-11  1588
redis2irc.log.2024-02-12  29
redis2irc.log.2024-02-13  0
redis2irc.log.2024-02-14  0

Event Timeline

Adding @greg, @TheresNoTime, and @LucasWerkmeister for visibility. These folks have been kindly restarting the bot job recently.

The storms of loss of connection to libera.chat are a bit confusing for me as I have not seen similar issues with Stashbot, Tool-bridgebot, or Jouncebot in the same period. This could be random luck of which libera.chat nodes these various IRC bots are connected to, or it could point to some quirk of configuration or implementation in redis2irc.py that makes it more sensitive to some sort of network delay or interruption. It could also just be some bias in my attention to things too.

The SULWatchers have had some connection problems lately as well, but not to the same frequency.

Is anything visible from other IRC clients? i.e. do they see a connection reset by peer, a forced close due to X, ...?

There could also be a server-specific component; from the SGE era I vaguely remember that wikibugs had trouble connecting to IRC from some specific hosts. E.g. too many IRC bots connecting from the same external IP and the server aggressively throttling.

Feb 06 07:06:09 ◀━━ Quits:	wikibugs (~wikibugs2@wikimedia/bot/pywikibugs) (Remote host closed the connection)
Feb 06 08:00:59 ◀━━ Quits:	wikibugs (~wikibugs2@wikimedia/bot/pywikibugs) (Remote host closed the connection)
Feb 07 10:49:59 ◀━━ Quits:	wikibugs (~wikibugs2@wikimedia/bot/pywikibugs) (Ping timeout: 272 seconds)
Feb 08 00:21:46 ◀━━ Quits:	wikibugs (~wikibugs2@wikimedia/bot/pywikibugs) (*.net *.split)
Feb 08 13:00:54 ◀━━ Quits:	wikibugs (~wikibugs2@wikimedia/bot/pywikibugs) (Read error: Connection reset by peer)
Feb 08 17:29:56 ◀━━ Quits:	wikibugs (~wikibugs2@wikimedia/bot/pywikibugs) (Remote host closed the connection)
Feb 08 17:37:37 ◀━━ Quits:	wikibugs (~wikibugs2@wikimedia/bot/pywikibugs) (Remote host closed the connection)
Feb 09 05:18:38 ◀━━ Quits:	wikibugs (~wikibugs2@wikimedia/bot/pywikibugs) (Remote host closed the connection)
Feb 10 04:55:49 ◀━━ Quits:	wikibugs (~wikibugs2@wikimedia/bot/pywikibugs) (Read error: Connection reset by peer)
Feb 10 12:14:50 ◀━━ Quits:	wikibugs (~wikibugs2@wikimedia/bot/pywikibugs) (Remote host closed the connection)
Feb 12 16:16:00 ◀━━ Quits:	wikibugs (~wikibugs2@wikimedia/bot/pywikibugs) (Read error: Connection reset by peer)
Feb 12 18:05:13 ◀━━ Quits:	wikibugs (~wikibugs2@wikimedia/bot/pywikibugs) (Remote host closed the connection)
Feb 14 09:11:21 ◀━━ Quits:	wikibugs (~wikibugs2@wikimedia/bot/pywikibugs) (Remote host closed the connection)
Feb 15 04:07:59 ◀━━ Quits:	wikibugs (~wikibugs2@wikimedia/bot/pywikibugs) (Read error: Connection reset by peer)
Feb 15 05:49:23 ◀━━ Quits:	wikibugs (~wikibugs2@wikimedia/bot/pywikibugs) (Remote host closed the connection)
Feb 16 04:16:54 ◀━━ Quits:	wikibugs (~wikibugs2@wikimedia/bot/pywikibugs) (Read error: Connection reset by peer)
Feb 16 06:16:41 ◀━━ Quits:	wikibugs (~wikibugs2@wikimedia/bot/pywikibugs) (Remote host closed the connection)

(times UTC-5)

There could also be a server-specific component; from the SGE era I vaguely remember that wikibugs had trouble connecting to IRC from some specific hosts. E.g. too many IRC bots connecting from the same external IP and the server aggressively throttling.

We dropped per-exec node public IPs several years ago, so every IRC bot in Toolforge is seen as coming from the same IP by libera.chat. There was a brief experiment of adding ident proxies, but the additional Kubernetes network translation layer makes that pretty close to impossible even if we had been able to make it work as hoped on the old grid engine nodes. T278584: Promote use of SASL for Cloud VPS/Toolforge hosted Libera.chat / Freenode IRC bots is something that I would like to see done eventually to make individual bots easier for ircops to target when misbehaving. As far as I can tell redis2irc.py is using SASL auth itself already.

bd808 added a subscriber: Legoktm.

Claiming this as a signal that I'm actively looking at the code and the logs to see if there is anything that I can do to make things more stable. @Legoktm accepted my recent offer to become a co-maintainer in this project, so this is at least partially my problem now. Until @valhallasw gets annoyed by me and kicks me out anyway. ;)

Until @valhallasw gets annoyed by me and kicks me out anyway. ;)

I, for one, welcome our new unicorn overlords.

The list from @AntiCompositeNumber seems a lot more limited than the logs from the bot - that suggests that the connection dies even before the bot has a chance to join a channel (consistent with the

2024-02-15 10:49:54,899 - irc3.wikibugs - DEBUG - CONNECT ping-pong ()
2024-02-15 10:50:25,884 - irc3.wikibugs - CRITICAL - connection lost (140684783862592): None

logs that suggest the connection is lost maybe before the full login cycle has completed?

Is there any way to get a tcpdump for the bot? There's obviously no root access in the container but maybe kubernetes has some magic for it. That would allow to see if a random RST is somehow happening (and from which direction)

Is there any way to get a tcpdump for the bot? There's obviously no root access in the container but maybe kubernetes has some magic for it. That would allow to see if a random RST is somehow happening (and from which direction)

It would be technically possible using Toolforge root super powers and the magic of nsenter. Tutorials like https://www.digitalocean.com/community/tutorials/how-to-inspect-kubernetes-networking and https://www.suse.com/support/kb/doc/?id=000021060 talk about how to get started with those types of investigations.

Mentioned in SAL (#wikimedia-cloud) [2024-03-02T19:20:09Z] <wmbot~bd808@tools-sgebastion-11> Deployed znc and service (T357729)

Mentioned in SAL (#wikimedia-cloud) [2024-03-02T19:20:09Z] <wmbot~bd808@tools-sgebastion-11> Deployed znc and service (T357729)

I rolled this back for the moment. I'm having a hard time debugging the python irc bot with or without the znc in the mix.

The bot is now running with a znc instance between it and libera.chat. The hope here is that znc has more robust algorithms for detecting connection issues and reacting by attempting to reconnect. Many other small things have changed as well with the deployment of T357851: bd808's big pile of refactoring ideas. Let's see how things settle in in the next few days.

bd808 changed the task status from Open to Stalled.Mar 4 2024, 12:11 AM

Marking as stalled while we wait to see how the bot+znc combo behaves. I'm setting a calendar reminder for myself to check back on 2024-03-11 (DST switch hangover day for those of us here in the US).

Some random observation: Some IRC notifications are notably delayed. I merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/1008513 at 09:15 CET and the IRC notification was sent to #wikimedia-operations at 10:18 CET

Some random observation: Some IRC notifications are notably delayed. I merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/1008513 at 09:15 CET and the IRC notification was sent to #wikimedia-operations at 10:18 CET

That delay was most likely related to T359097: Frequent "Redis listener crashed; restarting in a few seconds." errors logged. Let's keep our eyes open however to see if long delays are repeatable now that in theory that particular bug is fixed.

The "connection lost" log event metric is looking much better, but with znc as a proxy I would have expected this no matter what.

$ for f in irc.log*; do echo -n $f " "; grep 'connection lost' $f | wc -l; done
irc.log  0
irc.log.2024-03-03  1
irc.log.2024-03-04  0
irc.log.2024-03-05  0

In the tools.wikibugs-testing deployment the problem I am seeing now is the irc bot losing it's connection to redis rather than to znc & libera.chat. I am starting to look at reusing the AsyncRedisQueue I added as part of T359096: Bot does not detect when ssh connection to Gerrit is interrupted in the irc bot. I have been able to experimentally demonstrate that AsyncRedisQueue reconnects to Redis robustly via local testing where I shutdown the Redis instance, let the bot thrash for a minute trying to reconnect, and then brought the instance back up resulting in a restored connection. The irc bot has logic that is trying to accomplish this, but it seems to be less robust.

Let's try calling this "fixed", or at least remediated, for the moment. We didn't have strong criteria for what "better" looks like other than user satisfaction:

[17:04]  <    bd808> greg-g: y'all are my canary for wikibugs health. Has it felt more stable to you over the last week? There are other things I still want to improve, but I'm trying to guess if T357729 can be closed yet or not.
[17:04]  < stashbot> T357729: wikibugs having a hard time staying connected to libera.chat IRC network - https://phabricator.wikimedia.org/T357729
[17:10]  <   greg-g> bd808: yeah I think so, I haven't noticed any problems in the last week at least

Reopen if things go back to crappy in the near term. If you are thinking about reopening this months from 2024-03-11, maybe consider a new task instead and reference this one if there is a strong correlation with something discussed here previously.