Page MenuHomePhabricator

irc disconnect/reconnect loops led to libra.chat ban on bot
Closed, ResolvedPublic

Description

time="2021-10-06T14:47:34Z" level=debug msg="":copper.libera.chat 900 wm-bb wm-bb!wm-bb@wikimedia/bot/wm-bridgebot wm-bridgebot :You are now logged in as wm-bridgebot"" func=handleOther file="bridge/irc/handlers.go:171" prefix=irc
time="2021-10-06T14:47:34Z" level=debug msg="":copper.libera.chat 903 wm-bb :SASL authentication successful"" func=handleOther file="bridge/irc/handlers.go:171" prefix=irc
time="2021-10-06T14:47:34Z" level=debug msg="":copper.libera.chat 465 wm-bb :You are banned from this server- Your client is repeatedly reconnecting. Please email bans@libera.chat when fixed. (2021/10/6 02.24)"" func=handleOther file="bridge/irc/handlers.go:171" prefix=irc

Event Timeline

From the point of view of the bot what seems to happen is a ping loss or other disconnect from libra.chat leads to a reconnect failure storm. I've never caught it actually doing it though, just seen the after effects.

Trying some new settings (https://github.com/42wim/matterbridge/wiki/Settings):

  • JoinDelay=1000 (was unset) -- wait 1000 ms (1 second) between channel joins
  • RejoinDelay=60 (was RejoinDelay=5) -- wait 60 seconds before attempting to rejoin a kicked channel

Adding a BNC to sit between the bot and libera.chat is probably our best bet in the long term. I have reached a point of low confidence in the golang irc library used by matterbridge. This is probably in no small part a reflection of my lack of comfort (and joy) in debugging golang.

[15:02]  <  majavah> bd808: stupid hack: put a behaving bouncer between wm-bb and libera, that way the bridge bot is not the one controlling reconnects
[15:02]  <    bd808> majavah: that might be the best "fix" honestly
[15:02]  <AntiComposite> yup
[15:03]  <AntiComposite> works pretty well for the wm-bots

Email sent to bans@ asking for another chance to attach the bot prior to trying to setup a bnc.

Mentioned in SAL (#wikimedia-cloud) [2021-10-06T15:54:56Z] <wm-bot> <bd808> Shutting down bot until we hear back from libera.chat (T292640)

Mentioned in SAL (#wikimedia-cloud) [2021-10-06T16:00:12Z] <wm-bot> <bd808> Restarting to reconnect to irc after libera.chat lifted the account ban (T292640)

Mentioned in SAL (#wikimedia-cloud) [2021-10-06T23:49:19Z] <wm-bot> <bd808> Restarting to place bnc between bot and libra.chat IRC network (T292640)

Mentioned in SAL (#wikimedia-cloud) [2021-10-07T00:00:19Z] <wm-bot> <bd808> IRC directly connected again. Need to work on bnc setup some more. (T292640)

Mentioned in SAL (#wikimedia-cloud) [2021-10-07T00:16:36Z] <wm-bot> <bd808> Bot is now connected to libera.chat via a ZNC bouncer (T292640)

bd808 changed the task status from Open to In Progress.Oct 7 2021, 12:18 AM
bd808 claimed this task.
bd808 triaged this task as High priority.

So... the magic that I am now attempting is [matterbridge] -> [ZNC] -> [libera.chat]. I compiled ZNC from source inside a ruby27 container (it had all the -dev libraries I needed installed) and installed it to $HOME/.local. Then I created a Kubernetes deployment to run ZNC in a pod using the bullseye base image (no need for -dev libs once ZNC was compiled). Then I added a Service to the namespace so that the matterbridge pod can find and connect to the ZNC pod. Finally I changed the matterbridge config to connect to the ZNC server instead of directly to libera.chat.

I'm going to leave this task open for a couple of days until I'm confident that this new pile of neat hacks will hold. I need to update https://wikitech.wikimedia.org/wiki/Tool:Bridgebot too so that I and others will be able to grok this mess later.