time="2021-10-06T14:47:34Z" level=debug msg="":copper.libera.chat 900 wm-bb wm-bb!wm-bb@wikimedia/bot/wm-bridgebot wm-bridgebot :You are now logged in as wm-bridgebot"" func=handleOther file="bridge/irc/handlers.go:171" prefix=irc time="2021-10-06T14:47:34Z" level=debug msg="":copper.libera.chat 903 wm-bb :SASL authentication successful"" func=handleOther file="bridge/irc/handlers.go:171" prefix=irc time="2021-10-06T14:47:34Z" level=debug msg="":copper.libera.chat 465 wm-bb :You are banned from this server- Your client is repeatedly reconnecting. Please email bans@libera.chat when fixed. (2021/10/6 02.24)"" func=handleOther file="bridge/irc/handlers.go:171" prefix=irc
Description
Related Objects
- Duplicates Merged Here
- T264212: Bridgebot gets stuck in bad state when kicked from IRC channels
Event Timeline
From the point of view of the bot what seems to happen is a ping loss or other disconnect from libra.chat leads to a reconnect failure storm. I've never caught it actually doing it though, just seen the after effects.
Trying some new settings (https://github.com/42wim/matterbridge/wiki/Settings):
- JoinDelay=1000 (was unset) -- wait 1000 ms (1 second) between channel joins
- RejoinDelay=60 (was RejoinDelay=5) -- wait 60 seconds before attempting to rejoin a kicked channel
Adding a BNC to sit between the bot and libera.chat is probably our best bet in the long term. I have reached a point of low confidence in the golang irc library used by matterbridge. This is probably in no small part a reflection of my lack of comfort (and joy) in debugging golang.
[15:02] < majavah> bd808: stupid hack: put a behaving bouncer between wm-bb and libera, that way the bridge bot is not the one controlling reconnects [15:02] < bd808> majavah: that might be the best "fix" honestly [15:02] <AntiComposite> yup [15:03] <AntiComposite> works pretty well for the wm-bots
Email sent to bans@ asking for another chance to attach the bot prior to trying to setup a bnc.
Mentioned in SAL (#wikimedia-cloud) [2021-10-06T15:54:56Z] <wm-bot> <bd808> Shutting down bot until we hear back from libera.chat (T292640)
Mentioned in SAL (#wikimedia-cloud) [2021-10-06T16:00:12Z] <wm-bot> <bd808> Restarting to reconnect to irc after libera.chat lifted the account ban (T292640)
Mentioned in SAL (#wikimedia-cloud) [2021-10-06T23:49:19Z] <wm-bot> <bd808> Restarting to place bnc between bot and libra.chat IRC network (T292640)
Mentioned in SAL (#wikimedia-cloud) [2021-10-07T00:00:19Z] <wm-bot> <bd808> IRC directly connected again. Need to work on bnc setup some more. (T292640)
Mentioned in SAL (#wikimedia-cloud) [2021-10-07T00:16:36Z] <wm-bot> <bd808> Bot is now connected to libera.chat via a ZNC bouncer (T292640)
So... the magic that I am now attempting is [matterbridge] -> [ZNC] -> [libera.chat]. I compiled ZNC from source inside a ruby27 container (it had all the -dev libraries I needed installed) and installed it to $HOME/.local. Then I created a Kubernetes deployment to run ZNC in a pod using the bullseye base image (no need for -dev libs once ZNC was compiled). Then I added a Service to the namespace so that the matterbridge pod can find and connect to the ZNC pod. Finally I changed the matterbridge config to connect to the ZNC server instead of directly to libera.chat.
I'm going to leave this task open for a couple of days until I'm confident that this new pile of neat hacks will hold. I need to update https://wikitech.wikimedia.org/wiki/Tool:Bridgebot too so that I and others will be able to grok this mess later.
The ZNC setup has been documented on https://wikitech.wikimedia.org/wiki/Tool:Bridgebot