Page MenuHomePhabricator

Bridgebot gets stuck in bad state when kicked from IRC channels
Open, HighPublic

Description

[16:11]  <   Waggie> bd808: I have Freenode staff asking about wm-bb and some sort of flood/reconnect issue..
[16:12]  <    bd808> Waggie: ugh. Did they say what channel?
[16:12]  <   Waggie> Well, flood/reconnect would be any channel that it's in.
[16:12]  <    bd808> let me see if the bot's logs tell me anything
[16:14]  <   wm-bot> !log tools.bridgebot <bd808> Restarting. IRC join failures
[16:14]  < stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.bridgebot/SAL
[16:15]  wm-bb (~wm-bb@wikimedia/bot/wm-bridgebot) joined the channel
[16:15]  <    bd808> Waggie: trying a full restart to see if that fixes the bot. I think it gets into strange states when it netsplits
[16:15]  <    bd808> looks like it at least joined properly here
[16:15]  <    bd808> Waggie: bot's logs look good now. Thank you for the poke
[16:17]  <   Waggie> Ok, thanks
time="2020-09-30T12:43:42Z" level=debug msg="<= Sending JOIN_LEAVE event from irc.bridgebot to gateway" func=handleJoinPart file="bridge/irc/handlers.go:98" prefix=irc
time="2020-09-30T12:43:42Z" level=debug msg="<= Message is config.Message{Text:"wm-bb quits", Channel:"excess flood", Username:"system", UserID:"", Avatar:"", Account:"irc.bridgebot", Event:"join_leave", Protocol:"", Gateway:"", ParentID:"", Timestamp:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}, ID:"", Extra:map[string][]interface {}(nil)}" func=handleJoinPart file="bridge/irc/handlers.go:100" prefix=irc
time="2020-09-30T12:43:42Z" level=error msg="disconnect: error: EOF" func=doConnect file="bridge/irc/irc.go:184" prefix=irc
time="2020-09-30T12:43:42Z" level=info msg="reconnecting in 30 seconds..." func=doConnect file="bridge/irc/irc.go:192" prefix=irc
time="2020-09-30T12:44:17Z" level=info msg="irc.bridgebot: joining #wikimedia-cloud (ID: #wikimedia-cloudirc.bridgebot)" func=joinChannels file="bridge/bridge.go:77" prefix=irc
time="2020-09-30T12:44:17Z" level=info msg="irc.bridgebot: joining #wikipedia-abstract (ID: #wikipedia-abstractirc.bridgebot)" func=joinChannels file="bridge/bridge.go:77" prefix=irc
time="2020-09-30T12:44:17Z" level=info msg="irc.bridgebot: joining #wikipedia-it-sysop (ID: #wikipedia-it-sysopirc.bridgebot)" func=joinChannels file="bridge/bridge.go:77" prefix=irc
time="2020-09-30T12:44:17Z" level=info msg="irc.bridgebot: joining #wmhack (ID: #wmhackirc.bridgebot)" func=joinChannels file="bridge/bridge.go:77" prefix=irc
time="2020-09-30T12:44:17Z" level=info msg="irc.bridgebot: joining ##stashbot (ID: ##stashbotirc.bridgebot)" func=joinChannels file="bridge/bridge.go:77" prefix=irc
time="2020-09-30T12:44:17Z" level=info msg="irc.bridgebot: joining #wikimedia-cloud (ID: #wikimedia-cloudirc.bridgebot)" func=joinChannels file="bridge/bridge.go:77" prefix=irc
time="2020-09-30T12:44:17Z" level=info msg="irc.bridgebot: joining #wikipedia-abstract (ID: #wikipedia-abstractirc.bridgebot)" func=joinChannels file="bridge/bridge.go:77" prefix=irc
time="2020-09-30T12:44:17Z" level=info msg="irc.bridgebot: joining #wikipedia-it-sysop (ID: #wikipedia-it-sysopirc.bridgebot)" func=joinChannels file="bridge/bridge.go:77" prefix=irc
time="2020-09-30T12:44:17Z" level=info msg="irc.bridgebot: joining #wmhack (ID: #wmhackirc.bridgebot)" func=joinChannels file="bridge/bridge.go:77" prefix=irc
time="2020-09-30T12:44:17Z" level=info msg="irc.bridgebot: joining ##stashbot (ID: ##stashbotirc.bridgebot)" func=joinChannels file="bridge/bridge.go:77" prefix=irc
time="2020-09-30T12:44:17Z" level=info msg="irc.bridgebot: joining ##stashbot (ID: ##stashbotirc.bridgebot)" func=joinChannels file="bridge/bridge.go:77" prefix=irc
time="2020-09-30T12:44:17Z" level=info msg="irc.bridgebot: joining #wikimedia-cloud (ID: #wikimedia-cloudirc.bridgebot)" func=joinChannels file="bridge/bridge.go:77" prefix=irc
time="2020-09-30T12:44:17Z" level=info msg="irc.bridgebot: joining #wikipedia-abstract (ID: #wikipedia-abstractirc.bridgebot)" func=joinChannels file="bridge/bridge.go:77" prefix=irc
time="2020-09-30T12:44:17Z" level=info msg="irc.bridgebot: joining #wikipedia-it-sysop (ID: #wikipedia-it-sysopirc.bridgebot)" func=joinChannels file="bridge/bridge.go:77" prefix=irc
time="2020-09-30T12:44:17Z" level=info msg="irc.bridgebot: joining #wmhack (ID: #wmhackirc.bridgebot)" func=joinChannels file="bridge/bridge.go:77" prefix=irc
time="2020-09-30T12:44:17Z" level=info msg="irc.bridgebot: joining #wikimedia-cloud (ID: #wikimedia-cloudirc.bridgebot)" func=joinChannels file="bridge/bridge.go:77" prefix=irc
time="2020-09-30T12:44:17Z" level=info msg="irc.bridgebot: joining #wikipedia-abstract (ID: #wikipedia-abstractirc.bridgebot)" func=joinChannels file="bridge/bridge.go:77" prefix=irc
time="2020-09-30T12:44:17Z" level=info msg="irc.bridgebot: joining #wikipedia-it-sysop (ID: #wikipedia-it-sysopirc.bridgebot)" func=joinChannels file="bridge/bridge.go:77" prefix=irc
time="2020-09-30T12:44:17Z" level=info msg="irc.bridgebot: joining #wmhack (ID: #wmhackirc.bridgebot)" func=joinChannels file="bridge/bridge.go:77" prefix=irc
time="2020-09-30T12:44:17Z" level=info msg="irc.bridgebot: joining ##stashbot (ID: ##stashbotirc.bridgebot)" func=joinChannels file="bridge/bridge.go:77" prefix=irc
time="2020-09-30T12:44:17Z" level=info msg="irc.bridgebot: joining ##stashbot (ID: ##stashbotirc.bridgebot)" func=joinChannels file="bridge/bridge.go:77" prefix=irc
time="2020-09-30T12:44:17Z" level=info msg="irc.bridgebot: joining #wikimedia-cloud (ID: #wikimedia-cloudirc.bridgebot)" func=joinChannels file="bridge/bridge.go:77" prefix=irc
time="2020-09-30T12:44:17Z" level=info msg="irc.bridgebot: joining #wikipedia-abstract (ID: #wikipedia-abstractirc.bridgebot)" func=joinChannels file="bridge/bridge.go:77" prefix=irc
time="2020-09-30T12:44:17Z" level=info msg="irc.bridgebot: joining #wikipedia-it-sysop (ID: #wikipedia-it-sysopirc.bridgebot)" func=joinChannels file="bridge/bridge.go:77" prefix=irc
time="2020-09-30T12:44:17Z" level=info msg="irc.bridgebot: joining #wmhack (ID: #wmhackirc.bridgebot)" func=joinChannels file="bridge/bridge.go:77" prefix=irc
time="2020-09-30T12:44:17Z" level=debug msg="empty channel" func=handleMessage file="gateway/handlers.go:213" prefix=gateway
time="2020-09-30T12:44:17Z" level=debug msg="empty channel" func=handleMessage file="gateway/handlers.go:213" prefix=gateway
time="2020-09-30T12:44:17Z" level=debug msg="empty channel" func=handleMessage file="gateway/handlers.go:213" prefix=gateway
time="2020-09-30T12:44:17Z" level=debug msg="empty channel" func=handleMessage file="gateway/handlers.go:213" prefix=gateway
time="2020-09-30T12:44:17Z" level=debug msg="empty channel" func=handleMessage file="gateway/handlers.go:213" prefix=gateway
time="2020-09-30T12:44:17Z" level=debug msg="empty channel" func=handleMessage file="gateway/handlers.go:213" prefix=gateway
time="2020-09-30T12:44:17Z" level=debug msg="empty channel" func=handleMessage file="gateway/handlers.go:213" prefix=gateway
time="2020-09-30T12:44:17Z" level=debug msg="empty channel" func=handleMessage file="gateway/handlers.go:213" prefix=gateway
time="2020-09-30T12:44:17Z" level=debug msg="empty channel" func=handleMessage file="gateway/handlers.go:213" prefix=gateway
time="2020-09-30T12:44:17Z" level=debug msg="empty channel" func=handleMessage file="gateway/handlers.go:213" prefix=gateway
time="2020-09-30T12:44:22Z" level=debug msg="<= Sending JOIN_LEAVE event from irc.bridgebot to gateway" func=handleJoinPart file="bridge/irc/handlers.go:98" prefix=irc
time="2020-09-30T12:44:22Z" level=debug msg="<= Message is config.Message{Text:"wm-bb quits", Channel:"excess flood", Username:"system", UserID:"", Avatar:"", Account:"irc.bridgebot", Event:"join_leave", Protocol:"", Gateway:"", ParentID:"", Timestamp:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}, ID:"", Extra:map[string][]interface {}(nil)}" func=handleJoinPart file="bridge/irc/handlers.go:100" prefix=irc

Event Timeline

bd808 triaged this task as High priority.Sep 30 2020, 4:47 PM

This happened a week or so ago as well. There appear to be some tuneable parameters for the irc gateway that might be helpful:

The RejoinDelay for sure needs to be tuned. Freenode is never fond of immediate auto-rejoining.

Mentioned in SAL (#wikimedia-cloud) [2020-09-30T16:53:54Z] <wm-bot> <bd808> Upgrade to 1.18.3; Add RejoinDelay=5 for IRC channels (T264212)

bd808 lowered the priority of this task from High to Medium.

I'll leave this open for a while to see if the excess flood problem is worked around with the settings I added:

[irc.bridgebot]
# max messages to hold in queue for sending
MessageQueue=30
# delay between each message in milliseconds
MessageDelay=1500
# delay in seconds to rejoin a channel when kicked
RejoinDelay=5
bd808 moved this task from To Do to Volunteer time on the User-bd808 board.

Mentioned in SAL (#wikimedia-cloud) [2020-09-30T21:12:35Z] <wm-bot> <bd808> Another restart for irc flood crash (T264212)

bd808 raised the priority of this task from Medium to High.Sep 30 2020, 9:16 PM

The most recent irc reconnect crash looks to have started with:

time="2020-09-30T19:26:24Z" level=error msg="disconnect: error: read tcp 192.168.231.6:37976->149.56.134.238:6697: read: connection reset by peer" func=doConnect file="bridge/irc/irc.go:184" prefix=irc
time="2020-09-30T19:26:24Z" level=info msg="reconnecting in 30 seconds..." func=doConnect file="bridge/irc/irc.go:192" prefix=irc
2020/09/30 19:26:30 Post "https://api.telegram.org/bot608908347:AAGyDriC2_wylLOjKHBJJNc0DS8TcYyoTd4/getUpdates": read tcp 192.168.231.6:48246->149.154.167.220:443: read: connection reset by peer
2020/09/30 19:26:30 Failed to get updates, retrying in 3 seconds...
time="2020-09-30T19:27:04Z" level=info msg="irc.bridgebot: joining #wikimedia-cloud (ID: #wikimedia-cloudirc.bridgebot)" func=joinChannels file="bridge/bridge.go:77" prefix=irc
time="2020-09-30T19:27:04Z" level=info msg="irc.bridgebot: joining #wikipedia-abstract (ID: #wikipedia-abstractirc.bridgebot)" func=joinChannels file="bridge/bridge.go:77" prefix=irc
time="2020-09-30T19:27:04Z" level=info msg="irc.bridgebot: joining #wikipedia-it-sysop (ID: #wikipedia-it-sysopirc.bridgebot)" func=joinChannels file="bridge/bridge.go:77" prefix=irc
time="2020-09-30T19:27:04Z" level=info msg="irc.bridgebot: joining #wmhack (ID: #wmhackirc.bridgebot)" func=joinChannels file="bridge/bridge.go:77" prefix=irc
time="2020-09-30T19:27:04Z" level=info msg="irc.bridgebot: joining ##stashbot (ID: ##stashbotirc.bridgebot)" func=joinChannels file="bridge/bridge.go:77" prefix=irc
time="2020-09-30T19:27:04Z" level=info msg="irc.bridgebot: joining #wikimedia-cloud (ID: #wikimedia-cloudirc.bridgebot)" func=joinChannels file="bridge/bridge.go:77" prefix=irc

The IRC join messages repeat quite a few times and eventually:

time="2020-09-30T19:27:04Z" level=debug msg="empty channel" func=handleMessage file="gateway/handlers.go:213" prefix=gateway
time="2020-09-30T19:27:04Z" level=debug msg="empty channel" func=handleMessage file="gateway/handlers.go:213" prefix=gateway
time="2020-09-30T19:27:04Z" level=debug msg="empty channel" func=handleMessage file="gateway/handlers.go:213" prefix=gateway
time="2020-09-30T19:27:04Z" level=debug msg="empty channel" func=handleMessage file="gateway/handlers.go:213" prefix=gateway
time="2020-09-30T19:27:04Z" level=debug msg="empty channel" func=handleMessage file="gateway/handlers.go:213" prefix=gateway
time="2020-09-30T19:27:04Z" level=debug msg="empty channel" func=handleMessage file="gateway/handlers.go:213" prefix=gateway
time="2020-09-30T19:27:04Z" level=debug msg="empty channel" func=handleMessage file="gateway/handlers.go:213" prefix=gateway
time="2020-09-30T19:27:04Z" level=debug msg="empty channel" func=handleMessage file="gateway/handlers.go:213" prefix=gateway
time="2020-09-30T19:27:04Z" level=debug msg="empty channel" func=handleMessage file="gateway/handlers.go:213" prefix=gateway
time="2020-09-30T19:27:04Z" level=debug msg="empty channel" func=handleMessage file="gateway/handlers.go:213" prefix=gateway
time="2020-09-30T19:27:09Z" level=debug msg="<= Sending JOIN_LEAVE event from irc.bridgebot to gateway" func=handleJoinPart file="bridge/irc/handlers.go:98" prefix=irc
time="2020-09-30T19:27:09Z" level=debug msg="<= Message is config.Message{Text:"wm-bb_ quits", Channel:"excess flood", Username:"system", UserID:"", Avatar:"", Account:"irc.bridgebot", Event:"join_leave", Protocol:"", Gateway:"", ParentID:"", Timestamp:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}, ID:"", Extra:map[string][]interface {}(nil)}" func=handleJoinPart file="bridge/irc/handlers.go:100" prefix=irc
time="2020-09-30T19:27:09Z" level=error msg="disconnect: error: EOF" func=doConnect file="bridge/irc/irc.go:184" prefix=irc
time="2020-09-30T19:27:09Z" level=info msg="reconnecting in 30 seconds..." func=doConnect file="bridge/irc/irc.go:192" prefix=irc

My current hunch is that the disconnects are "normal" and caused by network connectivity interruptions. I think the real problem is the rate at which the matterbridge software tries to reconnect to Freenode. I haven't found a join rate setting for matterbidge yet, so I may need to report that as a missing feature upstream.

bd808 renamed this task from Bridgebot gets stuck in bad state when kicked from Freenode to Bridgebot gets stuck in bad state when kicked from IRC channels.Jul 7 2021, 2:52 PM