Page MenuHomePhabricator

switchdc SAL log entries are getting cut off because long lines are being split over IRC
Open, Needs TriagePublic

Description

From today's switchover:

07:36:44 <+logmsgbot> !log jayme@cumin1001 Switching services kartotherian, proton, wdqs-internal, wikifeeds, zotero, recommendation-api, swift-ro, linkrecommendation, mobileapps, citoid, eventgate-analytics, push-notifications, eventstreams-internal, mathoid, similar-users, schema, apertium, restbase-async, shellbox, termbox, wdqs, ores, eventgate-analytics-external, swift, helm-charts, restbase, cxserver, search, sessionstore, eventstreams, api-gate
07:36:44 <+logmsgbot> ore, eventgate-logging-external: eqiad => codfw

Ideally something would split at some line limit and prefix the second line with !log as well.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I actually would disagree that is something that spicerack should do. Doesn't sound more logic to do it at logmsgbot level?
It will fix also the case of a human !log-ging a long line.

Legoktm renamed this task from cookbook/spicerack needs to wrap long lines when posting SAL messages to IRC to switchdc SAL log entries are getting cut off because long lines are being split over IRC.Jun 30 2021, 7:09 PM
Legoktm updated the task description. (Show Details)

I actually would disagree that is something that spicerack should do. Doesn't sound more logic to do it at logmsgbot level?

Yeah, that seems reasonable. I adjusted the bug summary so it describes the problem rather than suggesting a solution. Not sure what Phab project logmsgbot falls under.

So, logmsgbot (that is, tcpircbot) thinks it's already doing this, and that's why the single long message from the cookbook turned into two IRC messages: modules/tcpircbot/files/tcpircbot.py#54

BUFSIZE = 460  # Read from socket in IRC-message-sized chunks.

and then that's used as the buffer size for each recv() call on the socket. So we need to lower the limit a bit. The IRC line length limit (512 bytes for the entire PRIVMSG line, not just the message text) is fixed in stone in the RFC, but maybe something changed in the prefix before the PRIVMSG command, e.g. in the switch to Libera -- it probably doesn't really matter exactly why 460 isn't the right number anymore. For whatever reason, 460 is too long, so the first of the two IRC messages got truncated.

For the example in the task description, the original message read "...eventstreams, api-gateway, eventgate-main, echostore, eventgate-logging-external..." We're missing 27 bytes ("way, eventgate-main, echost") in between the two lines that got through, so we could lower BUFSIZE to 460 - 27 = 433 -- or a bit lower, if we wanted to leave some extra safety margin.

But there's a problem with that. It would ensure we don't lose bytes in between IRC lines like this, but note that only the first line actually has a !log command -- so stashbot ignores the second line and it never makes it into the SAL at all, only IRC. And logmsgbot isn't the right place to fix that, because it's not tcpircbot that adds the !log in the first place -- it's just reciting exactly the message it received over TCP. SALSocketHandler in wmflib (wmflib/irc.py) is where the initial !log comes from.

So we should adjust logmsgbot's BUFSIZE, but I think we also need to add a length limit to SALSocketHandler. If the log record is too long, that handler should split it into multiple messages to logmsgbot, each with !log. (And maybe the first gets !log user@host ... but the rest get !log (continued) ... or something.) That way we get two appropriately-sized lines in IRC, and also two messages in the SAL.

@Volans What do you think?

(Side note:

It will fix also the case of a human !log-ging a long line.

I don't think that's right -- the human's IRC client will still split or truncate the message, and there's nothing logmsgbot can do about that. We could get stashbot to buffer and combine messages sent within some time interval, but it would be tricky to get right, and I don't think it's the right solution to this. Some related discussion at T176880, which was about many short lines instead of one long one, but it's hard for the same reasons.)

Change 703787 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] tcpircbot: Reduce max length of an IRC message.

https://gerrit.wikimedia.org/r/703787

Change 703787 merged by RLazarus:

[operations/puppet@production] tcpircbot: Reduce max length of an IRC message.

https://gerrit.wikimedia.org/r/703787

Could we avoid IRC from the chain to store the SAL message?
Maybe make tcpircbot/logmsgbot go directly to stashbot/whatever records the entry in SAL? (and keep the irc message as a 'side effect')

That would allow recording SAL messages even if the bot/irc system is misbehaving and bypass the message size limitations (and open a bunch of new possibilities).

Could we avoid IRC from the chain to store the SAL message?

In principle, yes, we should definitely do that. But it's a longer conversation and I want to fix this bug before resolving that -- a "good is better than perfect" situation.

Change 720816 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/software/pywmflib@master] irc: Split long !log lines

https://gerrit.wikimedia.org/r/720816