Page MenuHomePhabricator

Diacritics from IRC are sometimes encoded incorrectly in telegram
Closed, ResolvedPublic

Description

The IRC channel #wikipedia-it-sysop is bridged to a telegram channel with bridgebot. We noticed that sometimes, messages containing diacritics in IRC are encoded incorrectly by the bot and the output on telegram is mangled. An example is "è" being shown as "è", so apparently the text is not being interpreted as UTF-8.

I'd be happy to tell you exactly what causes the encoding to break, but unfortunately there doesn't seem to be an easily identifiable pattern of breakage. At least it seems to be deterministic, that is, a given message is always either broken or correct.

Examples of broken messages, one per line:

Forse allora è il bot che incasina gli accenti
Forse è il bot che incasina gli accenti
è il bot che incasina gli accenti
è che
SÌ

Examples of correctly-encoded messages:

Forse allora è
è il bot c
è il bot
è e
Perché farà così, però, orsù!

Event Timeline

bd808 changed the task status from Open to In Progress.Sep 23 2022, 6:20 PM
bd808 claimed this task.
bd808 triaged this task as Medium priority.
bd808 moved this task from To Do to In Dev/Progress on the Tool-bridgebot board.
bd808 subscribed.

We are not currently setting an explicit charset (which can be done https://github.com/42wim/matterbridge/wiki/Settings#charset) and instead are letting matterbridge guess the inbound IRC message encoding for each message. This detection is done by the https://github.com/saintfish/chardet library.

I think the first thing we could try is adding a Charset="utf-8" setting under the assumption that more often than not utf-8 will be the correct encoding.

Mentioned in SAL (#wikimedia-cloud) [2022-09-23T18:24:10Z] <wm-bot> <bd808> Assume that IRC messages are always encoded in UTF-8 (T318161)

[18:24]  <   wm-bot> !log tools.bridgebot <bd808> Assume that IRC messages are always encoded in UTF-8 (T318161)
[18:24]  < stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.bridgebot/SAL
[18:26]  <    bd808> (testing) T318161 said that "è che" was not being seen as utf-8 when crossing over to telegram. What happens now? 
[18:26]  < stashbot> T318161: Diacritics from IRC are sometimes encoded incorrectly in telegram - https://phabricator.wikimedia.org/T318161
[18:27]  <    wm-bb> <bd808> It looks ok in this context. (re @wmtelegram_bot: <bd808> (testing) T318161 said that "è che" was not being seen as utf-8 when crossing over to telegram. What happens now?)

@Daimona please do try out the new setting in your channel and reopen this if you continue to see mojibake issues.

Yup, it seems to be working now, thank you :)

[18:24]  <   wm-bot> !log tools.bridgebot <bd808> Assume that IRC messages are always encoded in UTF-8 (T318161)
[18:24]  < stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.bridgebot/SAL
[18:26]  <    bd808> (testing) T318161 said that "è che" was not being seen as utf-8 when crossing over to telegram. What happens now? 
[18:26]  < stashbot> T318161: Diacritics from IRC are sometimes encoded incorrectly in telegram - https://phabricator.wikimedia.org/T318161
[18:27]  <    wm-bb> <bd808> It looks ok in this context. (re @wmtelegram_bot: <bd808> (testing) T318161 said that "è che" was not being seen as utf-8 when crossing over to telegram. What happens now?)

It doesn't really matter at this point, but for the bug to happen, each example posted above needs to be the whole message. If you add or remove text, it's not guaranteed to remain broken. I guess it had to do with how the library that you linked would guess the encoding.