Diacritics from IRC are sometimes encoded incorrectly in telegram
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Daimona
	Sep 20 2022, 1:54 PM

Description

The IRC channel #wikipedia-it-sysop is bridged to a telegram channel with bridgebot. We noticed that sometimes, messages containing diacritics in IRC are encoded incorrectly by the bot and the output on telegram is mangled. An example is "è" being shown as "Ã¨", so apparently the text is not being interpreted as UTF-8.

I'd be happy to tell you exactly what causes the encoding to break, but unfortunately there doesn't seem to be an easily identifiable pattern of breakage. At least it seems to be deterministic, that is, a given message is always either broken or correct.

Examples of broken messages, one per line:

Forse allora è il bot che incasina gli accenti
Forse è il bot che incasina gli accenti
è il bot che incasina gli accenti
è che
SÌ

Examples of correctly-encoded messages:

Forse allora è
è il bot c
è il bot
è e
Perché farà così, però, orsù!

Event Timeline

Daimona created this task.Sep 20 2022, 1:54 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 20 2022, 1:54 PM

We are not currently setting an explicit charset (which can be done https://github.com/42wim/matterbridge/wiki/Settings#charset) and instead are letting matterbridge guess the inbound IRC message encoding for each message. This detection is done by the https://github.com/saintfish/chardet library.

I think the first thing we could try is adding a Charset="utf-8" setting under the assumption that more often than not utf-8 will be the correct encoding.

Restricted Application added a project: User-bd808. · View Herald TranscriptSep 23 2022, 6:20 PM

Mentioned in SAL (#wikimedia-cloud) [2022-09-23T18:24:10Z] <wm-bot> <bd808> Assume that IRC messages are always encoded in UTF-8 (T318161)

[18:24]  <   wm-bot> !log tools.bridgebot <bd808> Assume that IRC messages are always encoded in UTF-8 (T318161)
[18:24]  < stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.bridgebot/SAL
[18:26]  <    bd808> (testing) T318161 said that "è che" was not being seen as utf-8 when crossing over to telegram. What happens now? 
[18:26]  < stashbot> T318161: Diacritics from IRC are sometimes encoded incorrectly in telegram - https://phabricator.wikimedia.org/T318161
[18:27]  <    wm-bb> <bd808> It looks ok in this context. (re @wmtelegram_bot: <bd808> (testing) T318161 said that "è che" was not being seen as utf-8 when crossing over to telegram. What happens now?)

@Daimona please do try out the new setting in your channel and reopen this if you continue to see mojibake issues.

Yup, it seems to be working now, thank you :)

In T318161#8256373, @bd808 wrote:

[18:24]  <   wm-bot> !log tools.bridgebot <bd808> Assume that IRC messages are always encoded in UTF-8 (T318161)
[18:24]  < stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.bridgebot/SAL
[18:26]  <    bd808> (testing) T318161 said that "è che" was not being seen as utf-8 when crossing over to telegram. What happens now? 
[18:26]  < stashbot> T318161: Diacritics from IRC are sometimes encoded incorrectly in telegram - https://phabricator.wikimedia.org/T318161
[18:27]  <    wm-bb> <bd808> It looks ok in this context. (re @wmtelegram_bot: <bd808> (testing) T318161 said that "è che" was not being seen as utf-8 when crossing over to telegram. What happens now?)

It doesn't really matter at this point, but for the bug to happen, each example posted above needs to be the whole message. If you add or remove text, it's not guaranteed to remain broken. I guess it had to do with how the library that you linked would guess the encoding.

Diacritics from IRC are sometimes encoded incorrectly in telegramClosed, ResolvedPublicActions

Description

Event Timeline

Diacritics from IRC are sometimes encoded incorrectly in telegram
Closed, ResolvedPublic
Actions