Page MenuHomePhabricator

EventStreams butcher up some Unicode characters
Open, MediumPublic

Description

@Iluvatar has mentioned to me today that he is using EventStreams and some Cyrillic characters (usually one per message) get received incorrectly:
https://ru.wikipedia.org/?diff=93764185

{{User:IluvatarBot/Подозрительный источник|Миру — �ир! (скульптура)|93761090|93764184| livejournal.com|Ksenya1|1530789469}}

This is eerily similar to the problem that I had for months with XmlRcs library (for Discord bot, so I am going to quote messages) in which some symbols frequently also disappear exactly in the same way:

(разн.) . . (+966) . . QBA-II-bot (обсуждение | вклад) (Правки участника массово отменяются: special:contribs/83.219.133.32 - новый запро�)
(разн.) . . (+914) . . Рейму Хакур�й (обсуждение | вклад) (special:contribs/93.73.174.149 - новый запрос)
(разн.) . . (+918) . . Рейму Хакурей (обсуждение | вклад) (special:contribs/149.126.168.135 - новый запро�)
(разн.) . . (+967) . . QBA-II-bot (обсуждение | вклад) (Удаление тек�та со служебных страниц: special:contribs/109.63.182.26 - новый запрос)

Could the maintainers of EventStreams please investigate what might cause such problems with Unicode in different fields and libraries?

Event Timeline

stjn created this task.Jul 6 2018, 7:42 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 6 2018, 7:42 PM
SerDIDG added a subscriber: SerDIDG.Jul 6 2018, 8:18 PM
MBH added a subscriber: MBH.Jul 7 2018, 1:15 AM
fdans triaged this task as Medium priority.Jul 9 2018, 4:14 PM
fdans raised the priority of this task from Medium to High.
fdans lowered the priority of this task from High to Medium.
fdans moved this task from Incoming to Operational Excellence on the Analytics board.
fdans moved this task from Operational Excellence to Modern Event Platform on the Analytics board.

Hey sorry, just saw this!

Hm, If I understand correctly, the diff you linked to has the same characters incorrect? '�ир!'. Is that right?
If so, this is a problem with how the string was saved in MediaWiki, not EventStreams delivery. EventStreams is giving you the string as MediaWiki had it.

Please correct me if I am wrong. :)

Restricted Application added a subscriber: Liuxinyu970226. · View Herald TranscriptThu, Nov 14, 8:50 PM
stjn added a comment.EditedFri, Nov 15, 12:09 PM

In �ир missing character was м, in запро� it was с, and in Хакур�й it is е. The article in the linked diff was called Миру — мир! (скульптура), EventStreams emitted that title with a missing м. The same error was happening for other fields (Хакур�й is from a username, запро� is from a comment).

Today, in November 2019, I can’t really say if this still persists; maybe others might chip in. When I filed the task, this was much more frequent, maybe there were some unknown changes that fixed this.

Hm, right, but what I mean is, the missing м looks like it is actually missing from the saved content:

curl -s 'https://ru.wikipedia.org/w/api.php?action=compare&fromrev=93764185&torelative=prev&format=json' | jq .
{
  "compare": {
    "*": "<tr>\n  <td colspan=\"2\" class=\"diff-lineno\">Строка 1:</td>\n  <td colspan=\"2\" class=\"diff-lineno\">Строка 1:</td>\n</tr>\n<tr>\n  <td colspan=\"2\" class=\"diff-empty\">&#160;</td>\n  <td class=\"diff-marker\">+</td>\n  <td class=\"diff-addedline\"><div>{{User:IluvatarBot/Подозрительный источник|Миру — �ир! (скульптура)|93761090|93764184| livejournal.com|Ksenya1|1530789469}}</div></td>\n</tr>\n<tr>\n  <td class=\"diff-marker\">&#160;</td>\n  <td class=\"diff-context\"><div>{{User:IluvatarBot/Подозрительный источник|Ярославское восстание|93668421|93763600| livejournal.com|Tpyvvikky|1530787434}}</div></td>\n  <td class=\"diff-marker\">&#160;</td>\n  <td class=\"diff-context\"><div>{{User:IluvatarBot/Подозрительный источник|Ярославское восстание|93668421|93763600| livejournal.com|Tpyvvikky|1530787434}}</div></td>\n</tr>\n<tr>\n  <td class=\"diff-marker\">&#160;</td>\n  <td class=\"diff-context\"><div>{{User:IluvatarBot/Подозрительный источник|60-й отдельный бронепоезд|93759489|93759498| forum|Семен Владимиров|1530768880}}</div></td>\n  <td class=\"diff-marker\">&#160;</td>\n  <td class=\"diff-context\"><div>{{User:IluvatarBot/Подозрительный источник|60-й отдельный бронепоезд|93759489|93759498| forum|Семен Владимиров|1530768880}}</div></td>\n</tr>\n\n<!-- diff cache key ruwiki:diff:wikidiff2:1.12:old-93763603:rev-93764185:1.9.0 -->\n",
    "fromid": 7404545,
    "fromrevid": 93763603,
    "fromns": 2,
    "fromtitle": "Участник:IluvatarBot/Badlinks/raport",
    "toid": 7404545,
    "torevid": 93764185,
    "tons": 2,
    "totitle": "Участник:IluvatarBot/Badlinks/raport"
  }
}

So EventStreams is giving you exactly what MediaWiki has. This wouldn't be a problem with EventStreams, it happened somewhere in MediaWiki.

stjn added a comment.Fri, Nov 15, 2:39 PM

I think you misunderstood what the purpose of that diff was. The bot in the diff consumes EventStreams data and posts the result. These kinds of omissions were present in EventStreams data, not in MediaWiki. To prove this, my Discord bot regularly encountered the exact same problem, despite not touching MediaWiki at all in the process.

Ah! I did misunderstand that, thank you. Ok I think I get it now.
So the bot consumed an event for revision 93764184 create (right?), which is for a page with the title with мир:

curl -s 'https://ru.wikipedia.org/w/api.php?action=compare&fromrev=93764184&torelative=prev&format=json' | jq .compare.totitle
"Миру — мир! (скульптура)"

but the м missing when the bot got the title from EventStreams.
That does indeed sound like a problem with EventStreams.

I can’t really say if this still persists

Yeah, I'm not sure either. I just created a test page on test.wikipedia.org with these characters and got the correct ones in EventStreams, so I can't reproduce directly.

Iluvatar added a comment.EditedFri, Nov 15, 5:10 PM

Bug was not in specific symbols. Periodically (not always!) EventStream did a defect in random characters of different writing (not only in Cyrillic). One time an error is in "м", and another time "м" is normal, but error is in "y" or "ф".

See another very old screenshot. "М" is normal: https://imgur.com/a/zQHCitr

Bug disappeared a few months after submitted of this task. Now that bug does not exist.
Sorry for my English.