Page MenuHomePhabricator

Language::hebrewNumeral() returns invalid UTF-8 when passed a multiple of 1000
Closed, ResolvedPublic1 Estimated Story Points

Description

It's using bytewise string-manipulation functions and apparently assuming every character in the string is 2 bytes. But multiples of 1000 include U+0027 in the string generated to that point, which is only one byte.

Wikitext such as {{#time:xhxjj xjx xhxjY|4239-10-12}} will produce such a call, which may result in oddly truncated or empty output from the parser.

Event Timeline

Maniphest changed the visibility from "Public (No Login Required)" to "Custom Policy".Apr 28 2015, 5:51 PM
Maniphest changed the edit policy from "All Users" to "Custom Policy".
eranroz updated the task description. (Show Details)
eranroz changed Security from None to Software security bug.
eranroz edited subscribers, added: eranroz; removed: Aklapper.

Nothing in particular to do with the parser function. The problem is that that particular #time invocation is calling Language::hebrewNumeral( 8000 ), which returns a string containing invalid UTF-8. Then later on this invalid UTF-8 is causing something to reject the parser output.

It should be easy enough to fix for someone who understands what exactly the existing code is supposed to be doing; I'd guess changing all the strlen and substr calls to mb_strlen and mb_substr plus changing all the constant 2 to 1 would probably do it, but that's just guessing.

Anomie renamed this task from Invalid #time can kill the parser to Language::hebrewNumeral() returns invalid UTF-8 when passed a multiple of 1000.Apr 28 2015, 6:47 PM
Anomie updated the task description. (Show Details)
Anomie edited projects, added Language-Team, I18n; removed ParserFunctions.

I'm going to leave this as a security bug for the moment, as I'm not sure that this couldn't somehow be exploited to produce hard-to-detect or hard-to-fix vandalism.

I spoke with Amir about this. Using mb_substr should be sufficient to fix the issue (strlen is probably not needed at all). If fixing this is not urgent, we can probably work on this during next LangEng sprint to write unit tests and clean up the code.

An amusing bug :)

The first couple of things that are definitely needed are

  1. writing unit tests for hebrewNumeral() - looks like there aren't any.
  2. Refresh its code a bit, to fix == if nothing else.

As for the actual solution, I can think of several ways:

  1. Replace the usage of <'> and <"> with <׳> and <״>. It's more elegant, more correct for Hebrew and doesn't have the single-byte problem, because they are in the same Unicode block as Hebrew letters, but it also has a little downside - people may have trouble searching for them using Ctrl-F. @eranroz, how bad do you think it is? I don't think that it's actually used much in article content, but I might be wrong.
  2. Find a way to get rid of strlen in this function entirely. I suspect that it's possible.
  3. Use mb_strlen.

@Amire80 -I like the idea of ׳ and ״ it is a bit more "correct" to use those characters, and I'm not aware to use in article content.

If someone can manage to Ctrl-F search for "א'רל"ד", surely they can also type "א׳רל״ד" to search for that?

If someone can manage to Ctrl-F search for "א'רל"ד", surely they can also type "א׳רל״ד" to search for that?

Actually, no, because many keyboards don't include "״" and "׳".

In any case, I did in https://gerrit.wikimedia.org/r/#/c/212730/ without these characters. Maybe it can be changed later.

Is this resolved? If so, can it be made a public task?

Up to @csteipp, I guess. I don't mind its being public. (It's not deployed to production, if it's important.)

csteipp changed the visibility from "Custom Policy" to "Public (No Login Required)".
csteipp changed the edit policy from "Custom Policy" to "All Users".
csteipp changed Security from Software security bug to None.

I'm pretty sure there are no security implications here.