Language::hebrewNumeral() returns invalid UTF-8 when passed a multiple of 1000
Closed, ResolvedPublic1 Estimated Story Points
Actions

Assigned To

Authored By

	eranroz
	Apr 28 2015, 5:51 PM

Description

It's using bytewise string-manipulation functions and apparently assuming every character in the string is 2 bytes. But multiples of 1000 include U+0027 in the string generated to that point, which is only one byte.

Wikitext such as {{#time:xhxjj xjx xhxjY|4239-10-12}} will produce such a call, which may result in oddly truncated or empty output from the parser.

Details

	Subject	Repo	Branch	Lines +/-
	Rewrite Language::hebrewNumeral()	mediawiki/core	master	+60 -33
	Add tests for Language::hebrewNumeral()	mediawiki/core	master	+62 -0

Customize query in gerrit

Related Objects

Mentioned In: rMWc9678525eb0c: Rewrite Language::hebrewNumeral()
rMW399d00e7e7f1: Add tests for Language::hebrewNumeral()

Event Timeline

Maniphest changed the visibility from "Public (No Login Required)" to "Custom Policy".Apr 28 2015, 5:51 PM

Maniphest changed the edit policy from "All Users" to "Custom Policy".

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 28 2015, 5:51 PM

eranroz created this task.Apr 28 2015, 5:51 PM

eranroz updated the task description. (Show Details)

eranroz added projects: ParserFunctions, acl*security.

eranroz changed Security from None to Software security bug.

eranroz edited subscribers, added: eranroz; removed: Aklapper.

Nothing in particular to do with the parser function. The problem is that that particular #time invocation is calling Language::hebrewNumeral( 8000 ), which returns a string containing invalid UTF-8. Then later on this invalid UTF-8 is causing something to reject the parser output.

It should be easy enough to fix for someone who understands what exactly the existing code is supposed to be doing; I'd guess changing all the strlen and substr calls to mb_strlen and mb_substr plus changing all the constant 2 to 1 would probably do it, but that's just guessing.

I'm going to leave this as a security bug for the moment, as I'm not sure that this couldn't somehow be exploited to produce hard-to-detect or hard-to-fix vandalism.

• csteipp assigned this task to Nikerabbit.Apr 28 2015, 7:28 PM

Nikerabbit added a subscriber: Amire80.Apr 30 2015, 2:10 PM

I spoke with Amir about this. Using mb_substr should be sufficient to fix the issue (strlen is probably not needed at all). If fixing this is not urgent, we can probably work on this during next LangEng sprint to write unit tests and clean up the code.

An amusing bug :)

The first couple of things that are definitely needed are

writing unit tests for hebrewNumeral() - looks like there aren't any.
Refresh its code a bit, to fix == if nothing else.

As for the actual solution, I can think of several ways:

Replace the usage of <'> and <"> with <׳> and <״>. It's more elegant, more correct for Hebrew and doesn't have the single-byte problem, because they are in the same Unicode block as Hebrew letters, but it also has a little downside - people may have trouble searching for them using Ctrl-F. @eranroz, how bad do you think it is? I don't think that it's actually used much in article content, but I might be wrong.
Find a way to get rid of strlen in this function entirely. I suspect that it's possible.
Use mb_strlen.

@Amire80 -I like the idea of ׳ and ״ it is a bit more "correct" to use those characters, and I'm not aware to use in article content.

If someone can manage to Ctrl-F search for "א'רל"ד", surely they can also type "א׳רל״ד" to search for that?

• csteipp triaged this task as Low priority.May 21 2015, 9:57 PM

Amire80 added a project: LE-Sprint-86.May 21 2015, 9:58 PM

In T97444#1249233, @Anomie wrote:

If someone can manage to Ctrl-F search for "א'רל"ד", surely they can also type "א׳רל״ד" to search for that?

Actually, no, because many keyboards don't include "״" and "׳".

In any case, I did in https://gerrit.wikimedia.org/r/#/c/212730/ without these characters. Maybe it can be changed later.

Amire80 added a subscriber: Arrbee.May 25 2015, 9:36 AM

Arrbee edited projects, added LE-Sprint-87; removed LE-Sprint-86.May 25 2015, 9:48 AM

Amire80 moved this task from Backlog to In Review on the LE-Sprint-87 board.May 28 2015, 2:37 PM

Nikerabbit mentioned this in rMW399d00e7e7f1: Add tests for Language::hebrewNumeral().Jun 2 2015, 12:35 PM

Nikerabbit moved this task from In Review to Done on the LE-Sprint-87 board.Jun 2 2015, 1:22 PM

Nikerabbit mentioned this in rMWc9678525eb0c: Rewrite Language::hebrewNumeral().Jun 2 2015, 1:29 PM

Is this resolved? If so, can it be made a public task?

Up to @csteipp, I guess. I don't mind its being public. (It's not deployed to production, if it's important.)

I'm pretty sure there are no security implications here.

• santhosh edited a custom field.Jun 9 2015, 8:04 AM

Arrbee closed this task as Resolved.Jun 9 2015, 9:18 AM

Aklapper removed a subscriber: Anomie.Oct 16 2020, 5:43 PM

Language::hebrewNumeral() returns invalid UTF-8 when passed a multiple of 1000Closed, ResolvedPublic1 Estimated Story PointsActions

Description

Details

Related Objects

Event Timeline

Language::hebrewNumeral() returns invalid UTF-8 when passed a multiple of 1000
Closed, ResolvedPublic1 Estimated Story Points
Actions