Page MenuHomePhabricator

Bengali handling can cause issues with API editing/MD5 checking
Closed, ResolvedPublic

Description

https://github.com/alexz-enwp/wikitools/issues/40 was reported as a bug for the wikitools library. However, the problem appears to be in MediaWiki.

When trying to do an API edit containing the character য়, MD5 verification fails. This is because MediaWiki is decomposing this single character into 2 characters, য+ ়. The latter is a combining character, so the end result will look the same - য় - but it is 2 different characters when calculating the md5 hash, and does not match the hash of what the client sent.
So md5( "য়" ) -> 29a76e95913204ab68f10a1c34784a7c
And md5( "য়" ) -> 1bb92266e7298fb3aa4aa5c641562f10

I've verified that Python is sending a single character in the POST data, and PHP has a single character in the $_POST variable, so the decomposition is done in MediaWiki somewhere. It's easy enough to work around by just not using MD5 verification with Bengali text, but it seems like unexpected behavior.

I've also verified this is the case with other precomposed Bengali characters such as ড়, but is not the case with precomposed characters from other scripts such as ў.

Simple replication code:

# -*- coding: utf-8 -*-
from wikitools import *
site = wiki.Wiki('http://localhost/w/api.php')
p = page.Page(site, 'Test 123')
p.edit(text= 'য়')

This fails with a "badmd5" error.

Event Timeline

Schnark subscribed.

Mediawiki transforms all input into NFC before further processing. Though it doesn't seem obvious, the NFC of য় (U+09DF) is <য ়> (<U+09AF U+09BC>), because it has the Comp_Ex property set, like other other Bengali characters, too. This isn't the case for most other precomposed characters, so the NFC of ў is ў itself.

So when using the md5 parameter, you should make sure that all data is already in NFC, though of course it would be nice, if Mediawiki could check the original input instead, or at least explicitly document this behavior, e.g. on https://www.mediawiki.org/wiki/API:Edit

In Python, you can use unicodedata.normalize('NFC', text) to convert to NFC beforehand.

Thanks. I agree it would be nice if it compared the original input in the spirit of the purpose of the MD5 check. But documentation somewhere would be helpful.

Mediawiki transforms all input into NFC before further processing. Though it doesn't seem obvious, the NFC of য় (U+09DF) is <য ়> (<U+09AF U+09BC>), because it has the Comp_Ex property set, like other other Bengali characters, too. This isn't the case for most other precomposed characters, so the NFC of ў is ў itself.

This is really good information to have, thank you. Do you know where MediaWiki is doing this? I see includes/compat/normal/UtfNormal.php, but it seems to be referencing a Validator of some kind(?) and I'm not sure I see the data flow properly.

So when using the md5 parameter, you should make sure that all data is already in NFC, though of course it would be nice, if Mediawiki could check the original input instead, or at least explicitly document this behavior, e.g. on https://www.mediawiki.org/wiki/API:Edit

I definitely think we need better documentation here. There may be existing pages that document these transformations, but I checked https://www.mediawiki.org/wiki/NFC and it was a dead end. :-( We should consider both MediaWiki developer-focused documentation (people building api.php) and user-focused documentation (people using api.php).

Thanks. I agree it would be nice if it compared the original input in the spirit of the purpose of the MD5 check. But documentation somewhere would be helpful.

Thank you for filing this task!

I verified yesterday that MediaWiki is indeed hitting if ( !is_null( $params['md5'] ) && md5( $toMD5 ) !== $params['md5'] ) { in includes/api/ApiEditPage.php and producing the two hashes mentioned in the task description: $params['md5'] is 29a76e95913204ab68f10a1c34784a7c and md5( $toMD5 ) is 1bb92266e7298fb3aa4aa5c641562f10 with the provided input.

I'm not sure I understand the purpose and behavior of the MD5 hash used with api.php. If the hash is supposed to check the post-transformed content, wouldn't it always fail with wikitext such as ~~~~ and {{subst:}}? If the hash is supposed to check pre-transformed content, why is it failing here?

Do you know where MediaWiki is doing this? I see includes/compat/normal/UtfNormal.php, but it seems to be referencing a Validator of some kind(?) and I'm not sure I see the data flow properly.

Poking at this a bit more, https://github.com/wikimedia/utfnormal is what I was missing.

Mediawiki transforms all input into NFC before further processing. Though it doesn't seem obvious, the NFC of য় (U+09DF) is <য ়> (<U+09AF U+09BC>), because it has the Comp_Ex property set, like other other Bengali characters, too. This isn't the case for most other precomposed characters, so the NFC of ў is ў itself.

This is really good information to have, thank you. Do you know where MediaWiki is doing this? I see includes/compat/normal/UtfNormal.php, but it seems to be referencing a Validator of some kind(?) and I'm not sure I see the data flow properly.

It happens in WebRequest::normalizeUnicode() (and should happen in every other place where we ingest data, e.g. maintenance/importImages.php, etc.).

That calls UtfNormal\Validator::cleanUp(), which usually just calls normalizer_normalize (or implements NFC itself if it's not available).

I'm not sure I understand the purpose and behavior of the MD5 hash used with api.php. If the hash is supposed to check the post-transformed content, wouldn't it always fail with wikitext such as ~~~~ and {{subst:}}? If the hash is supposed to check pre-transformed content, why is it failing here?

The normalization happens long before the rest of MediaWiki sees any of the data. It's an entirely separate thing from page text parsing and from title parsing.

So when using the md5 parameter, you should make sure that all data is already in NFC, though of course it would be nice, if Mediawiki could check the original input instead, or at least explicitly document this behavior, e.g. on https://www.mediawiki.org/wiki/API:Edit

https://gerrit.wikimedia.org/r/#/c/306491/ will add a warning when submitted input is normalized, which should tip off someone trying to debug this sort of issue in the future. The md5 parameter will still check the post-normalized data, in part because some normalization results in � characters or potentially other non-equivalent changes.

As for documentation, I'll update that patch to add a short note at https://en.wikipedia.org/w/api.php#main/datatypes that input should be in NFC-normalized UTF-8 and that MediaWiki's conversion may cause operations such as edits with MD5 to fail.

I'm not sure I understand the purpose and behavior of the MD5 hash used with api.php. If the hash is supposed to check the post-transformed content, wouldn't it always fail with wikitext such as ~~~~ and {{subst:}}? If the hash is supposed to check pre-transformed content, why is it failing here?

The purpose is to guard against data corruption between when the client generates the output and when it gets into ApiEditPage. The Unicode normalization counts as data corruption.

Change 306491 had a related patch set uploaded (by Anomie):
API: Warn when input parameters are normalized

https://gerrit.wikimedia.org/r/306491

Change 306491 merged by jenkins-bot:
API: Warn when input parameters are normalized

https://gerrit.wikimedia.org/r/306491

Anomie claimed this task.

From the action API perspective this is resolved as much as it's going to be: the warning and documentation update should go out with 1.28.0-wmf.17, see https://www.mediawiki.org/wiki/MediaWiki_1.28/Roadmap for the schedule. Edits using non-NFC-normalized Unicode will continue to fail the md5 check, but at least the reason for it failing should be clearer.