Bengali handling can cause issues with API editing/MD5 checking
Closed, ResolvedPublic
Actions

Description

https://github.com/alexz-enwp/wikitools/issues/40 was reported as a bug for the wikitools library. However, the problem appears to be in MediaWiki.

When trying to do an API edit containing the character য়, MD5 verification fails. This is because MediaWiki is decomposing this single character into 2 characters, য+ ়. The latter is a combining character, so the end result will look the same - য় - but it is 2 different characters when calculating the md5 hash, and does not match the hash of what the client sent.
So md5( "য়" ) -> 29a76e95913204ab68f10a1c34784a7c
And md5( "য়" ) -> 1bb92266e7298fb3aa4aa5c641562f10

I've verified that Python is sending a single character in the POST data, and PHP has a single character in the $_POST variable, so the decomposition is done in MediaWiki somewhere. It's easy enough to work around by just not using MD5 verification with Bengali text, but it seems like unexpected behavior.

I've also verified this is the case with other precomposed Bengali characters such as ড়, but is not the case with precomposed characters from other scripts such as ў.

Simple replication code:

# -*- coding: utf-8 -*-
from wikitools import *
site = wiki.Wiki('http://localhost/w/api.php')
p = page.Page(site, 'Test 123')
p.edit(text= 'য়')

This fails with a "badmd5" error.

Details

	Subject	Repo	Branch	Lines +/-
	API: Warn when input parameters are normalized	mediawiki/core	master	+82 -4

Customize query in gerrit

Related Objects

Mentioned In: T325826: Impossible to save if edit contains ড় or ঢ় or য়
T232662: IABot is getting MD5 hash errors when doing action=edit on tewiki
T206188: External link not working if character whose decomposition contains nukta is present in link
T197560: Support 'warningsaserror' flag in action=edit
Mentioned Here: T29849: API: add normalized info also for unicode normalization of titles

Event Timeline

Mr.Z-man created this task.Aug 27 2016, 2:05 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 27 2016, 2:05 AM

• MZMcBride subscribed.Aug 27 2016, 2:14 AM

Krenair added a project: MediaWiki-Internationalization.Aug 27 2016, 8:11 AM

Krenair added a project: MediaWiki-Page-editing.

Mediawiki transforms all input into NFC before further processing. Though it doesn't seem obvious, the NFC of য় (U+09DF) is <য ়> (<U+09AF U+09BC>), because it has the Comp_Ex property set, like other other Bengali characters, too. This isn't the case for most other precomposed characters, so the NFC of ў is ў itself.

So when using the md5 parameter, you should make sure that all data is already in NFC, though of course it would be nice, if Mediawiki could check the original input instead, or at least explicitly document this behavior, e.g. on https://www.mediawiki.org/wiki/API:Edit

In Python, you can use unicodedata.normalize('NFC', text) to convert to NFC beforehand.

Thanks. I agree it would be nice if it compared the original input in the spirit of the purpose of the MD5 check. But documentation somewhere would be helpful.

In T144071#2588205, @Schnark wrote:

Mediawiki transforms all input into NFC before further processing. Though it doesn't seem obvious, the NFC of য় (U+09DF) is <য ়> (<U+09AF U+09BC>), because it has the Comp_Ex property set, like other other Bengali characters, too. This isn't the case for most other precomposed characters, so the NFC of ў is ў itself.

This is really good information to have, thank you. Do you know where MediaWiki is doing this? I see includes/compat/normal/UtfNormal.php, but it seems to be referencing a Validator of some kind(?) and I'm not sure I see the data flow properly.

So when using the md5 parameter, you should make sure that all data is already in NFC, though of course it would be nice, if Mediawiki could check the original input instead, or at least explicitly document this behavior, e.g. on https://www.mediawiki.org/wiki/API:Edit

I definitely think we need better documentation here. There may be existing pages that document these transformations, but I checked https://www.mediawiki.org/wiki/NFC and it was a dead end. :-( We should consider both MediaWiki developer-focused documentation (people building api.php) and user-focused documentation (people using api.php).

In T144071#2588862, @Mr.Z-man wrote:

Thanks. I agree it would be nice if it compared the original input in the spirit of the purpose of the MD5 check. But documentation somewhere would be helpful.

Thank you for filing this task!

I verified yesterday that MediaWiki is indeed hitting if ( !is_null( $params['md5'] ) && md5( $toMD5 ) !== $params['md5'] ) { in includes/api/ApiEditPage.php and producing the two hashes mentioned in the task description: $params['md5'] is 29a76e95913204ab68f10a1c34784a7c and md5( $toMD5 ) is 1bb92266e7298fb3aa4aa5c641562f10 with the provided input.

I'm not sure I understand the purpose and behavior of the MD5 hash used with api.php. If the hash is supposed to check the post-transformed content, wouldn't it always fail with wikitext such as ~~~~ and {{subst:}}? If the hash is supposed to check pre-transformed content, why is it failing here?

Hmmm, T29849: API: add normalized info also for unicode normalization of titles may be related.

In T144071#2589000, @MZMcBride wrote:

Do you know where MediaWiki is doing this? I see includes/compat/normal/UtfNormal.php, but it seems to be referencing a Validator of some kind(?) and I'm not sure I see the data flow properly.

Poking at this a bit more, https://github.com/wikimedia/utfnormal is what I was missing.

In T144071#2589000, @MZMcBride wrote:

In T144071#2588205, @Schnark wrote:

Mediawiki transforms all input into NFC before further processing. Though it doesn't seem obvious, the NFC of য় (U+09DF) is <য ়> (<U+09AF U+09BC>), because it has the Comp_Ex property set, like other other Bengali characters, too. This isn't the case for most other precomposed characters, so the NFC of ў is ў itself.

This is really good information to have, thank you. Do you know where MediaWiki is doing this? I see includes/compat/normal/UtfNormal.php, but it seems to be referencing a Validator of some kind(?) and I'm not sure I see the data flow properly.

It happens in WebRequest::normalizeUnicode() (and should happen in every other place where we ingest data, e.g. maintenance/importImages.php, etc.).

That calls UtfNormal\Validator::cleanUp(), which usually just calls normalizer_normalize (or implements NFC itself if it's not available).

In T144071#2589002, @MZMcBride wrote:

I'm not sure I understand the purpose and behavior of the MD5 hash used with api.php. If the hash is supposed to check the post-transformed content, wouldn't it always fail with wikitext such as ~~~~ and {{subst:}}? If the hash is supposed to check pre-transformed content, why is it failing here?

The normalization happens long before the rest of MediaWiki sees any of the data. It's an entirely separate thing from page text parsing and from title parsing.

In T144071#2588205, @Schnark wrote:

So when using the md5 parameter, you should make sure that all data is already in NFC, though of course it would be nice, if Mediawiki could check the original input instead, or at least explicitly document this behavior, e.g. on https://www.mediawiki.org/wiki/API:Edit

https://gerrit.wikimedia.org/r/#/c/306491/ will add a warning when submitted input is normalized, which should tip off someone trying to debug this sort of issue in the future. The md5 parameter will still check the post-normalized data, in part because some normalization results in � characters or potentially other non-equivalent changes.

As for documentation, I'll update that patch to add a short note at https://en.wikipedia.org/w/api.php#main/datatypes that input should be in NFC-normalized UTF-8 and that MediaWiki's conversion may cause operations such as edits with MD5 to fail.

In T144071#2589002, @MZMcBride wrote:

I'm not sure I understand the purpose and behavior of the MD5 hash used with api.php. If the hash is supposed to check the post-transformed content, wouldn't it always fail with wikitext such as ~~~~ and {{subst:}}? If the hash is supposed to check pre-transformed content, why is it failing here?

The purpose is to guard against data corruption between when the client generates the output and when it gets into ApiEditPage. The Unicode normalization counts as data corruption.

Change 306491 had a related patch set uploaded (by Anomie):
API: Warn when input parameters are normalized

https://gerrit.wikimedia.org/r/306491

gerritbot added a project: Patch-For-Review.Aug 29 2016, 2:58 PM

Anomie moved this task from Unsorted to Needs Review on the MediaWiki-Action-API board.Aug 29 2016, 2:58 PM

Change 306491 merged by jenkins-bot:
API: Warn when input parameters are normalized

https://gerrit.wikimedia.org/r/306491

ReleaseTaggerBot added projects: MW-1.28-release-notes, MW-1.28-release (WMF-deploy-2016-08-30_(1.28.0-wmf.17)).Aug 30 2016, 2:00 AM

From the action API perspective this is resolved as much as it's going to be: the warning and documentation update should go out with 1.28.0-wmf.17, see https://www.mediawiki.org/wiki/MediaWiki_1.28/Roadmap for the schedule. Edits using non-NFC-normalized Unicode will continue to fail the md5 check, but at least the reason for it failing should be clearer.

Bodhisattwa added a parent task: T42760: [DO NOT USE] Bengali Wiki projects (tracking) [superseded by #Bengali-Sites].Aug 31 2016, 3:56 PM

MarcoAurelio added a project: Bengali-Sites.Dec 13 2016, 10:13 AM

MarcoAurelio removed a parent task: T42760: [DO NOT USE] Bengali Wiki projects (tracking) [superseded by #Bengali-Sites].Dec 13 2016, 10:27 AM

Anomie merged a task: T163728: Greek Ano Teleia symbol causes "The supplied MD5 hash was incorrect".May 5 2017, 4:34 PM

Anomie added subscribers: Ivan_A_Krestinin, Lydia_Pintscher, thiemowmde.

Anomie mentioned this in T197560: Support 'warningsaserror' flag in action=edit.Jun 18 2018, 1:41 PM

Schnark mentioned this in T206188: External link not working if character whose decomposition contains nukta is present in link.Oct 4 2018, 8:12 AM

Anomie mentioned this in T232662: IABot is getting MD5 hash errors when doing action=edit on tewiki.Sep 11 2019, 7:58 PM

Anomie merged a task: T232662: IABot is getting MD5 hash errors when doing action=edit on tewiki.Oct 21 2019, 8:59 PM

Anomie added a subscriber: Cyberpower678.

Aklapper removed a subscriber: Anomie.Oct 16 2020, 5:39 PM

Bodhisattwa moved this task from Backlog to Closed on the Bengali-Sites board.Mar 21 2022, 11:00 AM