Page MenuHomePhabricator

Greek Ano Teleia symbol causes "The supplied MD5 hash was incorrect"
Closed, DuplicatePublic

Description

Symbol with UTF-8 code 0xCE 0x87 causes "The supplied MD5 hash was incorrect" on posting it throw https://www.wikidata.org/w/api.php?action=edit.

The response says "NFC-normalized Unicode without C0 control characters other than...", but the symbol is looked good, see: https://unicode-table.com/en/0387/

The issue can be reproduced using 2-byte document "0xCE 0x87".

Request:

POST https://www.wikidata.org/w/api.php?action=edit&bot=1&assert=bot&format=json&utf8=true&md5=b80d5a5d9193d69ce0b1009c31587da5&notminor&nocreate&basetimestamp=2017-04-24T17:57:35Z&starttimestamp=2017-04-24T17:57:35Z&title=Wikidata:Database%20reports%2FConstraint%20violations%2FP274 HTTP/1.1
Content-Type: multipart/form-data; boundary=---------------------------15841t4258657059076
Content-Length: 550
User-Agent: C++ WikiAPI
Host: www.wikidata.org
Connection: Keep-Alive
Cache-Control: no-cache
Cookie: WMF-Last-Access=24-Apr-2017; wikidatawikiUserName=KrBot; wikidatawikiSession=<cut>; forceHTTPS=true; wikidatawikiUserID=<cut>; centralauth_User=KrBot; centralauth_Token=<cut>; centralauth_Session=<cut>; WMF-Last-Access-Global=24-Apr-2017; GeoIP=<cut>

-----------------------------15841t4258657059076
Content-Disposition: form-data; name="text"
Content-Type: application/x-www-form-urlencoded

·
-----------------------------15841t4258657059076
Content-Disposition: form-data; name="summary"
Content-Type: application/x-www-form-urlencoded

update
-----------------------------15841t4258657059076
Content-Disposition: form-data; name="token"
Content-Type: application/x-www-form-urlencoded

<cut>+\
-----------------------------15841t4258657059076--

Responce:

HTTP/1.1 200 OK
Date: Mon, 24 Apr 2017 17:57:36 GMT
Content-Type: application/json; charset=utf-8
Connection: keep-alive
Server: mw2221.codfw.wmnet
X-Powered-By: HHVM/3.12.14
X-Content-Type-Options: nosniff
Cache-control: private, must-revalidate, max-age=0
MediaWiki-API-Error: badmd5
X-Frame-Options: DENY
Vary: Accept-Encoding
Backend-Timing: D=<cut> t=<cut>
X-Varnish: <cut>, <cut>, <cut>
Via: 1.1 varnish-v4, 1.1 varnish-v4, 1.1 varnish-v4
Accept-Ranges: bytes
Age: 0
X-Cache: cp2019 pass, cp3033 pass, cp3041 pass
X-Cache-Status: pass
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
X-Analytics: WMF-Last-Access=24-Apr-2017;WMF-Last-Access-Global=24-Apr-2017;https=1
X-Client-IP: <cut>
Content-Length: 565

{"error":{"code":"badmd5","info":"The supplied MD5 hash was incorrect.","*":"See https://www.wikidata.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce> for notice of API deprecations and breaking changes."},"warnings":{"edit":{"*":"The value passed for \"text\" contains invalid or non-normalized data. Textual data should be valid, NFC-normalized Unicode without C0 control characters other than HT (\\t), LF (\\n), and CR (\\r)."}},"servedby":"mw2221"}

Event Timeline

I had a look. From what I understand this is an edit done by KrBot to a page in the project namespace (not to an entity namespace), see https://www.wikidata.org/wiki/Wikidata:Database_reports/Constraint_violations/P274. The edit is rejected by MediaWiki core's API because the provided MD5 checksum does not match with the one MediaWiki core expects. The error is triggered by this line in core's ApiBase.php: https://phabricator.wikimedia.org/source/mediawiki/browse/master/includes/api/ApiBase.php;b0e34dab688b64eae52108b1fa937704a615c9cc$1124. This might happen because the WebRequest::normalizeUnicode call you can see just a few lines above does not support this character. But this is where I stopped, and would like to ask other developers more familiar with this UTF-8 clean-up, normalization, and validation to take over and assign this ticket to the proper project.

This is not a Wikidata ticket.

Anomie subscribed.

Unicode Normalization Form C converts U+0387 into U+00B7.

The warning is telling you what you need to know: "The value passed for "text" contains invalid or non-normalized data. Textual data should be valid, NFC-normalized Unicode without C0 control characters other than HT (\t), LF (\n), and CR (\r)."