Wikidata does not accept characters ending in \x85 (Cyrillic х, Armenian Յ, Arabic م etc.) in labels/aliases/descriptions
Closed, ResolvedPublic

Tokens
"Doubloon" token, awarded by RandomDSdevel."The World Burns" token, awarded by Liuxinyu970226."Hungry Hippo" token, awarded by Esc3300."The World Burns" token, awarded by Base."The World Burns" token, awarded by revi."The World Burns" token, awarded by NickK.
Assigned To
Authored By
NickK, Mar 23 2017

Description

The problem was first reported here: https://www.wikidata.org/wiki/Wikidata:Project_chat#Cannot_add_an_alias . Basically any attempt to add a label, a description or an alias with any character ending in \x85 in UTF-8 (Cyrillic х, Armenian Յ, Arabic م etc.) fails with an error message:

Could not save due to an error.
Malformed input: (alias here)

For example, adding an alias "Венк фон Венкхейм" to https://www.wikidata.org/wiki/Q830544 fails.

Other letters (not containing \x85 in UTF-8) do not seem to be affected

This is very critical (it affects capacity to edit Wikidata, as one or more (Armenian) letters of the alphabet cannot be used)

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added subscribers: Base, Aklapper. · View Herald TranscriptMar 23 2017, 10:08 PM
NickK triaged this task as "Unbreak Now!" priority.Mar 23 2017, 10:08 PM
Restricted Application added subscribers: Jay8g, TerraCodes. · View Herald TranscriptMar 23 2017, 10:08 PM
NickK changed the title from "Wikidata does not accept lowercase Cyrillic х in aliases" to "Wikidata does not accept lowercase Cyrillic х in labels/aliases".Mar 23 2017, 10:10 PM
NickK edited the task description. (Show Details)

It clearly worked just a few days ago, as I and many other users did add labels or aliases containing this letter

NickK added a comment.EditedMar 23 2017, 10:21 PM

Clarification: labels, descriptions and aliases are all affected. Editing a description already containing this letter (i.e. not adding this letter but just keeping a letter added before) is not possible either.

NickK changed the title from "Wikidata does not accept lowercase Cyrillic х in labels/aliases" to "Wikidata does not accept lowercase Cyrillic х in labels/aliases/descriptions".Mar 23 2017, 10:22 PM
NickK edited the task description. (Show Details)

This character (U+0445), অ (U+0985), Յ (U+0545), օ (U+0585), ׅ (U+05C5), अ (U+0905), Ӆ (U+04C5), and ԅ (U+0505), all possibly among others, all trigger this error. I was attempting to use the second (which generated this error), and seeing that both অ's and х's UTF-8 representations ended in 0x85 tried the rest of the characters in that list.

NickK changed the title from "Wikidata does not accept lowercase Cyrillic х in labels/aliases/descriptions" to "Wikidata does not accept lowercase Cyrillic х and several characters of other alphabets in labels/aliases/descriptions".Mar 24 2017, 9:32 AM
NickK edited the task description. (Show Details)

Maybe it's because russian keyboard has russian lovercase "х" on the same key that "[" in English keyboard. I.e. some filter restricts adding "[" into parameters values, and it's affects all other language symbols that are on the same key with "[".

Maybe it's because russian keyboard has russian lovercase "х" on the same key that "[" in English keyboard. I.e. some filter restricts adding "[" into parameters values, and it's affects all other language symbols that are on the same key with "[".

Not actually. Blocked Armenian letters are օ (on o key) and Յ (on J key). In addition, [ is an acceptable input

On more character affected: م from Arabic alphabet (reported in T161297 )

Drbug added a subscriber: Drbug.EditedMar 24 2017, 10:10 AM

May it be related to the fact that Unicode NEL character (Next Line) is U+0085?
Hence, it should be 0xC2 0x85 in UTF-8, but some code that checks for new lines might check just against 0x85 instead by mistake.

https://en.wikipedia.org/wiki/Newline#Unicode

NickK added a comment.EditedMar 24 2017, 10:32 AM

Yes, the thing in common is \x85 in UTF-8 encoding:
х = \xD1\x85
Ӆ = \xD3\x85
ԅ = \xD4\x85
Յ = \xD5\x85
օ = \xD6\x85
م = \xD9\x85
ׅ = \x20\xD7\x85
अ = \xE0\xA4\x85
অ = \xE0\xA6\x85
Thus other letters are also affected:
Ѕ (Cyrillic) = \xD0\x85
҅ (Old Church Slavonic) = \xD2\x85
څ (Pashto) = \xDA\x85
etc.
Update: Greek affected as well with
υ = \xCF\x85

NickK changed the title from "Wikidata does not accept lowercase Cyrillic х and several characters of other alphabets in labels/aliases/descriptions" to "Wikidata does not accept characters ending in \x85 (Cyrillic х, Armenian Յ, Arabic م etc.) in labels/aliases/descriptions".Mar 24 2017, 10:35 AM
NickK edited the task description. (Show Details)

Reproduced:
More details:

Change 344597 had a related patch set uploaded (by Thiemo Mättig (WMDE)):
[mediawiki/extensions/Wikibase@master] Change bad ASCII to UTF-8 validation in terms/value validators

https://gerrit.wikimedia.org/r/344597

thiemowmde claimed this task.
thiemowmde moved this task from incoming to in current sprint on the Wikidata board.
thiemowmde moved this task from Proposed to Review on the Wikidata-Sprint board.
Drbug added a comment.EditedMar 24 2017, 11:37 AM

I'm now even more convinced that the problem is with the code that replaces 0x85 (incorrectly treated as NEL) with 0x0D+0x0A (CR+LF).
Because xD1 x0D (or xD1 x0A), xD3 x0D (or xD3 x0A), etc. are malformed UTF-8 sequences indeed.

@Drbug, thanks a lot for investigating. However, the conclusion is not correct. @Lea_Lacroix_WMDE told us this morning and https://gerrit.wikimedia.org/r/344597 will fix the underlying issue. We are also backporting this fix right now, which means it will go live much sooner than the regular deployment. We hope this will be fixed on the live site on Monday, hopefully earlier.

Change 344599 had a related patch set uploaded (by Ladsgroup; owner: Thiemo Mättig (WMDE)):
[mediawiki/extensions/Wikibase@wmf/1.29.0-wmf.17] Change bad ASCII to UTF-8 validation in terms/value validators

https://gerrit.wikimedia.org/r/344599

Change 344600 had a related patch set uploaded (by Aleksey Bekh-Ivanov (WMDE)):
[mediawiki/extensions/WikibaseLexeme@master] Forgotten unicode regex flag

https://gerrit.wikimedia.org/r/344600

Change 344597 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Change bad ASCII to UTF-8 validation in terms/value validators

https://gerrit.wikimedia.org/r/344597

May be this helps:
https://www.wikidata.org/wiki/Special:SetLabelDescriptionAliases/Q11079271/ru when changing label to "хеширование" gives an error: "Validation failed: Negative pattern matched: /^\s|[\v\t]|\s$/"

revi added a subscriber: revi.Mar 24 2017, 12:38 PM

For the record, also failing for Korean too. screenshots onwiki

Change 344599 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@wmf/1.29.0-wmf.17] Change bad ASCII to UTF-8 validation in terms/value validators

https://gerrit.wikimedia.org/r/344599

Change 344600 merged by jenkins-bot:
[mediawiki/extensions/WikibaseLexeme@master] Forgotten unicode regex flag

https://gerrit.wikimedia.org/r/344600

Do I understand correctly that this will not be deployed before Monday due to "no deployments on Friday" rule?

In this case will there be a simple way to fix typos that people had to introduce into labels/descriptions/categories to be able to simply save changes? Four days of edits (Thu-Mon) sounds like too much to be fixed manually

revi awarded a token.Mar 25 2017, 7:26 AM

https://tools.wmflabs.org/pltools/rech/ should help. Filter by term and language code.

https://tools.wmflabs.org/pltools/rech/ should help. Filter by term and language code.

No, that does not help. This shows only unpatrolled edits, i.e. those by new users. New users are more likely to stop contributing if they face such an obstacle, those who are more likely to add typos intentionally to be able to save page are experienced users whose edits are patrolled.

Base awarded a token.Mar 26 2017, 5:14 AM

Do I understand correctly that this will not be deployed before Monday due to "no deployments on Friday" rule?

Sounds like a good rule. This way you can deploy a fix on Friday if you break it on Thursday.

Mentioned in SAL (#wikimedia-operations) [2017-03-27T09:52:12Z] <Amir1> start of ladsgroup@tin:/srv/mediawiki-staging/php-1.29.0-wmf.17$ scap sync-dir php-1.29.0-wmf.17/extensions/Wikidata "Update Wikidata - fix term validation (T161263)"

Mentioned in SAL (#wikimedia-operations) [2017-03-27T09:54:03Z] <ladsgroup@tin> Synchronized php-1.29.0-wmf.17/extensions/Wikidata: Update Wikidata - fix term validation (T161263) (duration: 02m 22s)

Mentioned in SAL (#wikimedia-operations) [2017-03-27T10:15:00Z] <Amir1> start of ladsgroup@tin:/srv/mediawiki-staging/php-1.29.0-wmf.17$ scap sync-dir php-1.29.0-wmf.17/extensions/Wikidata "Second try for Update Wikidata - fix term validation (T161263)"

Mentioned in SAL (#wikimedia-operations) [2017-03-27T10:16:30Z] <ladsgroup@tin> Synchronized php-1.29.0-wmf.17/extensions/Wikidata: Second try for Update Wikidata - fix term validation (T161263) (duration: 02m 05s)

Now it works. https://www.wikidata.org/w/index.php?title=Q211760&diff=470686073&oldid=470531099. If you can't do it, it might be because of varnish caching or something similar. Give it another try in several minutes. If it persists, feel free to reopen this task.

Ladsgroup closed this task as "Resolved".Mon, Mar 27, 10:22 AM
Ladsgroup removed a project: Patch-For-Review.
Ladsgroup moved this task from Review to Done on the Wikidata-Sprint board.
Mr.Ibrahem added a subscriber: Mr.Ibrahem.EditedMon, Mar 27, 11:17 AM

I always got error "Malformed input" when I tray to add Arabic labels/aliases/descriptions

I just tried again and it worked: https://www.wikidata.org/w/index.php?title=Q37617&diff=470696095&oldid=470516941. @Mr.Ibrahem Can you provide more details? Most importantly, when was the last time you tried and can you try again?

Just tried to add the label Λουδοβίκος του Εβρέ to Q352940 and I get a pop-up saying invalid input.
ApiSandbox fails as well: Malformed input: \u039b\u03bf\u03c5\u03b4\u03bf\u03b2\u03af\u03ba\u03bf\u03c2 \u03c4\u03bf\u03c5 \u0395\u03b2\u03c1\u03ad

@Mbch331 It seems you were able to add the label in the second (or more) try: https://www.wikidata.org/w/index.php?title=Q352940&diff=prev&oldid=470697022 Same happens to me too. Sometimes it works sometimes it doesnt.

Mentioned in SAL (#wikimedia-operations) [2017-03-27T12:16:57Z] <Amir1> start of ladsgroup@tin:/srv/mediawiki-staging$ scap sync-file php-1.29.0-wmf.17/extensions/Wikidata/composer.lock 'Third try for Update Wikidata - fix term validation (T161263) Part I'

Mentioned in SAL (#wikimedia-operations) [2017-03-27T12:17:33Z] <ladsgroup@tin> Synchronized php-1.29.0-wmf.17/extensions/Wikidata/composer.lock: Third try for Update Wikidata - fix term validation (T161263) Part I (duration: 00m 44s)

Mentioned in SAL (#wikimedia-operations) [2017-03-27T12:19:40Z] <ladsgroup@tin> Synchronized php-1.29.0-wmf.17/extensions/Wikidata/extensions/Wikibase/: Third try for Update Wikidata - fix term validation (T161263) Part II (duration: 01m 32s)

Mentioned in SAL (#wikimedia-operations) [2017-03-27T12:21:09Z] <ladsgroup@tin> Synchronized php-1.29.0-wmf.17/extensions/Wikidata/vendor/composer/installed.json: Third try for Update Wikidata - fix term validation (T161263) Part III (duration: 00m 43s)

Okay, I manually logged in into one random mediawiki node and saw it wasn't sync'ed there. It seems sync-dir in scap is not as fun as it looks. I manually tested it several times now and it works just fine but please test again and tell me if anything is not correct.

@Mbch331 It seems you were able to add the label in the second (or more) try: https://www.wikidata.org/w/index.php?title=Q352940&diff=prev&oldid=470697022 Same happens to me too. Sometimes it works sometimes it doesnt.

I was talking about the label and the linked edit is an update of the description.
Just tested Q352940 and I could add the label now.

Arbnos added a subscriber: Arbnos.Tue, Mar 28, 11:22 PM