Page MenuHomePhabricator

Certain characters outside Unicode plane 0 are blocked for uploading by mw.QuickTitleChecker.js
Open, Needs TriagePublicBUG REPORT

Description

List of steps to reproduce (step by step, including full links if applicable):

  • Go to Wikimedia Commons
  • Upload any file containing a Warang Citi character in the filename

What happens?:

  • Upload fails and shows notification "Please choose a different, descriptive title"

What should have happened instead?:
The file should be uploaded like any other file.

Software version (if not a Wikimedia wiki), browser information, screenshots, other information, etc:
Firefox 92.0

Ho.png (900×1 px, 241 KB)

Event Timeline

This does not seem to be related to Lingua Libre. Or was this bug discovered whilst trying to record words on Lingua Libre, and the process therefore failed due to this Commons limitation?

This was detected while many were trying to contribute pronunciations using Lingua Libre. So we created a few sample recordings (Ho-aandi.wav, Ho-andhratayed.wav, Ho-andun.wav) with Audacity and tried to directly upload to Commons using Upload Wizard and failed as the Warang Citi characters in the filenames were not recognized. So, we had to rename the files with transliterated Latin names and uploaded. This is the suspected reason for all uploads failing while attempting to record and upload using Lingua Libre.

We reported this issue on the day of a workshop. To our frustration it did not work for any when the participants were trying to upload the pronunciations. If this bug remain unattended the problem is not fixed, the participants will probably forget. I'd request to the developers to help provide the ways to address this issue.

Amire80 subscribed.

I tried debugging this a bit. The problem is probably in the UploadWizard extension, in resources/mw.QuickTitleChecker.js, which says:

		invalid: [
			/[\u00A0\u1680\u180E\u2000-\u200B\u2028\u2029\u202F\u205F\u3000]/, // NBSP and other unusual spaces
			/[\u202A-\u202E]/, // BiDi overrides
			// eslint-disable-next-line no-control-regex
			/[\x00-\x1f]/, // Control characters
			/\uFEFF/, // Byte order mark
			/\u00AD/, // Soft-hyphen
			/[\uD800-\uDFFF\uE000-\uF8FF\uFFF0-\uFFFF]/, // Surrogates, Private Use Area and Specials, including the Replacement Character U+FFFD
			/[^\0-\uFFFF]/, //  Very few characters outside the Basic Multilingual Plane are useful in titles
			/''/
		],

I first thought that removing the /[^\0-\uFFFF]/ rule would resolve it. It probably should be removed in any case, because these days there are writing systems beyond FFFF in which there can be legitimate Commons filenames, including Warang Citi, which is the script we're discussing here. (Most emojis are also beyond FFFF, and perhaps we should block them, but blocking emojis shouldn't come at the expense of blocking legitimate strings in some languages.)

However, this is not the rule that makes it fail. That's actually /[\uD800-\uDFFF\uE000-\uF8FF\uFFF0-\uFFFF]/. I don't know the guts of JavaScript well enough to solve this, but I suspect that JavaScript's usual regex matching just doesn't work beyond FFFF.

Add a breakpoint in the mw.QuickTitleChecker.checkTitle function to step through the code.

Or try this code:

regex = /[\uD800-\uDFFF\uE000-\uF8FF\uFFF0-\uFFFF]/;
title = '𑢢';
title.match( regex );

In both Firefox and Chrome, the result is '\uD806', even though the character is something else entirely: ‎118A2 WARANG CITI CAPITAL LETTER WI.

I think that the newest ECMAScript versions handle characters beyond FFFF better than old-school JS does, but I'm not sure that this can be used here. Someone who knows JavaScript better than I do should take it from here...

Note: JavaScript stores strings as UTF-16, i.e. characters beyond BMP are represented as surrogate pairs.

It returns \uD806 because it returns the first character which matches and it's representing non-BMP characters (i.e. those beyond U+FFFF) using surrogate pairs, so U+118A2 becomes U+D806 U+DCA2.

To allow all non-BMP characters, I think changing

/[\uD800-\uDFFF\uE000-\uF8FF\uFFF0-\uFFFF]/, // Surrogates, Private Use Area and Specials, including the Replacement Character U+FFFD
/[^\0-\uFFFF]/, //  Very few characters outside the Basic Multilingual Plane are useful in titles

to

/[\uE000-\uF8FF\uFFF0-\uFFFF]/, // Private Use Area and Specials, including the Replacement Character U+FFFD
/[\uD800-\uDBFF][^\uDC00-\uDFFF]/, // High surrogate not followed by a low surrogate
/[^\uD800-\uDBFF][\uDC00-\uDFFF]/, // Low surrogate not preceded by a high surrogate

would work.

If you want to exclude some ranges of non-BMP characters or only allow some ranges, that can be done by matching the corresponding ranges of surrogate pairs... but first someone needs to decide which ranges should/shouldn't be allowed.

Thank you @Amire80, @Bugreporter and @Nikki for helping out. As I am not, and probably @Biswajeet3 is probably not too, very aware of JS or how to fix this, I am going to wait for anyone who knows it well. To address the last part of Nikki's comment, I think Warang Citi's range(118A0–118FF) is relevant to this particular bug. But as Amir rightly points out, many other writing systems also suffer because of the larger issue. To test, and for Biswajeet and my own selfish reasons, it would really great to make things work for Warang Citi to test. I am a bit worried that the participants, who form a potential new set of contributors, of a workshop that Biswajeet and I conducted might completely forget if things take too long. Thanks in advance to whoever is helping out.

1234qwer1234qwer4 renamed this task from Warang Citi (Ho-language writing system) characters not detected on Wikimedia Commons to Writing systems outside of Unicode plane 0 are blocked by Upload Wizard.Apr 24 2022, 8:30 PM
1234qwer1234qwer4 awarded a token.
1234qwer1234qwer4 subscribed.
Yug triaged this task as Low priority.Jul 6 2022, 10:24 PM
Yug moved this task from Backlog to Datasets and mass download on the Lingua-Libre-Legacy board.
Yug renamed this task from Writing systems outside of Unicode plane 0 are blocked by Upload Wizard to Unicode: characters outside Unicode plane 0 are blocked by Upload Wizard.Jul 7 2022, 11:37 AM
Yug updated the task description. (Show Details)
This comment was removed by Yug.

@Aklapper @Amire80 , this issue is preventing non-Western minorities with fewer human resources from contributing their cultural contents to Wikimedia projects.
It therefore de facto blocks Wikimedia's core mission on the diversity side.
We need to investigate (partially done) and solve this ticket rapidly.
Is there a path to increase the priority of this ticket and get professional JS and i18l support from WMF ?

Aklapper renamed this task from Unicode: characters outside Unicode plane 0 are blocked by Upload Wizard to Certain characters outside Unicode plane 0 are blocked for uploading by mw.QuickTitleChecker.js.Jul 11 2022, 12:18 PM

@Yug: TitleBlacklist is a separate codebase so that's a separate issue.

I wanted to highlight that it's been nearly 1.5 years, and this issue remains unsolved. An entire historically and socio-economically marginalized community cannot even upload a file in its native script. Ganesh Birua shared earlier about this issue and it reminded me the frustration a group of young people who excitedly gathered once to record pronunciations of words using Lingua Libre only to fail collectively. Ganesh has been building a dictionary, brick by brick, with his limited time and finds out that he can record the pronunciation. I tried it myself (see screenshot, Firefox 111.0.1), and it failed just like before.

issue-unable-to-upload-file-with-Warang-Citi-character-2023-04-22.png (1×2 px, 157 KB)

Yug raised the priority of this task from Low to High.Apr 22 2023, 6:46 PM

@Yug: TitleBlacklist is a separate codebase so that's a separate issue.

The initial priority level within #lingua_libre was set to low because it is out of our scope. Now that the ticket was rightfully moved into Commons UploadWizard TitleBlacklist (?) and I18n , this ticket deserves high priority. Whole cultures are prevented to contribute and visibilize their cultural assets, this breaks the core mission of the Wikimédia Foundation.

Aklapper raised the priority of this task from High to Needs Triage.Apr 23 2023, 8:16 AM
Aklapper updated the task description. (Show Details)

Brahmi script looks also to be blacklisted.
Please, allow filename using the Brahmi script.

Unrelated to title blacklist as made clear in the title.

This issue seems staled. @Amire80 made some notable progresses and suggestion ( T297351#7591178 ) to fix it. Is there someone who could help to move this issue forward on triage, priority and assigning / taking it over ?
This year's Wikimania theme is “Diversity. Collaboration. Future.” It would be a good news if we can fix this issue.

@Yug: Anybody is very welcome to contribute and propose a code change (patch) in Gerrit for review based on the previous comments in this ticket.

I've made a request to change the title blacklist on Commons at https://commons.wikimedia.org/wiki/MediaWiki_talk:Titleblacklist#Non-BMP_characters, if anyone wants to comment there.

I think changing it there would make it work for Lingua Libre. The upload wizard will probably still need changes in the file @Amire80 mentioned.

The change to MediaWiki:Titleblacklist that I proposed has been made. I think the upload wizard and Lingua Libre should both work now. Please try uploading files using the scripts which weren't accepted and let me know if it works.

The upload wizard will probably still need changes in the file @Amire80 mentioned.

I looked at that file (link) and it says "These checks are ignored when the TitleBlacklist API is available". If that's accurate, changing MediaWiki:Titleblacklist should fix the upload wizard on Commons, and mw.QuickTitleChecker.js would probably only need editing to keep it in sync for third-party users.