Page MenuHomePhabricator

legaltitlechars in mw.config and from API don't match (\xFF is not the same as \uFFFF)
Closed, InvalidPublic

Description

The value mw.config.get('wgLegalTitleChars') in the JS interface and the legaltitlechars value from ApiQuerySiteInfo are not same or equivalent.

Running the following in the javascript console:

new mw.Api().get( {
	"meta": "siteinfo",
	"siprop": "general"
} ).then( function( data ) {
	console.log( data.query.general.legaltitlechars + '\t\tAPI version' );
	console.log( mw.config.get('wgLegalTitleChars') + '\tmw.config version' );
} );

gives:

%!"$&'()*,\-.\/0-9:;=?@A-Z\\^_`a-z~\x80-\xFF+		API version
%!"$&'()*,\-./0-9:;=?@A-Z\\\^_`a-z~+\u0080-\uFFFF	mw.config version

\xFF isn't the same as \uFFFF, apart from the apparent differences in the number of \ characters used for escaping.

Shouldn't these be the equivalent?

The mw.config version seems to be the correct one. There are some titles such as File:Michał Cieślak Sejm 2016.JPG containing \u0142 and \u015b, which shouldn't be allowed according to the API version.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Aklapper renamed this task from legaltitlechars in mw.config and from API don't match to legaltitlechars in mw.config and from API don't match (\xFF is not the same as \uFFFF).May 22 2020, 9:25 AM
Aklapper added a subscriber: AMooney.

@AMooney: Assuming that "Set projects" was accidentally used instead of "Add projects", hence restoring some previous project tags.

Ammarpad closed this task as Invalid.EditedAug 17 2020, 10:51 AM
Ammarpad subscribed.

The mw.config version seems to be the correct one.

The API version is the canonical one as it just echoes the original source of the regex.

The value mw.config.get('wgLegalTitleChars') in the JS interface and the legaltitlechars value from ApiQuerySiteInfo are not same or equivalent.

It's intentional that they are different. The result from mw.config object (meant for JS) is a modified version of the canonical value so it returns unicode code value representaion of the UTF-8 bytes (as used in PHP). cf. dc9c9ee7fc6d