Page MenuHomePhabricator

Scribunto string comparison works case insensitive while the standard Lua case sensitive
Closed, ResolvedPublic

Description

Scribunto string comparison works case insensitive while the standard Lua case sensitive.

Scribunto debug console on various WMF wikis:

print( ("bar"):byte(1, -1) )   print( ("Foo"):byte(1, -1) )   print( "bar" < "Foo" )
98	97	114
70	111	111
true

If it is "feature", then it is nowhere documented and I don't see any reason for it when it's trivial to use [mw.u]string.lower() for case-insensitive comparison.

Event Timeline

Danny_B raised the priority of this task from to High.
Danny_B updated the task description. (Show Details)
Danny_B added a project: Scribunto.
Danny_B added subscribers: Danny_B, Anomie, tstarling.

Lua calls strcoll(), which according to the Linux Programmers Manual is equivalent to strcmp() in the C locale. But we presumably use an English locale.

But we presumably use an English locale.

Specifically $wgShellLocale, which was added for T16944: escapeshellcmd does not work properly with php security update. Default is "en_US.utf8", and is unchanged on WMF sites.

I notice there's a "C.UTF-8" available (wish I had noticed that 2+ years ago), which should work to fix both this and T16944. Is that widespread enough to make it the default $wgShellLocale, or do we need to keep en_US.utf8 as the default and just recommend people use C.UTF-8 where available?

Also, is there any particular reason for wfInitShellLocale() to set LC_CTYPE instead of LC_ALL? It'd be nice to just use that instead of having to manually set LC_ALL in Scribunto.

@Anomie maybe this could answer your questions a bit?


do we need to keep en_US.utf8 as the default and just recommend people use C.UTF-8 where available?

Scribunto modules must operate in "C" locale by default, since otherwise there is no way for bytewise comparison (which is naturally expected) except for writing some own set of functions. OTOH "en_US" can be easily achieved by using existing functions as mentioned in the bug description.

I'm pretty sure I've submitted a request to allow setting of locale in Scribunto, but after the migration to Phabricator I can't find it and Bugzilla is no longer available, so I can't look to my bugs there. Maybe some of you will have luck. IIRC it was closed in favour of having one single locale/collation over all WMF wikis, however, the reasoning was not strong IIRC and I would be in favour of and strongly encouraging allowing setting of collation by os.setlocale() (or our own function, maybe tied to mw.language library if the original function would be considered insecure) so it would be easy to run sorts naturally without necessity to write own collation rules, which might be very difficult (i.e. Czech has two-level sorting).

@Anomie maybe this could answer your questions a bit?

Doesn't help at all, sorry.

Scribunto modules must operate in "C" locale by default

Please read the discussion on Gerrit change 49084; in short, we can't allow os.setlocale because in LuaSandbox mode that will screw up the locale being used by PHP too, possibly even in different threads. And since we can't allow it in LuaSandbox, we can't allow it in LuaStandalone either because we want the two to behave the same.

Similarly, we can't set a different locale for different wikis (since they're all served by the same apache processes), and trying to do it for Lua code only is prohibitively complex.

so it would be easy to run sorts naturally without necessity to write own collation rules, which might be very difficult (i.e. Czech has two-level sorting).

Someone certainly could write a library (as part of mw.language or separately) that exposed appropriate methods that could be used for sorting. You might want to base it on includes/Collation.php. But that would be a matter for a different task, and won't manage to change the string comparison operators or the like.

Change 251280 had a related patch set uploaded (by Anomie):
Change default $wgShellLocale to C.UTF-8

https://gerrit.wikimedia.org/r/251280

Could this be deployed, please?

Could this be deployed, please?

https://gerrit.wikimedia.org/r/#/c/251280/ has a -1 and needs discussion...

@Anomie Could you please take a look on those inline comments from @tstarling? Thanks.

Change 251280 merged by jenkins-bot:
[mediawiki/core@master] Change default $wgShellLocale to C.UTF-8, and use it to set LC_ALL

https://gerrit.wikimedia.org/r/251280

Change 351871 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/extensions/Scribunto@master] Call wfInitShellLocale() from Scribunto_LuaSandboxInterpreter

https://gerrit.wikimedia.org/r/351871

It doesn't seem like that change will be enough, since wfInitShellLocale() isn't necessarily going to be called. Live-hacking the relevant pieces of the patch onto mwdebug1001 didn't result in a changed locale or changed behavior from Scribunto until I added a wfInitShellLocale() call into Setup.php.

Or we could put the call into Scribunto_LuaSandboxInterpreter if we don't want to put it in Setup.php.

Change 352861 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] Call wfInitShellLocale() from Setup.php

https://gerrit.wikimedia.org/r/352861

Change 353228 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[operations/puppet@production] For HHVM set LANG=C.UTF-8

https://gerrit.wikimedia.org/r/353228

Change 351871 abandoned by Anomie:
Call wfInitShellLocale() from Scribunto_LuaSandboxInterpreter

Reason:
Doing I02943803d26d5b1b3ac00ef9216f69cdfa149585 instead

https://gerrit.wikimedia.org/r/351871

Change 352861 merged by jenkins-bot:
[mediawiki/core@master] Apply $wgShellLocale in Setup.php

https://gerrit.wikimedia.org/r/352861

Anomie claimed this task.

The performance optimization https://gerrit.wikimedia.org/r/353228 is still pending, but the task itself is resolved now.

This may be worth a user notice in case modules subtly break; I'll tag it with User-notice and people can remove the tag if they disagree.

An appropriate announcement might be "String comparisons in Scribunto modules are now consistently done case-insensitively by byte order. Previously they were sometimes ordered according to a case-sensitive US-English collation." Note that it was always a US-English collation on WMF wikis, even on wikis for other languages.

BTW, if people want correct collation in Scribunto, the relevant task is probably T152845.

Change 353228 merged by Giuseppe Lavagetto:
[operations/puppet@production] For HHVM set LANG=C.UTF-8

https://gerrit.wikimedia.org/r/353228

The patch has been merged, but hhvm restarts will not be performed but on the mwdebug servers first, and the canaries later. The change will be rolled out by the weekly hhvm restarts anyways, but we can accelerate the process once our smoke-test confirms it's not creating issues.

Mentioned in SAL (#wikimedia-operations) [2017-10-04T09:09:53Z] <_joe_> restarting hhvm on mwdebug*, T107128

Mentioned in SAL (#wikimedia-operations) [2017-10-04T09:54:50Z] <_joe_> restarting hhvm on canaries (both appservers and canaries) T107128

It is worth noticing that some non-WMF wikis need to set a locale other than en_US.utf8 to allow for the proper interaction with data stored in external sources.

An example would be using Extension:Data Transfer to import data from CSV or XML files. Having an incorrect locale may cause the imported data to be mis-decoded, thus the text corrupted.

Perhaps other extensions and use-cases are affected too.

Chances are the code accessing data from external sources should be managing the encoding issues directly rather than relying on the process's locale setting.