CentralAuth API list=globalallusers should capitalize the first letter
Open, Needs TriagePublic

Description

These two requests do not return the same results:
https://meta.wikimedia.org/w/api.php?action=query&list=globalallusers&agufrom=d
and
https://meta.wikimedia.org/w/api.php?action=query&list=globalallusers&agufrom=D

The former is incorrect since usernames that begin with a lowercase letter are bugs. To resolve this, the API should internally uppercase the first letter (by using User::getCanonicalName()).

The only alternative is to have every client uppercase the first letter before sending over the request.

This should also be applied to any other endpoint where you can specify the username.

Workaround
T180084: Interaction Timeline V1: The first character of usernames should not be case sensitive

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 17 2017, 11:28 PM

Change 392176 had a related patch set uploaded (by Niharika29; owner: Niharika29):
[mediawiki/extensions/CentralAuth@master] Turn lowercase initial letter in usernames to uppercase before querying

https://gerrit.wikimedia.org/r/392176

Anomie added a subscriber: Anomie.Nov 20 2017, 4:03 PM

If this is done, it should be done everywhere relevant and not just in globalallusers. The only place in core that I know of is ApiQueryAllUsers.

As for whether this should be done, I see arguments either way. None seem particularly compelling to me, it comes down to DWIM versus breaking weirdly in some rare edge cases.

For:

  • It's more likely to be DWIM.
  • We do do this sort of normalization for title parts in modules like ApiQueryAllPages, despite the possibility of the "Against" issues there. OTOH, it could as well be argued that those are buggy too.
  • We did it for user names in ApiQueryAllUsers from 2007 until 2012, when it was lost in the fix for T35602. Again, though, it could be argued that that was a bug fix.

Against:

  • There's the possibility of strange behavior in the cases where these "invalid" names with lowercase first letters are in the database. Besides bugs, that can happen when the Unicode version is updated, as for example the case of user ɱ.
  • The values being considered here are positions in the space of all possible names, not necessarily valid names themselves. Particularly as the end point of a range, a lowercase letter may make more sense than having to figure out the last possible name beginning with the previous character (e.g. "a" instead of "` followed by 63 U+10FFFF followed by U+07FF", assuming we never increase the 255-byte hard limit on usernames).
  • Uppercasing in MediaWiki varies by language, e.g. uppercase of i is I for most languages, but for Kazakh, Azerbaijani, Karakalpak, an Turkish it's İ instead. That could be unexpected, or completely expected, depending. Especially on multilingual sites where it's going to use the content language rather than the user language.
    • Note the sorting is always in Unicode order, not by any localized collation. That's a restriction from the database layer.

As a compromise, we could have a flag to disable the behavior. But I think given T180084, the expected behavior is DWIM and I don't think each client should have to re-implement the "proper" behavior.

As a compromise, we could have a flag to disable the behavior. But I think given T180084, the expected behavior is DWIM and I don't think each client should have to re-implement the "proper" behavior.

From my product manager I agree. But if this ticket is declined, I would greatly appreciate guidance on how to emulate this functionality for our user-facing products.

Tgr added a subscriber: Tgr.Nov 29 2017, 10:49 PM

I think what you really want here is a new API parameter for limiting usernames by a prefix. If I type in AAA as a username and get BBB as an autosuggest result, I would probably be suprised. You can discard mismatching results on the client side, but 1) as you said each client shouldn't reimplement the proper behavior, 2) it's nontrivial due to the language issues mentioned in T180858#3774735 (if I type in iaz and get back İazak, that's actually a valid result in a Turkish language context, but the client would probably mistakenly throw it away). It's better to let the API figure out what the correct results are.

Also, that way aufrom/gaufrom would stay as a continuation parameter that accepts any value and thus avoids the invalid username corner cases Anomie mentioned.

Who would be the most appropriate person to build such an API, and what is the best way to get it on their radar?

I think what you really want here is a new API parameter for limiting usernames by a prefix.

aguprefix already exists.

Tgr added a comment.Nov 30 2017, 7:57 PM

aguprefix already exists.

So maybe only canonize that and declare agufrom/aguto to be continuation paramters and leave them as is? A continuation parameter uppercasing can be really bad (in the worst case it can send clients in an inifinite loop, given that uppercase characters can have lower Unicode code points than their lowercase versions), and even if a new continuation parameter were introduced, as you recommend on the patch, existing clients use agufrom/aguto for paging already. OTOH the prefix not working right for invalid usernames does not seem like a big deal.

Anything using continuation properly is going to use whatever the module returns, so it would only be clients doing manual paging that would be affected.

Tgr added a comment.Nov 30 2017, 8:14 PM

I don't think that's the case. Try something like https://meta.wikimedia.org/wiki/Special:ApiSandbox#action=query&format=json&list=globalallusers&agufrom=%60%EF%B9%8F%E7%89%99%E6%80%A1 , the response will have

"continue": {
    "agufrom": "apfeldieb",
    "continue": "-||"
}

If the client feeds that back faithfully, and the API canonizes it, it will loop back to Apfeldieb which is probably a few million users earlier.

Tgr added a comment.Nov 30 2017, 8:16 PM

Or did you mean that clients properly handling continuation will use the new parameter? That's true, but I wouldn't be surprised if there were still pre-new-continuation-style clients around.

I mean that if the code is changed to use a new continuation parameter instead of agufrom, then the response will have

"continue": {
    "agucontinue": "apfeldieb",
    "continue": "-||"
}

Any client that isn't totally broken with respect to continuation would feed that back faithfully and things would just work. Any client that somehow manages to use agufrom instead is already broken.

And it works the same way for the old-style continuation. they'd start seeing

"query-continue": {
    "globalallusers": {
        "agucontinue": "apfeldieb"
    }
},

and would therefore feed agucontinue back just as new-style continuation users would.

Who would be the most appropriate person to build such an API, and what is the best way to get it on their radar?

What's being suggested is adding this functionality (adding a new param) to the existing API. I have a patch but I submitted that when this was supposedly a one-liner task. Would @dbarratt be interested in taking over the patch and making the changes being suggested?

dbarratt updated the task description. (Show Details)Feb 14 2018, 8:43 PM
TBolliger moved this task from Backlog to Defects on the InteractionTimeline board.Mar 1 2018, 5:02 PM