Page MenuHomePhabricator

API can exceed GenderCache's miss limit, thus returning titles with the default-gendered namespace prefix
Open, LowPublic

Description

In Russian Wikipedia, there are two standard forms of User: namespace: Участник: for male users and Участница: for females. If you'll pass a pagename with wrong gender to MediaWiki API, it will be normalized into the correct form. But if you'll send a request that should return some pagenames, it is not guaranteed that those pagenames will be returned in correct (standard) form. So, if you have a bot that gets the list of user pagenames from API and then, for instance, asks API to return the list of categories of those pages, it will probably fail while parsing the result, because you probably will not parse normalized section assuming that pagenames from API are always normalized. My bot that sorts unreviewed files in Russian Wikipedia (https://github.com/Facenapalm/NapalmBot/blob/master/scripts/sort_unrev_files.py) has met that problem.

Unfortunately, I can't illustrate it with simple example because the problem is a bit random and I can't catch it with small requests. I've met this problem with complicated request with generator, two continues and more than 90000 results. Here it is:

https://ru.wikipedia.org/w/api.php?action=query&prop=fileusage&fulimit=5000&generator=unreviewedpages&gurnamespace=6&gurfilterredir=nonredirects&gurlimit=5000&format=json&formatverison=2

In my case, one of 90000 returned pages was Участник:Miss Amber. Then I asked for its categories with request like that and failed to find this page in answer:

https://ru.wikipedia.org/w/api.php?action=query&prop=categories&crlimit=500&titles=Участник:Miss%20Amber&format=json&formatversion=2

The problem appears between 29th of June and 6th of July. My bot runs this script every week at fridays; it runs successfully at 29th of June and fails every week after that.

I've already updated my bot so it now parses normalized section to understand corrected pagenames. But the expected behaviour is "API should always return results in normalized form", and it is broken anyway.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 24 2018, 12:03 AM
Facenapalm renamed this task from MediaWiki API not always returns pagenames in standart form to MediaWiki API not always returns pagenames in standard form.Jul 24 2018, 12:11 AM
Facenapalm updated the task description. (Show Details)
Facenapalm updated the task description. (Show Details)Jul 24 2018, 12:13 AM
Anomie changed the task status from Open to Stalled.
Anomie added a subscriber: Anomie.

You'll have to figure out how to give a specific simple reproduction case, rather than linking queries with limits of 5000 that might have failed a month ago. Everything I tried to reproduce this returns "Участница:Miss Amber" for the relevant page.

Checking the normalized section is the proper way to handle the case where non-normalized input might have been given.

that might have failed a month ago

It have been failed (on different pages) every time I tried it in last month, three times at Toolforge and about 10 times on my PC while I was debugging my script, none of those runs were successful. That is not a problem that happened once a long time ago. I can't reproduce it on smaller requests, but there are always several pages in incorrect form in this request. Maybe the size of the answer, the generator usage or FlaggedRevs extension are the reasons, I don't now.

Checking the normalized section is the proper way to handle the case where non-normalized input might have been given.

Sure, if you're talking about user input or something like that, but in this case I get data from API and pass it to API without any changes, and I expect no normalization.

Anomie changed the task status from Stalled to Open.Jul 24 2018, 2:43 PM
Anomie moved this task from Needs details or plan to Needs Code on the MediaWiki-API board.
Anomie triaged this task as Low priority.

I think I worked it out. GenderCache stops doing its thing after 1000 cache misses, so with limit=5000 if you have enough user pages in the mix from prop=fileusage it'll stop resolving genders partway through.

A possible solution:

  • Add a needsQuery( $title ) method to GenderCache, which returns true if the title passed is in a namespace with gender distinction and it's not already cached.
  • Make some changes to ApiResult:
    • public static function setTitleInfo( array &$arr, LinkTarget $title, $prefix = '' ) mostly works just like ApiQueryBase::addTitleInfo().
      • If GenderCache->needsQuery( $title ) is false, it does exactly like ApiQueryBase::addTitleInfo().
      • If it's true, it populates the fields forcing the 'unknown'-gender namespace prefix and sets a new metadata item indicating the title field needs gender transformation.
    • public function ApiResult::addTitleInfo( $path, LinkTarget $title, $prefix = '' ) will use $this->path() then call self::setTitleInfo(), like most of the other "add" functions.
    • ApiResult::applyTransformations() will somehow or other use that metadata item to collect all the titles, pass them through GenderCache->doTitlesArray() or the like, then replace every flagged title with the genderized version.
  • Deprecate ApiQueryBase::addTitleInfo() in favor of ApiResult::setTitleInfo(), reimplementing it to just call that method.
  • Unit testing should involve overriding GenderCache with a version with missLimit = 0 or the like. Languages 'aln' and 'mwl' have three-way gender distinction for NS_USER, or maybe it's easier to just use $wgExtraGenderNamespaces.
Anomie renamed this task from MediaWiki API not always returns pagenames in standard form to API can exceed GenderCache's miss limit, thus returning titles with the default-gendered namespace prefix.Jul 24 2018, 2:44 PM

The magic transformation sounds good, because it does not need to fill the cache in each api module, but for a quick win it seems easier to fill the cache in fileusage - https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/462122/

Change 462122 had a related patch set uploaded (by Umherirrender; owner: Umherirrender):
[mediawiki/core@master] Fill GenderCache for used pages in action=query&prop=fileusage

https://gerrit.wikimedia.org/r/462122

Change 462122 had a related patch set uploaded (by Umherirrender; owner: Umherirrender):
[mediawiki/core@master] Fill GenderCache for used pages in action=query&prop=fileusage

https://gerrit.wikimedia.org/r/462122