Page MenuHomePhabricator

Ensure retrieval and storage of article metadata
Closed, ResolvedPublic


Follow up to regression caused by T154719 / T199699, and reported in T202815: [wmf.18] enwiki NPP page - no scroll

  • Improve hook coverage so that metadata is generated and saved on POST requests for all circumstances where items are added to the PageTriage queue
  • Allow graceful failure in the UI if metadata is missing
  • Add fallback mechanism to generate and save metadata via the job queue

Scenarios to handle

As far as I can see, the following scenarios need to be addressed as part of this task/patchset:

1. New article is created (POST), user is redirected to view the article (GET) and metadata is not in replica

This scenario was reported in T154719: PageTriage opens master connection on GET for ArticleMetadata cache misses. PageTriageHooks::isArticleNew() is called to determine if noindex,nofollow robot policy should be set. isArticleNew() is a bit of a misnomer as is the string creation_date because we're looking at when the article was added to the PageTriage queue, not when it was created.

In any case for this scenario, the user has POSTed an article so metadata is in the process of getting compiled via a deferred update, but they are redirected to view the page and PageTriageHooks::isArticleNew() is called, but metadata isn't in the replica.

Proposed solution

This request will not return any metadata for the new article. That's OK because it's not needed for the page view following page creation. The compiled metadata will be available in the replica shortly after this request.

Potential problems

The only problem I could see is that the robot policy for <meta name="robots" content="noindex,follow"/> would not be in Varnish, and I'm not sure about the process for invalidating the Varnish result for this page so that the robots policy can get set.

2. Move sandbox page to main namespace, then edit sandbox page to remove redirect

This is the scenario that caused PageTriage to break in T202815. Here's an example edit.

The page that had the redirect to the main namespace is in PageTriage queue, so it is returned by ApiPageTriageStats via PageTriageUtil::getArticleFilterStat(), and most importantly in ApiPageTriageList::execute() via $pages = self::getPageIds( $opts );. getPageIds() doesn't check to see if an individual page has metadata, just if it's in the queue. So ApiPageTriageList might return 15 results instead of 20, and then the UI thinks that there are no more articles to load.

Proposed solution
  • Additional coverage in the hooks implementation in the patchset for this task should ensure that metadata is compiled for the page.
  • ApiPageTriageList has been updated to add a warning to the response with a list of pages that don't have metadata; the JS has been updated to look for this in the response and allow for loading more
Potential problems


3. Rollback an edit

This one is documented in T202735: Prevent article metadata compilation on rollback actions

Proposed solution

Similar to what the patchset introduces for getMetadata() when the metadata can't be returned from the replica: if in a POST request, then wait for replica, compile, then save to the DB immediately, if in a GET request (as is the case with a rollback action) then don't do any compilation, just queue a job to compile later on.

Potential problems


Event Timeline

Change 455870 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[mediawiki/extensions/PageTriage@master] Ensure retrieval and storage of page metadata

Change 455870 merged by jenkins-bot:
[mediawiki/extensions/PageTriage@master] Ensure retrieval and storage of page metadata