Page MenuHomePhabricator

Refactor Category::refreshCounts logic to a job and simplify
Open, MediumPublic

Description

As of https://gerrit.wikimedia.org/r/506032 we now have four ways of updating category counts:

1. If a non-locking master read says the stale count is zero, we do a full recount.

This is used after an edit to a page, for the categories that were in the previous revision, but not in the new one. (From LinksUpdate, via WikiPage::updateCategoryCounts).

2. If a non-locking master read says the stale count is <= 200, we do a full recount.

This is used for a category after its category description page is deleted.

3. If no row exists yet, or it appears corrupt, we do a full recount.

This can happen through any of the following scenarios:

  • Reading a category page.
  • Viewing "Page information" (action=info) for a category page.
  • Parsing wikitext containing {{pagesincategory}}.
  • Viewing search results on Special:Search for a match that is a category page.
  • UploadWizard/ApiQueryAllCampaigns for querying the file count from a campaign's category.

This is triggered whenever one of these methods is called on a Category object: getPageCount(), getSubcatCount(), getFileCount(), or getTitle(). This then uses the path via Category->initialize( Category::LAZY_INIT_ROW ).

4. Relative increments/decrements (including creation/deletion of the row)

From WikiPage::updateCategoryCounts after edits for categories associated with that page.


I'd like to re-explore whether we still need use case three. It seems to me like, at least in theory, it wouldn't be needed. If we can validate that relatively easily, I would propose we remove it in favour of a warning being logged with stack trace so that we can find out why and whether that is preventible.

Alternatively, if it cannot be prevented within reason (e.g. too costly or impossible to get right given scale requirements), then I suggest we move it to a job and have use case 1, 2 and 3 be reduced to the queuing of a job that takes care of things.

  • Document and/or reference from the code how case 2 is possible.
    • If rare/unlikely:
      • Consider removing in favour of a manual recount admins can trigger via purge of the category page.
    • If common and not easily preventible:
      • Move to job queue as a "validate recount", emit log warning if result turned out different.
  • Determine whether case 3 is still probable.
    • If so:
      • Move refresh logic (recount, auto-create, auto-delete) to a job and queue that for case 1, 2, and 3.
    • If not:
      • Replace recount with a log warning from case 1 and 3.

After this is done and we have confidence in its logic, we may want to consider removing the refreshing cronjob ref T299823, esp once users are able to ad-hoc purge ref T85696.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

The problem persists. Any plans to have this resolved? Any chance to trigger the recount manually?

I moved T224321 in the hierarchy of tasks so that the current task would be its parent. @Superyetkin that is essentially what you are asking for. @Krinkle I think it would be nice to start over in a clean state; this can help with identifying the root causes of mismatches, whether before or after #3 is refactored.

aaron triaged this task as Low priority.Jun 6 2019, 10:45 AM
Anomie raised the priority of this task from Low to Medium.Mar 24 2020, 9:00 PM

Same bug? On hu:Kategória:Tudományos egyértelműsítő lapok, there are 189 pages and one subcategory listed, however, its parent category hu:Kategória:Egyértelműsítő lapok lists it as having 190 pages and zero subcategories. The subcategory hu:Kategória:Biológiai egyértelműsítő lapok was first accidentally created in the main namespace as hu:Biológiai egyértelműsítő lapok, then moved to the category namespace, then deleted and recreated in the category space.

Just to be clear, the expectation of Wikipedians is that maintainance categories are up to date with an low delay. The remaining categories can be less accurate, as long as changes to each categorylink changes the count eventually. So, like 10 categories should be really well up to date, the rest not. This probably should have been said a month ago.

FWIW, this system has caused an incident where several million articles were not editable because this process locked 3M rows in pagelinks. Fixing that would improve our resilience.

FWIW, this system has caused an incident where several million articles were not editable because this process locked 3M rows in pagelinks. Fixing that would improve our resilience.

T352628 for reference

Questions from @Bmueller:

  • What are the dependencies (code, and/or people)?
    • "Categories" are a fairly small feature. It builds on general platform concepts like JobQueue, DeferredUpdates, and utility functions, but I don't think this task requires changes to other components or feature behaviours. As such, while not exactly standalone, I'd say it has no code dependencies that we need to be mindful of in this context. In terms of teams, the feature is unowned. The task does not require making behaviour changes besides obvious bug fixes, so no controversial decisions or indirect product impact expected there.
  • What is the impact of this?
    • Sustainability: the current implementation is needlessly complex and redundant and yet still seeems to regularly produce inaccurate counts. It is prune to production errors and load problems. e.g. T352628: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError", per @Ladsgroup above.
    • Product experience: The most prominent use category counts, where the number itself is imporant/sensitive, is Wikipedia maintenance categories, and similar "meta" categories on other wikis. The counts are often used to power templates, gadgets, and bots to know when a particular kind of issue has more than 0 articles affecting it. The problem is often that the category will either be zero when it shouldn't be or above zero when it should be zero. To clarify: The problem is not e.g. when a category containing 100+ items and the number being off by a few, which would not be a big problem in that case. We have ~15 duplicate bug reports about this at the moment, dating back several years.
  • What would it cost to fix? TBD.

Might involve:

  • Investigate the two hypotheses in the task description, thinking through the way this is triggered and confirming whether a simpler approach indeed would produce equal-or-more-reliable outcome compared to today.
  • Come up with steps to reproduce the issue (preferably on a plain local install, but possibly in prod if specific to certain factors).
  • Understand why it continues to happen today.
  • Re-investigate T352628 to understand why it caused deadlocks, and consider removing the workarounds added there if the simpler Job-based approach is believed to not need them.
  • Look at how we solved user_editcount increments/decrements, and see if there is something we can learn from that. Share on-task why or why not.
  • Look at how we do compute+cache BetaFeatures usage counts (highlighted as good example in https://wikitech.wikimedia.org/wiki/MediaWiki_Engineering/Guides/Backend_performance_practices). If not a good example for this, share learnings on-task why not.

Perhaps 1 month (real time) for myself and 1 person from the team, where we both spend ~4 hours a week for 2-3 weeks (i.e. not as main priority), plus 1 extra week buffer for comms/follow-ups.