Page MenuHomePhabricator

Refactor Category::refreshCounts logic to a job and simplify
Open, HighPublic

Description

Background

Previous incidents and performance issues::

Misc bugs:

Status quo

As of https://gerrit.wikimedia.org/r/506032 we now have four ways of updating category counts:

1. If a non-locking master read says the stale count is zero, we do a full recount.

This is used after an edit to a page, for the categories that were in the previous revision, but not in the new one. (From LinksUpdate, via WikiPage::updateCategoryCounts).

2. If a non-locking master read says the stale count is <= 200, we do a full recount.

This is used for a category after its category description page is deleted.

3. If no row exists yet, or it appears corrupt, we do a full recount.

This can happen through any of the following scenarios:

  • Reading a category page.
  • Viewing "Page information" (action=info) for a category page.
  • Parsing wikitext containing {{pagesincategory}}.
  • Viewing search results on Special:Search for a match that is a category page.
  • UploadWizard/ApiQueryAllCampaigns for querying the file count from a campaign's category.

This is triggered whenever one of these methods is called on a Category object: getPageCount(), getSubcatCount(), getFileCount(), or getTitle(). This then uses the path via Category->initialize( Category::LAZY_INIT_ROW ).

4. Relative increments/decrements (including creation/deletion of the row)

From WikiPage::updateCategoryCounts after edits for categories associated with that page.

Proposal 1

I'd like to re-explore whether we still need use case three. It seems to me like, at least in theory, it wouldn't be needed. If we can validate that relatively easily, I would propose we remove it in favour of a warning being logged with stack trace so that we can find out why and whether that is preventible.

Alternatively, if it cannot be prevented within reason (e.g. too costly or impossible to get right given scale requirements), then I suggest we move it to a job and have use case 1, 2 and 3 be reduced to the queuing of a job that takes care of things.

  • Document and/or reference from the code how case 2 is possible.
    • If rare/unlikely:
      • Consider removing in favour of a manual recount admins can trigger via purge of the category page.
    • If common and not easily preventible:
      • Move to job queue as a "validate recount", emit log warning if result turned out different.
  • Determine whether case 3 is still probable.
    • If so:
      • Move refresh logic (recount, auto-create, auto-delete) to a job and queue that for case 1, 2, and 3.
    • If not:
      • Replace recount with a log warning from case 1 and 3.

Proposal 2

After this is done and we have confidence in its logic, we may want to consider removing the refreshing cronjob ref T299823, esp once users are able to ad-hoc purge ref T85696.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
aaron triaged this task as Low priority.Jun 6 2019, 10:45 AM
Anomie raised the priority of this task from Low to Medium.Mar 24 2020, 9:00 PM

Same bug? On hu:Kategória:Tudományos egyértelműsítő lapok, there are 189 pages and one subcategory listed, however, its parent category hu:Kategória:Egyértelműsítő lapok lists it as having 190 pages and zero subcategories. The subcategory hu:Kategória:Biológiai egyértelműsítő lapok was first accidentally created in the main namespace as hu:Biológiai egyértelműsítő lapok, then moved to the category namespace, then deleted and recreated in the category space.

Just to be clear, the expectation of Wikipedians is that maintainance categories are up to date with an low delay. The remaining categories can be less accurate, as long as changes to each categorylink changes the count eventually. So, like 10 categories should be really well up to date, the rest not. This probably should have been said a month ago.

FWIW, this system has caused an incident where several million articles were not editable because this process locked 3M rows in pagelinks. Fixing that would improve our resilience.

FWIW, this system has caused an incident where several million articles were not editable because this process locked 3M rows in pagelinks. Fixing that would improve our resilience.

T352628 for reference

Questions from @Bmueller:

  • What are the dependencies (code, and/or people)?
    • "Categories" are a fairly small feature. It builds on general platform concepts like JobQueue, DeferredUpdates, and utility functions, but I don't think this task requires changes to other components or feature behaviours. As such, while not exactly standalone, I'd say it has no code dependencies that we need to be mindful of in this context. In terms of teams, the feature is unowned. The task does not require making behaviour changes besides obvious bug fixes, so no controversial decisions or indirect product impact expected there.
  • What is the impact of this?
    • Sustainability: the current implementation is needlessly complex and redundant and yet still seeems to regularly produce inaccurate counts. It is prune to production errors and load problems. e.g. T352628: Fatal DBTransactionSizeError during category update via "Lock wait timeout exceeded" from WikiPage::lockAndGetLatest, per @Ladsgroup above.
    • Product experience: The most prominent use category counts, where the number itself is imporant/sensitive, is Wikipedia maintenance categories, and similar "meta" categories on other wikis. The counts are often used to power templates, gadgets, and bots to know when a particular kind of issue has more than 0 articles affecting it. The problem is often that the category will either be zero when it shouldn't be or above zero when it should be zero. To clarify: The problem is not e.g. when a category containing 100+ items and the number being off by a few, which would not be a big problem in that case. We have ~15 duplicate bug reports about this at the moment, dating back several years.
  • What would it cost to fix? TBD.

Might involve:

  • Investigate the two hypotheses in the task description, thinking through the way this is triggered and confirming whether a simpler approach indeed would produce equal-or-more-reliable outcome compared to today.
  • Come up with steps to reproduce the issue (preferably on a plain local install, but possibly in prod if specific to certain factors).
  • Understand why it continues to happen today.
  • Re-investigate T352628 to understand why it caused deadlocks, and consider removing the workarounds added there if the simpler Job-based approach is believed to not need them.
  • Look at how we solved user_editcount increments/decrements, and see if there is something we can learn from that. Share on-task why or why not.
  • Look at how we do compute+cache BetaFeatures usage counts (highlighted as good example in https://wikitech.wikimedia.org/wiki/MediaWiki_Engineering/Guides/Backend_performance_practices). If not a good example for this, share learnings on-task why not.

Perhaps 1 month (real time) for myself and 1 person from the team, where we both spend ~4 hours a week for 2-3 weeks (i.e. not as main priority), plus 1 extra week buffer for comms/follow-ups.

Afaik T365303 has not changed this task. That task deals with an urgent performance issue in a job (RefreshLinks I assume), and moved the a contentious part of that job (category member counts) to a dedicated new job.

This task is about the bigger picture of category counts often being left in an inaccurate state that doesn't self-correct, and there being half a dozen ways and places where we (need to, try to) recount things. This includes, for example category pageviews calling Category->refreshCountsIfSmall, and various scenarios lazily calling Category->refreshCounts from a DeferredUpdate.

I suppose the timing of things could make race conditions or bugs less likely. Do we have a specific theory as to certain counts became inaccurate and how T365303 prevents that?

Afaik T365303 has not changed this task. That task deals with an urgent performance issue in a job (RefreshLinks I assume), and moved the a contentious part of that job (category member counts) to a dedicated new job.

It was something a bit different. It wasn't contention because of the jobs, it was because many edits happened to move a lot of pages from one category to another and they had deferred updates right after edit and that caused contention.

Now category count update doesn't happen after edit as deferred update but as a job.

This task is about the bigger picture of category counts often being left in an inaccurate state that doesn't self-correct, and there being half a dozen ways and places where we (need to, try to) recount things. This includes, for example category pageviews calling Category->refreshCountsIfSmall, and various scenarios lazily calling Category->refreshCounts from a DeferredUpdate.

Okay, it's different then. but I think it helps since you can outsource the update to the job.

I suppose the timing of things could make race conditions or bugs less likely. Do we have a specific theory as to certain counts became inaccurate and how T365303 prevents that?

It could make things better in some way, it's no longer bound to all deferred updates succeeding but also jobs can fail due to various different reasons so it's hard to say.