Refactor Category::refreshCounts logic to a job and simplify
Open, MediumPublic
Actions

Description

As of https://gerrit.wikimedia.org/r/506032 we now have four ways of updating category counts:

1. If a non-locking master read says the stale count is zero, we do a full recount.

This is used after an edit to a page, for the categories that were in the previous revision, but not in the new one. (From LinksUpdate, via WikiPage::updateCategoryCounts).

2. If a non-locking master read says the stale count is <= 200, we do a full recount.

This is used for a category after its category description page is deleted.

3. If no row exists yet, or it appears corrupt, we do a full recount.

This can happen through any of the following scenarios:

Reading a category page.
Viewing "Page information" (action=info) for a category page.
Parsing wikitext containing {{pagesincategory}}.
Viewing search results on Special:Search for a match that is a category page.
UploadWizard/ApiQueryAllCampaigns for querying the file count from a campaign's category.

This is triggered whenever one of these methods is called on a Category object: getPageCount(), getSubcatCount(), getFileCount(), or getTitle(). This then uses the path via Category->initialize( Category::LAZY_INIT_ROW ).

4. Relative increments/decrements (including creation/deletion of the row)

From WikiPage::updateCategoryCounts after edits for categories associated with that page.

I'd like to re-explore whether we still need use case three. It seems to me like, at least in theory, it wouldn't be needed. If we can validate that relatively easily, I would propose we remove it in favour of a warning being logged with stack trace so that we can find out why and whether that is preventible.

Alternatively, if it cannot be prevented within reason (e.g. too costly or impossible to get right given scale requirements), then I suggest we move it to a job and have use case 1, 2 and 3 be reduced to the queuing of a job that takes care of things.

Document and/or reference from the code how case 2 is possible.
- If rare/unlikely:
  - Consider removing in favour of a manual recount admins can trigger via purge of the category page.
- If common and not easily preventible:
  - Move to job queue as a "validate recount", emit log warning if result turned out different.
Determine whether case 3 is still probable.
- If so:
  - Move refresh logic (recount, auto-create, auto-delete) to a job and queue that for case 1, 2, and 3.
- If not:
  - Replace recount with a log warning from case 1 and 3.

After this is done and we have confidence in its logic, we may want to consider removing the refreshing cronjob ref T299823, esp once users are able to ad-hoc purge ref T85696.

Related Objects
Search...

Status	Subtype	Assigned	Task
Open		None	T221795 Refactor Category::refreshCounts logic to a job and simplify
Duplicate		None	T224321 Run populateCategory.php
Resolved	PRODUCTION ERROR	Anomie	T195397 {{PAGESINCATEGORY}} returns incorrect value on en-wiki Category:Candidates for speedy deletion
Resolved	BUG REPORT	Legoktm	T299244 Deleted pages are not being removed from links tables, which also messes up category counts

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Restricted Application added subscribers: Liuxinyu970226, Aklapper. · View Herald TranscriptApr 24 2019, 5:15 PM

• kchapman moved this task from Inbox, needs triage to Backlog: Maintenance, non-prioritized on the Performance-Team board.Apr 29 2019, 8:14 PM

Krinkle mentioned this in T18036: Number of category members (PAGESINCATEGORY) is inaccurate for large categories.May 6 2019, 11:57 PM

Krinkle merged a task: T18036: Number of category members (PAGESINCATEGORY) is inaccurate for large categories.

Krinkle updated the task description. (Show Details)

Krinkle added subscribers: MER-C, Steinsplitter, Yann and 39 others.

Izno merged a task: T223944: PAGESINCATEGORY continues to show the wrong number.May 25 2019, 4:45 AM

Izno mentioned this in T223944: PAGESINCATEGORY continues to show the wrong number.

Izno added a subscriber: Huji.

The problem persists. Any plans to have this resolved? Any chance to trigger the recount manually?

Huji added a subtask: T224321: Run populateCategory.php.May 25 2019, 1:52 PM

I moved T224321 in the hierarchy of tasks so that the current task would be its parent. @Superyetkin that is essentially what you are asking for. @Krinkle I think it would be nice to start over in a clean state; this can help with identifying the root causes of mismatches, whether before or after #3 is refactored.

Reedy changed the status of subtask T224321: Run populateCategory.php from Open to Stalled.May 25 2019, 7:34 PM

Reedy updated the task description. (Show Details)May 26 2019, 11:48 AM

Reedy merged a task: T224360: {{PAGESINCATEGORY}} returns 1 while category is empty.May 26 2019, 12:00 PM

Reedy added subscribers: Mbch331, Reedy.

Problems with counting starts from T224209

aaron triaged this task as Low priority.Jun 6 2019, 10:45 AM

JJMC89 merged a task: T227841: Magic word PAGESINCATEGORY gives wrong value on zh:Category:快速删除候选.Jul 12 2019, 3:59 AM

JJMC89 added subscribers: Xiplus, 94rain, Stang.

JJMC89 subscribed.

Ammarpad merged a task: T228449: Pagesincategory on dewiki is wrong.Jul 31 2019, 8:31 AM

Ammarpad added a subscriber: Luke081515.

Aklapper mentioned this in T229394: Categories showing a bigger number than they contain.Jul 31 2019, 11:16 AM

DannyS712 subscribed.Aug 4 2019, 9:04 AM

Krinkle added a project: Platform Engineering.Aug 4 2019, 4:42 PM

Krinkle updated the task description. (Show Details)

Krinkle removed a project: Platform Engineering.Aug 4 2019, 5:04 PM

Krinkle mentioned this in T224209: Orphaned entries in categorylinks.

Aklapper mentioned this in T229764: Incorrect category size.Aug 6 2019, 6:26 PM

Anomie mentioned this in T240405: WikiPage::updateCategoryCounts causing replication lag due to long-running writes on commonswiki.Dec 13 2019, 4:42 PM

Krinkle mentioned this in T116462: {{PAGESINCAT:...}} should reduce cache time as dynamic content.Jan 27 2020, 12:09 AM

Krinkle added a project: Platform Engineering.Jan 27 2020, 12:16 AM

Tacsipacsi subscribed.Jan 27 2020, 12:48 PM

daniel moved this task from Inbox to Triage Meeting Inbox on the Platform Engineering board.Feb 5 2020, 7:20 PM

Anomie raised the priority of this task from Low to Medium.Mar 24 2020, 9:00 PM

Anomie moved this task from Triage Meeting Inbox to Feature Requests to Review on the Platform Engineering board.

daniel moved this task from Feature Requests to Review to Tech Planning Review on the Platform Engineering board.May 7 2020, 4:17 PM

mdaniels5757 subscribed.Aug 12 2020, 6:00 PM

Huji mentioned this in T14019: ifexist function uses pagelinks table in lieu of better options.Nov 21 2020, 11:27 PM

Huji mentioned this in T155336: CategoryTree shows dropdown when subcategory size is negative.Nov 23 2020, 9:09 PM

Huji mentioned this in T268526: Use a dedicated mechanism to track page dependencies.Nov 23 2020, 9:15 PM

daniel moved this task from Tech Planning Review to Future Initiatives/Small Projects on the Platform Engineering board.Dec 2 2020, 12:11 PM

CCicalese_WMF edited projects, added Platform Engineering Roadmap Decision Making; removed Platform Engineering.Feb 24 2021, 4:35 PM

CCicalese_WMF moved this task from Untriaged to Code Jam on the Platform Engineering Roadmap Decision Making board.Feb 24 2021, 4:48 PM

Krinkle edited projects, added Performance-Team (Radar); removed Performance-Team.Feb 24 2021, 9:52 PM

Krinkle removed a subscriber: • bzimport.

Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.Mar 1 2021, 8:24 PM

Same bug? On hu:Kategória:Tudományos egyértelműsítő lapok, there are 189 pages and one subcategory listed, however, its parent category hu:Kategória:Egyértelműsítő lapok lists it as having 190 pages and zero subcategories. The subcategory hu:Kategória:Biológiai egyértelműsítő lapok was first accidentally created in the main namespace as hu:Biológiai egyértelműsítő lapok, then moved to the category namespace, then deleted and recreated in the category space.

Taylor mentioned this in T85696: Allow action=purge to recalculate the number of pages/subcats/files in a category.May 24 2021, 6:03 PM

daniel added a project: Platform Engineering Code Jam-2021.Jul 29 2021, 12:27 PM

Mbch331 mentioned this in T299244: Deleted pages are not being removed from links tables, which also messes up category counts.Jan 14 2022, 7:21 PM

Krinkle updated the task description. (Show Details)Jan 25 2022, 6:32 AM

Just to be clear, the expectation of Wikipedians is that maintainance categories are up to date with an low delay. The remaining categories can be less accurate, as long as changes to each categorylink changes the count eventually. So, like 10 categories should be really well up to date, the rest not. This probably should have been said a month ago.

Krinkle added a project: Sustainability (Incident Followup).Aug 19 2022, 3:07 PM

Krinkle moved this task from Watching to Perf recommendation on the Performance-Team (Radar) board.Aug 18 2023, 8:05 PM

Krinkle edited projects, added Wikimedia-Performance-recommendation; removed Performance-Team (Radar).Aug 18 2023, 8:42 PM

Snaevar unsubscribed.Aug 19 2023, 9:54 AM

Ladsgroup subscribed.Mar 27 2024, 4:17 PM

FWIW, this system has caused an incident where several million articles were not editable because this process locked 3M rows in pagelinks. Fixing that would improve our resilience.

In T221795#9666191, @Ladsgroup wrote:

FWIW, this system has caused an incident where several million articles were not editable because this process locked 3M rows in pagelinks. Fixing that would improve our resilience.

T352628 for reference

Wargo unsubscribed.Apr 21 2024, 1:49 PM

Krinkle added a project: MediaWiki-Engineering.May 8 2024, 6:00 PM

FJoseph-WMF moved this task from Inbox, needs triage to Backlog (revisit in the future) on the MediaWiki-Engineering board.Wed, May 22, 1:48 PM

Krinkle merged a task: T221679: API producing inconsistent results on trwiki.Wed, May 22, 2:05 PM

Krinkle merged a task: T202833: Wrong file count for category.

Krinkle merged a task: T200402: Category count inaccurate on enwp.

Krinkle merged a task: T178489: Incorrect count in category.

Krinkle merged a task: T23230: Category count is incorrect.

Krinkle merged a task: T117971: Category size count is radically different than actual count..

Krinkle merged a task: T85527: Category gives wrong number of pages.

Krinkle added subscribers: • bzimport, • wikibugs-l-list, Danmichaelo.

Krinkle added a subtask: T195397: {{PAGESINCATEGORY}} returns incorrect value on en-wiki Category:Candidates for speedy deletion.Wed, May 22, 2:08 PM

Krinkle added a subtask: T299244: Deleted pages are not being removed from links tables, which also messes up category counts.

Questions from @Bmueller:

What are the dependencies (code, and/or people)?
- "Categories" are a fairly small feature. It builds on general platform concepts like JobQueue, DeferredUpdates, and utility functions, but I don't think this task requires changes to other components or feature behaviours. As such, while not exactly standalone, I'd say it has no code dependencies that we need to be mindful of in this context. In terms of teams, the feature is unowned. The task does not require making behaviour changes besides obvious bug fixes, so no controversial decisions or indirect product impact expected there.
What is the impact of this?
- Sustainability: the current implementation is needlessly complex and redundant and yet still seeems to regularly produce inaccurate counts. It is prune to production errors and load problems. e.g. T352628: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError", per @Ladsgroup above.
- Product experience: The most prominent use category counts, where the number itself is imporant/sensitive, is Wikipedia maintenance categories, and similar "meta" categories on other wikis. The counts are often used to power templates, gadgets, and bots to know when a particular kind of issue has more than 0 articles affecting it. The problem is often that the category will either be zero when it shouldn't be or above zero when it should be zero. To clarify: The problem is not e.g. when a category containing 100+ items and the number being off by a few, which would not be a big problem in that case. We have ~15 duplicate bug reports about this at the moment, dating back several years.
What would it cost to fix? TBD.

Might involve:

Investigate the two hypotheses in the task description, thinking through the way this is triggered and confirming whether a simpler approach indeed would produce equal-or-more-reliable outcome compared to today.
Come up with steps to reproduce the issue (preferably on a plain local install, but possibly in prod if specific to certain factors).
Understand why it continues to happen today.
Re-investigate T352628 to understand why it caused deadlocks, and consider removing the workarounds added there if the simpler Job-based approach is believed to not need them.
Look at how we solved user_editcount increments/decrements, and see if there is something we can learn from that. Share on-task why or why not.
Look at how we do compute+cache BetaFeatures usage counts (highlighted as good example in https://wikitech.wikimedia.org/wiki/MediaWiki_Engineering/Guides/Backend_performance_practices). If not a good example for this, share learnings on-task why not.

Perhaps 1 month (real time) for myself and 1 person from the team, where we both spend ~4 hours a week for 2-3 weeks (i.e. not as main priority), plus 1 extra week buffer for comms/follow-ups.

Refactor Category::refreshCounts logic to a job and simplifyOpen, MediumPublicActions