files in a category
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	matmarex
	Jan 2 2015, 12:45 PM

Description

Let's allow action=purge to recalculate the number of pages/subcats/files in a category. We continuously get reports about these counts being incorrect, even for pages with as few as ~300 pages.

Adding/removing pages to/from categories only increments/decrements the counts, never recalculating them from scratch, so as time goes they get more and more out of date. Curiously, there isn't even a maintenance script to rebuild them; they are only reset if the count becomes negative (so obviously incorrect) and sometimes when there are <200 pages.

Currently the only way to force recount is to manually remove the relevant row from the 'category' table. That's suboptimal.

This would need to be limited to categories with, say, fewer than 5000 entries; for pathological cases like Category:All stub articles on en.wp with nearly two million pages, the query (see Category::refreshCounts()) takes several minutes to run (I tried on labs and it took 6m18s to count 1,877,668 pages). I also tried Category:Year of birth unknown and it took only 4.5s for 20,125 pages, so only doing this for categories where the current count is <5000 should be safe?

Details

	Subject	Repo	Branch	Lines +/-
	Update category counts on purge	mediawiki/core	master	+27 -0

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Open	Feature	None	T18660 Database table cleanup (tracking)
Duplicate		None	T18036 Number of category members (PAGESINCATEGORY) is inaccurate for large categories
Resolved		Legoktm	T85696 Allow action=purge to recalculate the number of pages/subcats/files in a category

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

matmarex raised the priority of this task from to Needs Triage.Jan 2 2015, 12:45 PM

matmarex added a project: MediaWiki-Categories.

matmarex updated the task description. (Show Details)

matmarex set Security to None.

matmarex added subscribers: matmarex, Bawolff, valhallasw, Ciencia_Al_Poder.

valhallasw updated the task description. (Show Details)Jan 2 2015, 12:49 PM

Well, there's a maintenance script to rebuild the counts, apparently: populateCategory.php

matmarex added a subscriber: Rotpunkt.Jan 2 2015, 12:55 PM

Krenair subscribed.Jan 2 2015, 3:19 PM

Matthewrbowker subscribed.Jan 3 2015, 8:44 AM

Aklapper triaged this task as Lowest priority.Jan 5 2015, 12:33 PM

• MZMcBride mentioned this in T85527: Category gives wrong number of pages.Jun 7 2015, 6:05 AM

matmarex updated the task description. (Show Details)Nov 9 2015, 2:29 PM

Ankry subscribed.Dec 11 2015, 7:56 PM

so only doing this for categories where the current count is <5000 should be safe?

Imo, yes it should be given that limit, and only recounting on purge, not read. (although i guess @jcrespo is the person to ask).

There is the pathalogical case where if the category contains 5 billion entries but category table says 1, that would be bad, but that seems extraordinarily unlikely.

The only thing to watch out for is i think that the recount method gets a shared lock on all the categorylinks entries for that category. Im unsure how that affects performance

I do not think I am very useful here. I do not think counting on the foreground is a good idea, a purge action creating a background job that asyncronously counts them would be better, and it would allow for a larger limit, if using the "slow" slave.

Aside from that, I would test it to see how much it takes with real examples, and set the limit accordingly. I can do that, but others could do that, too. My only fear would be allowing spaming the purge action.

the recount method gets a shared lock on all the categorylinks entries for that category

That is worrying- with InnoDB, while inside a transaction, you do not need to lock the rows- it should get consistent reads automatically.

In T85696#1874387, @jcrespo wrote:

I do not think I am very useful here. I do not think counting on the foreground is a good idea, a purge action creating a background job that asyncronously counts them would be better, and it would allow for a larger limit, if using the "slow" slave.

Aside from that, I would test it to see how much it takes with real examples, and set the limit accordingly. I can do that, but others could do that, too. My only fear would be allowing spaming the purge action.

the recount method gets a shared lock on all the categorylinks entries for that category

That is worrying- with InnoDB, while inside a transaction, you do not need to lock the rows- it should get consistent reads automatically.

Without locking, pages getting added and removed from the category while the SELECT gets the new counts will not be counted correctly.

Without locking, pages getting added and removed from the category while the SELECT gets the new counts will not be counted correctly.

I disagree in a general case, but all depends on the definition of "counting", and what is the actual problem to be solved- which I am not familiar with. Maybe SERIALIZABLE is the right approach there, I do not know.

In T85696#1874397, @aaron wrote:

In T85696#1874387, @jcrespo wrote:

the recount method gets a shared lock on all the categorylinks entries for that category

That is worrying- with InnoDB, while inside a transaction, you do not need to lock the rows- it should get consistent reads automatically.

Without locking, pages getting added and removed from the category while the SELECT gets the new counts will not be counted correctly.

There is a way to address this concern, and I already have a patch (not quite ready for review, so not uploaded yet) that does the following:

Acquires a named lock on the category to prevent concurrent counts refresh updates (normal links updates are unaffected)
Ensures that subsequent queries will not read a stale snapshot (one taken before the lock was acquired)
Counts all categorylinks rows of each type, and retrieves the old set of counts, in a single query (thus using a consistent snapshot, regardless of READ COMMITTED vs. REPEATABLE READ, assuming MVCC)
If old and new counts are not the same, applies the necessary change(s) as difference(s) between old and new counts
Releases the named lock

Here's an analogy:

You are a professor who teaches in a gigantic lecture hall. You notice that a clock high on the wall reads 11:00, but it is actually 11:05, according to your wristwatch, which is synchronized to a national time standard. You call maintenance and ask them to fix the time on the clock. At 12:05, a maintenance worker arrives and sets up his slow and rather noisy lift. Unfortunately, the worker's watch is broken, so he needs some help.

New method: You compare the clock's time (12:00) with your wristwatch's time (12:05). Because the clock's time is five minutes earlier than your wristwatch's time, you tell the worker to advance the clock by five minutes, making sure no other worker arrives until he is done. He takes one minute to go up in the lift, then adjusts the clock (to 12:06), then goes back down.

Current method: You stop time, because otherwise the time might change (after you check your wristwatch, but before the clock is adjusted). You check your wristwatch, then tell the worker to set the clock to 12:05. He starts going up in the lift. One minute later, he reaches the clock, though because you had stopped time, it is still 12:05. You only restart time once the worker is done adjusting the clock.

This would need to be limited to categories with, say, fewer than 5000 entries; for pathological cases like Category:All stub articles on en.wp with nearly two million pages, the query (see Category::refreshCounts()) takes several minutes to run (I tried on labs and it took 6m18s to count 1,877,668 pages). I also tried Category:Year of birth unknown and it took only 4.5s for 20,125 pages, so only doing this for categories where the current count is <5000 should be safe?

Currently, the query joins against the page table, checking page_namespace to determine what is already available as cl_type. The following query (a simplified version of that from my patch):

select cl_type, count(*) from categorylinks where cl_to = 'All_stub_articles' group by cl_type;

when run against an October 2015 dump of enwiki, takes only 1.35 sec on my desktop PC (running MariaDB 5.5.47).

However, I first need to work on improving the consistency between the two fields, at the very least by making it possible for users to fix inconsistencies themselves (e.g. by purging the affected pages, or the affected category).

Schnark subscribed.Dec 14 2016, 8:45 AM

TTO mentioned this in T155091: "Invalid DB key" errors on various special pages.Jan 23 2017, 3:04 PM

TTO mentioned this in T18765: Write a maintenance script to refresh category member counts.Jan 24 2017, 11:53 AM

TTO added a parent task: T18036: Number of category members (PAGESINCATEGORY) is inaccurate for large categories.Jan 24 2017, 12:22 PM

ArielGlenn mentioned this in T201119: For some categories, cat_pages is less than cat_subcats.Aug 7 2018, 10:15 AM

Proc subscribed.Nov 21 2020, 12:03 AM

SUPPORT. Reevalutae the category, including the sortorder.

This is still a problem: too many or too few, OFF-BY-ONE or OFF-BY-MANY, even negative (-12 is the record so far)

A problem probably related to wrong count is that the sortorder in categories does not update without a "null edit" on the affected pages in the category ("action=purge" does NOT help, deleting and undeteting the cat page does NOT help).

T18036: Number of category members (PAGESINCATEGORY) is inaccurate for large categories created year 2008
T18765: Write a maintenance script to refresh category member counts
T23230: Category count is incorrect
T70240: Negative membership counts in category table
T85527: Category gives wrong number of pages
T117971: Category size count is radically different than actual count.
T169964: Counter of the numbers of the pages on a category shows negative result
T170737: Run recountCategories.php on Wikimedia wikis
T210733: Magic word "PAGESINCATEGORY" with parameter "pages" returns negative value on zhwikibooks year 2018
T221795: Refactor Category::refreshCounts logic to a job and simplify year 2019 the optimal but difficult solution
T224209: Orphaned entries in categorylinks
T224321: Run populateCategory.php
T228585: Counter for pages in category is below zero
T229394: Categories showing a bigger number than they contain created year 2019
T238500: Inconsistencies in Database dewiki.p
T247187: Wrong subcategory counts in Commons
T272821: Chinese Wikipedia: category has -1 page 2021-01 ... 2021-03
T299244: Deleted pages are not being removed from links tables, which also messes up category counts 2022-01 temporary regression, fixed by code revert
T299823: Regularly run recountCategories.php on Wikimedia wikis via systemd timer 2022-01 run "recountCategories.php" regularly

There are tracking categories that are supposed to be empty, or categories supposed to contain only subcategories, but no pages. Well, sometimes they contain ghost stuff that cannot be removed, not even with enhanced sysop rights. This more than a cosmetical issue.

This seems to be the (almost) last one not yet closed as "duplicate", thus getting it "fixed" would be nice.

Taylor raised the priority of this task from Lowest to High.May 22 2021, 9:09 PM

Taylor updated the task description. (Show Details)

Taylor added a project: Wikimedia-maintenance-script-run.May 22 2021, 9:24 PM

Taylor merged a task: T272821: Chinese Wikipedia: category has -1 page.May 22 2021, 9:30 PM

Taylor mentioned this in T272821: Chinese Wikipedia: category has -1 page.

Taylor added subscribers: • IN, Shizhao, Stang.

Hi @Taylor, do you plan to work on fixing this task, as you increased the priority of this task?
You are welcome to increase priority and set yourself as task assignee if you plan to work on fixing this, as the Priority field summarizes and reflects reality and does not cause it. Thanks for your understanding!

IMHO the priority of this task should be increased given the large number of complaints and redundant duplicate tasks about the very same problem.

Taylor merged a task: T247187: Wrong subcategory counts in Commons.May 24 2021, 6:11 PM

Taylor mentioned this in T247187: Wrong subcategory counts in Commons.

Taylor added subscribers: Ghouston, Rubin16, mdaniels5757 and 7 others.

Taylor mentioned this in T224209: Orphaned entries in categorylinks.May 24 2021, 6:15 PM

Taylor merged a task: T224209: Orphaned entries in categorylinks.

Taylor added subscribers: Wargo, Yann, EvanProdromou and 3 others.

Taylor mentioned this in T224321: Run populateCategory.php.Jun 5 2021, 11:27 PM

Taylor merged a task: T224321: Run populateCategory.php.

Taylor added subscribers: Huji, RhinosF1, Superyetkin.

Possible fixes:

a) run the script now on all public WMF wikis (will fix categories broken for now, will not prevent the wrong counts from re-appearing)
b) run the script now, and re-run it regularly or at least occasionally, or establish a way to request running it on a particular wiki
c) add a check against ZERO before decrementing the counter, making it impossible for the counts to become negative (removes the most prominent and most silly symptome of the bug, particularly if a) is performed subsequently, but will not fix false positive counts for empty categories, less useful alone, sufficiently useful together with b))
d) Allow "action=purge" to recalculate the number of pages/subcats/files in a category (very useful, let users fix those categories that bother them)
e) fix the flawed logic (see above T85696#2079552 by "PleaseStand") causing this bug (most difficult strategy?) and perform a) then

Taylor mentioned this in T229394: Categories showing a bigger number than they contain.Jun 5 2021, 11:56 PM

Taylor merged a task: T229394: Categories showing a bigger number than they contain.

Taylor added subscribers: Niridya, MBH, Vort, Thibaut120094.

Taylor mentioned this in T170737: Run recountCategories.php on Wikimedia wikis.Jun 23 2021, 11:10 AM

Xeriphas1994 subscribed.Jun 28 2021, 3:39 AM

Jonesey95 subscribed.Jan 14 2022, 7:53 PM

The problem of wrong category member counts comes up time and again (currently on enwiki at https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Problems_with_speedy_deletion_category_counts ). It would be really helpful to have some workarounds for the next time. If there are performance issues, just restrict the "force recount" action to admins or similar.

The script was run again on all public WMF wikis due to T299244: Deleted pages are not being removed from links tables, which also messes up category counts.

Change 756107 had a related patch set uploaded (by Legoktm; author: Legoktm):

[mediawiki/core@master] Update category counts on purge

https://gerrit.wikimedia.org/r/756107

gerritbot added a project: Patch-For-Review.Jan 22 2022, 7:26 AM