Page MenuHomePhabricator

Allow action=purge to recalculate the number of pages/subcats/files in a category
Open, LowPublic

Description

Let's allow action=purge to recalculate the number of pages/subcats/files in a category. We continuously get reports about these counts being incorrect, even for pages with as few as ~300 pages.

Adding/removing pages to/from categories only increments/decrements the counts, never recalculating them from scratch, so as time goes they get more and more out of date. Curiously, there isn't even a maintenance script to rebuild them; they are only reset if the count becomes negative (so obviously incorrect) and sometimes when there are <200 pages.

Currently the only way to force recount is to manually remove the relevant row from the 'category' table. That's suboptimal.

This would need to be limited to categories with, say, fewer than 5000 entries; for pathological cases like Category:All stub articles on en.wp with nearly two million pages, the query (see Category::refreshCounts()) takes several minutes to run (I tried on labs and it took 6m18s to count 1,877,668 pages). I also tried Category:Year of birth unknown and it took only 4.5s for 20,125 pages, so only doing this for categories where the current count is <5000 should be safe?

Related Objects

Event Timeline

matmarex raised the priority of this task from to Needs Triage.
matmarex updated the task description. (Show Details)
matmarex added a project: MediaWiki-Categories.
matmarex updated the task description. (Show Details)
matmarex set Security to None.

Well, there's a maintenance script to rebuild the counts, apparently: populateCategory.php

Aklapper triaged this task as Lowest priority.Jan 5 2015, 12:33 PM

so only doing this for categories where the current count is <5000 should be safe?

Imo, yes it should be given that limit, and only recounting on purge, not read. (although i guess @jcrespo is the person to ask).

There is the pathalogical case where if the category contains 5 billion entries but category table says 1, that would be bad, but that seems extraordinarily unlikely.

The only thing to watch out for is i think that the recount method gets a shared lock on all the categorylinks entries for that category. Im unsure how that affects performance

I do not think I am very useful here. I do not think counting on the foreground is a good idea, a purge action creating a background job that asyncronously counts them would be better, and it would allow for a larger limit, if using the "slow" slave.

Aside from that, I would test it to see how much it takes with real examples, and set the limit accordingly. I can do that, but others could do that, too. My only fear would be allowing spaming the purge action.

the recount method gets a shared lock on all the categorylinks entries for that category

That is worrying- with InnoDB, while inside a transaction, you do not need to lock the rows- it should get consistent reads automatically.

I do not think I am very useful here. I do not think counting on the foreground is a good idea, a purge action creating a background job that asyncronously counts them would be better, and it would allow for a larger limit, if using the "slow" slave.

Aside from that, I would test it to see how much it takes with real examples, and set the limit accordingly. I can do that, but others could do that, too. My only fear would be allowing spaming the purge action.

the recount method gets a shared lock on all the categorylinks entries for that category

That is worrying- with InnoDB, while inside a transaction, you do not need to lock the rows- it should get consistent reads automatically.

Without locking, pages getting added and removed from the category while the SELECT gets the new counts will not be counted correctly.

Without locking, pages getting added and removed from the category while the SELECT gets the new counts will not be counted correctly.

I disagree in a general case, but all depends on the definition of "counting", and what is the actual problem to be solved- which I am not familiar with. Maybe SERIALIZABLE is the right approach there, I do not know.

the recount method gets a shared lock on all the categorylinks entries for that category

That is worrying- with InnoDB, while inside a transaction, you do not need to lock the rows- it should get consistent reads automatically.

Without locking, pages getting added and removed from the category while the SELECT gets the new counts will not be counted correctly.

There is a way to address this concern, and I already have a patch (not quite ready for review, so not uploaded yet) that does the following:

  1. Acquires a named lock on the category to prevent concurrent counts refresh updates (normal links updates are unaffected)
  2. Ensures that subsequent queries will not read a stale snapshot (one taken before the lock was acquired)
  3. Counts all categorylinks rows of each type, and retrieves the old set of counts, in a single query (thus using a consistent snapshot, regardless of READ COMMITTED vs. REPEATABLE READ, assuming MVCC)
  4. If old and new counts are not the same, applies the necessary change(s) as difference(s) between old and new counts
  5. Releases the named lock

Here's an analogy:

You are a professor who teaches in a gigantic lecture hall. You notice that a clock high on the wall reads 11:00, but it is actually 11:05, according to your wristwatch, which is synchronized to a national time standard. You call maintenance and ask them to fix the time on the clock. At 12:05, a maintenance worker arrives and sets up his slow and rather noisy lift. Unfortunately, the worker's watch is broken, so he needs some help.

New method: You compare the clock's time (12:00) with your wristwatch's time (12:05). Because the clock's time is five minutes earlier than your wristwatch's time, you tell the worker to advance the clock by five minutes, making sure no other worker arrives until he is done. He takes one minute to go up in the lift, then adjusts the clock (to 12:06), then goes back down.

Current method: You stop time, because otherwise the time might change (after you check your wristwatch, but before the clock is adjusted). You check your wristwatch, then tell the worker to set the clock to 12:05. He starts going up in the lift. One minute later, he reaches the clock, though because you had stopped time, it is still 12:05. You only restart time once the worker is done adjusting the clock.

This would need to be limited to categories with, say, fewer than 5000 entries; for pathological cases like Category:All stub articles on en.wp with nearly two million pages, the query (see Category::refreshCounts()) takes several minutes to run (I tried on labs and it took 6m18s to count 1,877,668 pages). I also tried Category:Year of birth unknown and it took only 4.5s for 20,125 pages, so only doing this for categories where the current count is <5000 should be safe?

Currently, the query joins against the page table, checking page_namespace to determine what is already available as cl_type. The following query (a simplified version of that from my patch):

select cl_type, count(*) from categorylinks where cl_to = 'All_stub_articles' group by cl_type;

when run against an October 2015 dump of enwiki, takes only 1.35 sec on my desktop PC (running MariaDB 5.5.47).

However, I first need to work on improving the consistency between the two fields, at the very least by making it possible for users to fix inconsistencies themselves (e.g. by purging the affected pages, or the affected category).

SUPPORT. Reevalutae the category, including the sortorder.

This is still a problem: too many or too few, OFF-BY-ONE or OFF-BY-MANY, even negative (-12 is the record so far)

A problem probably related to wrong count is that the sortorder in categories does not update without a "null edit" on the affected pages in the category ("action=purge" does NOT help, deleting and undeteting the cat page does NOT help).

There are tracking categories that are supposed to be empty, or categories supposed to contain only subcategories, but no pages. Well, sometimes they contain ghost stuff that cannot be removed, not even with enhanced sysop rights. This more than a cosmetical issue.

This seems to be the (almost) last one not yet closed as "duplicate", thus getting it "fixed" would be nice.

Taylor raised the priority of this task from Lowest to High.May 22 2021, 9:09 PM
Taylor updated the task description. (Show Details)
Aklapper lowered the priority of this task from High to Low.May 22 2021, 10:27 PM

Hi @Taylor, do you plan to work on fixing this task, as you increased the priority of this task?
You are welcome to increase priority and set yourself as task assignee if you plan to work on fixing this, as the Priority field summarizes and reflects reality and does not cause it. Thanks for your understanding!

IMHO the priority of this task should be increased given the large number of complaints and redundant duplicate tasks about the very same problem.

Possible fixes:

  • a) run the script now on all public WMF wikis (will fix categories broken for now, will not prevent the wrong counts from re-appearing)
  • b) run the script now, and re-run it regularly or at least occasionally, or establish a way to request running it on a particular wiki
  • c) add a check against ZERO before decrementing the counter, making it impossible for the counts to become negative (removes the most prominent and most silly symptome of the bug, particularly if a) is performed subsequently, but will not fix false positive counts for empty categories, less useful alone, sufficiently useful together with b))
  • d) Allow "action=purge" to recalculate the number of pages/subcats/files in a category (very useful, let users fix those categories that bother them)
  • e) fix the flawed logic (see above T85696#2079552 by "PleaseStand") causing this bug (most difficult strategy?) and perform a) then