Page MenuHomePhabricator

Allow action=purge to recalculate the number of pages/subcats/files in a category
Open, LowestPublic

Description

Let's allow action=purge to recalculate the number of pages/subcats/files in a category. We continuously get reports about these counts being incorrect, even for pages with as few as ~300 pages.

Adding/removing pages to/from categories only increments/decrements the counts, never recalculating them from scratch, so as time goes they get more and more out of date. Curiously, there isn't even a maintenance script to rebuild them; they are only reset if the count becomes negative (so obviously incorrect) and sometimes when there are <200 pages.

Currently the only way to force recount is to manually remove the relevant row from the 'category' table. That's suboptimal.

This would need to be limited to categories with, say, fewer than 5000 entries; for pathological cases like Category:All stub articles on en.wp with nearly two million pages, the query (see Category::refreshCounts()) takes several minutes to run (I tried on labs and it took 6m18s to count 1,877,668 pages). I also tried Category:Year of birth unknown and it took only 4.5s for 20,125 pages, so only doing this for categories where the current count is <5000 should be safe?

See also:

Event Timeline

matmarex created this task.Jan 2 2015, 12:45 PM
matmarex raised the priority of this task from to Needs Triage.
matmarex updated the task description. (Show Details)
matmarex added a project: MediaWiki-Categories.
matmarex updated the task description. (Show Details)
matmarex set Security to None.
valhallasw updated the task description. (Show Details)Jan 2 2015, 12:49 PM

Well, there's a maintenance script to rebuild the counts, apparently: populateCategory.php

Krenair added a subscriber: Krenair.Jan 2 2015, 3:19 PM
Aklapper triaged this task as Lowest priority.Jan 5 2015, 12:33 PM
matmarex updated the task description. (Show Details)Nov 9 2015, 2:29 PM
Ankry added a subscriber: Ankry.Dec 11 2015, 7:56 PM

so only doing this for categories where the current count is <5000 should be safe?

Imo, yes it should be given that limit, and only recounting on purge, not read. (although i guess @jcrespo is the person to ask).

There is the pathalogical case where if the category contains 5 billion entries but category table says 1, that would be bad, but that seems extraordinarily unlikely.

The only thing to watch out for is i think that the recount method gets a shared lock on all the categorylinks entries for that category. Im unsure how that affects performance

I do not think I am very useful here. I do not think counting on the foreground is a good idea, a purge action creating a background job that asyncronously counts them would be better, and it would allow for a larger limit, if using the "slow" slave.

Aside from that, I would test it to see how much it takes with real examples, and set the limit accordingly. I can do that, but others could do that, too. My only fear would be allowing spaming the purge action.

the recount method gets a shared lock on all the categorylinks entries for that category

That is worrying- with InnoDB, while inside a transaction, you do not need to lock the rows- it should get consistent reads automatically.

aaron added a subscriber: aaron.Dec 11 2015, 9:50 PM

I do not think I am very useful here. I do not think counting on the foreground is a good idea, a purge action creating a background job that asyncronously counts them would be better, and it would allow for a larger limit, if using the "slow" slave.
Aside from that, I would test it to see how much it takes with real examples, and set the limit accordingly. I can do that, but others could do that, too. My only fear would be allowing spaming the purge action.

the recount method gets a shared lock on all the categorylinks entries for that category

That is worrying- with InnoDB, while inside a transaction, you do not need to lock the rows- it should get consistent reads automatically.

Without locking, pages getting added and removed from the category while the SELECT gets the new counts will not be counted correctly.

Without locking, pages getting added and removed from the category while the SELECT gets the new counts will not be counted correctly.

I disagree in a general case, but all depends on the definition of "counting", and what is the actual problem to be solved- which I am not familiar with. Maybe SERIALIZABLE is the right approach there, I do not know.

the recount method gets a shared lock on all the categorylinks entries for that category

That is worrying- with InnoDB, while inside a transaction, you do not need to lock the rows- it should get consistent reads automatically.

Without locking, pages getting added and removed from the category while the SELECT gets the new counts will not be counted correctly.

There is a way to address this concern, and I already have a patch (not quite ready for review, so not uploaded yet) that does the following:

  1. Acquires a named lock on the category to prevent concurrent counts refresh updates (normal links updates are unaffected)
  2. Ensures that subsequent queries will not read a stale snapshot (one taken before the lock was acquired)
  3. Counts all categorylinks rows of each type, and retrieves the old set of counts, in a single query (thus using a consistent snapshot, regardless of READ COMMITTED vs. REPEATABLE READ, assuming MVCC)
  4. If old and new counts are not the same, applies the necessary change(s) as difference(s) between old and new counts
  5. Releases the named lock

Here's an analogy:

You are a professor who teaches in a gigantic lecture hall. You notice that a clock high on the wall reads 11:00, but it is actually 11:05, according to your wristwatch, which is synchronized to a national time standard. You call maintenance and ask them to fix the time on the clock. At 12:05, a maintenance worker arrives and sets up his slow and rather noisy lift. Unfortunately, the worker's watch is broken, so he needs some help.

New method: You compare the clock's time (12:00) with your wristwatch's time (12:05). Because the clock's time is five minutes earlier than your wristwatch's time, you tell the worker to advance the clock by five minutes, making sure no other worker arrives until he is done. He takes one minute to go up in the lift, then adjusts the clock (to 12:06), then goes back down.

Current method: You stop time, because otherwise the time might change (after you check your wristwatch, but before the clock is adjusted). You check your wristwatch, then tell the worker to set the clock to 12:05. He starts going up in the lift. One minute later, he reaches the clock, though because you had stopped time, it is still 12:05. You only restart time once the worker is done adjusting the clock.

This would need to be limited to categories with, say, fewer than 5000 entries; for pathological cases like Category:All stub articles on en.wp with nearly two million pages, the query (see Category::refreshCounts()) takes several minutes to run (I tried on labs and it took 6m18s to count 1,877,668 pages). I also tried Category:Year of birth unknown and it took only 4.5s for 20,125 pages, so only doing this for categories where the current count is <5000 should be safe?

Currently, the query joins against the page table, checking page_namespace to determine what is already available as cl_type. The following query (a simplified version of that from my patch):

select cl_type, count(*) from categorylinks where cl_to = 'All_stub_articles' group by cl_type;

when run against an October 2015 dump of enwiki, takes only 1.35 sec on my desktop PC (running MariaDB 5.5.47).

However, I first need to work on improving the consistency between the two fields, at the very least by making it possible for users to fix inconsistencies themselves (e.g. by purging the affected pages, or the affected category).