Page MenuHomePhabricator

Number of category members (PAGESINCATEGORY) is inaccurate for large categories
Closed, DuplicatePublic

Description

*[[Category:All articles proposed for deletion]]
*[[Category:All disputed non-free images]]
*[[Category:Articles for deletion]]

The page count as given by PAGESINCATEGORY for the above categories are 3034, 2460 and 2776 respectively. Manually paginating through the category contents gives 583, 241 and 745.

All three categories are populated by templates and some entries exit the category via deletion.


Version: unspecified
Severity: normal

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

per comment 11, adding "shell" keyword

(In reply to comment #12)

http://en.wikipedia.org/w/index.php?title=Wikipedia:Village_pump_%28technical%29&oldid=436694300#Category:Images_lacking_a_description
is another example where this is causing issues

Something weird is happening. This bug should no longer happen on categories with less than 200 pages in them, and 63 < 200.

(In reply to comment #14)

(In reply to comment #12)

http://en.wikipedia.org/w/index.php?title=Wikipedia:Village_pump_%28technical%29&oldid=436694300#Category:Images_lacking_a_description
is another example where this is causing issues

Something weird is happening. This bug should no longer happen on categories
with less than 200 pages in them, and 63 < 200.

I see why. We don't refresh the cat counts if everything is 0 (or specifically if everything is right except for the images, and the image section is really 0 or >200). This behaviour is kind of weird (Also, we only check if a specific section is correct, which could result in recounting a 200000 big category if it only happens to have 5 images in it, and the image count is wrong) [this all seems kind of wrong, but off topic to this bug so I'll stop talking about that and eiter fix it or split it off to some other bug]

Anyhow, the upshot of all this is - If you add a single image to that category, view that category page (this is the important part, you need to view the category page while it has the single image in it), and then remove that image from the category, it will probably reset the number of files in that category count - no shell required.

Has any progress been made on resolving this issue?

Betacommand: No, otherwise it would be mentioned here. :)

Does this need doing everywhere?

hercule.wikipedia wrote:

I confirm this is not only for deletion categorie. An example on fr.wiki is :

http://fr.wikipedia.org/wiki/Cat%C3%A9gorie:Bon_article

Today it counts 1 850 pages + 3 sub-categorys and via the api I retrieve 1863 items.

hercule.wikipedia wrote:

(In reply to comment #20)

http://commons.wikimedia.org/wiki/Category:Non-
empty_disambiguation_categories
http://commons.wikimedia.org/wiki/Category:Non-empty_category_redirects and
http://commons.wikimedia.org/wiki/Category:Broken_category_redirects take
sometimes weeks before getting correct. It became much worse last month.

I don't think this is related to this bug. Your problem is due to the cache management. That's something else.

Today I made an update of every pages in http://fr.wikipedia.org/wiki/Cat%C3%A9gorie:Article_de_qualit%C3%A9 this morning, and update the category page. None changed the count of articles.

  • Bug 68240 has been marked as a duplicate of this bug. ***
tomasz set Security to None.

In the distant past, refreshLinks ran in autocommit mode, which partly why there are so many discrepancies with counts (due to partial updates).

I wonder how often new drift occurs. In any case, it might be useful to have a chron refresh the counts.

Just for the record, here are some stats that I've dug up while working on T18765: Write a maintenance script to refresh category member counts:

Out of 1,555,008 rows in enwiki's category table, there are 15,476 rows with an incorrect cat_pages value. 57 of these have drifted significantly (more than 200 away from the real value), and 45 have cat_pages < 0. Fewer than 300 categories have a cat_subcats miscount, and a similar number for cat_files.

I wonder how often new drift occurs.

wikidatawiki is relatively new (started in late 2012). There are 19 miscounts for cat_pages out of 4,511 total rows in the category table, and no miscounts for cat_subcats or cat_files. Figures are similar for enwikivoyage, a site of a similar age. lrcwiki (a small wiki started in mid-2015) has 12 miscounts out of 861 category rows. So they still happen.

matmarex renamed this task from PAGESINCATEGORY inaccurate for large categories to Number of category members (PAGESINCATEGORY) is inaccurate for large categories.Feb 1 2017, 3:35 PM

BTW, I have no proof of that, but this was discovered recently and it would fit (maybe?) the issues mentioned here: T163337 (some kind of refresh being executed twice and substracting more than once). I am speaking without seeing the code- I do not know if an addition/substraction is done or just full count is done each time.

I don't understand why counters are not just refreshed using a scheduled date on the job queue: the current counter value will help schedule this refresh in some future (if no other schedule has been programmed, using a delay based on the last time the counter was refreshed + a delay depending on the current counter value; if this still falls in the past, schedule the job to run immmediately).

When the scheduled job will run, it will compute a COUNT(*) from SQL query and will set the actual value, and mark the category page as "touched" (to invalidate its rendering in cache if the page was using PAGESINCATEGORY and allow recategorization of the page as it will be freshed if it was the category page itself). The job queue will also record the last date it was run and trhe job does not need to reschedule itself (rescheduling will be made when member pages will change their own categorisation.

This way, the counters will not staty out of sync very long, they will fainlly be synchronized and cleaned regularly, but not too often to avoid costly accesses (however the SQL queries to perform a SELECT COUNT(*) for members in a specific category should be already optimized using indexes and should not be very costly: the job queue may use a job specific schedule queue allowing counters for multiple categories to be run in the same SQL session, the standard job queue not managing itself individual categories but only when the dedicate job will run to perform multiple updates in a batch from its own specific queue, probably with priorities for categories whole last estimated counters was the smallest or requested since long enough (e.g. since more than 24 hours, before every others).

This job may also be scheduled manually by system admins if needed for specific forgotten categories using some admin tool or by calling some SQL stored procedure from the SQL console or some restricted PHP or MediaWiki "SpecialPage :" for maintenance (restricted only to avoid DoS attacks that attempt to affect the performance of servers with too many concurrent costly requests on a large number of categories: I understand that recounting some heavily populated categories may be costly, but the SQL backend should still be optimized to perform a COUNT(*) of members in a specific category with a small cost even if that category is very populated, while basic editing of pages will simply use a fast but optimistic "UPDATE [table] SET counter=counter[+/-]1 WHERE [categoryname]" while also scheduling the category in the job queue).

I don't understand why counters are not just refreshed using a scheduled date on the job queue
COUNT(*) for members in a specific category should be already optimized using indexes and should not be very costly

How to count 10M rows with the current schema? https://commons.wikimedia.org/wiki/Category:CC-BY-SA-3.0 If we do things not-locking, by the time we finish the counting the counter is outdated. If we do things locking (SERIALIZABLE/SELECT FOR UPDATE), we block new items to be added, causing high contention. Of course, this is not a new problem that only we have, but 1) we have a very inefficient structure that makes things very slow 2) we have a structure that favours high contention.

Note that I am not disagreeing with you- more or less, what you propose was already going to be done at: https://gerrit.wikimedia.org/r/333917 (then non-locking model). What I am saying is that we should maybe start planing for those kind of issues and change the references model to "solve" or make trivial problems like this.

but the SQL backend should still be optimized to perform a COUNT(*) of members in a specific category with a small cost even if that category is very populated

Ok, let's design that optimization :-) - we can start by normalizing titles out of references. Should we also do some kind of sharding to avoid contention and maybe allow parallel counting? More ideas?

Is there any workaround to force recount for a given category?

No, there isn't (other than removing all pages from it, but please don't do that).

Reedy changed the task status from Open to Stalled.Feb 10 2019, 1:32 AM

No, there isn't (other than removing all pages from it, but please don't do that).

"Removing all pages from it" would not help. https://commons.wikimedia.org/wiki/Category:Flickr_review_needed is constantly emptied and filled up, but {{PAGESINCAT:Flickr review needed|R|files}} still returns a wrong tally (+42 from the actual number).

That was not the case when I made that comment. At the time, a category would be recounted whenever someone viewed the category description page if the tally was <200 and if it did not match the actual number of items shown on the page. Commits rMWde75c4e63bd6: Avoid triggering Category::refreshCounts() on HTTP GET requests and then rMW9a2ba8e21d82: Reduce frequency of refreshCounts() calls in LinksDeletionUpdate changed this logic, and it seems currently categories are only recounted whenever a page is removed from the category and the tally is <=0, so that category will indeed never be recounted.

Change 506032 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Reinstall small category refresh logic in LinksDeletionUpdate

https://gerrit.wikimedia.org/r/506032

Change 506032 merged by jenkins-bot:
[mediawiki/core@master] Reinstate small category refresh logic in LinksDeletionUpdate

https://gerrit.wikimedia.org/r/506032

With this patch, small categories (up to 100 pages) should be once again recounted whenever a page is removed from them. This should be deployed to Wikimedia wikis next week, per the usual schedule. Larger categories may still remain inaccurate forever.

With this patch, small categories (up to 100 pages) should be once again recounted whenever a page is removed from them. This should be deployed to Wikimedia wikis next week, per the usual schedule. Larger categories may still remain inaccurate forever.

I'd like to suggest a different logic to consider:

We would like to trigger a recount every now and then, but not too often in order to avoid server load. A simple solution is to trigger a recount after pn changes (addition/removal of page from a category), where n is the category size and p is a constant factor, i.e. 2%. For small categories, recount will take place each time. For a category with about 1000 pages, recount will take place after 20 operations.

However, such a solution requires storing the accumulated number of operations. In order to avoid changes to the DB, we can apply a stochastic trigger: Pick r, a random number between 0 and 1, and execute recount if r < 1/(p·n).

Superyetkin raised the priority of this task from Low to High.May 3 2019, 3:29 PM

The patch does not seem to have any positive effect on trwiki. The call {{PAGESINCATEGORY:Vikipedi silinecek sayfalar}}
returns 12 while the category is empty (as of May 3, 2019).

Aklapper lowered the priority of this task from High to Low.May 3 2019, 4:06 PM

@Superyetkin: Please do not change task priority if the situation has not changed and has not suddenly become more urgent, as requested before.

@Aklapper, I think the applied patch having failed changes the situation. The problem persists and needs to be worked on more closely.

On a closer look, my description of @aaron's patch was incorrect – rather than being recounted whenever a page is removed from them, categories are only recounted when a page that belongs to them is deleted.

I am actually not sure which is the intended behavior…

There does not appear to be a description of a specific technical problem in this task with the way counts happen or are updated.

Rather, it seems we assume that updating counts breaks sometimes and are wanting to do full recounts more often to mitigate this. The logic around that is being discussed at T221795. Input there is welcome :)

I don't think that task is about the same thing. We've gone a bit off the topic in the last few comments, but the main point of this task is that "Number of category members is inaccurate for large categories". I've just read T221795 and I don't see how it seeks to improve this.

  • By having log warnings we can better understand why these numbers diverge and whether we can fix that.
  • By using the job queue we could afford doing recounts more liberally.

Specific to this task, I don't see what action could resolve it. Is there a specific problem with steps to reproduce that you'd like this task to focus on? In that case it could depend on T221795 and deprioritised until after that. But a general task for the feature having non-zero bugs doesn't seem particularly actionable.

Problems with counting starts from T224209

The failure to fix this bug means that has a negative effect on maintenance, because the use of PAGESINCATEGORY to report on the size of tracking categories gives inaccurate results.

It's ridiculous that this problem remains unresolved after 9 years.

@BrownHairedGirl: This task is closed as a duplicate, see T18036#5163022. Please see https://www.mediawiki.org/wiki/Bug_management/Development_prioritization why it is not "ridiculous" but unfortunately rather common that many issues do not get solved as long as there is no unlimited workforce and as long as you or anyone else does not contribute a patch to fix an issue... Thanks :)