Page MenuHomePhabricator

Number of category members (PAGESINCATEGORY) is inaccurate for large categories
Closed, DuplicatePublic


*[[Category:All articles proposed for deletion]]
*[[Category:All disputed non-free images]]
*[[Category:Articles for deletion]]

The page count as given by PAGESINCATEGORY for the above categories are 3034, 2460 and 2776 respectively. Manually paginating through the category contents gives 583, 241 and 745.

All three categories are populated by templates and some entries exit the category via deletion.

Version: unspecified
Severity: normal



Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Does anyone have a clue what's causing this? The error seems to go in both directions (sometimes there are too many articles in the category, sometimes too few). Perhaps we don't put article saves and category updates in one transaction and if one succeeds and the other fails we get wrong numbers?

Here's a simple query to see where there are wrong numbers in the database:

SELECT cat_title as title, cat_pages as storedcount, realcount FROM category,
(SELECT cl_to, count(cl_to) as realcount FROM categorylinks GROUP BY cl_to) as rc
WHERE cl_to=cat_title AND rc.realcount!=cat_pages LIMIT 10;

(In reply to comment #5)

See also bug 12188

That's not really the same bug.

From what I've tested so far, this problem seems to be two different problems:

a) categorylinks contains obsolete entries. See cl_to='1284_establishments' for on dewiki an example. This was an imported article located at [[:de:Portal Diskussion:Peterhouse, Cambridge]]. Even though the article was moved and the category removed ( ), the reference in categorylinks remains.

b) cat_pages is inaccurate. See for example which clearly contains 32 pages + 5 subcats = 37 cats. Yet, cat_pages says only 36 pages are in that category.
(What I don't understand in this case is why PAGESINCATEGORY returns the correct amount of 37. This seems to be inconsistent with the cat_pages result on the toolserver).

By the way, all in all there are 30 results on dewiki for a cat_pages <-> count mismatches.

Does this only happen for template-populated categories?
Possibly related bugs:
Bug 16686
Bug 16765
Bug 21230

it looks like there are stale entries in table categorylinks. This is a referential constraint : one deletion occured in table category, which probably did not occur at the same time in table cateforylinks, probably due to concurrent accesses and incorrecct recovery after a failure.

So we need now a SQL-enabled bot to check the counts and cleanup the stale categorylinks entries. This causes another bug: in debuggging categories (populated by templates, such as those detecting pages using obsoleted or missing parameters or incorrect template parameter values), we can cleanup the listed pages by correcting these template invokations (most of these cleanup actions are performed by bots).

However, the page reappears in the categoy some minutes later, because apparently it is found with a request on "categorylinks" only. When we nulledit these pages, they disappear immedirately from the category, because the query cache will only list the entries found in the "category" page. A few minutes later (about 5 minutes), when no edit was really performed, the same pages reappear without any reason (no edit was performed), because the stale entries in "categorylinks" are still there and where not deleted by the nulledit.

Checking now all categories could take a lot of time and SQL rssources. However, a null-edit action should be able to cleanup the categories that the null-edit is trying to refresh : not only it should look at what is in table "category" but also if the null-edited page is found somewhere in "categorylinks", to delete the unwanted stale entries: if the two "SELECT count(*)" from the two tables are incoherent, then it should list the existing entries and compare them to what is expected for the null-edited page, and delete those superfluous entries.

Note that MySQL does not properly rollbacks transactions (a full support for transactions is very costly for the server and could cause dealys and huge consumption of ressources, plus additional deadlocks: to avoid this, you should make sure that deletions or additions in any of the two tables are always performed in the same order (first "category", then "categorylinks"): this will remain true if transaction support is enabled (in another engine than MySQL, such as Oracle or Postgres).

In my opinion, the full support for transaction only works with Oracle, Sybase, MS-SQL, Informix, Ingres, PostGres, but if transactions is not enabled (i.e. all requests are autocommitted), then you need a better management of errors and their recovery (by inserting insert/update/delete requests in a queue list with a callback to handle the retires with a timed delay, for example by reinserting the page that failed to update completely in the existing update queue).

If this is not specific to deletion categories, the bug title should be changed.

Per Aryeh, these counts can be fixed by running populateCategory.php with the --force parameter.

per comment 11, adding "shell" keyword

(In reply to comment #12)
is another example where this is causing issues

Something weird is happening. This bug should no longer happen on categories with less than 200 pages in them, and 63 < 200.

(In reply to comment #14)

(In reply to comment #12)
is another example where this is causing issues

Something weird is happening. This bug should no longer happen on categories
with less than 200 pages in them, and 63 < 200.

I see why. We don't refresh the cat counts if everything is 0 (or specifically if everything is right except for the images, and the image section is really 0 or >200). This behaviour is kind of weird (Also, we only check if a specific section is correct, which could result in recounting a 200000 big category if it only happens to have 5 images in it, and the image count is wrong) [this all seems kind of wrong, but off topic to this bug so I'll stop talking about that and eiter fix it or split it off to some other bug]

Anyhow, the upshot of all this is - If you add a single image to that category, view that category page (this is the important part, you need to view the category page while it has the single image in it), and then remove that image from the category, it will probably reset the number of files in that category count - no shell required.

Has any progress been made on resolving this issue?

Betacommand: No, otherwise it would be mentioned here. :)

Reedy added a comment.May 16 2013, 9:24 PM

Does this need doing everywhere?

hercule.wikipedia wrote:

I confirm this is not only for deletion categorie. An example on is :

Today it counts 1 850 pages + 3 sub-categorys and via the api I retrieve 1863 items.

hercule.wikipedia wrote:

(In reply to comment #20)
empty_disambiguation_categories and take
sometimes weeks before getting correct. It became much worse last month.

I don't think this is related to this bug. Your problem is due to the cache management. That's something else.

Today I made an update of every pages in this morning, and update the category page. None changed the count of articles.

  • Bug 68240 has been marked as a duplicate of this bug. ***
tomasz removed a project: Shell.Feb 23 2015, 7:51 PM
tomasz set Security to None.
aaron added a subscriber: aaron.Dec 11 2015, 9:51 PM

In the distant past, refreshLinks ran in autocommit mode, which partly why there are so many discrepancies with counts (due to partial updates).

I wonder how often new drift occurs. In any case, it might be useful to have a chron refresh the counts.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 11 2015, 9:51 PM
Meno25 removed a subscriber: Meno25.Feb 22 2016, 6:06 PM
TTO added a comment.Jan 24 2017, 1:37 PM

Just for the record, here are some stats that I've dug up while working on T18765: Write a maintenance script to refresh category member counts:

Out of 1,555,008 rows in enwiki's category table, there are 15,476 rows with an incorrect cat_pages value. 57 of these have drifted significantly (more than 200 away from the real value), and 45 have cat_pages < 0. Fewer than 300 categories have a cat_subcats miscount, and a similar number for cat_files.

I wonder how often new drift occurs.

wikidatawiki is relatively new (started in late 2012). There are 19 miscounts for cat_pages out of 4,511 total rows in the category table, and no miscounts for cat_subcats or cat_files. Figures are similar for enwikivoyage, a site of a similar age. lrcwiki (a small wiki started in mid-2015) has 12 miscounts out of 861 category rows. So they still happen.

matmarex renamed this task from PAGESINCATEGORY inaccurate for large categories to Number of category members (PAGESINCATEGORY) is inaccurate for large categories.Feb 1 2017, 3:35 PM

BTW, I have no proof of that, but this was discovered recently and it would fit (maybe?) the issues mentioned here: T163337 (some kind of refresh being executed twice and substracting more than once). I am speaking without seeing the code- I do not know if an addition/substraction is done or just full count is done each time.

Verdy_p added a comment.EditedMay 20 2017, 2:13 AM

I don't understand why counters are not just refreshed using a scheduled date on the job queue: the current counter value will help schedule this refresh in some future (if no other schedule has been programmed, using a delay based on the last time the counter was refreshed + a delay depending on the current counter value; if this still falls in the past, schedule the job to run immmediately).

When the scheduled job will run, it will compute a COUNT(*) from SQL query and will set the actual value, and mark the category page as "touched" (to invalidate its rendering in cache if the page was using PAGESINCATEGORY and allow recategorization of the page as it will be freshed if it was the category page itself). The job queue will also record the last date it was run and trhe job does not need to reschedule itself (rescheduling will be made when member pages will change their own categorisation.

This way, the counters will not staty out of sync very long, they will fainlly be synchronized and cleaned regularly, but not too often to avoid costly accesses (however the SQL queries to perform a SELECT COUNT(*) for members in a specific category should be already optimized using indexes and should not be very costly: the job queue may use a job specific schedule queue allowing counters for multiple categories to be run in the same SQL session, the standard job queue not managing itself individual categories but only when the dedicate job will run to perform multiple updates in a batch from its own specific queue, probably with priorities for categories whole last estimated counters was the smallest or requested since long enough (e.g. since more than 24 hours, before every others).

This job may also be scheduled manually by system admins if needed for specific forgotten categories using some admin tool or by calling some SQL stored procedure from the SQL console or some restricted PHP or MediaWiki "SpecialPage :" for maintenance (restricted only to avoid DoS attacks that attempt to affect the performance of servers with too many concurrent costly requests on a large number of categories: I understand that recounting some heavily populated categories may be costly, but the SQL backend should still be optimized to perform a COUNT(*) of members in a specific category with a small cost even if that category is very populated, while basic editing of pages will simply use a fast but optimistic "UPDATE [table] SET counter=counter[+/-]1 WHERE [categoryname]" while also scheduling the category in the job queue).

jcrespo added a comment.EditedMay 22 2017, 9:03 AM

I don't understand why counters are not just refreshed using a scheduled date on the job queue
COUNT(*) for members in a specific category should be already optimized using indexes and should not be very costly

How to count 10M rows with the current schema? If we do things not-locking, by the time we finish the counting the counter is outdated. If we do things locking (SERIALIZABLE/SELECT FOR UPDATE), we block new items to be added, causing high contention. Of course, this is not a new problem that only we have, but 1) we have a very inefficient structure that makes things very slow 2) we have a structure that favours high contention.

Note that I am not disagreeing with you- more or less, what you propose was already going to be done at: (then non-locking model). What I am saying is that we should maybe start planing for those kind of issues and change the references model to "solve" or make trivial problems like this.

but the SQL backend should still be optimized to perform a COUNT(*) of members in a specific category with a small cost even if that category is very populated

Ok, let's design that optimization :-) - we can start by normalizing titles out of references. Should we also do some kind of sharding to avoid contention and maybe allow parallel counting? More ideas?

TTO changed the status of subtask T170737: Run recountCategories.php on Wikimedia wikis from Stalled to Open.Jul 22 2017, 1:09 AM
Fuzzy added a comment.Oct 22 2017, 2:57 PM

Is there any workaround to force recount for a given category?

No, there isn't (other than removing all pages from it, but please don't do that).

JLJ001 added a subscriber: JLJ001.May 31 2018, 9:21 PM
Jeff_G added a subscriber: Jeff_G.Aug 28 2018, 12:48 PM
Izno added a subscriber: Izno.Nov 7 2018, 2:08 PM
Reedy changed the task status from Open to Stalled.Feb 10 2019, 1:32 AM
Roy17 removed a subscriber: Roy17.Feb 24 2019, 10:39 PM

No, there isn't (other than removing all pages from it, but please don't do that).

"Removing all pages from it" would not help. is constantly emptied and filled up, but {{PAGESINCAT:Flickr review needed|R|files}} still returns a wrong tally (+42 from the actual number).

That was not the case when I made that comment. At the time, a category would be recounted whenever someone viewed the category description page if the tally was <200 and if it did not match the actual number of items shown on the page. Commits rMWde75c4e63bd6: Avoid triggering Category::refreshCounts() on HTTP GET requests and then rMW9a2ba8e21d82: Reduce frequency of refreshCounts() calls in LinksDeletionUpdate changed this logic, and it seems currently categories are only recounted whenever a page is removed from the category and the tally is <=0, so that category will indeed never be recounted.

matmarex added subscribers: Fastily, Srittau, Superyetkin and 6 others.

The changes I linked in my last comment certainly made this issue much more common. @aaron Can that logic be reconsidered?

Change 506032 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Reinstall small category refresh logic in LinksDeletionUpdate

Change 506032 merged by jenkins-bot:
[mediawiki/core@master] Reinstate small category refresh logic in LinksDeletionUpdate

With this patch, small categories (up to 100 pages) should be once again recounted whenever a page is removed from them. This should be deployed to Wikimedia wikis next week, per the usual schedule. Larger categories may still remain inaccurate forever.

Fuzzy added a comment.Apr 28 2019, 1:17 PM

With this patch, small categories (up to 100 pages) should be once again recounted whenever a page is removed from them. This should be deployed to Wikimedia wikis next week, per the usual schedule. Larger categories may still remain inaccurate forever.

I'd like to suggest a different logic to consider:

We would like to trigger a recount every now and then, but not too often in order to avoid server load. A simple solution is to trigger a recount after pn changes (addition/removal of page from a category), where n is the category size and p is a constant factor, i.e. 2%. For small categories, recount will take place each time. For a category with about 1000 pages, recount will take place after 20 operations.

However, such a solution requires storing the accumulated number of operations. In order to avoid changes to the DB, we can apply a stochastic trigger: Pick r, a random number between 0 and 1, and execute recount if r < 1/(p·n).

Superyetkin raised the priority of this task from Low to High.May 3 2019, 3:29 PM

The patch does not seem to have any positive effect on trwiki. The call {{PAGESINCATEGORY:Vikipedi silinecek sayfalar}}
returns 12 while the category is empty (as of May 3, 2019).

Aklapper lowered the priority of this task from High to Low.May 3 2019, 4:06 PM

@Superyetkin: Please do not change task priority if the situation has not changed and has not suddenly become more urgent, as requested before.

@Aklapper, I think the applied patch having failed changes the situation. The problem persists and needs to be worked on more closely.

On a closer look, my description of @aaron's patch was incorrect – rather than being recounted whenever a page is removed from them, categories are only recounted when a page that belongs to them is deleted.

I am actually not sure which is the intended behavior…

There does not appear to be a description of a specific technical problem in this task with the way counts happen or are updated.

Rather, it seems we assume that updating counts breaks sometimes and are wanting to do full recounts more often to mitigate this. The logic around that is being discussed at T221795. Input there is welcome :)

I don't think that task is about the same thing. We've gone a bit off the topic in the last few comments, but the main point of this task is that "Number of category members is inaccurate for large categories". I've just read T221795 and I don't see how it seeks to improve this.

Krinkle added a comment.EditedMay 7 2019, 12:40 AM
  • By having log warnings we can better understand why these numbers diverge and whether we can fix that.
  • By using the job queue we could afford doing recounts more liberally.

Specific to this task, I don't see what action could resolve it. Is there a specific problem with steps to reproduce that you'd like this task to focus on? In that case it could depend on T221795 and deprioritised until after that. But a general task for the feature having non-zero bugs doesn't seem particularly actionable.

Wargo added a subscriber: Wargo.May 29 2019, 8:02 AM

Problems with counting starts from T224209