Page MenuHomePhabricator

Number of category members (PAGESINCATEGORY) is inaccurate for large categories
Open, Stalled, LowPublic

Description

*[[Category:All articles proposed for deletion]]
*[[Category:All disputed non-free images]]
*[[Category:Articles for deletion]]

The page count as given by PAGESINCATEGORY for the above categories are 3034, 2460 and 2776 respectively. Manually paginating through the category contents gives 583, 241 and 745.

All three categories are populated by templates and some entries exit the category via deletion.


Version: unspecified
Severity: normal

Details

Reference
bz16036

Related Objects

Event Timeline

bzimport raised the priority of this task from to Low.
bzimport set Reference to bz16036.
bzimport added a subscriber: Unknown Object (MLST).
MER-C created this task.Oct 19 2008, 5:27 AM

ForoaW wrote:

Simular problems on [[:commons:category:Halo]] and [[:commons:category:Duplicate]]. Mostly emptied by bot move and deletion.

demon added a comment.Jun 3 2009, 3:56 PM
  • Bug 19068 has been marked as a duplicate of this bug. ***
IAlex added a comment.Oct 4 2009, 6:54 AM
  • Bug 20977 has been marked as a duplicate of this bug. ***
APPER added a comment.Jan 6 2010, 9:29 AM

On de.wikipedia there is maybe the same problem: "SELECT cat_pages FROM category WHERE cat_title='Mann';" gives 260098, but "SELECT count(*) FROM categorylinks WHERE cl_to='Mann';" gives 259087. I don't know if it's the same bug, because the category "Mann" (man) is normally not inserted by template but directly (I don't know any article, where it's added by template).

But wrong numbers in PAGESINCATEGORY makes this not very usable...

See also bug 12188

Does anyone have a clue what's causing this? The error seems to go in both directions (sometimes there are too many articles in the category, sometimes too few). Perhaps we don't put article saves and category updates in one transaction and if one succeeds and the other fails we get wrong numbers?

Here's a simple query to see where there are wrong numbers in the database:

SELECT cat_title as title, cat_pages as storedcount, realcount FROM category,
(SELECT cl_to, count(cl_to) as realcount FROM categorylinks GROUP BY cl_to) as rc
WHERE cl_to=cat_title AND rc.realcount!=cat_pages LIMIT 10;

(In reply to comment #5)

See also bug 12188

That's not really the same bug.

From what I've tested so far, this problem seems to be two different problems:

a) categorylinks contains obsolete entries. See cl_to='1284_establishments' for on dewiki an example. This was an imported article located at [[:de:Portal Diskussion:Peterhouse, Cambridge]]. Even though the article was moved and the category removed ( http://de.wikipedia.org/w/index.php?title=Peterhouse&diff=next&oldid=72561479&uselang=en&diffonly=1 ), the reference in categorylinks remains.

b) cat_pages is inaccurate. See for example http://de.wikipedia.org/wiki/Kategorie:Unternehmen_%28Saarland%29 which clearly contains 32 pages + 5 subcats = 37 cats. Yet, cat_pages says only 36 pages are in that category.
(What I don't understand in this case is why PAGESINCATEGORY returns the correct amount of 37. This seems to be inconsistent with the cat_pages result on the toolserver).

By the way, all in all there are 30 results on dewiki for a cat_pages <-> count mismatches.

Does this only happen for template-populated categories?
Possibly related bugs:
Bug 16686
Bug 16765
Bug 21230

it looks like there are stale entries in table categorylinks. This is a referential constraint : one deletion occured in table category, which probably did not occur at the same time in table cateforylinks, probably due to concurrent accesses and incorrecct recovery after a failure.

So we need now a SQL-enabled bot to check the counts and cleanup the stale categorylinks entries. This causes another bug: in debuggging categories (populated by templates, such as those detecting pages using obsoleted or missing parameters or incorrect template parameter values), we can cleanup the listed pages by correcting these template invokations (most of these cleanup actions are performed by bots).

However, the page reappears in the categoy some minutes later, because apparently it is found with a request on "categorylinks" only. When we nulledit these pages, they disappear immedirately from the category, because the query cache will only list the entries found in the "category" page. A few minutes later (about 5 minutes), when no edit was really performed, the same pages reappear without any reason (no edit was performed), because the stale entries in "categorylinks" are still there and where not deleted by the nulledit.

Checking now all categories could take a lot of time and SQL rssources. However, a null-edit action should be able to cleanup the categories that the null-edit is trying to refresh : not only it should look at what is in table "category" but also if the null-edited page is found somewhere in "categorylinks", to delete the unwanted stale entries: if the two "SELECT count(*)" from the two tables are incoherent, then it should list the existing entries and compare them to what is expected for the null-edited page, and delete those superfluous entries.

Note that MySQL does not properly rollbacks transactions (a full support for transactions is very costly for the server and could cause dealys and huge consumption of ressources, plus additional deadlocks: to avoid this, you should make sure that deletions or additions in any of the two tables are always performed in the same order (first "category", then "categorylinks"): this will remain true if transaction support is enabled (in another engine than MySQL, such as Oracle or Postgres).

In my opinion, the full support for transaction only works with Oracle, Sybase, MS-SQL, Informix, Ingres, PostGres, but if transactions is not enabled (i.e. all requests are autocommitted), then you need a better management of errors and their recovery (by inserting insert/update/delete requests in a queue list with a callback to handle the retires with a timed delay, for example by reinserting the page that failed to update completely in the existing update queue).

If this is not specific to deletion categories, the bug title should be changed.

Per Aryeh, these counts can be fixed by running populateCategory.php with the --force parameter.

per comment 11, adding "shell" keyword

(In reply to comment #12)

http://en.wikipedia.org/w/index.php?title=Wikipedia:Village_pump_%28technical%29&oldid=436694300#Category:Images_lacking_a_description
is another example where this is causing issues

Something weird is happening. This bug should no longer happen on categories with less than 200 pages in them, and 63 < 200.

(In reply to comment #14)

(In reply to comment #12)
> http://en.wikipedia.org/w/index.php?title=Wikipedia:Village_pump_%28technical%29&oldid=436694300#Category:Images_lacking_a_description
> is another example where this is causing issues

Something weird is happening. This bug should no longer happen on categories
with less than 200 pages in them, and 63 < 200.

I see why. We don't refresh the cat counts if everything is 0 (or specifically if everything is right except for the images, and the image section is really 0 or >200). This behaviour is kind of weird (Also, we only check if a specific section is correct, which could result in recounting a 200000 big category if it only happens to have 5 images in it, and the image count is wrong) [this all seems kind of wrong, but off topic to this bug so I'll stop talking about that and eiter fix it or split it off to some other bug]

Anyhow, the upshot of all this is - If you add a single image to that category, view that category page (this is the important part, you need to view the category page while it has the single image in it), and then remove that image from the category, it will probably reset the number of files in that category count - no shell required.

Has any progress been made on resolving this issue?

Betacommand: No, otherwise it would be mentioned here. :)

Reedy added a comment.May 16 2013, 9:24 PM

Does this need doing everywhere?

hercule.wikipedia wrote:

I confirm this is not only for deletion categorie. An example on fr.wiki is :

http://fr.wikipedia.org/wiki/Cat%C3%A9gorie:Bon_article

Today it counts 1 850 pages + 3 sub-categorys and via the api I retrieve 1863 items.

hercule.wikipedia wrote:

(In reply to comment #20)

http://commons.wikimedia.org/wiki/Category:Non-
empty_disambiguation_categories

http://commons.wikimedia.org/wiki/Category:Non-empty_category_redirects and

http://commons.wikimedia.org/wiki/Category:Broken_category_redirects take
sometimes weeks before getting correct. It became much worse last month.

I don't think this is related to this bug. Your problem is due to the cache management. That's something else.

Today I made an update of every pages in http://fr.wikipedia.org/wiki/Cat%C3%A9gorie:Article_de_qualit%C3%A9 this morning, and update the category page. None changed the count of articles.

  • Bug 68240 has been marked as a duplicate of this bug. ***
tomasz removed a project: Shell.Feb 23 2015, 7:51 PM
tomasz set Security to None.
aaron added a subscriber: aaron.Dec 11 2015, 9:51 PM

In the distant past, refreshLinks ran in autocommit mode, which partly why there are so many discrepancies with counts (due to partial updates).

I wonder how often new drift occurs. In any case, it might be useful to have a chron refresh the counts.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 11 2015, 9:51 PM
Meno25 removed a subscriber: Meno25.Feb 22 2016, 6:06 PM
TTO added a comment.Jan 24 2017, 1:37 PM

Just for the record, here are some stats that I've dug up while working on T18765: Write a maintenance script to refresh category member counts:

Out of 1,555,008 rows in enwiki's category table, there are 15,476 rows with an incorrect cat_pages value. 57 of these have drifted significantly (more than 200 away from the real value), and 45 have cat_pages < 0. Fewer than 300 categories have a cat_subcats miscount, and a similar number for cat_files.

I wonder how often new drift occurs.

wikidatawiki is relatively new (started in late 2012). There are 19 miscounts for cat_pages out of 4,511 total rows in the category table, and no miscounts for cat_subcats or cat_files. Figures are similar for enwikivoyage, a site of a similar age. lrcwiki (a small wiki started in mid-2015) has 12 miscounts out of 861 category rows. So they still happen.

matmarex renamed this task from PAGESINCATEGORY inaccurate for large categories to Number of category members (PAGESINCATEGORY) is inaccurate for large categories.Feb 1 2017, 3:35 PM

BTW, I have no proof of that, but this was discovered recently and it would fit (maybe?) the issues mentioned here: T163337 (some kind of refresh being executed twice and substracting more than once). I am speaking without seeing the code- I do not know if an addition/substraction is done or just full count is done each time.

Verdy_p added a comment.EditedMay 20 2017, 2:13 AM

I don't understand why counters are not just refreshed using a scheduled date on the job queue: the current counter value will help schedule this refresh in some future (if no other schedule has been programmed, using a delay based on the last time the counter was refreshed + a delay depending on the current counter value; if this still falls in the past, schedule the job to run immmediately).

When the scheduled job will run, it will compute a COUNT(*) from SQL query and will set the actual value, and mark the category page as "touched" (to invalidate its rendering in cache if the page was using PAGESINCATEGORY and allow recategorization of the page as it will be freshed if it was the category page itself). The job queue will also record the last date it was run and trhe job does not need to reschedule itself (rescheduling will be made when member pages will change their own categorisation.

This way, the counters will not staty out of sync very long, they will fainlly be synchronized and cleaned regularly, but not too often to avoid costly accesses (however the SQL queries to perform a SELECT COUNT(*) for members in a specific category should be already optimized using indexes and should not be very costly: the job queue may use a job specific schedule queue allowing counters for multiple categories to be run in the same SQL session, the standard job queue not managing itself individual categories but only when the dedicate job will run to perform multiple updates in a batch from its own specific queue, probably with priorities for categories whole last estimated counters was the smallest or requested since long enough (e.g. since more than 24 hours, before every others).

This job may also be scheduled manually by system admins if needed for specific forgotten categories using some admin tool or by calling some SQL stored procedure from the SQL console or some restricted PHP or MediaWiki "SpecialPage :" for maintenance (restricted only to avoid DoS attacks that attempt to affect the performance of servers with too many concurrent costly requests on a large number of categories: I understand that recounting some heavily populated categories may be costly, but the SQL backend should still be optimized to perform a COUNT(*) of members in a specific category with a small cost even if that category is very populated, while basic editing of pages will simply use a fast but optimistic "UPDATE [table] SET counter=counter[+/-]1 WHERE [categoryname]" while also scheduling the category in the job queue).

jcrespo added a comment.EditedMay 22 2017, 9:03 AM

I don't understand why counters are not just refreshed using a scheduled date on the job queue
COUNT(*) for members in a specific category should be already optimized using indexes and should not be very costly

How to count 10M rows with the current schema? https://commons.wikimedia.org/wiki/Category:CC-BY-SA-3.0 If we do things not-locking, by the time we finish the counting the counter is outdated. If we do things locking (SERIALIZABLE/SELECT FOR UPDATE), we block new items to be added, causing high contention. Of course, this is not a new problem that only we have, but 1) we have a very inefficient structure that makes things very slow 2) we have a structure that favours high contention.

Note that I am not disagreeing with you- more or less, what you propose was already going to be done at: https://gerrit.wikimedia.org/r/333917 (then non-locking model). What I am saying is that we should maybe start planing for those kind of issues and change the references model to "solve" or make trivial problems like this.

but the SQL backend should still be optimized to perform a COUNT(*) of members in a specific category with a small cost even if that category is very populated

Ok, let's design that optimization :-) - we can start by normalizing titles out of references. Should we also do some kind of sharding to avoid contention and maybe allow parallel counting? More ideas?

TTO changed the status of subtask T170737: Run recountCategories.php on Wikimedia wikis from Stalled to Open.Jul 22 2017, 1:09 AM
Fuzzy added a comment.Oct 22 2017, 2:57 PM

Is there any workaround to force recount for a given category?

No, there isn't (other than removing all pages from it, but please don't do that).

JLJ001 added a subscriber: JLJ001.May 31 2018, 9:21 PM
Jeff_G added a subscriber: Jeff_G.Aug 28 2018, 12:48 PM
Izno added a subscriber: Izno.Nov 7 2018, 2:08 PM
Reedy changed the task status from Open to Stalled.Feb 10 2019, 1:32 AM
Roy17 removed a subscriber: Roy17.Feb 24 2019, 10:39 PM

No, there isn't (other than removing all pages from it, but please don't do that).

"Removing all pages from it" would not help. https://commons.wikimedia.org/wiki/Category:Flickr_review_needed is constantly emptied and filled up, but {{PAGESINCAT:Flickr review needed|R|files}} still returns a wrong tally (+42 from the actual number).

That was not the case when I made that comment. At the time, a category would be recounted whenever someone viewed the category description page if the tally was <200 and if it did not match the actual number of items shown on the page. Commits rMWde75c4e63bd6: Avoid triggering Category::refreshCounts() on HTTP GET requests and then rMW9a2ba8e21d82: Reduce frequency of refreshCounts() calls in LinksDeletionUpdate changed this logic, and it seems currently categories are only recounted whenever a page is removed from the category and the tally is <=0, so that category will indeed never be recounted.