PAGESINCATEGORY inaccurate for large categories
Open, LowPublic


*[[Category:All articles proposed for deletion]]
*[[Category:All disputed non-free images]]
*[[Category:Articles for deletion]]

The page count as given by PAGESINCATEGORY for the above categories are 3034, 2460 and 2776 respectively. Manually paginating through the category contents gives 583, 241 and 745.

All three categories are populated by templates and some entries exit the category via deletion.

Version: unspecified
Severity: normal

bzimport added a subscriber: Unknown Object (MLST).
bzimport set Reference to bz16036.
MER-C created this task.Oct 19 2008, 5:27 AM

ForoaW wrote:

Simular problems on [[:commons:category:Halo]] and [[:commons:category:Duplicate]]. Mostly emptied by bot move and deletion.

demon added a comment.Jun 3 2009, 3:56 PM
  • Bug 19068 has been marked as a duplicate of this bug. ***
IAlex added a comment.Oct 4 2009, 6:54 AM
  • Bug 20977 has been marked as a duplicate of this bug. ***
APPER added a comment.Jan 6 2010, 9:29 AM

On de.wikipedia there is maybe the same problem: "SELECT cat_pages FROM category WHERE cat_title='Mann';" gives 260098, but "SELECT count(*) FROM categorylinks WHERE cl_to='Mann';" gives 259087. I don't know if it's the same bug, because the category "Mann" (man) is normally not inserted by template but directly (I don't know any article, where it's added by template).

But wrong numbers in PAGESINCATEGORY makes this not very usable...

See also bug 12188

Does anyone have a clue what's causing this? The error seems to go in both directions (sometimes there are too many articles in the category, sometimes too few). Perhaps we don't put article saves and category updates in one transaction and if one succeeds and the other fails we get wrong numbers?

Here's a simple query to see where there are wrong numbers in the database:

SELECT cat_title as title, cat_pages as storedcount, realcount FROM category,
(SELECT cl_to, count(cl_to) as realcount FROM categorylinks GROUP BY cl_to) as rc
WHERE cl_to=cat_title AND rc.realcount!=cat_pages LIMIT 10;

(In reply to comment #5)

See also bug 12188

That's not really the same bug.

From what I've tested so far, this problem seems to be two different problems:

a) categorylinks contains obsolete entries. See cl_to='1284_establishments' for on dewiki an example. This was an imported article located at [[:de:Portal Diskussion:Peterhouse, Cambridge]]. Even though the article was moved and the category removed ( ), the reference in categorylinks remains.

b) cat_pages is inaccurate. See for example which clearly contains 32 pages + 5 subcats = 37 cats. Yet, cat_pages says only 36 pages are in that category.
(What I don't understand in this case is why PAGESINCATEGORY returns the correct amount of 37. This seems to be inconsistent with the cat_pages result on the toolserver).

By the way, all in all there are 30 results on dewiki for a cat_pages <-> count mismatches.

Does this only happen for template-populated categories?
Possibly related bugs:
Bug 16686
Bug 16765
Bug 21230

it looks like there are stale entries in table categorylinks. This is a referential constraint : one deletion occured in table category, which probably did not occur at the same time in table cateforylinks, probably due to concurrent accesses and incorrecct recovery after a failure.

So we need now a SQL-enabled bot to check the counts and cleanup the stale categorylinks entries. This causes another bug: in debuggging categories (populated by templates, such as those detecting pages using obsoleted or missing parameters or incorrect template parameter values), we can cleanup the listed pages by correcting these template invokations (most of these cleanup actions are performed by bots).

However, the page reappears in the categoy some minutes later, because apparently it is found with a request on "categorylinks" only. When we nulledit these pages, they disappear immedirately from the category, because the query cache will only list the entries found in the "category" page. A few minutes later (about 5 minutes), when no edit was really performed, the same pages reappear without any reason (no edit was performed), because the stale entries in "categorylinks" are still there and where not deleted by the nulledit.

Checking now all categories could take a lot of time and SQL rssources. However, a null-edit action should be able to cleanup the categories that the null-edit is trying to refresh : not only it should look at what is in table "category" but also if the null-edited page is found somewhere in "categorylinks", to delete the unwanted stale entries: if the two "SELECT count(*)" from the two tables are incoherent, then it should list the existing entries and compare them to what is expected for the null-edited page, and delete those superfluous entries.

Note that MySQL does not properly rollbacks transactions (a full support for transactions is very costly for the server and could cause dealys and huge consumption of ressources, plus additional deadlocks: to avoid this, you should make sure that deletions or additions in any of the two tables are always performed in the same order (first "category", then "categorylinks"): this will remain true if transaction support is enabled (in another engine than MySQL, such as Oracle or Postgres).

In my opinion, the full support for transaction only works with Oracle, Sybase, MS-SQL, Informix, Ingres, PostGres, but if transactions is not enabled (i.e. all requests are autocommitted), then you need a better management of errors and their recovery (by inserting insert/update/delete requests in a queue list with a callback to handle the retires with a timed delay, for example by reinserting the page that failed to update completely in the existing update queue).

If this is not specific to deletion categories, the bug title should be changed.

Per Aryeh, these counts can be fixed by running populateCategory.php with the --force parameter.

per comment 11, adding "shell" keyword

(In reply to comment #12)
is another example where this is causing issues

Something weird is happening. This bug should no longer happen on categories with less than 200 pages in them, and 63 < 200.

(In reply to comment #14)

(In reply to comment #12)
> is another example where this is causing issues

Something weird is happening. This bug should no longer happen on categories
with less than 200 pages in them, and 63 < 200.

I see why. We don't refresh the cat counts if everything is 0 (or specifically if everything is right except for the images, and the image section is really 0 or >200). This behaviour is kind of weird (Also, we only check if a specific section is correct, which could result in recounting a 200000 big category if it only happens to have 5 images in it, and the image count is wrong) [this all seems kind of wrong, but off topic to this bug so I'll stop talking about that and eiter fix it or split it off to some other bug]

Anyhow, the upshot of all this is - If you add a single image to that category, view that category page (this is the important part, you need to view the category page while it has the single image in it), and then remove that image from the category, it will probably reset the number of files in that category count - no shell required.

Has any progress been made on resolving this issue?

Betacommand: No, otherwise it would be mentioned here. :)

Reedy added a comment.May 16 2013, 9:24 PM

Does this need doing everywhere?

hercule.wikipedia wrote:

I confirm this is not only for deletion categorie. An example on is :

Today it counts 1 850 pages + 3 sub-categorys and via the api I retrieve 1863 items.

hercule.wikipedia wrote:

(In reply to comment #20)
empty_disambiguation_categories and take
sometimes weeks before getting correct. It became much worse last month.

I don't think this is related to this bug. Your problem is due to the cache management. That's something else.

Today I made an update of every pages in this morning, and update the category page. None changed the count of articles.

  • Bug 68240 has been marked as a duplicate of this bug. ***
tomasz removed a project: Shell.Feb 23 2015, 7:51 PM
tomasz set Security to None.
aaron added a subscriber: aaron.Dec 11 2015, 9:51 PM

In the distant past, refreshLinks ran in autocommit mode, which partly why there are so many discrepancies with counts (due to partial updates).

I wonder how often new drift occurs. In any case, it might be useful to have a chron refresh the counts.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 11 2015, 9:51 PM
Meno25 removed a subscriber: Meno25.Feb 22 2016, 6:06 PM

Add Comment