Page MenuHomePhabricator

Run populateCategory.php
Closed, DuplicatePublic

Description

Somehow the counts of pages in categories is inaccurate for a small number of categories. The parent task is about finding out how this has happened in the first place. This child task focuses on fixing the data for WMF wikis.

This query shows that there are 654 categories on fawiki where the category.cat_pages does not match the number of entries in categorylinks table. On arwiki this number is 2201 and this query for enwiki will show the number there.

It seems like on all wikis, a drift has occured at some point or over time. Therefore, I am requesting for the populateCategory.php script to be re-run on all WMF wikis to correct the errors. Thereafter, we can monitor these tables to identify if the error reoccures, and do a deeper investigation as to why and how.

Event Timeline

I ran it for fawiki last night, rather than just running it everywhere without knowing if it actually fixes the problem...

Timings for interest

real    78m20.925s
user    30m46.988s
sys     2m23.496s

Forked the query and re-ran it just now... https://quarry.wmflabs.org/query/36401 - 70 results

Just over 89% less rows... I don't know if these weren't fixed by the script run, or they've regressed in the ~18 hours since the script was run and finished. Most of them look to be off by one (though, some are tens out)

FYI, query for enwiki timed out (no surprise there)

I compared the output of the two Quarry queries; there is very little overlap. Some of the overlapping cases (e.g. Category:خطای_CS1:_تاریخ ) are categories that I know are actively changing (I have a bot working on this one, for instance) so if I were to guess I would say that the script fixed these but they drifted again.

My hope was that these drifts happened few and far in between, and updating the category table would give us ample time before a new drift would occur, so we could investigate it. Now I am not sure.

I'm running it on arwiki out of interest too (as we have a number for comparison)...

I guess, on more active wikis, trying to update it while edits/updates/everything else going on is going to potentially cause more drift, it's only going to be right at one specific point

I'm happy to run it everywhere for some consistency, but it's obviously clear it's not going to fix the problem :)

Reedy triaged this task as Low priority.May 25 2019, 6:18 PM

I don't think you should run it everywhere. At least not until we either fully fix the problem, or fix it in a way that these drifts happen rarely (say, once a day, or once a week). Only in that setting it is useful to have this script executed, so we can identify those rare drifts quickly and try to figure how they happened.

Reedy changed the task status from Open to Stalled.May 25 2019, 7:34 PM

For arwiki...

real    100m41.125s
user    36m38.304s
sys     3m1.004s

Query forked to https://quarry.wmflabs.org/query/36403... 742 rows

Could you please run it for trwiki? Thanks.

Could you please run it for trwiki? Thanks.

No, there's no point. It doesn't fix the problem

This comment was removed by jcrespo.

Sorry, my last comment was on the wrong ticket, apologies (ignore).

Running this script will not prevent the problem from re-appearing after some time, but it will make the values correct for now at least. There are other ideas around, most notably "Allow action=purge to recalculate the number of pages/subcats/files in a category". From the discussion above it's unclear why this is "stalled". It is clear that it is assigned to nobody and nobody is working on this, closing as dupe of T85696.

The title of this task "Run populateCategory.php" is presumably incorrrect (T170737):

"populateCategory.php" was designed to initially populate the category table when upgrading to MW 1.13. It
was deleted in rMW0dacf7d68d8d517cada731375f9612d8e060db58.
"recountCategories.php" is the script that should be used.

See T85696 for continued discussion.