Page MenuHomePhabricator

Run populateCategory.php
Open, Stalled, LowPublic

Description

Somehow the counts of pages in categories is inaccurate for a small number of categories. The parent task is about finding out how this has happened in the first place. This child task focuses on fixing the data for WMF wikis.

This query shows that there are 654 categories on fawiki where the category.cat_pages does not match the number of entries in categorylinks table. On arwiki this number is 2201 and this query for enwiki will show the number there.

It seems like on all wikis, a drift has occured at some point or over time. Therefore, I am requesting for the populateCategory.php script to be re-run on all WMF wikis to correct the errors. Thereafter, we can monitor these tables to identify if the error reoccures, and do a deeper investigation as to why and how.

Event Timeline

Huji created this task.May 24 2019, 9:44 PM
Reedy added a subscriber: Reedy.EditedMay 25 2019, 5:25 PM

I ran it for fawiki last night, rather than just running it everywhere without knowing if it actually fixes the problem...

Timings for interest

real    78m20.925s
user    30m46.988s
sys     2m23.496s

Forked the query and re-ran it just now... https://quarry.wmflabs.org/query/36401 - 70 results

Just over 89% less rows... I don't know if these weren't fixed by the script run, or they've regressed in the ~18 hours since the script was run and finished. Most of them look to be off by one (though, some are tens out)

FYI, query for enwiki timed out (no surprise there)

Huji added a comment.May 25 2019, 6:15 PM

I compared the output of the two Quarry queries; there is very little overlap. Some of the overlapping cases (e.g. Category:خطای_CS1:_تاریخ ) are categories that I know are actively changing (I have a bot working on this one, for instance) so if I were to guess I would say that the script fixed these but they drifted again.

My hope was that these drifts happened few and far in between, and updating the category table would give us ample time before a new drift would occur, so we could investigate it. Now I am not sure.

Reedy added a comment.May 25 2019, 6:18 PM

I'm running it on arwiki out of interest too (as we have a number for comparison)...

I guess, on more active wikis, trying to update it while edits/updates/everything else going on is going to potentially cause more drift, it's only going to be right at one specific point

I'm happy to run it everywhere for some consistency, but it's obviously clear it's not going to fix the problem :)

Reedy triaged this task as Low priority.May 25 2019, 6:18 PM
Huji added a comment.May 25 2019, 6:20 PM

I don't think you should run it everywhere. At least not until we either fully fix the problem, or fix it in a way that these drifts happen rarely (say, once a day, or once a week). Only in that setting it is useful to have this script executed, so we can identify those rare drifts quickly and try to figure how they happened.

Reedy changed the task status from Open to Stalled.May 25 2019, 7:34 PM

For arwiki...

real    100m41.125s
user    36m38.304s
sys     3m1.004s

Query forked to https://quarry.wmflabs.org/query/36403... 742 rows

Could you please run it for trwiki? Thanks.

Reedy added a comment.May 25 2019, 8:54 PM

Could you please run it for trwiki? Thanks.

No, there's no point. It doesn't fix the problem

This comment was removed by jcrespo.

Sorry, my last comment was on the wrong ticket, apologies (ignore).

Base added a subscriber: Base.Jun 11 2020, 4:49 AM