Page MenuHomePhabricator

Schedule initSiteStats.php maintenance script regularly
Closed, ResolvedPublic

Description

Please run "initSiteStats.php" on all wikis (or at least en.voy) periodically. See [[voy:Wikivoyage:Maintenance_panel/Ghost_articles]] and the talk page.

Details

Event Timeline

bzimport raised the priority of this task from to Lowest.Nov 22 2014, 2:33 AM
bzimport set Reference to bz57788.
bzimport added a subscriber: Unknown Object (MLST).

Considering that no one in 1.5y has solved the bug 40009, is it possible to reset/reprocessed the various https://en.wikivoyage.org/wiki/Special:Statistics (I mean for each language)? The problem is that those numbers do not reflect the real amount of articles, images, etc...

At least the current consistent discrepancy will be mitigated.

Thanks

With bug 64370 has been ran one more time the initSiteStats.php for all the wikis, so now the script works (there was a bug that blocked it).

Considering that this bug is relevant to the periodic schedule of this script (that now is working fine). Can this schedule be set and close this bug too?

This schedule should run until the bug 40009 and bug 64333 haven't solved yet, because they affect the article count.

Yes, please do this until the current open bugs affecting on-wiki stats are fixed.

I help to track stats at [[m:Wikimedia_News]] and I regularly come across "milestone" changes that I can't explain looking at on-wiki activity (new pages, imports, deletions, and often even the raw volume of RecentChanges activity can't account for the changes).

Just today, for example, q:ca: reached 2000 (increasing by 365 content pages) without corresponding on-wiki activity to account for the change. Took me half an hour to notice that a single page with 360 revisions was just imported into the wiki. So this looks like bug 40009, which just got fixed (but apparently no wikis have benefitted from the fix yet).

Anyway, more importantly, I believe there are still WM wikis that have been thousands of articles off in their article counts for years now -- although I haven't made a thorough check of the situation since May 2012. Interested parties can see [[m:User:Dcljr/Article_counts]] for a lot of information about article counts that never really went anywhere.

Waiting for that Great Day when the on-wiki stats actually mean something....

bug 64333 has been just solved. So initSiteStats.php could be run again to eliminate the discrepancy that has produced during its life cylce.

tomasz set Security to None.

Isn't this task actually different than T68867? That task only addresses article counts. Do we know that the other site statistics (total pages, editors, etc.) don't also need fixing?

It's different, but in 99 % of cases people only care about article count. Total pages tends to be more correct and active editors is refreshed in other ways (T89027).

Last year initSiteStats.php has fixed on Wikivoyage all the dicrepancies on images as well.

Although we have found out (and corrected) which were the two (major?) causes of the articles count discrepancies, for the images I don't know, also because according to the https://it.wikivoyage.org policy, we have choosen to use only commons images, and so we don't manage such content.

EddieGP added a project: Performance-Team.
EddieGP subscribed.

Given the amount of changes that happenend by Aaron running initSiteStats.php on s3 for T186947 I think we should reconsider this. I saw the list at https://meta.wikimedia.org/wiki/Wikimedia_News#February_2018 and found the numbers quite interesing - imho this effectively means that these statistics ain't worth much if you have to expect them being 90% off (for smaller wikis, but even for bigger ones like enwikibooks 37% change are a lot).

Also T187585 once again shows that we're all humans and mistakes happen - which means we shouldn't rely on the incremental updates to just never™ being broken.

We can surely bikeshed whether we really need to do this each month, or whether once per quarter, every six months or even once per year is enough. But I think it's reasonable to request that it happens with kind of regularity. The 'do it manual on request' we've done so far seems to have led to not doing it at all for years.

Afaics there's not even much needed to resolve this besides a puppet patch that enables a foreachwiki maintenance/initSiteStats.php --update cron and I am more than willing to do that part. The only question to resolve first I'm aware of is whether there are any concerns about the Performance-Team of that maintenance script (and thus running it unattended).

Keep in mind that some of the large differences between old and new counts seen at m:Wikimedia_News#February_2018 may be due to bugs that have been fixed for many years. The changes seen with this latest recount (T186947#3974067) reflects the cumultaive inaccuracy in stats since the affected wikis were first created (or at least since they were last recounted, which [AFAIK] most Wikibooks wikis have never been); it doesn't necessarily reflect the current inaccuracy in stats.

OTOH, note that the English and Portuguese Wikibooks use the 'comma' article-counting criterion, and according to T29256#305915 the 'comma' counting method has not been implemented as advertised since at least 2011 (it merely checks whether pages have positive page length), so bugs still exist with respect to article counting (at least) that simply running a script periodically will not fix.

That being said, I'm not necessarily against running initSiteStats.php periodically in addition to, or maybe instead of, the monthly running of updateArticleCount.php....

BTW, for some additional context about article counting in particular, see (if you haven't read these pages already) m:User:Dcljr/Article_counts (about the mass article-count updates of 10 May 2012) and m:Article_counts_revisited (about the mass article-count updates of 29 March 2015).

BTW, more interesting to me (since I was already familiar with the ugly history of article counting) were the large changes in total pages and page edits seen in some wikis. Why would those counts be so far off? Importing bugs? Page-deletion bugs?

There was a serious bug relating to counts of imported pages. I forget the task number or the exact details, but each revision was being counted as a new page, or something like that. The bug was fixed a few years ago.

In T59788#3980751, @TTO wrote:

There was a serious bug relating to counts of imported pages. I forget the task number or the exact details, but each revision was being counted as a new page, or something like that. The bug was fixed a few years ago.

It's the parent task to this one, T42009, but that was about the article count, not page count. Though that bug seems to have been really opaque, so it might as well have had another side-effect on page counts.

That being said, I'm not necessarily against running initSiteStats.php periodically in addition to, or maybe instead of, the monthly running of updateArticleCount.php....

I don't think instead of is a great idea unless we really schedule initSiteStats.php with a frequency of monthly, in which case there'd be no point in running both of them. If we end up with initSiteStats.php quarterly (or less often), we should keep the article recounts monthly.

EddieGP renamed this task from Run or schedule initSiteStats.php maintenance script to Schedule initSiteStats.php maintenance script regularly.Feb 21 2018, 4:06 PM
Imarlier subscribed.

Perf team talked about this a bit today, and we think it's fine to go ahead.

We just ran initSiteStats.php on all wikis over on T187845. There's no reason we couldn't do this like every other week.

Change 415066 had a related patch set uploaded (by Chad; owner: Chad):
[operations/puppet@production] Run initSiteStats twice a month

https://gerrit.wikimedia.org/r/415066

We're recounting articles on the content wikis monthly (T68867) but recounting all stats on all wikis semimonthly? That don't seem right…

Well, we should do all wikis or no wikis--no reason to omit some. We could swap to monthly, I figured twice a month would be nice. I'd suggest dropping the article count update monthly you mention with a full site stats monthly.

Right. We can kill the job to recount articles monthly once this one (recounting everything every other week) is active, there's no need for that redundance.

With the current implementations of updateArticleCount.php and initSiteStats.php in MediaWiki core, it is indeed redundant to run updateArticleCount.php given it is a full subset of what initSiteStats.php does.

However, running either of these frequently on all wikis seems problematic given the wgArticleCountMethod feature in MediaWiki core. The SiteStatsInit class (used by both maintenance scripts) supports the any and link method of counting, but does not support the comma method of counting. It approximates it (for performance) by using a crude test that asserts whether the revision text size is more than 0 bytes long.

We previously has an unwritten rule (as far as I knew) not to run these maintenance scripts for wikis configured with wgArticleCountMethod = 'comma' for this reason. Because it inflates the count too much.

The current puppet/cron for update-article-count, explicitly skips these wikis (see puppet code). Rather than relying on that, we should probably update the maintenance scripts in core to skip the article update (possibly overwrite-able with a force option), and let Puppet just iterate over all wikis instead of needing to be aware of this. Even now it's quite poor as the puppet cron just excludes a large group of wikis to be safe, rather than hardcoding the handful of wikis that use the comma-counting method.

See wmf-config/InitialiseSettings#wgArticleCountMethod.

If we were to run these with even greater frequency, it would essentially mean that during the normal routine of saving edits and creating pages, we follow the comma rule correctly. This would likely lower the article count throughout the week by having newly created no-comma articles correctly not increasing the counter; only for every 2 weeks the counter to be inflated back up again.

I see two options here:

  • Drop the feature (logic in WikiPage/WikitextContent, gitiles).
  • Or; Update the maintenance scripts to, if the comma method is used, to programmatically iterate through all pages of the entire wiki, and fetch the raw wikitext, and look for a comma. (Matching the logic in WikiPage used on page save.) While this is in theory possible, it would require two things 1) Dedicated time (and likely WMF funds) to evaluate how this could be implemented in a way that is not going to be very expensive or stressful on foundation infrastructure, and to evaluate whether it can actually complete a complete count within a reasonable timeframe. and 2) Dedicated time and resourcing to implement, review and deploy the feature..

See also:

Well, since enwikibooks and ptwikibooks just got recounted using initSiteStats.php, their article counts are, in effect, no longer based on the 'comma' criterion. (!)

See T188472: Give up on 'comma' article-count method, which I just created.

@Krinkle could you please post (a version of) your last comment in this task to T188472, so it can inform the conversation over there?

Krinkle raised the priority of this task from Lowest to Low.EditedMar 22 2018, 3:43 AM

Per T188472, this is now unblocked.

Change 415066 merged by Jcrespo:
[operations/puppet@production] Run initSiteStats twice a month

https://gerrit.wikimedia.org/r/415066

I'm sorry, does this mean this is now "live"? So there will be a full recount on April 15th?

EddieGP claimed this task.

I'm sorry, does this mean this is now "live"? So there will be a full recount on April 15th?

Yes, it'll start to run at 05:39 UTC on 1st and 15th of each month from now on. So the first run will be this Sunday.
If that first run fails for whatever reason, the cron will be reverted until problems can be fixed. However, if that would happen, you'd hear about it on this task.