Please run "initSiteStats.php" on all wikis (or at least en.voy) periodically. See [[voy:Wikivoyage:Maintenance_panel/Ghost_articles]] and the talk page.
|Resolved||TTO||T42009 Special:Import increases NUMBEROFARTICLES for each Revision instead of each Article|
|Resolved||EddieGP||T192139 Remove monthly run of updateArticleCount.php|
|Resolved||EddieGP||T59788 Schedule initSiteStats.php maintenance script regularly|
- Mentioned In
- T220936: Article count does not work on hyw.wiki
T192139: Remove monthly run of updateArticleCount.php
T188472: Give up on 'comma' article-count method
T187585: Add test for WikiPage post-edit stats update
T90468: Sprint: Wikimedia Site Requests triage/cleanup/process
- Mentioned Here
- T188472: Give up on 'comma' article-count method
T187845: Run initSiteStats.php for large.dblist wikis
T42009: Special:Import increases NUMBEROFARTICLES for each Revision instead of each Article
T29256: Correcting content page count at en.wikibooks and pt.wikibooks
T186947: many statistics have fallen to 0 on azwiktionary, ruwikiquote, and ptwikisource
T187585: Add test for WikiPage post-edit stats update
T89027: Caching of Special:ActiveUsers is broken on small wikis
T68867: Run updateArticleCount.php regularly
Considering that no one in 1.5y has solved the bug 40009, is it possible to reset/reprocessed the various https://en.wikivoyage.org/wiki/Special:Statistics (I mean for each language)? The problem is that those numbers do not reflect the real amount of articles, images, etc...
At least the current consistent discrepancy will be mitigated.
With bug 64370 has been ran one more time the initSiteStats.php for all the wikis, so now the script works (there was a bug that blocked it).
Considering that this bug is relevant to the periodic schedule of this script (that now is working fine). Can this schedule be set and close this bug too?
This schedule should run until the bug 40009 and bug 64333 haven't solved yet, because they affect the article count.
Yes, please do this until the current open bugs affecting on-wiki stats are fixed.
I help to track stats at [[m:Wikimedia_News]] and I regularly come across "milestone" changes that I can't explain looking at on-wiki activity (new pages, imports, deletions, and often even the raw volume of RecentChanges activity can't account for the changes).
Just today, for example, q:ca: reached 2000 (increasing by 365 content pages) without corresponding on-wiki activity to account for the change. Took me half an hour to notice that a single page with 360 revisions was just imported into the wiki. So this looks like bug 40009, which just got fixed (but apparently no wikis have benefitted from the fix yet).
Anyway, more importantly, I believe there are still WM wikis that have been thousands of articles off in their article counts for years now -- although I haven't made a thorough check of the situation since May 2012. Interested parties can see [[m:User:Dcljr/Article_counts]] for a lot of information about article counts that never really went anywhere.
Waiting for that Great Day when the on-wiki stats actually mean something....
Last year initSiteStats.php has fixed on Wikivoyage all the dicrepancies on images as well.
Although we have found out (and corrected) which were the two (major?) causes of the articles count discrepancies, for the images I don't know, also because according to the https://it.wikivoyage.org policy, we have choosen to use only commons images, and so we don't manage such content.
Given the amount of changes that happenend by Aaron running initSiteStats.php on s3 for T186947 I think we should reconsider this. I saw the list at https://meta.wikimedia.org/wiki/Wikimedia_News#February_2018 and found the numbers quite interesing - imho this effectively means that these statistics ain't worth much if you have to expect them being 90% off (for smaller wikis, but even for bigger ones like enwikibooks 37% change are a lot).
Also T187585 once again shows that we're all humans and mistakes happen - which means we shouldn't rely on the incremental updates to just never™ being broken.
We can surely bikeshed whether we really need to do this each month, or whether once per quarter, every six months or even once per year is enough. But I think it's reasonable to request that it happens with kind of regularity. The 'do it manual on request' we've done so far seems to have led to not doing it at all for years.
Afaics there's not even much needed to resolve this besides a puppet patch that enables a foreachwiki maintenance/initSiteStats.php --update cron and I am more than willing to do that part. The only question to resolve first I'm aware of is whether there are any concerns about the Performance-Team of that maintenance script (and thus running it unattended).
Keep in mind that some of the large differences between old and new counts seen at m:Wikimedia_News#February_2018 may be due to bugs that have been fixed for many years. The changes seen with this latest recount (T186947#3974067) reflects the cumultaive inaccuracy in stats since the affected wikis were first created (or at least since they were last recounted, which [AFAIK] most Wikibooks wikis have never been); it doesn't necessarily reflect the current inaccuracy in stats.
OTOH, note that the English and Portuguese Wikibooks use the 'comma' article-counting criterion, and according to T29256#305915 the 'comma' counting method has not been implemented as advertised since at least 2011 (it merely checks whether pages have positive page length), so bugs still exist with respect to article counting (at least) that simply running a script periodically will not fix.
That being said, I'm not necessarily against running initSiteStats.php periodically in addition to, or maybe instead of, the monthly running of updateArticleCount.php....
BTW, for some additional context about article counting in particular, see (if you haven't read these pages already) m:User:Dcljr/Article_counts (about the mass article-count updates of 10 May 2012) and m:Article_counts_revisited (about the mass article-count updates of 29 March 2015).
There was a serious bug relating to counts of imported pages. I forget the task number or the exact details, but each revision was being counted as a new page, or something like that. The bug was fixed a few years ago.
It's the parent task to this one, T42009, but that was about the article count, not page count. Though that bug seems to have been really opaque, so it might as well have had another side-effect on page counts.
I don't think instead of is a great idea unless we really schedule initSiteStats.php with a frequency of monthly, in which case there'd be no point in running both of them. If we end up with initSiteStats.php quarterly (or less often), we should keep the article recounts monthly.
Well, we should do all wikis or no wikis--no reason to omit some. We could swap to monthly, I figured twice a month would be nice. I'd suggest dropping the article count update monthly you mention with a full site stats monthly.
With the current implementations of updateArticleCount.php and initSiteStats.php in MediaWiki core, it is indeed redundant to run updateArticleCount.php given it is a full subset of what initSiteStats.php does.
However, running either of these frequently on all wikis seems problematic given the wgArticleCountMethod feature in MediaWiki core. The SiteStatsInit class (used by both maintenance scripts) supports the any and link method of counting, but does not support the comma method of counting. It approximates it (for performance) by using a crude test that asserts whether the revision text size is more than 0 bytes long.
We previously has an unwritten rule (as far as I knew) not to run these maintenance scripts for wikis configured with wgArticleCountMethod = 'comma' for this reason. Because it inflates the count too much.
The current puppet/cron for update-article-count, explicitly skips these wikis (see puppet code). Rather than relying on that, we should probably update the maintenance scripts in core to skip the article update (possibly overwrite-able with a force option), and let Puppet just iterate over all wikis instead of needing to be aware of this. Even now it's quite poor as the puppet cron just excludes a large group of wikis to be safe, rather than hardcoding the handful of wikis that use the comma-counting method.
If we were to run these with even greater frequency, it would essentially mean that during the normal routine of saving edits and creating pages, we follow the comma rule correctly. This would likely lower the article count throughout the week by having newly created no-comma articles correctly not increasing the counter; only for every 2 weeks the counter to be inflated back up again.
I see two options here:
- Drop the feature (logic in WikiPage/WikitextContent, gitiles).
- Or; Update the maintenance scripts to, if the comma method is used, to programmatically iterate through all pages of the entire wiki, and fetch the raw wikitext, and look for a comma. (Matching the logic in WikiPage used on page save.) While this is in theory possible, it would require two things 1) Dedicated time (and likely WMF funds) to evaluate how this could be implemented in a way that is not going to be very expensive or stressful on foundation infrastructure, and to evaluate whether it can actually complete a complete count within a reasonable timeframe. and 2) Dedicated time and resourcing to implement, review and deploy the feature..
- https://meta.wikimedia.org/wiki/Article_counts_revisited#What_is_to_be_done.3F, specifically how it mentions the two Wikibooks project which still use the comma count method.
Yes, it'll start to run at 05:39 UTC on 1st and 15th of each month from now on. So the first run will be this Sunday.
If that first run fails for whatever reason, the cron will be reverted until problems can be fixed. However, if that would happen, you'd hear about it on this task.