Page MenuHomePhabricator

Unexplained variation in the output of {{NUMBEROFEDITS}} on nowiki.
Closed, InvalidPublic

Description

Nowiki users noticed that their project's edit counter was showing (via WikiApiary - https://wikiapiary.com/wiki/Wikipedia_(no)#chart1) a massive (>1m) edit drop. This should be literally impossible, since it's working from Special:Statistics, which simply counts the number of rows in the revision and archive tables. Investigation found no discrepancy between the current (1m< peak) data and the database, and no discrepancy between the current data and the most recent dumps, which leads to the question of how and why Special:Statistics overcounted.

updateArticleCount.php or equivalent sitestats refresh was run on some wikis around 2014-11-29, correcting some mistakes in the on-wiki statistics (cf. https://bugzilla.wikimedia.org/show_bug.cgi?id=10834 ). We don't yet know what triggered the update.


Copy/pasting report by user:Jeblad on Meta:
"It seems like 29. nov or shortly before there was a drop in number of articles at nowiki and fiwiki. For nowiki it was from about 435k to 401k articles. If this drop is just a change in how the databases counts the articles then it doesn't matter much, but if this is a real issue then it should be fixed. This is a much to large drop to be neglected. [...] Compare wikiapiary:Wikipedia (no)#chart1 with wikiapiary:Wikipedia (da)#chart1 or wikiapiary:Wikipedia (fi)#chart1.
https://wikiapiary.com/wiki/Wikipedia_(no)#chart1
https://wikiapiary.com/wiki/Wikipedia_(da)#chart1
https://wikiapiary.com/wiki/Wikipedia_(fi)#chart1

Event Timeline

Elitre raised the priority of this task from to Needs Triage.
Elitre updated the task description. (Show Details)
Elitre changed Security from none to None.
Elitre subscribed.
Stryn renamed this task from Drop in articles on nowiki and fiwiki to Drop in articles on nowiki.Dec 1 2014, 8:36 PM
Aklapper renamed this task from Drop in articles on nowiki to Sudden drop in number of articles on nowiki on Nov29 (by 34k articles).Dec 3 2014, 8:12 PM

I admit I have to know idea who could investigate this - probably someone with shell access first? Or is this already Analytics territory?

Aklapper triaged this task as Medium priority.Dec 3 2014, 8:15 PM

This isn't a minor glitch, it is a title count for articles comparable to a one year production on nowiki. The loss in edits are about 5% of the complete production.

One idea was that it could be DISAMBIG, but that magic word was added to templates several weeks earlier. The change in pages ws articles doesn't add up either. An other thing is that the actual number of edits has also changed, which should not happen after such a change.

A first check if this is a real problem could be to compare a sorted list of page titles against a backup, if there is missing titles we know for sure that this is a real problem. Then the revision ids should be checked against a backup.

Some pointers to discussions

The drop in articles and subsequent increase in pages can be found on Pages and Articles, the very strange increase in editors on Active Users and Users, and the even stranger drop in number of edits on Edit Count.

I think the community at nowiki needs reassurance that there isn't a major loss of articles and content, that is the most important thing right now. Then the editors needs reassurance that their own edits are not lost.

To be absolutely clear, we don't maintain WikiApiary. If the bug is only appearing there, you probably want to contact the people who do, because it's their problem. If the bug is also appearing on our analytics tools, well, that's our problem

I have looked into the database records. I cannot find any unusual numbers of moves or delete actions since the beginning of November that would indicate WikiApiary's reporting is accurate. I cannot find any substantial variation in the number of edits per month, in the last year, that have not been deleted. I cannot find any substantiable variation in the number of edits, per month, in the last year, that have been deleted.

I don't think this is a nowiki problem. I think this is a problem with WikiApiary. If you can point me to some non-third-party data that suggests a problem has occurred, I will dig into it. But right now I don't see any evidence at our end to indicate that anything has happened - both because of the data checks I have run, and the fact that for there to be an actual problem, 5% of the wiki would have to have vanished and not a single editor report that an article they wrote has gone into the aether.

To quote myself from Erik Mõllers page at meta:

It seems like 29. nov or shortly before there was a drop in number of articles at nowiki and fiwiki. For nowiki it was from about 435k to 401k articles. If this drop is just a change in how the databases counts the articles then it doesn't matter much, but if this is a real issue then it should be fixed. This is a much to large drop to be neglected. From m:User talk:Eloquence#Phabricator is down, drop in articles on nowiki and fiwiki

and the reply on my user page at meta

The number that has dropped is reported in the thread w:no:Wikipedia:Torget#30 000 artikler borte? and is about the number {{NUMBEROFARTICLES}} as used on w:no:Special:RecentChanges (w:no:MediaWiki:Watchlist-summary) and w:no:Spesial:Statistikk. As a special page this is not cached. The drop is noticed by several users, including me, and the previous number is also noticed and referred by news outlets, for example this one Ønsker kvinner på Wikipedia (it is about a seminar on gender gap) is using the number 435 526 while the number right now is 401 062. It is also worth noting that wikiapiary:Wikipedia (no)#chart1 shows a similar strange discontinuity. From m:User talk:Jeblad#Re: no.wiki article count

and the original post by ooo86 at our Bazaar

What happened to article number here? The Special: Statistics says we now only have 401,013 content pages. Earlier today it was well over 435,000 articles and has increased uniformly until now. Why has over 30,000 articles disappeared in the afternoon? (I have both deleted the cache and scratched my head) From w:no:Wikipedia:Torget#30 000 artikler borte?

Unless a lot of people makes bogus claims, including newspapers, the drop reported by Special:Statistics, the magic word NUMBEROFARTICLES, and Wikiapiary, did happen. I think it is waste of time to discuss whether this happen, the only thing we should focus on is why and how. The simple solution is to make some diffs between greps in backups and dumps from the database.

The last database dump was on 11 November, so that wouldn't really help. Again, there is no unusual move activity or delete activity; I suspect a bug in NUMBEROFARTICLES, which is replicated to/from Special:Statistics (so, only a single source there) which is, in fact, cached; sure, the page is not cached at the server layer, but the underlying data is.

The only alternative is that:

  1. 5% of the articles, and correspondingly the revisions, on the wiki, vanished
  2. Not one reader noticed a missing article that was previously there
  3. Not one editor noticed an article they were reading that was missing.

Can anyone point to an article that is missing? Liaising with some people on the Platform team to see if they can think of an explanation, here.

https://github.com/wikimedia/mediawiki/blob/master/includes/SiteStats.php#L306-L334

[15:43:37] <Reedy> reedy@tin:~$ mwscript updateArticleCount.php nowiki --update
[15:43:37] <Reedy> Counting articles...found 401204.
[15:43:37] <Reedy> Updating site statistics table... done.
[15:43:37] <Reedy> reedy@tin:~$
Nemo_bis subscribed.

No articles vanished, Special:Statistics/NUMBEROFARTICLES was simply wrong.
https://meta.wikimedia.org/w/index.php?title=User_talk:Jeblad&diff=10674624&oldid=10050043

To be entirely clear, for Jeblad's sake: Special:Statistics' article count is not generated by "number of articles", it's generated by the number of articles with _links to them_. Accordingly, the removal of templates, for example, can make a big difference.

The number I refer to is the number on the top right side in Special:RecentChanges which is generated by a template that uses {{NUMBEROFARTICLES}}, which is the same number generated by Special:Statistics.

Any changes to a template should not change the number of edits. I heard some says, some even several times, that there is no problem. Still I have not seen any analysis that support their claims, although I see several places that there are a notable drop. Sorry, but the explanation is not good enough.

I can not give the community this explanation, if anyone of you guys want to do it feel free.

Again, NUMBEROFARTICLES and Special:Statistics' count are the same count, generated through the same method. This is based on the number of articles with links to them, not the number of articles that exist. Accordingly, if you delete a template that links multiple articles together, and that template is the only link for each article, those articles will vanish from the NUMBEROFARTICLES number.

I have analysed data, as I told you. Not the data that wikiapiary has, which is based on Special:Statistics. Not the data that Special:Statistics has, which is based on the number of inbound links, but the actual data from the actual database on the number of pages, the number of edits, the delete and move actions, so on and so forth. There was no unusual variation in deletes or moves, there is no indication there is any database instability, and Special:Statistics does not represent the actual number of pages. It never has.

As for the number of edits, well, if the template was a particularly prominent one it has a lot of edits associated with it that no longer exist for the purpose of counting them. So it would make sense. To put this to bed, I'm going to run another query looking at exactly this. Would that help?

Hmn. Actually, it looks like the edit number should include edits in the archive table. This is interesting.

Opened a WikiApiary issue here to see if they have any ideas. I still suspect this is a caching or data artefact, but we'll see.

Huh; the database agrees with the new number, using the same methodology. So, either we dropped rows, or at some point the script massively overcounted.

Okay, will check the dumps; this looks like it could be a problem. I'll report back; if it /is/, this becomes Ops's issue. Hrm.

Went through the most recent XML dump (october) and found <=13,470,713 revisions.*

*Grabbed the archive and rev_deleted revisions from the db and added those in too, since we don't put those in the dumps. This matches expectations based on the current total, and torpedoes the idea that 5% of content vanished and nobody noticed, so it's no longer an analytics or db issue. I imagine Core, though, should really look into how exactly Special:Statistics overcounted edits for months on end. Will reassign on that basis.

Ironholds removed Ironholds as the assignee of this task.

Thanks, the important thing is that nothing is lost. If that means we told some journalist we had more articles/edits than we in fact did, then we can live with that!

it's generated by the number of articles with _links to them

This is not correct.
https://www.mediawiki.org/wiki/Manual:Article_count

Nemo_bis claimed this task.

I imagine Core, though, should really look into how exactly Special:Statistics overcounted edits for months on end. Will reassign on that basis.

And closing as invalid accordingly, because this is expected behaviour.

Nemo_bis changed the task status from Declined to Invalid.Dec 6 2014, 1:22 PM
Nemo_bis edited projects, added MediaWiki-General; removed MediaWiki-Core-Team.

Ah. What? Nemo, please read my actual commentary. This is not to address the articles issue, this is to address the fact that Special:Statistics was /definitely/ overcounting edits by around 1m. I'm reopening the bug and will modify accordingly.

Hello all. I’m the person that runs WikiApiary with the help of a number of other contribitors (kghbln, nemo_bis, others).

Just to clarify how WikiApiary gets all of it’s data, it uses the MediaWiki API. The bots do collect Special:Statistics for very, very old wikis with no API or ones with API disabled. But certainly all Wikimedia wikis are collected via API.

FWIW, there have been no changes to the bots recently.

Thanks for the comment! I imagine the API probably gets its data from the same underlying source. Hrn.

Ironholds renamed this task from Sudden drop in number of articles on nowiki on Nov29 (by 34k articles) to Unexplained variation in the output of {{NUMBEROFEDITS}} on nowiki..Dec 7 2014, 2:41 AM

All of the data WikiApiary collects is in MySQL in a time-series format. Would there be any queries that would be helpful for me to run? Worth seeing if other wikis have had drops in edit count?

That would be tremendously helpful! Thanks so much!

Also, FWIW, this edit shows what commit was running before and after the edit count changed.

https://wikiapiary.com/w/index.php?title=Wikipedia_%28no%29%2FGeneral&diff=1526811&oldid=1513562

This comment was removed by thingles.

That would be tremendously helpful! Thanks so much!

Okay, turns out I didn’t need SQL. Semantic MediaWiki to the rescue! WikiApiary currently knows of only 15 wikis with negative edit drops. Wikipedia (no) is the ONLY one in Wikimedia showing this. I can tell this using my "Has statistic edit index" and checking if it is negative.

The data is here (view source of page to see the ask queries): https://wikiapiary.com/wiki/User:Thingles/Negative_edits

I certainly wouldn't assume that all negatives are an indicator of a bug. Sometimes the wiki admins muck with things. it is worth noting that some of these wikis show patterns just like Wikipedia (no), including changes to pages and articles.

https://wikiapiary.com/wiki/A_Complete_Guide_to_Super_Metroid_Speedrunning_(Incomplete)

Others clearly are admins resetting databases.

https://wikiapiary.com/wiki/Botnets

Hurm. Okay, this is weird. Good find on the Metroid one, though - just the same pattern.

Hurm. Okay, this is weird. Good find on the Metroid one, though - just the same pattern.

Just more data, but the number of registered users has a jump at this time too.

https://wikiapiary.com/wiki/Wikipedia_(no)

I also looked at the graph of Images and it’s fairly odd, with a notable spike at that same time.

screenshot.png (296×986 px, 23 KB)

Suspicious that the drop in number of edits seems to happen almost precisely at 12:10 am according to the graphs on wikiapiary

That may just have been the WA runtime. The problem is not really the drop, in my mind - it's the rise. There were never 14m edits on nowiki, unless the db dropped a million and nobody noticed. And yet there are many runs in a row where the number creeps closer and closer to 14 before plummeting. So the question is not only "what was broken at 12:10?" but also "what was broken before that?"

That may just have been the WA runtime. The problem is not really the drop, in my mind - it's the rise. There were never 14m edits on nowiki, unless the db dropped a million and nobody noticed. And yet there are many runs in a row where the number creeps closer and closer to 14 before plummeting. So the question is not only "what was broken at 12:10?" but also "what was broken before that?"

The stats counter is a rough estimate. Being off by a million is more than I would expect, but not shockingly more. Having the count go down unexpectedly is really weird and would be a sign of something going wrong. (Unless someone runs the updateArticleCount.php script. Not sure if Reedy's comment above is meant to imply that the update script was run on that date, in which case mystery solved, or if that was a later run)

it's generated by the number of articles with _links to them

This is not correct.
https://www.mediawiki.org/wiki/Manual:Article_count

See the relevant code - https://github.com/wikimedia/mediawiki/blob/master/includes/SiteStats.php#L306-L334

Which confirms the manual is correct. :) pl_from

Nemo_bis unsubscribed.

it's generated by the number of articles with _links to them

This is not correct.
https://www.mediawiki.org/wiki/Manual:Article_count

See the relevant code - https://github.com/wikimedia/mediawiki/blob/master/includes/SiteStats.php#L306-L334

Which confirms the manual is correct. :) pl_from

OTOH, that's not actually the relavent code.... That's only used when regenerating stats. You're looking for WikitextContent::isCountable and WikiPage::isCountable. Either way, Nemo is correct. Articles are counted that have links on them (not links to them) except for on cswikinews, enwikibooks, ptwikibooks, and guwikisource.

The number of images have a drop due to a guy messing around with the local images, insisting on uploading them to Commons. I would not spend to much time on that.

There is an ticket in OTRS, Ticket#2014120710006825 — Something is wrong with Wikipedia. It is just a report that something is wrong with the count.

it's generated by the number of articles with _links to them

This is not correct.
https://www.mediawiki.org/wiki/Manual:Article_count

See the relevant code - https://github.com/wikimedia/mediawiki/blob/master/includes/SiteStats.php#L306-L334

Which confirms the manual is correct. :) pl_from

This code seems really weird... An article should be a page with sufficient number of codepoints within a codepage for alpahnumeric og ideographic characters, unless the page is identified as a redirect or a disambiguation page. Number of inbound/outbound links has nothing to do with whether a page is an article, except articles should not be without parents and should not be a dead end.

I am tempted to say that all pages that are not identified as a redirect, a disambiguation page, or some other specially marked page should be identified as a content page, and any attempt to identify the page as an article should be under community control.

Is there any maintenance job, any updates, etc, that could have triggered a run of updateArticleCount.php, or is this script only run manually?

Is there any maintenance job, any updates, etc, that could have triggered a run of updateArticleCount.php, or is this script only run manually?

Its run manually. The only exception is if the number of pages is less than the number of "articles" or if the number of edits is less than the number of pages. Or if the counter goes over 2 billion, or if the counter has an integer overflow (Should only happen if the counter goes over 2^63, except for the image counter, which seems to be limited to 2^31)

This code seems really weird&hellip; An article should be a page with sufficient number of codepoints within a codepage for alpahnumeric og ideographic characters, unless the page is identified as a redirect or a disambiguation page. Number of inbound links has nothing to do with whether a page is an article, except articles usually has inbound links.

I am tempted to say that all pages that are not identified as a redirect, a disambiguation page, or some other specially marked page should be identified as a _content page_, and any attempt to identify the page as an _article_ should be under community control.

Number of outbound links. The metric is optimized for wikipedia (And specificly Wikipedia in the early 2000's). An article with no outbound links is probably a stub. What metric to use is configurable (There's currently three - Any page, any page with a comma, any page with a link (Not counting redirects, non content namespace). We could potentially add some other metric if need be (Keep in mind, things like how many characters of a certain class are on the page are hard to rebuild) , but that's a separate issue.

Communities can identify if something is an article all they want, however, if you want to have an automated count of the total number of articles, "ask a human if X is an "article" to be counted" is not a workable metric.

Is there any maintenance job, any updates, etc, that could have triggered a run of updateArticleCount.php, or is this script only run manually?

Its run manually. The only exception is if the number of pages is less than the number of "articles" or if the number of edits is less than the number of pages. Or if the counter goes over 2 billion, or if the counter has an integer overflow (Should only happen if the counter goes over 2^63, except for the image counter, which seems to be limited to 2^31)

There are several maintenance jobs that fails regularly on nowiki, as visible on requested pages (not run since 26. nov. 2014 kl. 05:08). This is quite common, so often in fact that users have stopped complaining. Other examples are old pages, dead ends, and few revisions.

This code seems really weird&hellip; An article should be a page with sufficient number of codepoints within a codepage for alpahnumeric og ideographic characters, unless the page is identified as a redirect or a disambiguation page. Number of inbound links has nothing to do with whether a page is an article, except articles usually has inbound links.

I am tempted to say that all pages that are not identified as a redirect, a disambiguation page, or some other specially marked page should be identified as a _content page_, and any attempt to identify the page as an _article_ should be under community control.

Number of outbound links. The metric is optimized for wikipedia (And specificly Wikipedia in the early 2000's). An article with no outbound links is probably a stub. What metric to use is configurable (There's currently three - Any page, any page with a comma, any page with a link (Not counting redirects, non content namespace). We could potentially add some other metric if need be (Keep in mind, things like how many characters of a certain class are on the page are hard to rebuild) , but that's a separate issue.

Communities can identify if something is an article all they want, however, if you want to have an automated count of the total number of articles, "ask a human if X is an "article" to be counted" is not a workable metric.

This is not about "ask a human if X is an "article" to be counted", it is about a working metric. Ask any editor at Wikipedia what is an article, and he or she will never describe the actual metric in use. That is anyhow a separate issue.

The pressing issue is why the present system fails so utterly complete. Good thing was it happen now, having this issue after a press release that we had passed half a mill would be BAD.

There are several maintenance jobs that fails regularly on nowiki, as visible on requested pages (not run since 26. nov. 2014 kl. 05:08). This is quite common, so often in fact that users have stopped complaining. Other examples are old pages, dead ends, and few revisions.

First of all, that's not the type of maintenance job I mean. updateArticleCount.php is a script to reset the article count, in case something goes wrong with it. It should only be triggered by a human (plus the exceptions that will probably never happen that I mentioned above), and only if something went wrong. If nothing goes wrong, it should never be triggered, ever.

Second, those aren't failing. They are running at the intentional rate. Different special pages update at different rates.

For wikis that are not enwikipedia, there are six Special pages that take a lot of time to generate are run twice a month (What day depends on what special page) Special:WantedPages should run on the 12th and 26th of every month. old pages runs on the 8th and 22nd of every month, dead end pages runs on the 9th and 23rd, and so on.

Other special pages that are quick to generate run more often, usually every 3days.

You can argue that they should be updated more often if you want, but to say they are failing is incorrect. They are working as intended.

This is not about "ask a human if X is an "article" to be counted", it is about a working metric. Ask any editor at Wikipedia what is an article, and he or she will never describe the actual metric in use. That is anyhow a separate issue.

The pressing issue is why the present system fails so utterly complete. Good thing was it happen now, having this issue after a press release that we had passed half a mill would be BAD.

For performance reasons, the special:Statistics counts should be considered approximate (Updates to special stats go in a different transaction than the actual edit, so if a server explodes half way through, the edit might be made, but not counted. Other bad things can happen). It has always been this way, it probably always will. If you need a number you can feel sure about for standing behind, it would probably be best to pull it from the tool labs.

That said, there could very well be problems with how the metric is counted (beyond potential for occasional server failure), and if so, it should be fixed. Nobody has really presented evidence that that is the case, or any theory on what the problem would be. Although at first glance, it is surprising that the metric is being overcounted instead of undercounted, I would expect transient server failure to undercount articles.

Nemo_bis claimed this task.

Hmpf, I was re-added as subscriber. I see this was moved back to MediaWiki-General-or-Unknown. Again, for MediaWiki this is not a bug, it's expected caching of stats.

P.s.: If you care about the article count being correct, I suggest that you force Edorfbir to stop the pointless importing https://no.wikipedia.org/wiki/Spesial:Logg/import

Hmpf, I was re-added as subscriber. I see this was moved back to MediaWiki-General-or-Unknown. Again, for MediaWiki this is not a bug, it's expected caching of stats.

Could you explain, please, how caching would lead to the number of edits being incorrect in an upwards fashion, for weeks on end?

@Nemo_bis, see above? I'd appreciate if anybody could provide a TL;DR recap of this conversation. Thank you!