Page MenuHomePhabricator

Discrepancies in historical total active editor numbers
Closed, DeclinedPublic

Description

Recently, the stated number of total active editors on all projects (TAE, which is featured in the report card and the previous WMF monthly reports) appears to have considerably increased retroactively for past months:

MonthTAE as given today [1]TAE as given on October 12, 2014 [2]TAE as given on July 5, 2014 [3]
Aug 20147864577173-
Jul 20147802576594-
Jun 20147604674598-
May 2014817878017680402
Apr 2014768597534275413
Mar 2014786157707277153

Any explanation? Assuming that there wasn't some massive undeletion of pages (i.e. the opposite of the usual "deletion dwindle" that can explain the slight decrease between July 5 and October 12), either an unannounced definition change or a bug seem to be the only possibilities. (It also happened long after this bug fix.)

[1] https://stats.wikimedia.org/EN/TablesWikimediaAllProjects.htm = https://web.archive.org/web/20150128064549/https://stats.wikimedia.org/EN/TablesWikimediaAllProjects.htm
[2] https://web.archive.org/web/20141012040955/http://stats.wikimedia.org/EN/TablesWikimediaAllProjects.htm
[3] https://web.archive.org/web/20140705052031/http://stats.wikimedia.org/EN/TablesWikimediaAllProjects.htm

Event Timeline

Tbayer raised the priority of this task from to Needs Triage.
Tbayer updated the task description. (Show Details)
Tbayer added a subscriber: Tbayer.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 28 2015, 7:04 AM
Tbayer set Security to None.Jan 28 2015, 7:04 AM
Tbayer added a subscriber: Eloquence.
Tbayer updated the task description. (Show Details)Feb 6 2015, 2:41 AM
Tbayer added subscribers: ezachte, DarTar.

@ezachte: can you look into this? Neither Aaron nor myself can generate/audit the TAE data using the legacy definition.

I can investigate but not before quarterly report this week.

OK - if you want us to mark these numbers as "preliminary estimate" or such, I can add something like a footnote in the report's scorecard.

Interestingly, there is almost no difference in the 1+ and 3+ levels.

also happened long after this bug fix

Code or dataset changes may take longer to take effect, if the report is generated on partly-prefilled data.

The first step is to check whether most of the variation comes from a single project. I see similar changes for 2012 as well, so it's probably not Wikidata.

@ezachte, have you had a chance to investigate further since February?

I just took another look and extended the above table with the numbers for the same historical months as given on Wikistats today, see below. And it gets even weirder - the historical numbers have fallen again and now seem too low instead of too high (as they were in January).

The deletion dwindle is unlikely to explain this (that effect can be assumed to generally decrease in size with time; but for e.g. the March 2014 numbers, the drop was a mere 81 from July to October, however then we lost 418 more from October to April...).

MonthTAE given April 27, 2015 [1]TAE given January 28, 2015 [2]TAE given October 12, 2014 [3]TAE given July 5, 2014 [4]
Aug 2014765937864577173-
Jul 2014769047802576594-
Jun 2014741537604674598-
May 201479728817878017680402
Apr 201474957768597534275413
Mar 201476654786157707277153

[1] https://web.archive.org/web/20150427191304/https://stats.wikimedia.org/EN/TablesWikimediaAllProjects.htm
[2] https://web.archive.org/web/20150128064549/https://stats.wikimedia.org/EN/TablesWikimediaAllProjects.htm
[3] https://web.archive.org/web/20141012040955/http://stats.wikimedia.org/EN/TablesWikimediaAllProjects.htm
[4] https://web.archive.org/web/20140705052031/http://stats.wikimedia.org/EN/TablesWikimediaAllProjects.htm

Spreadsheet with deduped 5+ editors, like in Tbayer's table, but all months, and with extra column for new data from May 2015

Also a chart which shows strange pattern, where most differences for May 2015 vs April 2015 peak around 2006/2007

Csv file with edits on all Wikipedias (only) combined, from 5 releases
The data in Wikistats are presented with M for million and k for thousand, after these symbols were substituted for numbers, a very rough difference between publications could be calculated

Numbers for May 2015 are considerably lower. That could be deliberate deletion of some sort (in theory).
But a few numbers in earlier releases went up, if only by a small amount. There is nothing in Wikistats to my knowledge that can explain this, except of course some a bug in Wikistats (seems very unlikely to me, those scripts haven't changed since long), or a bug in dump job which makes it miss out on some data between releases.

I'll see if I can find numbers that went down, then up again for edits per user, between releases, then maybe I can find exact revisions which went missing temporarily (but for now this is unproven hypothesis).

Caveat: deduplication in Wikistats was based on user name only, and thus imprecise. We always assumed that accounts that were still active were nearly all unique. The recent action to make all non-SUL accounts unique will certainly show in the numbers, could make some names drop between 5+ threshold. Is that in line with Nemo_bis' observation that "Interestingly, there is almost no difference in the 1+ and 3+ levels." Will check.

Is that in line with Nemo_bis' observation that "Interestingly, there is almost no difference in the 1+ and 3+ levels."

I'd say it is, because users with a single edit probably don't edit on multiple wikis and therefore are not affected by the deduplication mechanism. Can't tell whether that's likely to be the cause though.

TAE given January 28, 2015 is clearly an outlier.
Data almost returned to TAE given October 12, 2014 on subsequent releases.
See screenshot of top-left parts of TablesWikimediaAllProjects.htm on 5 releases.

Although there are small fluctuations up-or downwards I consider these within normal range.
For example for Mar 2014:
[1] 77072 reports up to Aug 2014
[2] 78615 reports up to Nov 2014 outlier
[3] 76654 reports up to Feb 2015
[4] 76703 reports up to Mar 2015
[5] 76705 reports up to April 2015

[1] -> [3] = - 0.60%
[3] -> [4] = + 0.04%
[4] -> [5] = + 0.00%

[1] https://web.archive.org/web/20141012040955/http://stats.wikimedia.org/EN/TablesWikimediaAllProjects.htm
[2] https://web.archive.org/web/20150128064549/https://stats.wikimedia.org/EN/TablesWikimediaAllProjects.htm
[3] https://web.archive.org/web/20150428040806/http://stats.wikimedia.org/EN/TablesWikimediaAllProjects.htm
[4] online now
[5] to be published

I spent some hours explaining the outlier (for example is it from one project/wiki only) but didn't get anywhere.
I have monthly backups for csv files for 2014-10-06 / 2014-11-05 / 2015-01-12 (weekly backups occur but get thinned out later on). I could restore all csv files for some release, rerun dedup tool, do this three times, and compare intermediate results per project. And still be unlucky as for dates of backups. If trends are stable again (and they seem to be), that would be over the top.

Thanks, @ezachte! Agree that the January 28 data ([2] in your overview) looks like an outlier. That said, the discrepancies between the April 27 data ([3]) and the May 12 data ([4]) charted in your first spreadsheet look pretty concerning too: up to almost 600 editors were retroactively gained for some months in 2006/2007. In other words, the much-discussed editor decline in subsequent years (recently halted/reversed) looks a bit more severe in [4]. I think it's worth investigating a bit further.

Note that SUL renaming could not only decrease TAE numbers as mentioned above (by splitting wrongly merged accounts into ones that fall below the 5 edits threshold separately) but also increase them (if the separated accounts each still remain above the threshold). Come to think of it, even without the observed discrepancies it would have seemed prudent anyway to gauge the impact of T37707 (SUL finalization) on TAE stats.

Something that I could imagine to be really helpful for debugging: Modify the script to also generate the actual list of the editors who are counted as active in each month (or at least for some example months), and then look directly at which users were counted for one dump but not the other. (The difference between two such sets can be calculated pretty easily in Python, for example, whose set difference function should be able to handle such sets with up to 100k members easily, would be happy to do that myself for a few in case we can make these lists avaliable.)

I see DragonBot~pmswiki in https://stats.wikimedia.org/EN/TablesWikipediaPMS.htm#sleeping
Are more bots being counted as users, perhaps?

Nemo_bis triaged this task as Medium priority.Aug 23 2015, 8:36 AM

The historical numbers for the same month continue to jump up and down merrily, by very implausible amounts, e.g.:

given on Oct 21, 2015... on Sep 5, 2015...on Jul 13, 2015...on May 15, 2015
TAE for Jan 200988935871658898088657

As suggested in June, I think that lists of users who were counted in one dump but not another might be useful for debugging.

Independent debugging could also be performed by checking whether any individual wiki exhibits such a fluctuation in the same month and whether there are significant fluctuations in the ZeitGeist for that wiki in that month.

Particularly puzzling for me are Wikisource statistics: for Italian Wikisource in November 2013, the count was 111 and is now 24. Most activity that month was from an editing drive whose contributions have certainly not been deleted.

Restricted Application added a project: Internet-Archive. · View Herald TranscriptNov 3 2015, 5:13 PM

As suggested in June, I think that lists of users who were counted in one dump but not another might be useful for debugging.

As this task is assigned to me, let me say I don't disagree, but currently my focus is total on pageviews, get those upgraded, and/or debugged. And that may take a while (months).

I see DragonBot~pmswiki in https://stats.wikimedia.org/EN/TablesWikipediaPMS.htm#sleeping
Are more bots being counted as users, perhaps?

DragonBot~pmswiki is now counted among bots: https://stats.wikimedia.org/EN/TablesWikipediaPMS.htm#bots

Beyond the already mentioned case of deletions, see also a detailed discussion of possible causes for such discrepancies - for very active editor numbers, but they should apply here too - at https://en.wikipedia.org/wiki/User:WereSpielChequers/100%2B_editors (however, as noted on the talk page there, I'm not sure if redirects matter in this context).

I continue to think it's unlikely that such causes could explain a retroactive "growth" of more than 1700 editors within two months six years after the fact.

Nemo_bis added a comment.EditedJan 24 2016, 10:40 PM

I continue to think it's unlikely that such causes could explain a retroactive "growth" of more than 1700 editors within two months six years after the fact.

I agree. On the other hand, Hale calculated 15k multilingual editors per month in http://arxiv.org/abs/1508.07266 . Hence I still believe that the most likely cause (among those mentioned so far) is some shift in deduplication and exclusion of usernames.

Aklapper removed ezachte as the assignee of this task.Mon, Nov 18, 11:52 AM

Removing assignee @ezachte as that Phabricator account has been deactivated. (If there are questions, it seems that @erik_zachte could be contacted.)

Restricted Application added a project: Analytics. · View Herald TranscriptMon, Nov 18, 11:52 AM
Ottomata closed this task as Declined.Mon, Nov 18, 4:46 PM
Ottomata added a subscriber: Ottomata.

WikiStats 1 is no longer maintained.

Restricted Application removed a subscriber: Liuxinyu970226. · View Herald TranscriptMon, Nov 18, 4:46 PM