Page MenuHomePhabricator

Discrepancies between wikistats and bot collected data
Closed, ResolvedPublic

Description

Reported briefly on irc:

Between http://wikistats.wmflabs.org/display.php?t=wp and https://meta.wikimedia.org/w/index.php?title=List_of_Wikipedias/Table&curid=99149&diff=12048692&oldid=12041961 there is data discrepancies between article counts for enwiki and ruwiki (while other projects are correct between) and there totals for articles on wikistats is below 35mil even though on wiki the article count exceeds 35mil.

Creating a ticket to track this per request.

Details

Related Gerrit Patches:
operations/debs/wikistats : masterhack needed for ru.wp being https-only

Event Timeline

JohnLewis raised the priority of this task from to Medium.
JohnLewis updated the task description. (Show Details)
JohnLewis added subscribers: JohnLewis, Dzahn.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 28 2015, 9:17 PM
Dzahn added a comment.EditedApr 28 2015, 9:18 PM

thanks. this was reported by @HaeB, is he on phab?

Dzahn claimed this task.Apr 28 2015, 9:18 PM
Dzahn lowered the priority of this task from Medium to Low.Apr 29 2015, 1:58 AM
Dzahn added a comment.May 2 2015, 12:52 AM

how about:

http://wikistats.wmflabs.org/wikipedias_wiki.php

this is/should be the source for the HTML table on meta.

also, the wiki and HTML tables on wikistats.wm should be identical since they got the data from the same db backend..

Dzahn added a comment.May 2 2015, 12:55 AM

i also see over 35 million "good" articles in the grand total on

http://wikistats.wmflabs.org/display.php?t=wp

is it already fixed meanwhile? (somebody needs to manually paste the wiki code to meta to update)

Dzahn added a comment.May 2 2015, 12:56 AM

eh, unless the "EmmausBot" does that. Does it really copy/paste from the correct URL?

Yes, it was me (HaeB) who reported that on #wikimedia-operations.

It's of course natural that there will be slight difference between the current wikistats list and the numbers that EmmausBot captured from there some hours earlier; however what was unusual here is that most numbers in (at least) the top 10 were identical while others differed, and that the Russian WP had 25k articles more in the EmmausBot version. The latter difference is still present right now, if smaller:

wikistats: ru 1191074

EmmausBot: ru 1,217,926

Dzahn added a comment.May 4 2015, 8:18 PM

Ah. So i think the first step would be we need to verify the source EmmausBot uses. Is EmmausBot copying from http://wikistats.wmflabs.org/wikipedias_wiki.php ? that would be the expected behaviour but let's check that first.

According to the bot operator, it does indeed using that data from wikistats.wmflabs. I've notified them about this ticket.

Dzahn added a comment.May 6 2015, 7:34 PM

He says he is getting it from http://wikistats.wmflabs.org/ but doesn't specify a full URL. And since he seems to make changes to the wiki code syntax. so that sounds like it's not just using the wiki code i generate? I'm not sure where the discrepancies come from.

Dzahn reassigned this task from Dzahn to Emaus.May 7 2015, 5:17 PM
Dzahn added a subscriber: Emaus.

Hi @Emaus , you are the author of EmausBot, right?

Emaus added a comment.May 11 2015, 8:59 PM

Hello!

Wikipedia's and Wikivoyage's satatistics my bot gets directly from wikiprojects. For example, here you can see the article number of Russian Wikipedia (see first line with numbers). And it differs from data on Wmflabs.

As far as I know, there is the same problem on other projects (wikinews, wikibooks, etc). Some users reported me about incorrect data in wmflabs stats: 1, 2. Probably, it is caused by switching to https from http or by some update of mediawiki on these projects.

Thanks Emaus! Good to know (I had assumed from your May 2013 talk page comment that it was still retrieving data from wikistats).

Actually I just noticed that the numbers in the table at https://wikistats.wmflabs.org/display.php?t=wp link to sources. For the Russian Wikipedia, the number in the "Good" column is currently 1191074 , but the linked source (https://ru.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=statistics&maxlag=5 ) says "'articles': 1219955".

and now it says ""articles": 1220143", so the number of articles is just growing quickly?

@Emaus Hi! So.. the whole purpose to create wikistats originally was to do that same thing you are doing now with a different bot. It gets the statistics directly from the projects. It fetches the data from the API and is updated once daily.

https://ru.wikipedia.org/wiki/%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F:%D0%A1%D1%82%D0%B0%D1%82%D0%B8%D1%81%D1%82%D0%B8%D0%BA%D0%B0

and

https://ru.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=statistics&maxlag=5

are showing the same data and that is also the same thing that wikistats gets the information from.

... but i just found the issue. It is indeed related to ru.wp switch to https only. As can be seen in my HTML tables, ru.wp has been updated ~ 167 hrs ago while all others have been updated within the last 24 hours.

@Tbayer @Emaus

So yes, the issue was due to ru.wp being https-only. The update script got an error because of that when trying http. After adding a hack to use https for ru.wp the discrepancy is gone.

1220143 is also on http://wikistats.wmflabs.org/display.php?t=wp

So really there is no difference whether wikistats gets it from the project API or Emausbot gets it from the API. Same thing. This is the source for the wiki table:

http://wikistats.wmflabs.org/wikipedias_wiki.php

Change 210617 had a related patch set uploaded (by Dzahn):
hack needed for ru.wp being https-only

https://gerrit.wikimedia.org/r/210617

Change 210617 merged by Dzahn:
hack needed for ru.wp being https-only

https://gerrit.wikimedia.org/r/210617

@Tbayer i checked both "en" and "ru" numbers after running an update and it looks all identical to me now. I'll claim it's resolved and the issue was only caused by the special case of "ru" failing to update until now.

Dzahn closed this task as Resolved.May 12 2015, 10:59 PM
Dzahn claimed this task.

P.S. @Emaus feel free to make it fetch from http://wikistats.wmflabs.org/wikipedias_wiki.php again. That was the purpose of the tool originally.

Emaus added a comment.May 13 2015, 3:12 PM

@Dzahn , could you please update your script for wiktionaries, wikibooks, wikiquotes, etc. There is the same problem with projects on russian language.

And, in addition, some of these tables don't present some newly created projects. For example, there are no Maithili Wikipedia and Oriya Wikisource.

Dzahn reopened this task as Open.Jul 14 2015, 10:17 PM

@Emaus sorry i didn't see the notification because the bug was closed. reopened it now. I will check if something is still missing for the other projects.

@Dzahn , could you please update your script for wiktionaries, wikibooks, wikiquotes, etc. There is the same problem with projects on russian language.

@Emaus this was done here. All WMF wikis use https URLs meanwhile.

And, in addition, some of these tables don't present some newly created projects. For example, there are no Maithili Wikipedia and Oriya Wikisource.

Maithili Wikipedia has also been added meanwhile. (http://wikistats.wmflabs.org/detail.php?t=wp&id=306). Same with Oriya Wikisource (http://wikistats.wmflabs.org/detail.php?t=ws&id=74)

Dzahn added a comment.EditedJul 14 2015, 10:47 PM

compared number of wikis per project to https://www.mediawiki.org/wiki/Special:SiteMatrix

project: SiteMatrix / Wikistats

wikipedias: 290/290 [x]
wiktionaries: 172/171 (!) one missing
wikibooks: 121/121 [x]
wikinews: 33/33 [x]
wikiquotes: 89/89 [x]
wikisources: 65/65 [x]
wikiversities: 15/16 (!) one too many?
wikivoyages: 17/16 (!) one missing

wikiversities: it's the "beta"/multilingual site that counts extra

wikivoyages: added missing "fa" (Persian)

wiktionaries: https://meta.wikimedia.org/wiki/Wiktionary#List_of_Wiktionaries actually just lists 171, why 172 on the Special page?

wiktionaries: added missing "pnb" (thanks @JohnLewis)

wmspecials: added missing "ca.wikimedia" (Canada) and "cn.wikimedia" (China)

Dzahn closed this task as Resolved.Jul 14 2015, 11:12 PM
Dzahn added a subscriber: RobiH.Jul 14 2015, 11:15 PM

@RobiH also see above.. fyi. added missing projects

-jem- added a subscriber: -jem-.Aug 29 2015, 8:37 PM