Page MenuHomePhabricator

Document missing project types in pagecount dumps
Closed, ResolvedPublic

Description

The pagecount dump script puts a header into dump files saying

# Project-code is
#
# b:wikibooks,
# k:wiktionary,
# n:wikinews,
# q:wikiquote,
# s:wikisource,
# v:wikiversity,
# wo:wikivoyage,
# z:wikipedia (z added by merge script: wikipedia happens to be sorted last in dammit.lt files, but without suffix)

but actually there are way more project codes than that, as evidenced by this StackExchange question. The documentation header should be expanded to include those.

UPDATE: merged in from T219914: "However, looking at the actual data, it seems that Wiktionary has code d so the comment about k seems wrong. Likewise, Wikivoyage seems to have code voy, not wo. It might be good to check the others, too."

Event Timeline

Milimetric triaged this task as High priority.
Milimetric updated the task description. (Show Details)
Milimetric moved this task from Incoming to Data Quality on the Analytics board.
Milimetric added subscribers: Sascha, Aklapper.

Checking https://dumps.wikimedia.org/other/pagecounts-ez/merged (April 2020), these are the correct codes:

# Project-code is
#
# b:wikibooks,  ✅
# d:wiktionary, ==> NOT K
# n:wikinews,  ✅
# q:wikiquote,  ✅
# s:wikisource,  ✅
# v:wikiversity,  ✅
# voy:wikivoyage, ==> NOT WO
# z:wikipedia ==> BUT NOT MOBILE
# m:mobile wikipedia
  • The StackExchange question says they found occurrences of k but I haven't found any. d is the correct code for Wiktionary.
  • wd has two sites: www.wd and m.wd, which correspond to the desktop and mobile versions of Wikidata.
  • w, like Wikidata, has www.w and m.w. w is Mediawiki.org.
  • I find y a little more puzzling. The only site associated to it is test.y. @Tgr perhaps you can think of a test site that isn't Wikipedia?

Leaving the best part for the end. At first I thought sites like en.m , test.m... just mean all the mobile Wikipedias, it's just that no z was added to them, contrarily to docs. The z only applies to Wikipedia desktop sites.

But it's a bit more complex than that. If the .m site has .m.m counterpart, this is a non-encyclopedic wiki project. Most of them are the regional chapters (e.g dk.wikimedia.org appears as dk.m and dk.m.m).

Here's where things get weird: what happens when the code for a chapter matches the code for a language? Let's take a look at the Dutch language/chapter. These are all the nl wiki codes pagecounts-ez has:

nl.b       // Wikibooks
nl.d       // Wiktionary
nl.m       // 🛑 Mobile Wikipedia?? Wikimedia Nederland?? Both??
nl.m.b     // Mobile Wikibooks
nl.m.d     // Mobile Wiktionary
nl.m.m     // Mobile Wikimedia Nederland
nl.m.n     // Mobile Wikinews
nl.m.q     // Mobile Wikiquote
nl.m.s     // Mobile Wikisource
nl.m.voy   // Mobile Wikivoyage
nl.n       // Wikinews
nl.q       // Wikiquote
nl.s       // Wikisource
nl.voy     // Wikivoyage
nl.z       // Wikipedia

Let's find the pageviews for a popular article in Dutch mobile Wikipedia:
https://nl.m.wikipedia.org/wiki/Michelle_Obama
nl.m Michelle_Obama 47 A5B1F2H1I2J3K3L3M1N1O5P2Q4R1S3T2V1W4X3

So far so good. Now let's check if pageviews for a page in the Dutch Wikimedia chapter is there:
https://nl.wikimedia.org/wiki/Over_ons
nl.m Over_ons 10 G1I1L1N2O2T2V1

Whoa, so it seems the dataset is storing as nl.m both the chapter and the Mobile Wikipedia. How about a page that both wikis have? Let's search Hoofdpagina (main page)

nl.m Hoofdpagina 42294 A805B548C428D382E346F501G976H1462I1817J2069K2061L2108M2172N2139O2123P2209Q2295R2622S2841T2933U2934V2963W2285X1275

There's only one entry so nl.wikimedia's stats have either been added to Wikipedia's, or they were overwritten.

As spoken with @Milimetric , this is something we should probably correct going forwards as we move pagecounts-ez to Hadoop, but it seems pretty difficult to change retroactively. Will bring it up during standup and see what we can come up with.

All Wiki projects were obtaining by running egrep -o "^\S*" pagecounts-2020-04-01 | sort --unique

Thanks @fdans! That's a fascinating piece of software archeology :)

I find y a little more puzzling. The only site associated to it is test.y. @Tgr perhaps you can think of a test site that isn't Wikipedia?

test.wikidata.org maybe?

Unearthing more things, from Webstatscollector's docs:

Requests to mobile sites get aggregated across projects. So “en.mw" does not refer to “PageViews to mobile site of enwiki”, but “PageViews of mobile sites of counted english wikis”. So requests to each of http://en.m.wikivoyage.org/wiki/Main_Page, and http://en.m.wikipedia.org/wiki/Main_Page will get counted towards en.mw.

So requests to each of http://en.m.wikivoyage.org/wiki/Main_Page, and http://en.m.wikipedia.org/wiki/Main_Page will get counted towards en.mw.

Seems unwise to say the least

Oh yeah btw, something I discovered last week: webhosts are not being passed through a toLowerCase function in the current Pagecounts-EZ form, so view counts per article are actually being split, in the case of enwiki, to 3 different projects: en.z, En.z and EN.z, which is just wonderful.