Page MenuHomePhabricator

Update Pageview UDF with dialect-specific directories {hawk}
Closed, DeclinedPublic

Description

Missing dialect-specific directories in the pageviews definition

One of the big improvements of the new definition over the old one is
that the old one is not limited to /wiki/. It includes all of the
chinese and serbian dialects that have their own folder names and were
not appearing, as a result, in the old pageview counts.

James F (thanks James!) recently pointed out to me that there are
other wikis that do this - see the list at
https://meta.wikimedia.org/wiki/Wikipedias_in_multiple_writing_systems#With_Automatic_Conversion_System
. These need to be factored into the new pageviews definition to avoid
culturally and nationally biased undercounting.

Event Timeline

Ironholds raised the priority of this task from to Needs Triage.
Ironholds updated the task description. (Show Details)
Ironholds subscribed.

A more useful thing to do than update the documentation, might be
actually updating the definition.

kevinator renamed this task from Missing dialect-specific directories in the pageviews definition to Update Pageview UDF with dialect-specific directories.Mar 12 2015, 10:41 PM
kevinator updated the task description. (Show Details)
kevinator updated the task description. (Show Details)

Not "per definition"; the changes are not in the definition. The
phabricator ticket gets registered as a def update, thence patched and
logged.

kevinator renamed this task from Update Pageview UDF with dialect-specific directories to Update Pageview UDF with dialect-specific directories {hawk}.Mar 31 2015, 4:28 AM
kevinator triaged this task as High priority.

Discussed with Oliver: on each wiki listed in the https://meta.wikimedia.org/wiki/Wikipedias_in_multiple_writing_systems#With_Automatic_Conversion_System page except for the Anglo-Saxon one, there is a drop-down on the third tab at the top of the page, allowing for changing language.
Using that feature, here is what I get as patterns of URLs :

  • Chinese Wikipedia: /(wiki)|(zh-(cn|hk|mo|sg|tw))/
  • Serbian Wikipedia: /(wiki)|(sr-(ec|el))/
  • Kazakh Wikipedia: /(wiki/)|(w/index.php)
  • Kurdish Wikipedia: /(wiki/)|(w/index.php)
  • Inuktitut Wikipedia: /(wiki/)|(w/index.php)
  • Anglo-saxon Wikipedia (only runic language): /wiki/

Given the current regular expression for matching url first folder, it seems we count at least enough pageviews, and might count even more pv than expected for china.

"^(/sr(-(ec|el))?|/w(iki)?|/zh(-(cn|hans|hant|hk|mo|my|sg|tw))?)/"

China

Why are we overcounting Chinese?

I'd say so:
we match zh(-(cn|hans|hant|hk|mo|my|sg|tw))? while, with the language drop-down we saw yesterday, I can only access zh(-(cn|hk|mo|sg|tw))?

Well, https://zh.wikipedia.org/zh-hans/ and https://zh.wikipedia.org/zh-hant/ - they exist, they're just presumably not common enough to be explicitly linked that prominently(?) we should poke a zh-wiki user.

Discussed in standup this morning with the team : we are gonna live it as it is now.
zh-(hans|hant) folder were not present in pageviews on the 2015-102[16-22] so it's not worthwhile changing the definition.
Documentation has been added to the pageview definition discussion page.