Page MenuHomePhabricator

Update Pageview UDF with dialect-specific directories {hawk}
Closed, DeclinedPublic

Description

Missing dialect-specific directories in the pageviews definition

One of the big improvements of the new definition over the old one is
that the old one is not limited to /wiki/. It includes all of the
chinese and serbian dialects that have their own folder names and were
not appearing, as a result, in the old pageview counts.

James F (thanks James!) recently pointed out to me that there are
other wikis that do this - see the list at
https://meta.wikimedia.org/wiki/Wikipedias_in_multiple_writing_systems#With_Automatic_Conversion_System
. These need to be factored into the new pageviews definition to avoid
culturally and nationally biased undercounting.

Event Timeline

Ironholds created this task.Mar 9 2015, 9:30 PM
Ironholds raised the priority of this task from to Needs Triage.
Ironholds updated the task description. (Show Details)
Ironholds added a subscriber: Ironholds.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 9 2015, 9:30 PM
kevinator edited projects, added Analytics-Kanban; removed Analytics-Engineering.
kevinator set Security to None.

A more useful thing to do than update the documentation, might be
actually updating the definition.

kevinator renamed this task from Missing dialect-specific directories in the pageviews definition to Update Pageview UDF with dialect-specific directories.Mar 12 2015, 10:41 PM
kevinator updated the task description. (Show Details)
kevinator updated the task description. (Show Details)

Not "per definition"; the changes are not in the definition. The
phabricator ticket gets registered as a def update, thence patched and
logged.

kevinator renamed this task from Update Pageview UDF with dialect-specific directories to Update Pageview UDF with dialect-specific directories {hawk}.Mar 31 2015, 4:28 AM
kevinator triaged this task as High priority.

Discussed with Oliver: on each wiki listed in the https://meta.wikimedia.org/wiki/Wikipedias_in_multiple_writing_systems#With_Automatic_Conversion_System page except for the Anglo-Saxon one, there is a drop-down on the third tab at the top of the page, allowing for changing language.
Using that feature, here is what I get as patterns of URLs :

  • Chinese Wikipedia: /(wiki)|(zh-(cn|hk|mo|sg|tw))/
  • Serbian Wikipedia: /(wiki)|(sr-(ec|el))/
  • Kazakh Wikipedia: /(wiki/)|(w/index.php)
  • Kurdish Wikipedia: /(wiki/)|(w/index.php)
  • Inuktitut Wikipedia: /(wiki/)|(w/index.php)
  • Anglo-saxon Wikipedia (only runic language): /wiki/

Given the current regular expression for matching url first folder, it seems we count at least enough pageviews, and might count even more pv than expected for china.

"^(/sr(-(ec|el))?|/w(iki)?|/zh(-(cn|hans|hant|hk|mo|my|sg|tw))?)/"

China

Why are we overcounting Chinese?

I'd say so:
we match zh(-(cn|hans|hant|hk|mo|my|sg|tw))? while, with the language drop-down we saw yesterday, I can only access zh(-(cn|hk|mo|sg|tw))?

Well, https://zh.wikipedia.org/zh-hans/ and https://zh.wikipedia.org/zh-hant/ - they exist, they're just presumably not common enough to be explicitly linked that prominently(?) we should poke a zh-wiki user.

Discussed in standup this morning with the team : we are gonna live it as it is now.
zh-(hans|hant) folder were not present in pageviews on the 2015-102[16-22] so it's not worthwhile changing the definition.
Documentation has been added to the pageview definition discussion page.

JAllemandou closed this task as Declined.Apr 3 2015, 7:08 PM