Page MenuHomePhabricator

Make plan for counting Global South edits and editors
Closed, ResolvedPublicNov 3 2018

Description

When calculating Global South reading and editing metrics, we treat an unknown location as being in the Global South. This doesn't make a huge difference in active editors (20,000 → 23,000), but it turns out that makes a huge difference in counting non-bot edits (2 M → 9 M).

So when looking at what proportion of total edits come from the Global South, including unknown locations makes it something like 55%! Excluding them makes it a much more reasonable 20%.

Since pageviews are more than half Global South overall, treating unknown as Global South seems reasonable. However, we're pretty confident that Global South edits are still a fairly small minority, so it's much less reasonable to apply the same approach.

@Tbayer and I need to discuss what to do about this.

Details

Due Date
Nov 3 2018, 1:00 AM

Event Timeline

nshahquinn-wmf changed Due Date from Oct 12 2018, 7:00 AM to Nov 3 2018, 1:00 AM.
nshahquinn-wmf raised the priority of this task from Medium to High.Oct 18 2018, 8:35 PM
nshahquinn-wmf moved this task from Triage to Next Up on the Product-Analytics board.

@Tbayer and I discussed this yesterday and came to the following conclusions:

  • Unknown countries account for roughly:
    • 0.3% of pageviews
    • 0.9% of editors (technically, of (wiki, country, editor) entities)
    • 27% of edits (my estimate in the description was too high)
  • We should investigate the small group of editors that are producing all these unknown-country edits
  • Starting with October board metrics (T206895, etc.), we will treat unknown countries as a third region alongside the Global North and the Global South. I will produce some shared code to help with this.

+1. And to clarify, we haven't yet decided whether we should report all three of these regions separately in the board deck and in https://www.mediawiki.org/wiki/Wikimedia_Audiences (it would amount to adding two new sections, as only GS is listed currently). But in any case we resolved to remove the "unknown" country part from the existing GS metrics.

+1. And to clarify, we haven't yet decided whether we should report all three of these regions separately in the board deck and in https://www.mediawiki.org/wiki/Wikimedia_Audiences (it would amount to adding two new sections, as only GS is listed currently). But in any case we resolved to remove the "unknown" country part from the existing GS metrics.

Oh yeah—good point. I wasn't thinking that we'd add Global North and unknown to the board deck, just that in the future we'd give all three groups rather than two whenever we gave a full breakdown.

I think separating out the unknowns makes the most sense until we can properly categorize them. It may be helpful to put them into the board deck for comparison to Global South, but we can talk about how to balance adding that data without overcomplicating the slides.

For additional context, we are talking about IPs that the MaxMind GeoIP database classifies as coming from an "unknown" country - it's probably not worth investing effort into coming up with a more precise classification for them ourselves.

@Tbayer,

The list of Global North countries I've been using is:

(
    "AD", "AL", "AT", "AX", "BA", "BE", "BG", "CH", "CY", "CZ",
    "DE", "DK", "EE", "ES", "FI", "FO", "FR", "FX", "GB", "GG",
    "GI", "GL", "GR", "HR", "HU", "IE", "IL", "IM", "IS", "IT",
    "JE", "LI", "LU", "LV", "MC", "MD", "ME", "MK", "MT", "NL",
    "NO", "PL", "PT", "RO", "RS", "RU", "SE", "SI", "SJ", "SK",
    "SM", "TR", "VA", "AU", "CA", "HK", "MO", "NZ", "JP", "SG",
    "KR", "TW", "US"
)

To check whether this was accurate, I pulled all the countries present in a month of pageviews data that didn't match this list or "--". That's in this spreadsheet and it looks good to me.

Is that what you had in mind when we talked about shareable code? I'm also interested in making a custom Hive UDF and making Git-tracked CSV file that contains our Global North/Global South definitions, but I don't have time to do that right now.

In T206898#4735113, @Neil_P._Quinn_WMF wrote:

@Tbayer,

The list of Global North countries I've been using is:

(
    "AD", "AL", "AT", "AX", "BA", "BE", "BG", "CH", "CY", "CZ",
    "DE", "DK", "EE", "ES", "FI", "FO", "FR", "FX", "GB", "GG",
    "GI", "GL", "GR", "HR", "HU", "IE", "IL", "IM", "IS", "IT",
    "JE", "LI", "LU", "LV", "MC", "MD", "ME", "MK", "MT", "NL",
    "NO", "PL", "PT", "RO", "RS", "RU", "SE", "SI", "SJ", "SK",
    "SM", "TR", "VA", "AU", "CA", "HK", "MO", "NZ", "JP", "SG",
    "KR", "TW", "US"
)

To check whether this was accurate, I pulled all the countries present in a month of pageviews data that didn't match this list or "--". That's in this spreadsheet and it looks good to me.

Oh, looks that Global North definition leaves out Lithuania (LT) and Bermuda (BM). Let me just go ahead and make that damn CSV file

In T206898#4735113, @Neil_P._Quinn_WMF wrote:

@Tbayer,

The list of Global North countries I've been using is:

(
    "AD", "AL", "AT", "AX", "BA", "BE", "BG", "CH", "CY", "CZ",
    "DE", "DK", "EE", "ES", "FI", "FO", "FR", "FX", "GB", "GG",
    "GI", "GL", "GR", "HR", "HU", "IE", "IL", "IM", "IS", "IT",
    "JE", "LI", "LU", "LV", "MC", "MD", "ME", "MK", "MT", "NL",
    "NO", "PL", "PT", "RO", "RS", "RU", "SE", "SI", "SJ", "SK",
    "SM", "TR", "VA", "AU", "CA", "HK", "MO", "NZ", "JP", "SG",
    "KR", "TW", "US"
)

OK, here is the HiveQL expression determining GN I have been using for the past few years:

FIND_IN_SET(country_code, 'AD,AL,AT,AX,BA,BE,BG,CH,CY,CZ,DE,DK,EE,ES,FI,FO,FR,FX,GB,GG,GI,GL,GR,HR,HU,IE,IL,IM,IS,IT,JE,LI,LU,LV,MC,MD,ME,MK,MT,NL,NO,PL,PT,RO,RS,RU,SE,SI,SJ,SK,SM,TR,VA,AU,CA,HK,MO,NZ,JP,SG,KR,TW,US') > 0

To check whether this was accurate, I pulled all the countries present in a month of pageviews data that didn't match this list or "--". That's in this spreadsheet and it looks good to me.

Is that what you had in mind when we talked about shareable code?

Yes, quite - except that the one for GS would be more interesting right now, considering that that is what we are reporting directly now (and no longer want to calculate as total - GN).

I'm also interested in making a custom Hive UDF and making Git-tracked CSV file that contains our Global North/Global South definitions, but I don't have time to do that right now.

@Tbayer, I've created a CSV file with country names, ISO codes, Global North/South classification, and MaxMind continents, tracked in a new wikimedia-research/canonical-data repo. It contains all the countries which appear in projectview_hourly, and I've carefully checked it to make sure the Global North/South classifications match the ones at meta:List of countries by regional classification.

This should give us a stable basis for our queries!

Actually, I'll go ahead and upload the CSV file into the Data Lake so we can join it to the other tables. That actually may be just as good as writing a UDF—you don't have to add a jar and create a function at the start of every query, at least.

In T206898#4737277, @Neil_P._Quinn_WMF wrote:

@Tbayer, I've created a CSV file with country names, ISO codes, Global North/South classification, and MaxMind continents, tracked in a new wikimedia-research/canonical-data repo. It contains all the countries which appear in projectview_hourly, and I've carefully checked it to make sure the Global North/South classifications match the ones at meta:List of countries by regional classification.

Cool, thanks for sorting this all out and vetting it!

In T206898#4737360, @Neil_P._Quinn_WMF wrote:

Actually, I'll go ahead and upload the CSV file into the Data Lake so we can join it to the other tables. That actually may be just as good as writing a UDF—you don't have to add a jar and create a function at the start of every query, at least.

What approach do you recommend for joining a CSV in Hive? If it involves creating a temporary Hive table, wouldn't it make sense to instead upload the CSV as a non-temporary table already? (similar to e.g. wmf.domain_abbrev_map or tbayer.country_name_vs_code)

What approach do you recommend for joining a CSV in Hive? If it involves creating a temporary Hive table, wouldn't it make sense to instead upload the CSV as a non-temporary table already? (similar to e.g. wmf.domain_abbrev_map or tbayer.country_name_vs_code)

Yeah, that's exactly what I was planning! It's now available as canonical_data.countries. My uploading code is in a notebook in the same repo. Currently, it's hard-coded for the countries dataset, but it won't be too hard to make it into general-purpose code that pushes all the latest datasets from the repo into the canonical-data database.

Now that the data is easily available in Hive, I think this is done!