Page MenuHomePhabricator

Wikistats Bug: Small countries not displayed on the map
Closed, ResolvedPublic9 Estimated Story Points

Description

While doing a quick dive on the dataset for monthly pageviews in Singapore, what was found that Singapore is not visually represented on the map, even though there are 88 million monthly page views.

stats link: https://stats.wikimedia.org/#/all-projects/reading/page-views-by-country/normal%7Cmap%7Clast-month%7C(access)~desktop*mobile-app*mobile-web%7Cmonthly

Can the map be corrected to add red dots to locations that cannot be displayed at the default zoomed out level?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
JArguello-WMF set the point value for this task to 9.

This table is filled monthly, subject to this disallow list. This ultimately makes it into Cassandra as editors_by_country, which is where AQS serves these endpoints to Wikistats. This old disallow list includes Singapore, which is one of the reasons why it's not showing up on Wikistats. We should switch to using the new disallow list, mentioned at https://wikitech.wikimedia.org/wiki/Country_protection_list. However, we should also productionize this to be in the refinery repo and maintained centrally so we don't rely on Hal's personal hive db.

The second reason it's not showing up is because of the topojson resolution as folks guessed in this task. Mikhail was exactly right in his analysis, and I'm not 100% sure of the fix, but here's what he says about the problem:

"I tried looking up examples of d3 choropleths and all the ones I’ve encountered were missing Singapore.
The good news is that at least SG is present in isoLookup.js (used in world.js which is used in MapChart.vue). The bad news is that it is indeed missing in the geometry data: world-50m.js – eg there is a polygon for Afghanistan (id 004) but not for Singapore (id 702). So we just have to add in the missing polygon data with that id and we should be good. It’s not a trivial thing to do but it’s good to know what needs to be done." (link to wikistats code referenced here)

However, we should also productionize this to be in the refinery repo and maintained centrally so we don't rely on Hal's personal hive db.

Fortunately that's already available in canonical_data.countries (is_protected column), jointly maintained by your team + my team. Source data: https://github.com/wikimedia-research/canonical-data/blob/master/country/countries.tsv

Change 929723 had a related patch set uploaded (by Aqu; author: Aqu):

[analytics/refinery@master] Use canonical_data countries maintained by analytics-product

https://gerrit.wikimedia.org/r/929723

I'm proposing with those patches:

to:

  • Use canonical_data.countries.is_protected in place of maintaining our list into analytics-refinery
  • Add an Airflow job in the analytics-product instance to regularly (+manually) sync the Hive table with the TSV in the repo.

@mpopov @nshahquinn-wmf @Htriedman, what do you think about it?

Thank you @Antoine_Quhen! I really really like this idea.

I'm wondering if this would be a good opportunity to migrate the canonical data repo from GitHub to GitLab. If we're going to set up these cross-repo dependencies I think it would be better to have everything be on Wikimedia's premises.

@nshahquinn-wmf: What do you think?

And then as far as where it gets moved to, I actually think that instead of repos/product-analytics perhaps it should live under repos/data-engineering with myself & Neil added co-maintainers? This is the kind of dataset that I think will be primarily maintained by the new Movement Insights team, so I'm not sure if it should be under PA but also I'm totally OK to have it under PA.

Change 929816 had a related patch set uploaded (by Milimetric; author: Milimetric):

[analytics/wikistats2@master] Increase world map resolution

https://gerrit.wikimedia.org/r/929816

Change 929816 merged by jenkins-bot:

[analytics/wikistats2@master] Increase world map resolution

https://gerrit.wikimedia.org/r/929816

I'm proposing with those patches:

to:

  • Use canonical_data.countries.is_protected in place of maintaining our list into analytics-refinery
  • Add an Airflow job in the analytics-product instance to regularly (+manually) sync the Hive table with the TSV in the repo.

@mpopov @nshahquinn-wmf @Htriedman, what do you think about it?

I also like this a lot! What do you think about generalizing it so both the wikis and countries get automatically imported (maybe weekly rather than monthly)? It looks like you're 90% of the way there, and it would be a big help for us. (Just to be clear, I'm not talking about automating the generation of the wikis table, just about automatically importing whatever's in the repo).

Thank you @Antoine_Quhen! I really really like this idea.

I'm wondering if this would be a good opportunity to migrate the canonical data repo from GitHub to GitLab. If we're going to set up these cross-repo dependencies I think it would be better to have everything be on Wikimedia's premises.

I agree. I don't think it does any harm to have a GitLab repo depend on a GitHub one, but we might as well migrate canonical data now so we skip the step of having to update it after migrating it to GitLab at some point in the future.

And then as far as where it gets moved to, I actually think that instead of repos/product-analytics perhaps it should live under repos/data-engineering with myself & Neil added co-maintainers? This is the kind of dataset that I think will be primarily maintained by the new Movement Insights team, so I'm not sure if it should be under PA but also I'm totally OK to have it under PA.

Yeah, I think moving under data-engineering makes sense, although I'm fine with other locations too.

Change 929816 merged by jenkins-bot:

[analytics/wikistats2@master] Increase world map resolution

https://gerrit.wikimedia.org/r/929816

Thanks all for the fix and the updated map. :)

Change 929723 merged by Aqu:

[analytics/refinery@master] Use canonical_data countries maintained by analytics-product

https://gerrit.wikimedia.org/r/929723

3 of our dataset are now going to use canonical.countries.is_protected:

  • Cassandra AQS pageview_top_percountry_daily
  • Cassandra AQS pageview_top_bycountry_monthly
  • Hive geoeditors_public_monthly

6 countries were not allowed to be released, and are going to be released by the change: Burundi, Equatorial Guinea, Lybia, Singapore, Somalia, Tajikistan
5 countries were allowed to be released, and are going to be disallowed by this change: Bangladesh, Honduras, Kuwait, Nicaragua, Oman