Page MenuHomePhabricator

Archive old geowiki data (editors per country) and make it easily available at WMF
Closed, ResolvedPublic13 Estimated Story Points

Description

From Analytics goals: https://www.mediawiki.org/w/index.php?title=Wikimedia_Technology/Goals/2017-18_Q4&action=edit&section=11

  • Sqoop from the old tables into the new format and right partitions (separate table).
  • Publishing documentation

Although calculations on last 30 days are happening daily, we should only consider the ones that match this pattern: yyyy-%-01.

Event Timeline

mforns created this task.Mar 27 2018, 5:55 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 27 2018, 5:55 PM
fdans triaged this task as Medium priority.Mar 29 2018, 4:58 PM
fdans moved this task from Incoming to Geowiki on the Analytics board.
fdans added a subscriber: fdans.EditedApr 23 2018, 4:07 PM

For this to happen we need to :

  • Sqoop the old geowiki tables into hadoop
  • Creating job to ingest data into druid
  • Remove old data.
  • Add schema docs to each table
fdans set the point value for this task to 13.Apr 23 2018, 4:10 PM
fdans added a project: Analytics-Kanban.
mforns updated the task description. (Show Details)May 7 2018, 4:22 PM
Milimetric removed Milimetric as the assignee of this task.May 14 2018, 2:51 PM
Milimetric added a subscriber: Milimetric.
fdans claimed this task.May 23 2018, 4:07 PM

Sooo a few decisions to make here in terms of naming/placement. They're these:

  • Location in HDFS (I suggest /wmf/data/archive/geowiki)
  • Location in Hive (maybe archive db, table geowiki_archive?)
  • Name of Druid dataSource (right now I'm using fdans_geowiki_archive_test, maybe geowiki_archive?)

Even though in its new version we've renamed the dataset to Geoeditors, I think the archive should still be called Geowiki.

Let's discuss this in grosking but maybe leave any objections here :) @Milimetric @Nuria @JAllemandou @Ottomata

Nuria moved this task from Next Up to In Progress on the Analytics-Kanban board.May 24 2018, 4:15 PM
fdans added a comment.EditedMay 24 2018, 4:46 PM

In hive:
geowiki_archive_monthly_country
geowiki_archive_country
geowiki_archive_city

In druid:
geowiki_archive_monthly

fdans added a comment.Jun 1 2018, 6:51 PM

OK, the following tables have been created:

In hdfs:
/wmf/data/archive/geowiki/geowiki_archive_active_editors_world
/wmf/data/archive/geowiki/geowiki_archive_country
/wmf/data/archive/geowiki/geowiki_archive_edit_fraction_city
/wmf/data/archive/geowiki/geowiki_archive_edits_country
/wmf/data/archive/geowiki/geowiki_archive_monthly_country
/wmf/data/archive/geowiki/geowiki_archive_monthly_edits_country

In hive:

Raw tables (not modified from the erosen_* ones in mysql):
geowiki_archive_active_editors_world
geowiki_archive_country
geowiki_archive_edit_fraction_city
geowiki_archive_edits_country
Refined tables (turned to monthly rows)
geowiki_archive_monthly_country
geowiki_archive_monthly_edits_country

In druid
geowiki_archive_country

Access to this data is documented in https://wikitech.wikimedia.org/wiki/Analytics/Systems/Geoeditors#Accessing_the_old_Geowiki_data

As discussed with @fdans we should re-sqoop analytics-slave::staging::erosen_geocode_active_editors_world into hive::geowiki_archive_active_editors_world, because it had an import problem. The other 3 sqooped tables seem to be fine!

Re-sqooping of geowiki_archive_active_editors_world done :)

geowiki_archive_active_editors_world looks good to me now!

mforns added a comment.EditedJun 20 2018, 2:51 PM

geowiki_archive_monthly_country Looks good to me!

There's only one small detail that we can discuss whether we want to change or not:
Sometimes, when normalizing the all cohort, its normalized value does not match the sum of the normalized values of its cohort parts.
An example of this is:

select * from geowiki_archive_monthly_country where month like '2012-08-01' and project='ca' and country='BR';
ca	BR	all	2012-08-01	16	2012-10-29 12:22:48.0
ca	BR	0-10	2012-08-01	13	2012-10-29 12:22:48.0
ca	BR	90-100	2012-08-01	1	2012-10-29 12:22:48.0
ca	BR	1	2012-08-01	8	2012-10-29 12:22:49.0
ca	BR	2	2012-08-01	3	2012-10-29 12:22:49.0
ca	BR	4	2012-08-01	1	2012-10-29 12:22:49.0
ca	BR	5+	2012-08-01	3	2012-12-05 19:29:13.0
ca	BR	50-60	2012-08-01	1	2012-10-29 12:22:49.0
ca	BR	9	2012-08-01	1	2012-10-29 12:22:49.0

The sum of 0-10, 50-60 and 90-100 is 15, but the all value is 16. If you query geowiki_archive_monthly_country you can see 15 is the correct value (non normalized). When normalizing values for August, we divide 15 by 30 and multiply by 31, which equals 15.5, which rounds up to 16. Lower values as 13 normalize to 13.43, and round down to 13.

Not sure we want to change that, but there's this difference.

geowiki_archive_monthly_edits_country Looks good to me overall as well.
There's another small difference in relation to geowiki_archive_monthly_country, which is the edits one has full country names, where as the editors one has country ISO codes. Is that expected?

The geowiki_archive_monthly data in Turnilo looks good to me overall and super useful!
There is an empty metric called Count though. From what I know, this metric is added by Druid no? It should count the number of rows.
However, there's another metric called Number of rows that seems to count that.
So, yea, probably everything is fine like that.

In Superset the 'Geowiki legacy archive' dashboard works well and shows correct data.
I had difficulties seeing the countries though, because there's no line delimiting them (or there's a white line).
The geoeditors dashboard has a blue line that helps. Can we use that as well for geowiki?
Also, the bottom right table shows row counts. So if you select a wiki and a cohort, the table will show always count=1.
Not sure that is of interest?

geowiki_archive_monthly_edits_country Looks good to me overall as well.
There's another small difference in relation to geowiki_archive_monthly_country, which is the edits one has full country names, where as the editors one has country ISO codes. Is that expected?

Yes, it's better for the map component in Superset to have standard country names, and the names in the original dataset were far from standard (e.g. for Curaçao I found three variations: Curaçao, Curacao and Cura?ao).

geowiki_archive_monthly_country Looks good to me!

There's only one small detail that we can discuss whether we want to change or not:
Sometimes, when normalizing the all cohort, its normalized value does not match the sum of the normalized values of its cohort parts.
An example of this is:

select * from geowiki_archive_monthly_country where month like '2012-08-01' and project='ca' and country='BR';
ca	BR	all	2012-08-01	16	2012-10-29 12:22:48.0
ca	BR	0-10	2012-08-01	13	2012-10-29 12:22:48.0
ca	BR	90-100	2012-08-01	1	2012-10-29 12:22:48.0
ca	BR	1	2012-08-01	8	2012-10-29 12:22:49.0
ca	BR	2	2012-08-01	3	2012-10-29 12:22:49.0
ca	BR	4	2012-08-01	1	2012-10-29 12:22:49.0
ca	BR	5+	2012-08-01	3	2012-12-05 19:29:13.0
ca	BR	50-60	2012-08-01	1	2012-10-29 12:22:49.0
ca	BR	9	2012-08-01	1	2012-10-29 12:22:49.0

The sum of 0-10, 50-60 and 90-100 is 15, but the all value is 16. If you query geowiki_archive_monthly_country you can see 15 is the correct value (non normalized). When normalizing values for August, we divide 15 by 30 and multiply by 31, which equals 15.5, which rounds up to 16. Lower values as 13 normalize to 13.43, and round down to 13.

Not sure we want to change that, but there's this difference.

This goes in line with the slight variations in numbers between the original dataset and geowiki_archive_monthly_country because of the monthly normalization. Since they're just variations of +/- 1 editor per country I think that is statistically insignificant.

In Superset the 'Geowiki legacy archive' dashboard works well and shows correct data.
I had difficulties seeing the countries though, because there's no line delimiting them (or there's a white line).
The geoeditors dashboard has a blue line that helps. Can we use that as well for geowiki?
Also, the bottom right table shows row counts. So if you select a wiki and a cohort, the table will show always count=1.
Not sure that is of interest?

I've just added a stroke to the countries, and corrected the country table so that it shows SUM(editors) instead of the row count.

@fdans

Yes, it's better for the map component in Superset to have standard country names, and the names in the original dataset were far from standard (e.g. for Curaçao I found three variations: Curaçao, Curacao and Cura?ao).

Great, then.

This goes in line with the slight variations in numbers between the original dataset and geowiki_archive_monthly_country because of the monthly normalization. Since they're just variations of +/- 1 editor per country I think that is statistically insignificant.

I pointed that out because it might cause confusion to a nitpicky user. However, I agree that it's statistically insignificant, and we can leave it like that.
Also, one can say it's a natural side-effect of the monthly normalization. We could document this detail on-wiki no?

I've just added a stroke to the countries, and corrected the country table so that it shows SUM(editors) instead of the row count.

Awesome, thanks :]

+2 on my side!

fdans added a comment.Jun 27 2018, 4:19 PM

We could document this detail on-wiki no?

The way of estimating monthly values is documented in the wikitech page (https://wikitech.wikimedia.org/wiki/Analytics/Systems/Geoeditors#Differences_with_the_original_data)

Oh yea, I meant the tiny inconsistency created by the normalization rounding. I added a one-liner to the docs:
https://wikitech.wikimedia.org/w/index.php?title=Analytics%2FSystems%2FGeoeditors&type=revision&diff=1795553&oldid=1793682

fdans moved this task from In Code Review to Done on the Analytics-Kanban board.Jun 29 2018, 9:49 AM
Nuria closed this task as Resolved.Jul 11 2018, 10:02 PM