Data Pipeline for Unique Editor metrics by Country
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	• EChetty
	Jul 29 2022, 1:58 PM

Description

User Story

As a Data User, I want to be able to count unique editors across different geographic regions regardless of the project they edited.

Scope

Build a data pipeline that modifies the existing insert_editors_daily_data job and aggregates editors across geographies while de-duplicating editors across wiki projects . The output should include a data table that de-duplicates active editors across wiki projects and aggregates them by geography.

Pipeline Description : Here

Success Criteria

A data pipeline deployed to the Data Engineering instance of Airflow that performs the functions described in the pipeline description
Outputs a queryable hive table that includes monthly counts of editors aggregate by geography.

{fad75b8e3a1322e824d6c8cd5fcd0116d2064c37}

Details

	Subject	Repo	Branch	Lines +/-
	Add table to compute unique editors per country on a monthly basis.	analytics/refinery	master	+142 -25

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	• EChetty	T310224 New Data Pipeline for Unique Editor metrics by Geo (Country & Region)
Resolved	xcollazo	T314147 Data Pipeline for Unique Editor metrics by Country
Resolved	Mayakp.wiki	T316398 Update PA job for changes to editors_daily table

Event Timeline

• EChetty reassigned this task from • EChetty to xcollazo.Jul 29 2022, 1:58 PM

• EChetty created this task.

• EChetty set the point value for this task to 5.

• EChetty moved this task from Ready to In progress on the Data-Engineering-Planning (Sprint 02) board.

Change 820478 had a related patch set uploaded (by Xcollazo; author: Xcollazo):

[analytics/refinery@master] Add table to compute unique editors per country on a monthly basis.

https://gerrit.wikimedia.org/r/820478

gerritbot added a project: Patch-For-Review.Aug 4 2022, 4:09 PM

Notes to do migration contained here:

migration notes.txt2 KBDownload

Merge request for the new pipeline https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/114

( After a conversation with @KCVelaga, we decided to keep things simple for now, thus the monthly table unique_editors_per_country_monthly will not be using GROUPING SETs .)

• EChetty moved this task from In progress to In code review on the Data-Engineering-Planning (Sprint 02) board.Aug 4 2022, 5:02 PM

Change 820478 merged by Milimetric:

[analytics/refinery@master] Add table to compute unique editors per country on a monthly basis.

https://gerrit.wikimedia.org/r/820478

xcollazo moved this task from In code review to Ready to deploy on the Data-Engineering-Planning (Sprint 02) board.Aug 5 2022, 5:25 PM

Status update:

Refinery changes are ready to deploy, but agreed with @Milimetric to hold till week of Aug 15 so that I can learn from the deployment.

Airflow changes still need a review: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/114

xcollazo moved this task from Ready to deploy to In code review on the Data-Engineering-Planning (Sprint 02) board.Aug 5 2022, 5:29 PM

Maintenance_bot removed a project: Patch-For-Review.Aug 5 2022, 5:30 PM

• EChetty moved this task from In code review to Ready to deploy on the Data-Engineering-Planning (Sprint 02) board.Aug 8 2022, 3:06 PM

Status update: Andrew and Dan tried to deploy this today, but ran into an issue with the query when backfilling. We altered the editors_daily table back to its original state. Dan and Xabriel to look into this when Xabriel is back.
Otto made a copy of the original editors_daily data (2022-06 and 2022-07) at /user/otto/editors_daily_backup_T314147_to_delete.

• EChetty moved this task from Ready to deploy to Paused/Blocked on the Data-Engineering-Planning (Sprint 02) board.Aug 11 2022, 4:04 PM

• EChetty edited projects, added Data Pipelines, Data-Engineering-Planning; removed Data-Engineering-Planning (Sprint 02).Aug 15 2022, 8:30 AM

• EChetty moved this task from Backlog to Sprint 00 on the Data Pipelines board.Aug 15 2022, 8:36 AM

• EChetty edited projects, added Data Pipelines (Sprint 00); removed Data Pipelines.

• EChetty moved this task from Backlog to Pipelines on the Data-Engineering-Planning board.Aug 15 2022, 8:43 AM

• EChetty moved this task from Ready to Ready to Deploy on the Data Pipelines (Sprint 00) board.Aug 15 2022, 8:57 AM

• EChetty updated the task description. (Show Details)Aug 15 2022, 4:29 PM

Copying some notes I sent thru email here for completeness:

The issue found by @Milimetric and @Ottomata while trying to backfill the editors_daily_monthly table seems to be a repro of https://issues.apache.org/jira/browse/SPARK-25271.

I *did* hit this issue while debugging my migration script, but I *only* hit it when I tried to re-CREATE the table, and not when ALTERing it (and so that is why I ALTER the table in the migration script, but as you folks show, I was wrong about it not showing while doing an ALTER, so my debugging must have gone wrong somewhere). The problem has to do with the Hive SerDe being used for writing parquet files not being happy with empty complex types (and so the problematic column is not the one I altered, it is user_is_bot_by). As an example, if the table is STORED AS ORC the error goes away.

We use Spark v2.4.4, and the bug linked above was fixed on v2.4.8 and v3.

I tried running the backfill on a copy of editors_daily_monthly and can confirm that I can't repro the issue with Spark 3.1.2 on our cluster. Since the Airflow jobs that do the monthly run actually run on Spark3 already, there should be no compatibility issues moving forward.

So let's retry the deployment in the near future using Spark3.

KCVelaga unsubscribed.Aug 16 2022, 5:16 AM

• EChetty moved this task from Ready to Deploy to Blocked/Paused on the Data Pipelines (Sprint 00) board.Aug 16 2022, 2:20 PM

• EChetty moved this task from Blocked/Paused to Ready to Deploy on the Data Pipelines (Sprint 00) board.

Updated migration notes to use Spark3.

migration notes.txt2 KBDownload

Successfully deployed the migration script and Airflow job today.

Next steps are:

Verify unique_editors_by_country_monthly table content.
Verify that the Airflow job is working as expected.

• EChetty moved this task from Ready to Deploy to Done on the Data Pipelines (Sprint 00) board.Aug 17 2022, 4:02 PM

• EChetty moved this task from Done to Ready to Deploy on the Data Pipelines (Sprint 00) board.

xcollazo updated the task description. (Show Details)Aug 17 2022, 7:10 PM

Verified that the table content and airflow job are working as designed. Closing.

@KCVelaga_WMF new table description below.

hive (default)> describe formatted wmf.unique_editors_by_country_monthly;
OK
col_name	data_type	comment
# col_name            	data_type           	comment             
	 	 
country_code        	string              	The 2-letter ISO country code this group of editors geolocated to, including Unknown (--)
users_are_anonymous 	boolean             	Whether or not this group of editors edited anonymously
activity_level      	string              	How many edits this group of editors performed, can be "1 to 4", "5 to 99", or "100 or more"
distinct_editors    	bigint              	Number of editors meeting this activity level
namespace_zero_distinct_editors	bigint              	Number of editors meeting this activity level with only namespace zero edits
	 	 
# Partition Information	 	 
# col_name            	data_type           	comment             
	 	 
month               	string              	The month in YYYY-MM format
	 	 
# Detailed Table Information	 	 
Database:           	wmf                 	 
...
Table Parameters:	 	 
...       
	comment             	This table contains unique editor counts by country, regardless of wiki project.

Thanks @xcollazo. From this commit, I see that editors_daily is also now using username instead of id. I have updated the Wikitech documentation for that. I am also wondering if we should document the new table as well on the same page. If it is the right page, I am happy to do that.

• EChetty moved this task from Ready to Deploy to Done on the Data Pipelines (Sprint 00) board.Aug 18 2022, 4:11 PM

I have updated the Wikitech documentation for that

Thanks @KCVelaga_WMF !

I am also wondering if we should document the new table as well on the same page. If it is the right page, I am happy to do that.

Agreed, we should document. @Milimetric can you please advise if that is the right location?

I wanted to note here that Product Analytics relies on the editors_daily table to calculate diversity metrics. While we were aware of the creation of a new aggregated table to calculate Unique Editor metrics by Country, this task doesn't explicitly mention that user_fingerprint_or_id will be changed to user_fingerprint_or_name .
Changes to the source table has impacted global north and global south metrics such as edits, editors, net new content etc. as well as the analytics-wmf-product-jobs that updates the wmf_product.active_editors table which provides data to the editors dashboard.

this task doesn't explicitly mention that user_fingerprint_or_id will be changed to user_fingerprint_or_name

@Mayakp.wiki sorry we broke your pipelines! We assumed that the change was low risk since there was no direct usage of this table in Superset, and we had made the necessary downstream table changes (of the tables that we control). We should have announced this change more broadly. I'll do so next time.

Changes to the source table has impacted global north and global south metrics such as edits, editors, net new content etc.

You can still use user_fingerprint_or_name in the same way you used user_fingerprint_or_id before, but now you will not get false positives due to id clashes between wikis.

Mayakp.wiki mentioned this in T316398: Update PA job for changes to editors_daily table.Aug 26 2022, 7:38 PM

Mayakp.wiki mentioned this in T316580: Retire the global market Editor metrics going forward .Aug 29 2022, 7:13 PM

Iflorez mentioned this in T316689: Data for impact analysis of Editors_Daily table update.Aug 30 2022, 5:08 PM

@xcollazo can you share documentation on the referenced editors_daily_monthly table? Where is that table located? I'm trying to understand the backfill that is referenced per T316689 (which looks at the impact of the switch from user_fingerprint_or_id to user_fingerprint_or_name.)

In T314147#8155713, @xcollazo wrote:

Copying some notes I sent thru email here for completeness:

The issue found by @Milimetric and @Ottomata while trying to backfill the editors_daily_monthly table seems to be a repro of https://issues.apache.org/jira/browse/SPARK-25271.

I *did* hit this issue while debugging my migration script, but I *only* hit it when I tried to re-CREATE the table, and not when ALTERing it (and so that is why I ALTER the table in the migration script, but as you folks show, I was wrong about it not showing while doing an ALTER, so my debugging must have gone wrong somewhere). The problem has to do with the Hive SerDe being used for writing parquet files not being happy with empty complex types (and so the problematic column is not the one I altered, it is user_is_bot_by). As an example, if the table is STORED AS ORC the error goes away.

We use Spark v2.4.4, and the bug linked above was fixed on v2.4.8 and v3.

I tried running the backfill on a copy of editors_daily_monthly and can confirm that I can't repro the issue with Spark 3.1.2 on our cluster. Since the Airflow jobs that do the monthly run actually run on Spark3 already, there should be no compatibility issues moving forward.

So let's retry the deployment in the near future using Spark3.

@xcollazo can you share documentation on the referenced editors_daily_monthly table? Where is that table located?

That was a typo, sorry, I meant to write editors_daily. This one is documented here: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Geoeditors#Editors_daily.

You can find this table on hive at wmf.editors_daily, and on HDFS at hdfs://analytics-hadoop/wmf/data/wmf/mediawiki_private/editors_daily.

I'm trying to understand the backfill that is referenced per T316689 (which looks at the impact of the switch from user_fingerprint_or_id to user_fingerprint_or_name.)

Right, the backfill on editors_daily did happen ( See T314147#8159492 ), and at that time we backfilled months 2022-06 and 2022-07. I just looked at the table and we do not have 2022-06 anymore:

hive (wmf)> show partitions editors_daily;
OK
partition
month=2022-07
month=2022-08
Time taken: 0.111 seconds, Fetched: 2 row(s)

This is due to the policy in this table being max 60 day retention. It gets scraped automatically.

@xcollazo thank you!

xcollazo mentioned this in T317923: Document wmf.unique_editors_by_country_monthly table.Sep 15 2022, 8:34 PM

Mayakp.wiki closed subtask T316398: Update PA job for changes to editors_daily table as Resolved.Oct 12 2022, 7:11 PM

mpopov mentioned this in T330436: Load wmf.unique_editors_by_country_monthly into Druid for access in Turnilo & Superset.Feb 23 2023, 8:39 PM

	F35462093: migration notes.txt
	Aug 16 2022, 4:22 PM

	F35397141: migration notes.txt
	Aug 4 2022, 4:16 PM

Data Pipeline for Unique Editor metrics by CountryClosed, ResolvedPublic5 Estimated Story PointsActions