Page MenuHomePhabricator

Data Pipeline for Unique Editor metrics by Country
Closed, ResolvedPublic5 Estimated Story Points

Description

User Story
As a Data User, I want to be able to count unique editors across different geographic regions regardless of the project they edited.
Scope

Build a data pipeline that modifies the existing insert_editors_daily_data job and aggregates editors across geographies while de-duplicating editors across wiki projects . The output should include a data table that de-duplicates active editors across wiki projects and aggregates them by geography.

Pipeline Description : Here

Success Criteria
  • A data pipeline deployed to the Data Engineering instance of Airflow that performs the functions described in the pipeline description
  • Outputs a queryable hive table that includes monthly counts of editors aggregate by geography.

{fad75b8e3a1322e824d6c8cd5fcd0116d2064c37}

Event Timeline

EChetty created this task.
EChetty set the point value for this task to 5.

Change 820478 had a related patch set uploaded (by Xcollazo; author: Xcollazo):

[analytics/refinery@master] Add table to compute unique editors per country on a monthly basis.

https://gerrit.wikimedia.org/r/820478

Notes to do migration contained here:

( After a conversation with @KCVelaga, we decided to keep things simple for now, thus the monthly table unique_editors_per_country_monthly will not be using GROUPING SETs .)

Change 820478 merged by Milimetric:

[analytics/refinery@master] Add table to compute unique editors per country on a monthly basis.

https://gerrit.wikimedia.org/r/820478

Status update:

Refinery changes are ready to deploy, but agreed with @Milimetric to hold till week of Aug 15 so that I can learn from the deployment.

Airflow changes still need a review: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/114

Status update: Andrew and Dan tried to deploy this today, but ran into an issue with the query when backfilling. We altered the editors_daily table back to its original state. Dan and Xabriel to look into this when Xabriel is back.
Otto made a copy of the original editors_daily data (2022-06 and 2022-07) at /user/otto/editors_daily_backup_T314147_to_delete.

Copying some notes I sent thru email here for completeness:

The issue found by @Milimetric and @Ottomata while trying to backfill the editors_daily_monthly table seems to be a repro of https://issues.apache.org/jira/browse/SPARK-25271.

I *did* hit this issue while debugging my migration script, but I *only* hit it when I tried to re-CREATE the table, and not when ALTERing it (and so that is why I ALTER the table in the migration script, but as you folks show, I was wrong about it not showing while doing an ALTER, so my debugging must have gone wrong somewhere). The problem has to do with the Hive SerDe being used for writing parquet files not being happy with empty complex types (and so the problematic column is not the one I altered, it is user_is_bot_by). As an example, if the table is STORED AS ORC the error goes away.

We use Spark v2.4.4, and the bug linked above was fixed on v2.4.8 and v3.

I tried running the backfill on a copy of editors_daily_monthly and can confirm that I can't repro the issue with Spark 3.1.2 on our cluster. Since the Airflow jobs that do the monthly run actually run on Spark3 already, there should be no compatibility issues moving forward.

So let's retry the deployment in the near future using Spark3.

Updated migration notes to use Spark3.

Successfully deployed the migration script and Airflow job today.

Next steps are:

  • Verify unique_editors_by_country_monthly table content.
  • Verify that the Airflow job is working as expected.

Verified that the table content and airflow job are working as designed. Closing.

@KCVelaga_WMF new table description below.

hive (default)> describe formatted wmf.unique_editors_by_country_monthly;
OK
col_name	data_type	comment
# col_name            	data_type           	comment             
	 	 
country_code        	string              	The 2-letter ISO country code this group of editors geolocated to, including Unknown (--)
users_are_anonymous 	boolean             	Whether or not this group of editors edited anonymously
activity_level      	string              	How many edits this group of editors performed, can be "1 to 4", "5 to 99", or "100 or more"
distinct_editors    	bigint              	Number of editors meeting this activity level
namespace_zero_distinct_editors	bigint              	Number of editors meeting this activity level with only namespace zero edits
	 	 
# Partition Information	 	 
# col_name            	data_type           	comment             
	 	 
month               	string              	The month in YYYY-MM format
	 	 
# Detailed Table Information	 	 
Database:           	wmf                 	 
...
Table Parameters:	 	 
...       
	comment             	This table contains unique editor counts by country, regardless of wiki project.

Thanks @xcollazo. From this commit, I see that editors_daily is also now using username instead of id. I have updated the Wikitech documentation for that. I am also wondering if we should document the new table as well on the same page. If it is the right page, I am happy to do that.

I have updated the Wikitech documentation for that

Thanks @KCVelaga_WMF !

I am also wondering if we should document the new table as well on the same page. If it is the right page, I am happy to do that.

Agreed, we should document. @Milimetric can you please advise if that is the right location?

I wanted to note here that Product Analytics relies on the editors_daily table to calculate diversity metrics. While we were aware of the creation of a new aggregated table to calculate Unique Editor metrics by Country, this task doesn't explicitly mention that user_fingerprint_or_id will be changed to user_fingerprint_or_name .
Changes to the source table has impacted global north and global south metrics such as edits, editors, net new content etc. as well as the analytics-wmf-product-jobs that updates the wmf_product.active_editors table which provides data to the editors dashboard.

this task doesn't explicitly mention that user_fingerprint_or_id will be changed to user_fingerprint_or_name

@Mayakp.wiki sorry we broke your pipelines! We assumed that the change was low risk since there was no direct usage of this table in Superset, and we had made the necessary downstream table changes (of the tables that we control). We should have announced this change more broadly. I'll do so next time.

Changes to the source table has impacted global north and global south metrics such as edits, editors, net new content etc.

You can still use user_fingerprint_or_name in the same way you used user_fingerprint_or_id before, but now you will not get false positives due to id clashes between wikis.

@xcollazo can you share documentation on the referenced editors_daily_monthly table? Where is that table located? I'm trying to understand the backfill that is referenced per T316689 (which looks at the impact of the switch from user_fingerprint_or_id to user_fingerprint_or_name.)

Copying some notes I sent thru email here for completeness:

The issue found by @Milimetric and @Ottomata while trying to backfill the editors_daily_monthly table seems to be a repro of https://issues.apache.org/jira/browse/SPARK-25271.

I *did* hit this issue while debugging my migration script, but I *only* hit it when I tried to re-CREATE the table, and not when ALTERing it (and so that is why I ALTER the table in the migration script, but as you folks show, I was wrong about it not showing while doing an ALTER, so my debugging must have gone wrong somewhere). The problem has to do with the Hive SerDe being used for writing parquet files not being happy with empty complex types (and so the problematic column is not the one I altered, it is user_is_bot_by). As an example, if the table is STORED AS ORC the error goes away.

We use Spark v2.4.4, and the bug linked above was fixed on v2.4.8 and v3.

I tried running the backfill on a copy of editors_daily_monthly and can confirm that I can't repro the issue with Spark 3.1.2 on our cluster. Since the Airflow jobs that do the monthly run actually run on Spark3 already, there should be no compatibility issues moving forward.

So let's retry the deployment in the near future using Spark3.

@xcollazo can you share documentation on the referenced editors_daily_monthly table? Where is that table located?

That was a typo, sorry, I meant to write editors_daily. This one is documented here: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Geoeditors#Editors_daily.

You can find this table on hive at wmf.editors_daily, and on HDFS at hdfs://analytics-hadoop/wmf/data/wmf/mediawiki_private/editors_daily.


I'm trying to understand the backfill that is referenced per T316689 (which looks at the impact of the switch from user_fingerprint_or_id to user_fingerprint_or_name.)

Right, the backfill on editors_daily did happen ( See T314147#8159492 ), and at that time we backfilled months 2022-06 and 2022-07. I just looked at the table and we do not have 2022-06 anymore:

hive (wmf)> show partitions editors_daily;
OK
partition
month=2022-07
month=2022-08
Time taken: 0.111 seconds, Fetched: 2 row(s)

This is due to the policy in this table being max 60 day retention. It gets scraped automatically.