Page MenuHomePhabricator

Merge visibility changes into hourly target table
Closed, ResolvedPublic3 Estimated Story Points

Description

On T335860, we implemented a pyspark job that runs a MERGE INTO that transforms event data into a table that will eventually have all the mediawiki revision history.

In that task, we are ingesting from event.rc1_mediawiki_page_content_change, which includes page deletion events.

However, other visibility changes at the comment and user level are not included in that stream.

In this task we should

  • Incorporate event.mediawiki_revision_visibility_change in the same target table as in T335860.
  • We should figure out if it makes sense to run it as another component of the existing MERGE INTO, or as a separate MERGE INTO that runs after T335860.

Event Timeline

Change 937047 had a related patch set uploaded (by Jennifer Ebe; author: Jennifer Ebe):

[analytics/refinery@master] T340880 Merge visibility changes into hourly target table

https://gerrit.wikimedia.org/r/937047

Change 937047 abandoned by Jennifer Ebe:

[analytics/refinery@master] T340880 Merge visibility changes into hourly target table

Reason:

moved to dumps gitlab repo

https://gerrit.wikimedia.org/r/937047

WDoranWMF set the point value for this task to 3.Aug 24 2023, 2:25 PM

The airflow-dags MR associated to this task is at https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/449.

I've just suggested that we hold merging that MR given that we have in the interim moved to wikitext_raw_rc1's schema and to Spark 3.3.2.

The airflow-dags MR associated to this task is at https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/449.

I've just suggested that we hold merging that MR given that we have in the interim moved to wikitext_raw_rc1's schema and to Spark 3.3.2.

The code was moved to wikitext_raw_rc1's schema, but migrating to Spark 3.3.2 is left as a follow up task. Closing.