Page MenuHomePhabricator

Add MW table 'cu_log' to data lake
Open, MediumPublic5 Estimated Story Points

Description

Background/Goal

Product Analytics maintains a Superset dashboard of IP Masking (Temp Accounts) metrics for Trust & Safety Product. Most of these metrics are calculated from sqooped-up MediaWiki tables in the data lake (in wmf_raw). One table is missing – cu_log – and requires us to calculate the Admin Requests metric by querying MariaDB analytics replicas.

Currently all these metrics are calculated from a Jupyter notebook scheduled to run under @jwang's user account. We are starting to migrate the queries to Airflow (T364406) and will be able to migrate all but one which depends on cu_log.

Please add it to the list of MW tables that are sqooped up and made available as monthly snapshots in the data lake. Thank you!

KR/Hypothesis(Initiative)

Temporary accounts for unregistered users

Success metrics

  • How we will measure success

Example areas:

  • Deadlines
  • User satisfaction
  • Performance
  • Accessibility
  • Maintenance
  • Movement impact
  • Scalability
  • Data Quality
  • Integration
  • Compliance

In scope

  • Monthly snapshots of cu_log available in the data lake

Out of Scope

  • known boundaries

Artifacts & Resources

Event Timeline

mpopov triaged this task as Medium priority.May 7 2024, 3:34 PM
mpopov created this task.
lbowmaker set the point value for this task to 5.May 16 2024, 1:01 PM

Update: @Ahoelzl is going to sync with DPE about sqoop and will follow up after the group is aligned on making more MariaDB wiki replica data available in the Data Lake.

DPE DE started working on the table ingestion.

Change #1082515 had a related patch set uploaded (by Snwachukwu; author: Snwachukwu):

[analytics/refinery@master] Add Query for cu_log table in sqoop.

https://gerrit.wikimedia.org/r/1082515

Change #1082800 had a related patch set uploaded (by Snwachukwu; author: Snwachukwu):

[operations/puppet@production] Add cu_log table to sqoop job

https://gerrit.wikimedia.org/r/1082800

Change #1084208 had a related patch set uploaded (by Snwachukwu; author: Snwachukwu):

[analytics/refinery@master] Add create table statement file for wmf_raw.cu_log table

https://gerrit.wikimedia.org/r/1084208

Change #1082515 merged by Snwachukwu:

[analytics/refinery@master] Add Query for cu_log table in sqoop.

https://gerrit.wikimedia.org/r/1082515

Change #1084208 merged by Joal:

[analytics/refinery@master] Add create table statement file for wmf_raw.cu_log table

https://gerrit.wikimedia.org/r/1084208

Change #1082800 merged by Brouberol:

[operations/puppet@production] Add cu_log table to sqoop job

https://gerrit.wikimedia.org/r/1082800

As requested by @jwang and @mpopov, September's data for cu_log is now available in the data lake at the wmf_raw.mediawiki_private_cu_log table. The cu_log table has been added to the list of tables to be sqooped monthly so expect to get monthly data as other tables.