Page MenuHomePhabricator

[Data Quality] Implement wiki completeness check for MediaWiki History
Closed, ResolvedPublic5 Estimated Story Points

Description

User Story

As a data analyst working on content and contributor metrics using MediaWiki History, I need the most recently generated snapshot to have all the production wikis that were open/active at the time of the snapshot so that I'm working with data from all wikis – including languages/projects recently graduated from Incubator – allowing me to paint a more complete picture of growth and productivity in the Wikimedia movement.

Context:

Notes

Related:

Event Timeline

mpopov renamed this task from [Data Quality] Implement completeness check for MediaWiki History to [Data Quality] Implement wiki completeness check for MediaWiki History.May 16 2024, 9:12 PM

An idea for an automated quality check step:

Using @Hghani 's site creation scraper to scrape the newest wikis from the Site Creation Log, and cross-check with canonical_data.wikis

(This could, additionally, be helpful for quality checking canonical_data.wikis itself).

lbowmaker set the point value for this task to 5.May 29 2024, 2:19 PM

Notes from chat with @gmodena:

  • Should be easy enough for someone in the DE team to implement
  • Maybe 3 points, 1ish day for coding then reviews and then testing it out could make it run 3 days of effort - 1 week
  • We would also need to think if we wanted alerts too.

I’ll tentatively put this into Q1 plans as a 5, we wouldn’t get to this by end of June.

To solve the missing wikis issue, we decided it's best to automate sqoop list. There are 3 source of truth in consideration:

  1. Canonical_data.wikis table (from Wikimedia NOC website). Note there is ongoing work to automate this table T339928
  2. Site_creation log website.
  3. Project_namespace_map table.

In this document, I tried to get people's opinion on the preferred option.

Note Project_namepsace_map table was recently added as an option after discussing with Dan. So it wasn't discussed in the document. The table is currently updated monthly but we can change to daily. I looked at the table it’s good but didn’t have status column(there is a new patch for this). I’m tending towards this approach.
whatever approach we pick, we'll need to validate that wikis exist in cloud replica before adding it to sqoop list. For project namespace we can add a column that for this check. Now I just have to add a step to validate wikis against cloud db. Pull the list and use that list for sqoop.

Thanks Sandra!

I am not 100% on all the pros and cons of the solutions, but I'm sure you and Dan and Neil can work out the best thing. Whatever yall think!

Change #1125184 had a related patch set uploaded (by Snwachukwu; author: Milimetric):

[analytics/refinery@master] 1.Add a closed flag to the project namespace map dataset 2. Add a whether to sqoop flag by checking if wikidb exists in cloud replica.

https://gerrit.wikimedia.org/r/1125184

Change #1191398 had a related patch set uploaded (by Snwachukwu; author: Snwachukwu):

[analytics/refinery@master] CREATE HQL SCRIPT TO UPDATE SCOOP WIKI LIST DATA FILE.

https://gerrit.wikimedia.org/r/1191398

Change #1191398 had a related patch set uploaded (by Snwachukwu; author: Snwachukwu):

[analytics/refinery@master] CREATE HQL SCRIPT TO UPDATE SCOOP WIKI LIST DATA FILE.

https://gerrit.wikimedia.org/r/1191398

Change #1125184 merged by Snwachukwu:

[analytics/refinery@master] Update project namespace map fields.

https://gerrit.wikimedia.org/r/1125184

Change #1191398 merged by Snwachukwu:

[analytics/refinery@master] Add HQL Script to update Mediawiki Ingestion wikis.

https://gerrit.wikimedia.org/r/1191398

Change #1193440 had a related patch set uploaded (by Snwachukwu; author: Snwachukwu):

[analytics/refinery/source@master] Add check for wikis count to Mediawiki history data quality checks

https://gerrit.wikimedia.org/r/1193440

Change #1193440 merged by jenkins-bot:

[analytics/refinery/source@master] Add check for wikis count to Mediawiki history data quality checks

https://gerrit.wikimedia.org/r/1193440

Change #1195268 had a related patch set uploaded (by Snwachukwu; author: Snwachukwu):

[analytics/refinery/source@master] Bug Fix: Add support for Deequ Metric value Distribution data type

https://gerrit.wikimedia.org/r/1195268

Change #1195268 merged by Snwachukwu:

[analytics/refinery/source@master] Bug Fix: Add support for Deequ Metric value Distribution data type

https://gerrit.wikimedia.org/r/1195268