Page MenuHomePhabricator

Keep canonical_data.wikis updated
Closed, DuplicatePublic

Description

In testing https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/555578/ I looked at the difference between wmf_raw.mediawiki_project_namespace_map and canonical_data.wikis. The latter lists closed projects but is not up to date with the most recently added wikis. We should work together with Product-Analytics to establish a way to regularly update and hopefully centralize this dimension across our jobs/pipelines.

Event Timeline

Milimetric triaged this task as High priority.
Milimetric added a project: Analytics-Kanban.
Milimetric moved this task from Incoming to Data Quality on the Analytics board.

ping @Neil_P._Quinn_WMF and @cchen to strategize on this. In my opinion, the ideal is that we update the tables in canonical_data regularly and use those in all our jobs. Any reason not to do that? For the specific differences here we could:

  • split up namespace information from project/wiki information
  • add a closed column to identify closed wikis in case they need to be filtered out
  • hook up the update process that works on project_namespace_map to update canonical_data.wikis instead, and refactor oozie jobs to use that.

Currently the canonical table is created by this notebook occasionally.

We can schedule some time this week to discuss the process for updating canonical table.

A productive meeting with Connie and Neil resulted in the following draft proposal:

  • create a separate repository and iteratively migrate the contents of refinery/static_data to it. We'll call it analytics/canonical-data
  • combine manual and automated updates by maintaining manual tables in the repository and joining to them from the automated process. So, for example, the "mobile heavy" column in the wikis table could be a list of wiki db names that should have mobile_heavy: true. When we download the sitematrix, left join to the mobile_heavy table and select join_key is not null as mobile_heavy.
  • We need to do this iteratively, and we can start by splitting up project_namespace_map. We can merge the wiki info from there into the canonical_data.wikis table, along with the automatic updates. And we can keep the namespace info keyed by wiki db name in a separate table.
  • We almost always want the latest snapshot, so I'll brainbounce with Joseph to see if there's a way to have snapshot=latest or have two different table definitions with one of them wikis_historical or something like that

@JAllemandou, @Nuria thoughts?

Oh, and some more context on how data is manually updated: https://github.com/wikimedia-research/canonical-data/blob/master/generation/wikis.ipynb (see cells 22 and 28)

This feels like something we should do sooner than later. The data's not getting any more centralized by itself :)

Aklapper subscribed.

Removing task assignee due to inactivity, as this open task has been assigned for more than two years. See the email sent to the task assignee on February 06th 2022 (and T295729).

Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome.

If this task has been resolved in the meantime, or should not be worked on ("declined"), please update its task status via "Add Action… 🡒 Change Status".

Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator.

JAnstee_WMF lowered the priority of this task from High to Medium.Mar 6 2024, 9:52 PM
JAnstee_WMF moved this task from Incoming to Watching on the Movement-Insights board.
JAnstee_WMF moved this task from Watching to Incoming on the Movement-Insights board.
JAnstee_WMF moved this task from Incoming to Backlog on the Movement-Insights board.

Change #1125184 had a related patch set uploaded (by Milimetric; author: Milimetric):

[analytics/refinery@master] Add a closed flag to the project namespace map dataset

https://gerrit.wikimedia.org/r/1125184