Page MenuHomePhabricator

Automate creation of sqoop list of wikis to import data for from sitematrix
Open, Needs TriagePublic

Description

Two options to implement this:

For snapshots, we could pull from sitematrix like we're doing now, maybe more frequently
For event-based, we could get notified by this maintenance bot: T292419: Post-creation work for pwnwiki

Event Timeline

fdans renamed this task from Automate creation of sqoop list of wikis to import data for to Automate creation of sqoop list of wikis to import data for from sitematrix.Mar 26 2018, 4:39 PM
fdans triaged this task as High priority.
fdans created this task.
fdans moved this task from Incoming to Smart Tools for Better Data on the Analytics board.
mforns lowered the priority of this task from High to Medium.Oct 15 2018, 4:35 PM
mforns raised the priority of this task from Medium to High.Dec 9 2019, 5:44 PM

Tasked this briefly, want to coordinate with T239589 and make T239136 not necessary. To that end,

  • Use project namespace map table as the main source of truth when it comes to wiki lists (maybe rename)
  • Add a boolean column to project namespace map to indicate whether the wiki database exists on cloud dbs
  • Coordinate with research and product analytics to share this metadata as part of their metadata database
  • Consider a table holding settings like per-table weights that we could use from sqoop to fine-tune how data is transferred
odimitrijevic lowered the priority of this task from High to Low.Jan 6 2022, 5:09 AM
Milimetric raised the priority of this task from Low to Needs Triage.Mar 8 2022, 5:00 PM
Milimetric added a subscriber: EChetty.

I think this should be much higher priority now. Usually we would move tasks back to incoming to re-prioritize, but not sure how to do that in the new process. cc @EChetty

This would only work for the snapshots, but a simple solution would be to just pull the sqoop list from canonical_data.wikis. There's no automated process keeping that up to date, but:

  1. I've been pretty diligent about keeping it up to date (adding new wikis within a few weeks)
  2. We are motivated to switch it to automatic generation as soon as possible, since the manual updates are a source of toil for the Movement Insights team
  3. It centralizes us further on a single canonical list of wikis, easing the maintenance burden (see T339928 for another example)