Page MenuHomePhabricator

Allow Superset to query ToolsDB public databases
Open, In Progress, MediumPublic

Description

Similar to T151158: Support queries against Quarry's own database and ToolsDB, we would like Superset to have access to ToolsDB databases, only the public ones identified by a name ending with _p.

Event Timeline

fnegri changed the task status from Open to In Progress.Mon, Jun 17, 10:27 AM
fnegri triaged this task as Medium priority.

Mentioned in SAL (#wikimedia-cloud-feed) [2024-06-24T15:43:49Z] <fnegri@cloudcumin1001> START - Cookbook wmcs.vps.add_user_to_project for user 'fnegri' in role 'member' (T367393)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-06-24T15:43:56Z] <fnegri@cloudcumin1001> END (PASS) - Cookbook wmcs.vps.add_user_to_project (exit_code=0) for user 'fnegri' in role 'member' (T367393)

From a first look, it's not possible (at least not easily) to give superset access to all ToolsDB _p databases, but we need to list each database individually.

@KCVelaga_WMF which database(s) you were interested in? We'll probably have to add them "ad hoc" following user requests.

@fnegri ad-hoc is fine. I don't need access a specific database at the moment.

But let me explain the need: for product teams there is often need to create dashboards for new features (T362610) or setup monitoring for existing ones (T365813). We need access to the dashboards for both to the product teams and the community - the public Superset instance works well for this.

However, the way it currently works, I can only run a query, and the dependent chart/dashboard against a single database/wiki. Wikis should be a filter, rather than a database selection - otherwise I will have to create hundreds of dashboards for each wiki the feature is on, which is not practical. So for a filter to work, the data for all wikis should come from a single table/database. My idea is setup a pipeline that aggregates the required metrics from all wikis, and writes to a db in the ToolsDB, which I can then use for the required dashboards. I want to be sure that Superset can do this before working on the data pipelines etc.

I might have to do it soon as Automoderator gets deployed to more wikis. In the meantime, is possible to test it? I can create a temporary DB within ToolsDB with test data and use it.

Let me know what you think.

@KCVelaga_WMF I think your plan should work, and I don't see any problem unless the size of the aggregated data gets too big (we currently recommend a maximum of 25 GB of size for each individual database in ToolsDB).

Feel free to create a temporary DB within ToolsDB to test it, please add _p to the name. Let me know the full name and I'll add a rule to the Superset config.

@fnegri I only plan to have final aggregated tables, so it should be much less than 25 GB limit.

I will create a test DB and get back to you in a couple of days.