Page MenuHomePhabricator

Allow Superset to query ToolsDB public databases
Closed, ResolvedPublic

Description

Similar to T151158: Support queries against Quarry's own database and ToolsDB, we would like Superset to have access to ToolsDB databases, only the public ones identified by a name ending with _p.

Event Timeline

fnegri changed the task status from Open to In Progress.Jun 17 2024, 10:27 AM
fnegri triaged this task as Medium priority.

Mentioned in SAL (#wikimedia-cloud-feed) [2024-06-24T15:43:49Z] <fnegri@cloudcumin1001> START - Cookbook wmcs.vps.add_user_to_project for user 'fnegri' in role 'member' (T367393)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-06-24T15:43:56Z] <fnegri@cloudcumin1001> END (PASS) - Cookbook wmcs.vps.add_user_to_project (exit_code=0) for user 'fnegri' in role 'member' (T367393)

From a first look, it's not possible (at least not easily) to give superset access to all ToolsDB _p databases, but we need to list each database individually.

@KCVelaga_WMF which database(s) you were interested in? We'll probably have to add them "ad hoc" following user requests.

@fnegri ad-hoc is fine. I don't need access a specific database at the moment.

But let me explain the need: for product teams there is often need to create dashboards for new features (T362610) or setup monitoring for existing ones (T365813). We need access to the dashboards for both to the product teams and the community - the public Superset instance works well for this.

However, the way it currently works, I can only run a query, and the dependent chart/dashboard against a single database/wiki. Wikis should be a filter, rather than a database selection - otherwise I will have to create hundreds of dashboards for each wiki the feature is on, which is not practical. So for a filter to work, the data for all wikis should come from a single table/database. My idea is setup a pipeline that aggregates the required metrics from all wikis, and writes to a db in the ToolsDB, which I can then use for the required dashboards. I want to be sure that Superset can do this before working on the data pipelines etc.

I might have to do it soon as Automoderator gets deployed to more wikis. In the meantime, is possible to test it? I can create a temporary DB within ToolsDB with test data and use it.

Let me know what you think.

@KCVelaga_WMF I think your plan should work, and I don't see any problem unless the size of the aggregated data gets too big (we currently recommend a maximum of 25 GB of size for each individual database in ToolsDB).

Feel free to create a temporary DB within ToolsDB to test it, please add _p to the name. Let me know the full name and I'll add a rule to the Superset config.

@fnegri I only plan to have final aggregated tables, so it should be much less than 25 GB limit.

I will create a test DB and get back to you in a couple of days.

@fnegri I have created a db named s55986__automod_metrics_p with a couple of tables and some dummy data.

Please add it to Superset config for testing.

I created the user and grants in ToolsDB, similar to the Quarry ones I created in T348407.

MariaDB [(none)]> CREATE USER 'superset_readonly'@'172.16.%' IDENTIFIED BY '***';
Query OK, 0 rows affected (0.310 sec)

MariaDB [(none)]> GRANT SELECT, SHOW VIEW ON `%\_p`.* TO 'superset_readonly'@'172.16.%';
Query OK, 0 rows affected (0.115 sec)

MariaDB [(none)]> SHOW GRANTS FOR `superset_readonly`@`172.16.%`;
+-------------------------------------------------------------------------------------------------------------------------+
| Grants for superset_readonly@172.16.%                                                                                   |
+-------------------------------------------------------------------------------------------------------------------------+
| GRANT USAGE ON *.* TO `superset_readonly`@`172.16.%` IDENTIFIED BY PASSWORD '***' |
| GRANT SELECT, SHOW VIEW ON `%\_p`.* TO `superset_readonly`@`172.16.%`                                                   |
+-------------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.001 sec)

Then in the PR https://github.com/toolforge/superset-deploy/pull/26, I added the credentials to Superset, and also the first database requested by @KCVelaga_WMF.

@Andrew @rook can you please review the PR at https://github.com/toolforge/superset-deploy/pull/26 and test if it works?

@KCVelaga_WMF I will be away until 15 July, but hopefully you will get something to test before I'm back.

@Andrew @rook can you please review the PR at https://github.com/toolforge/superset-deploy/pull/26 and test if it works?

Looks good. Deployed

@fnegri thank you! It is working as expected. I just created test dashboard with various wikis as filters. This is great.

Also when I select s55986__automod_metrics_p as database, it seems like all the available public tool databases are exposed. I am able to select those schemas and run queries. Not sure if this is expected or a bug.

@KCVelaga_WMF thanks for testing! Yes it is kind of expected that all dbs are exposed, as in the SQL permissions are for all dbs ending in _p. So maybe to solve the problem of manually listing the dbs, we could just call the db "ToolsDB" in Superset, and ask users to start all their queries with use some_db_p;. What do you think?

I got an error when i tried to query bigint columns

MySQL Error: Python int too large to convert to C long

How to reproduce:

  1. Select SQLLab: https://superset.wmcloud.org/sqllab/
  2. select s55986__automod_metrics_p as database
  3. select s55462__imagehashpublic_p as schema
  4. select imagehash as table

run query

SELECT max(phash) FROM imagehash

I updated the SQL query so that it will fetch large numbers instead of zeroes. (ie. SELECT max(phash) FROM imagehash) Also screenshot for consistency.

Screenshot 2024-08-01 at 18.01.36.png (1×3 px, 351 KB)

@Zache hmm this looks like a Superset bug to me, I found https://github.com/apache/superset/issues/15876 but they say it's resolved in Superset version 3, and we are running 3.11.

Can you please open a new "Bug" task in the superset.wikimedia.org project?

to solve the problem of manually listing the dbs, we could just call the db "ToolsDB" in Superset, and ask users to start all their queries with use some_db_p;

This is even easier than I imagined, there's no need to manually type use some_db_p as the SuperSet UI automatically detects all the available dbs and shows them under "schema", where you can pick the one you need:

Screenshot 2024-08-22 at 16.30.19.png (844×788 px, 93 KB)

There's one last quirk before we can resolve this task, I updated the database name to ToolsDB, but SuperSet has cached the old name (s55986__automod_metrics_p) somewhere and it's showing both:

Screenshot 2024-08-22 at 16.35.34.png (670×720 px, 56 KB)

@fnegri s55986__automod_metrics_p was just for testing, there is no real data in there. If it helps, I can delete it.

@KCVelaga_WMF the UI issue is fixed, but it's always good to clean up unused dbs, so feel free to delete it at your convenience!