Page MenuHomePhabricator

Allow Quarry to query ToolsDB public databases
Open, In Progress, MediumPublic

Description

In T151158, we plan to make public ToolsDB databases (the ones with a name ending in _p) queryable from Quarry. For this, Quarry needs a db user in which any write access to tools databases are explicitly denied.

s52788 is the existing user for querying replicas – maybe we could use the same after adjusting its permissions.

Event Timeline

fnegri subscribed.

I think the best solution here (both for security and performance) is to let Quarry connect to the read-only replica host (ToolsDB is running on two hosts, a read-write primary and a read-only replica). We don't have a DNS pointing to the ToolsDB replica but I can add it. In this way, any query from Quarry will not affect the performance of the primary ToolsDB host. I would probably also create a separate user dedicated to Quarry, it can have a more readable name (something like quarry_readonly) and it will make more obvious what it is used for.

I can take care of creating the DNS and the user, but I also want to consider another thing before opening this access: do we know of any tool that is currently storing any form of private/sensitive data in ToolsDB? For example, I imagine some tools might be using ToolsDB to store information related to user interactions with the tool itself, which I'm not sure we should make publicly available. Even if no tool is currently storing private/sensitive data in ToolsDB, we should clearly communicate on wiki and on mailing lists that this access will be opened, along with recommendations for how to store private/sensitive data if necessary.

I will raise this topic in the next Cloud Services and Toolforge meeting, I think it's a good feature to have but I want to make sure it doesn't result in exposing publicly any sensitive data.

but I also want to consider another thing before opening this access: do we know of any tool that is currently storing any form of private/sensitive data in ToolsDB? For example, I imagine some tools might be using ToolsDB to store information related to user interactions with the tool itself, which I'm not sure we should make publicly available.

Only the public tool databases (the ones with names ending in _p) are planned to be made accessible from Quarry. These are already accessible as of today by anyone with a toolforge account, so are not expected to contain private/sensitive data. Tools that store private data would use a non-_p-suffixed db name so that the db is accessible only by the tool.

Only the public tool databases (the ones with names ending in _p) are planned to be made accessible from Quarry.

Thanks, sorry for missing this part! I agree it should not be a security risk, even if it's still increasing the audience from "anyone with a toolforge account" to "anyone". :)

I will check if there are any objections in the WMCS team tomorrow, and if not I will proceed with creating the DNS for the ToolsDB replica host, and a dedicated user for Quarry.

Is this a feature we want to also make accessible to Superset? I suspect it could use the same technical implementation from the toolsdb side, and would make sense to do now if that is desired. cc @rook

Is this a feature we want to also make accessible to Superset? I suspect it could use the same technical implementation from the toolsdb side, and would make sense to do now if that is desired. cc @rook

If the solution is a user/pass/endpoint it should work fine to add to superset as well.

fnegri triaged this task as Medium priority.Oct 20 2023, 4:12 PM

There were no objections in the WMCS meeting, so we can proceed with creating the DNS record and the db user.

As this does not look too urgent, I plan to work on this next month, after I complete a few other things I'm currently working on.

Hi @fnegri, did you get a chance to get to this? Thanks!

@SD0001 not yet, sorry, too many other things! It's in my to-do list though!

@fnegri any approximate on when this might be prioritized. This will be very helpful for creating dashboards for Product teams on Superset. For example, the Content Translation key metrics dashboard (T325790)

@fnegri just verifying, the quarry_readonly user only has to have access to the public databases (not just read-only). There's a set of private databases for each tool that quarry should not have access to (https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database#Privileges_on_the_database).

@KCVelaga_WMF I'm sorry there was no progress on this so far, it is still in my backlog. I plan to finally work on this before the end of April.

@dcaro thanks for the reminder. This was discussed in the comments above, and the plan is to give access only to the databases with a name ending with _p. Does it cover your concerns?

@fnegri that'll be amazing, thank you! Also, a quick question, does this also enable Superset's SQL Lab access to tool dbs, or is that seperate?

Superset must be configured separately, but it can reuse the same credentials.

@KCVelaga_WMF I'm sorry there was no progress on this so far, it is still in my backlog. I plan to finally work on this before the end of April.

@dcaro thanks for the reminder. This was discussed in the comments above, and the plan is to give access only to the databases with a name ending with _p. Does it cover your concerns?

I missed that, yes that should be it :)

Superset must be configured separately, but it can reuse the same credentials.

We might want to give a different user to avoid confusion (ex. who is running this huge query that is stuck for two days?), unless it's too cumbersome

We might want to give a different user to avoid confusion (ex. who is running this huge query that is stuck for two days?), unless it's too cumbersome

Makes sense, we can create two different users with the same privileges.

fnegri changed the task status from Open to In Progress.May 17 2024, 3:09 PM

I finally started working on this. As a first proof of concept, I tried creating a new user on ToolsDB:

CREATE USER 'quarry_readonly'@'172.16.%' IDENTIFIED BY '{***}';
GRANT SELECT, SHOW VIEW ON `%\_p`.* TO `quarry_readonly`@`172.16.%`;

I verified I can connect from a container running in the Quarry K8s cluster,

root@quarry-bastion:~# kubectl run --rm --tty -i --image mariadb test-t348407 -- mariadb -h 172.16.0.142 -u quarry_readonly -p$TOOLSDB_PASSWORD

MariaDB [(none)]>

I can query databases with a name ending in _p, but not other databases:

MariaDB [(none)]> SELECT count(*) FROM s52532__listeria_p.list;
+----------+
| count(*) |
+----------+
|      417 |
+----------+
1 row in set (0.002 sec)

MariaDB [(none)]> SELECT count(*) FROM s52532__listeria_bot.wikis;
ERROR 1142 (42000): SELECT command denied to user 'quarry_readonly'@'172.16.2.72' for table `s52532__listeria_bot`.`wikis`

Next steps:

  1. set up a DNS pointing to the ToolsDB replica host (currently tools-db-3 (172.16.0.142) but likely to change in the future)
  2. add the credentials to quarry/config-prod.yaml
  3. restart/redeploy Quarry
  4. figure out what we need for Superset (can be split to a separate task)

Change #1034042 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] Add DNS for ToolsDB replica host

https://gerrit.wikimedia.org/r/1034042

The Quarry change above (adding the new credentials) has been merged and deployed, but I'm not sure how to ask Quarry to connect to ToolsDB instead of the wikireplicas.

@LucasWerkmeister pointed out in IRC we should announce more widely this feature before enabling access from Quarry to all the _p databases.

@Soda is testing the feature on the db s55771__wsstats_p so I restricted the access to that single database for now:

MariaDB [(none)]> REVOKE SELECT, SHOW VIEW ON `%\_p`.* FROM `quarry_readonly`@`172.16.%`;
Query OK, 0 rows affected (0.005 sec)

MariaDB [(none)]> GRANT SELECT, SHOW VIEW ON `s55771\_\_wsstats\_p`.* TO `quarry_readonly`@`172.16.%`;
Query OK, 0 rows affected (0.026 sec)

MariaDB [(none)]> SHOW GRANTS FOR `quarry_readonly`@`172.16.%`;
+-----------------------------------------------------------------------------------------------------------------------+
| Grants for quarry_readonly@172.16.%                                                                                   |
+-----------------------------------------------------------------------------------------------------------------------+
| GRANT USAGE ON *.* TO `quarry_readonly`@`172.16.%` IDENTIFIED BY PASSWORD '***' |
| GRANT SELECT, SHOW VIEW ON `s55771\_\_wsstats\_p`.* TO `quarry_readonly`@`172.16.%`                                |
+-----------------------------------------------------------------------------------------------------------------------+

@LucasWerkmeister pointed out in IRC we should announce more widely this feature before enabling access from Quarry to all the _p databases.

To elaborate a little bit: I’m worried that there might be data in public databases that tool maintainers are fine with being accessible to other tool maintainers, but not to the general public; or, perhaps worse, that there might be data that isn’t fine to be public at all, but that tool maintainers didn’t realize was public before. In a way, this feature could increase the “blast radius” of problems like T224725. (Quick summary of that PermanentlyPrivate task: a certain tool’s OAuth credentials were accidentally stored in a _p database, allowing anyone with access to the database to impersonate users of the tool and make edits on their behalf.)

@LucasWerkmeister pointed out in IRC we should announce more widely this feature before enabling access from Quarry to all the _p databases.

To elaborate a little bit: I’m worried that there might be data in public databases that tool maintainers are fine with being accessible to other tool maintainers, but not to the general public; or, perhaps worse, that there might be data that isn’t fine to be public at all, but that tool maintainers didn’t realize was public before.

https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database#Privileges_on_the_database says "Database names ending with _p are granted read access for everyone." without any caveats that further restrict "everyone". We generally assume that anything that all Toolforge maintainers can see is also "public" in that there is no strong guarantee of vetting of Toolforge maintainers. There is even a task floating around somewhere in the backlog to consider removing approval of new accounts entirely when we are satisfied with our ability to clean up after accounts that turn out to be malicious.

I'm not sure how to ask Quarry to connect to ToolsDB instead of the wikireplicas.

I found there is some logic in Quarry to select ToolsDB if the dbname contains __ and ends in p. I tried creating a query entering s55771__wsstats_p as the db name, but it fails with:

Access denied for user 'quarry'@'172.16.2.72' (using password: NO)

This error is not coming from ToolsDB but instead from Quarry's own Trove database. The Trove instance logs show a bunch of these errors, but I'm not sure they are related:

2024-05-21 15:58:27 2781 [Warning] Aborted connection 2781 to db: 'quarry' user: 'quarry' host: 'quarry-123-2-tpbvtagbbp6m-node-1.quarry.eqiad1.wikimedia.clo' (Got an error reading communication packets)

Yes, that issue is T365374. Should be unrelated.

Yes, that issue is T365374. Should be unrelated.

T365374 started to happen yesterday when quarry/pull/40 was deployed, which is suspicious.

@fnegri I am curious about the status of this task, and especially the subsequent step for Superset to be able to access ToolsDB as I have a couple of use cases for dashboards dependent on that.

@KCVelaga_WMF it's not working but I'm struggling to understand why. I had a quick look into the Quarry source code but I plan on investigating more this week. If anyone has any hints on what could be the problem please let me know.

We could also try with Superset in the meantime, maybe that will be easier. I will have a look.

After the fixes in T365374 (thanks to @SD0001 for the fix and thanks to @rook for redeploying Quarry) I can now query ToolsDB successfully!

Screenshot 2024-06-13 at 11.51.50.png (1×1 px, 106 KB)

Access at the moment is limited to the database s55771__wsstats_p. As discussed above, I will send an email to cloud-announce to inform everyone that we're opening this type of access to the ToolsDB databases (only for the databases ending with _p).

I will give people a 2-week notice for this change, and enable access to all _p databases on Monday, July 1st. If you would like to have access to a specific database that you are managing and would like to have it enabled before that date, just leave a comment below.

Technical note: I had to also grant permissions on heartbeat_p.* because Quarry uses it to check for replag and inform the user if the replag is higher than a threshold:

MariaDB [(none)]> GRANT SELECT, SHOW VIEW ON heartbeat_p.* TO `quarry_readonly`@`172.16.%`;

Change #1034042 abandoned by FNegri:

[operations/puppet@production] Add DNS for ToolsDB replica host

Reason:

Not needed (see patch comments)

https://gerrit.wikimedia.org/r/1034042

fnegri renamed this task from Create db user for Quarry with readonly access to public ToolsDB databases to Allow Quarry to query ToolsDB public databases.Thu, Jun 13, 2:47 PM
fnegri updated the task description. (Show Details)