Page MenuHomePhabricator

Clean up wikiadmin GRANTs mess
Closed, ResolvedPublic

Description

A typical db host should have these GRANTs:

Grants for wikiadmin@10.%
GRANT PROCESS, REPLICATION CLIENT ON *.* TO `wikiadmin`@`10.%` IDENTIFIED BY PASSWORD '*redacted'
GRANT SELECT, EXECUTE ON `sys`.* TO `wikiadmin`@`10.%`
GRANT SELECT ON `performance_schema`.* TO `wikiadmin`@`10.%`
GRANT SELECT, INSERT, UPDATE, DELETE, CREATE, DROP, REFERENCES, INDEX, ALTER, CREATE TEMPORARY TABLES, LOCK TABLES, EXECUTE, CREATE VIEW, SHOW VIEW, CREATE ROUTINE, ALTER ROUTINE, EVENT, TRIGGER ON `%wik%`.* TO `wikiadmin`@`10.%`
GRANT SELECT ON `heartbeat`.`heartbeat` TO `wikiadmin`@`10.%`
Grants for wikiadmin@localhost
GRANT PROCESS, REPLICATION CLIENT ON *.* TO `wikiadmin`@`localhost` IDENTIFIED BY PASSWORD '*redacted'
GRANT SELECT, INSERT, UPDATE, DELETE, CREATE, DROP, REFERENCES, INDEX, ALTER, CREATE TEMPORARY TABLES, LOCK TABLES, EXECUTE, CREATE VIEW, SHOW VIEW, CREATE ROUTINE, ALTER ROUTINE, EVENT, TRIGGER ON `%wik%`.* TO `wikiadmin`@`localhost`

I wrote a script to analyze grants and here are the results of analyzing 210 dbs:

General issues

DB unavaiable
No further check was done on this db

  • db2098.codfw.wmnet:3317 (s7)
  • db2098.codfw.wmnet:3318 (s8)

10.% user missing
No further check was done on this grants of 10.% user in these dbs

  • db1133.eqiad.wmnet:3306 (s1)
  • db1128.eqiad.wmnet:3306 (s1)
  • db1125.eqiad.wmnet:3306 (s4)
  • db1124.eqiad.wmnet:3306 (s4)

Expected - Test hosts


localhost user missing
No further check was done on this grants of localhost user in these dbs

  • db2097.codfw.wmnet:3311 (s1)
  • db1163.eqiad.wmnet:3306 (s1)
  • db1140.eqiad.wmnet:3311 (s1)
  • db1139.eqiad.wmnet:3311 (s1)
  • db1105.eqiad.wmnet:3311 (s1)
  • db1102.eqiad.wmnet:3312 (s2)
  • db2101.codfw.wmnet:3315 (s5)
  • db1140.eqiad.wmnet:3316 (s6)

Expected


Localhost

appserver grant missing

  • db1100.eqiad.wmnet:3306 (s5)

Expected


replication grant missing
None!


Extra grant #1
The extra grant:

GRANT SELECT, INSERT, UPDATE, DELETE, CREATE, REFERENCES, INDEX, ALTER, CREATE TEMPORARY TABLES, LOCK TABLES, EXECUTE, CREATE VIEW, SHOW VIEW, CREATE ROUTINE, ALTER ROUTINE, EVENT, TRIGGER ON `%wik%`.* TO `wikiadmin`@`localhost`
  • db1100.eqiad.wmnet:3306 (s5)

Expected


Extra grant #2
The extra grant:

GRANT SELECT, INSERT, UPDATE, DELETE, CREATE, DROP, REFERENCES, INDEX, ALTER, CREATE TEMPORARY TABLES, LOCK TABLES, EXECUTE, CREATE VIEW, SHOW VIEW, CREATE ROUTINE, ALTER ROUTINE, EVENT, TRIGGER ON `centralauth`.* TO `wikiadmin`@`localhost`

Expected


10.%

appserver grant missing

  • db1100.eqiad.wmnet:3306 (s5)

Expected

replication grant missing
None!


sys grant missing
None!


heartbeat grant missing
None!


performance_schema grant missing
None!


Extra grant #1
The extra grant:

GRANT SELECT, INSERT, UPDATE, DELETE, CREATE, REFERENCES, INDEX, ALTER, CREATE TEMPORARY TABLES, LOCK TABLES, EXECUTE, CREATE VIEW, SHOW VIEW, CREATE ROUTINE, ALTER ROUTINE, EVENT, TRIGGER ON `%wik%`.* TO `wikiadmin`@`10.%`
  • db1100.eqiad.wmnet:3306 (s5)

Expected


Extra grant #2
The extra grant:

GRANT SELECT, INSERT, UPDATE, DELETE, CREATE, DROP, REFERENCES, INDEX, ALTER, CREATE TEMPORARY TABLES, LOCK TABLES, EXECUTE, CREATE VIEW, SHOW VIEW, CREATE ROUTINE, ALTER ROUTINE, EVENT, TRIGGER ON `centralauth`.* TO `wikiadmin`@`10.%`

Expected

Event Timeline

Ladsgroup updated the task description. (Show Details)
jcrespo added a parent task: Restricted Task.Tue, Nov 23, 8:48 AM

clouddb and in general servers that will never be used as applications servers should not have the production admin account- they are of a different realm. The same reason why they don't share root password.

Mentioned in SAL (#wikimedia-operations) [2021-11-23T09:16:44Z] <Amir1> dropping useless GRANTs on s6 eqiad master without replication (T296274)

Mentioned in SAL (#wikimedia-operations) [2021-11-23T09:27:39Z] <Amir1> dropping useless GRANTs on s6 eqiad replicas without replication (T296274)

Ladsgroup updated the task description. (Show Details)

I cleaned up the obviously useless GRANTs (covered by other GRANTs) to reduce the size of the report

clouddb and in general servers that will never be used as applications servers should not have the production admin account- they are of a different realm. The same reason why they don't share root password.

Feel free to mark them as expected in case you are sure the divergence is expected.

Feel free to mark them as expected

It is a bit difficult, as they are spread among the whole list, but I can list them for you- the following hosts:

  • cloud hosts: clouddb1013 clouddb1014 clouddb1015 clouddb1016 clouddb1017 clouddb1018 clouddb1019 clouddb1020 clouddb1021
  • sanitariums: db1154 db1155 db2094 db2095
  • analytics replicas: dbstore1003 dbstore1004 dbstore1005 dbstore1007

The above hosts are in a different realm/network (I guess except sanitariums) and should be, in principle, not accessible from the production network anyway- except for replication.

There are hosts that are not technically part of mediawiki, like the backups or testing hosts, but I think those should have the grants to make sure they can be setup quickly as part of it.

Should be fixed by removing the user completely (both the admin and the webrequests user), not by fixing it, as they should get the data through replication, but wiki should never try to access it (they are not part of mediawiki). They are likely to be have gotten the user through reimports and replication. Basically anything that doesn't have the mediawiki::core puppet role, mariadb::core_multiinstance, mariadb::sanitarium_master, mariadb::dbstore_multiinstance (the backups sources only) mariadb::core_test.

I removed those from the report (the script is quite easy to modify, we can bring it back or run different checks on them)

Marostegui triaged this task as Medium priority.Wed, Nov 24, 6:18 AM
Marostegui moved this task from Triage to In progress on the DBA board.

I think I fixed the most GRANT issues for wikiadmin and core dbs. Some left that needs a bit of discussion

Fixed db1105:3311, it looks like a mistake.

With regard of wikiadmin@localhost I think that needs to be removed. There're no maintenances running locally on each host, and especially not done using the wikiadmin grant.
What I would do would be to remove that grant from s6, monitor it for 2-3 days and then proceed to remove it everywhere.

Lastly, 10.% I don't think it is entirely needed, 10.192.% and 10.64.% should be enough. However, it is easier (and less risky) just to add 10.% to the hosts that don't have it (as they are a few only)

Mentioned in SAL (#wikimedia-operations) [2021-11-26T10:05:42Z] <ladsgroup@cumin1001> START - Cookbook sre.hosts.downtime for 4:00:00 on db1177.eqiad.wmnet with reason: Maintenance T296274

Mentioned in SAL (#wikimedia-operations) [2021-11-26T10:05:45Z] <ladsgroup@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1177.eqiad.wmnet with reason: Maintenance T296274

Mentioned in SAL (#wikimedia-operations) [2021-11-26T10:05:49Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depooling db1177 (T296274)', diff saved to https://phabricator.wikimedia.org/P17877 and previous config saved to /var/cache/conftool/dbconfig/20211126-100547-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2021-11-26T10:14:24Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repool after fixing users T296274', diff saved to https://phabricator.wikimedia.org/P17878 and previous config saved to /var/cache/conftool/dbconfig/20211126-101423-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2021-11-26T10:17:08Z] <ladsgroup@cumin1001> START - Cookbook sre.hosts.downtime for 4:00:00 on db1111.eqiad.wmnet with reason: Maintenance T296274

Mentioned in SAL (#wikimedia-operations) [2021-11-26T10:17:11Z] <ladsgroup@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1111.eqiad.wmnet with reason: Maintenance T296274

Mentioned in SAL (#wikimedia-operations) [2021-11-26T10:17:14Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depooling db1111 (T296274)', diff saved to https://phabricator.wikimedia.org/P17879 and previous config saved to /var/cache/conftool/dbconfig/20211126-101714-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2021-11-26T10:23:40Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repool after fixing users T296274', diff saved to https://phabricator.wikimedia.org/P17880 and previous config saved to /var/cache/conftool/dbconfig/20211126-102340-ladsgroup.json

I ran it again and we are basically done. Except:

  • We didn't check the centralauth uniformity in s7 (whether it's missing from one host or not)
  • Dropping localhost user makes sense, maybe do it as part of this task or in another task.
  • clouddb, dbstore and sanitarium masters: db1154 db1155 db2094 db2095 have been ignored. They are a different beast.
  • Speaking of beasts, wikiuser is also excluded and needs to be done.
  • I didn't check for extra wikiadmin users (e.g. wikiadmin@10.192.%). Let me see how big of a mess it is.
Ladsgroup claimed this task.
Ladsgroup moved this task from In progress to Done on the DBA board.

I ran it again and we are basically done. Except:

  • I didn't check for extra wikiadmin users (e.g. wikiadmin@10.192.%). Let me see how big of a mess it is.

Except for s6 with extra user for wikitech, the rest are clean \o/

This is done, I create a separate ticket for dropping localhost user from everywhere.

db1139 broke replication with:

Last_SQL_Error: Error 'There is no such grant defined for user 'wikiadmin' on host '10.%'' on query. Default database: ''. Query: 'REVOKE SELECT, INSERT, UPDATE, DELETE, CREATE, DROP, REFERENCES, INDEX, ALTER, CREATE TEMPORARY TABLES, LOCK TABLES, EXECUTE, CREATE VIEW, SHOW VIEW, CREATE ROUTINE, ALTER ROUTINE, EVENT, TRIGGER ON `centralauth`.* FROM `wikiadmin`@`10.%`'

Was this removal done with replication enabled on the master?

I have fixed the above error and replication is flowing again.