Maniphest T351921

Bring dbstore1008 into service to replace dbstore1003
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	BTullis
	Nov 24 2023, 10:22 AM

Description

We need to configure dbstore1008 as a replacement for dbstore1003.

dbstore1003 is a MariaDB server containing replicas of mediawiki databases for analytics & research usage (mariadb::analytics_replica)

DB section s1 (alias: mysql.s1)
DB section s5 (alias: mysql.s5)
DB section s7 (alias: mysql.s7)

These are unredacted MariaDB replicas supporting analytics use cases: https://wikitech.wikimedia.org/wiki/Analytics/Systems/MariaDB

Once the databases have been instantiated on the new host, we will need to change the DNS CNAME and SRV records.

Acceptance criteria

These three sections must be served from dbstore1008

s1-analytics-replica.eqiad.wmnet
s5-analytics-replica.eqiad.wmnet
s7-analytics-replica.eqiad.wmnet

The analytics-mysql script has been proven to work with the new server
The wmfdata-python library has been proven to work with the new server. (It should, since it calls analytics-mysql)

n.b. dbstore1003 is currently at around 90% capacity on /srv so we should ensure that this work is done before long and that the new /srv volume on dbstore1008 is appropriately sized.

Details

Subject	Repo	Branch	Lines +/-
Disable monitoring on dbstore1003	operations/puppet	production	+2 -0
Switch s1-analytics-replica to dbstore1008	operations/dns	master	+2 -2
Switch s5-analytics-replica to dbstore1008	operations/dns	master	+2 -2
Switch s7-analytics-replica to dbstore1008	operations/dns	master	+2 -2
Enable monitoring for dbstore1008	operations/puppet	production	+0 -1
Upgrade dbstore100[89] to mariadb 10.6 with reimage	operations/puppet	production	+1 -1
Bring dbstore1008 into service	operations/puppet	production	+8 -4
dbstore1008: Add sections	operations/puppet	production	+5 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
		Unknown Object (Task)
Resolved	Jclark-ctr	T342862 Q1:rack/setup/install dbstore100[89]
Resolved	BTullis	T351921 Bring dbstore1008 into service to replace dbstore1003
Resolved	VRiley-WMF	T351923 Decommission dbstore1003

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I'm setting this to high priority because dbstore1003 is currently at 90% of capacity on /srv

btullis@dbstore1003:~$ df -h /srv
Filesystem             Size  Used Avail Use% Mounted on
/dev/mapper/tank-data  4.4T  3.9T  470G  90% /srv

Gehel moved this task from Ready for Work to Hardware refresh on the Data-Platform-SRE board.Dec 6 2023, 1:10 PM

Gehel moved this task from Hardware refresh to 2024.01.01 - 2024.01.21 on the Data-Platform-SRE board.Dec 13 2023, 10:01 AM

Gehel edited projects, added Data-Platform-SRE (2024.01.01 - 2024.01.21); removed Data-Platform-SRE.

root@dbstore1003:~# df -hT /srv
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   4.4T  4.0T  385G  92% /srv

I hope this doesn't page during the break

And as expected...this got filled up.

/dev/mapper/tank-data xfs       4.4T  4.4T   20K 100% /srv

In T351921#9426370, @Marostegui wrote:
And as expected...this got filled up.
/dev/mapper/tank-data xfs       4.4T  4.4T   20K 100% /srv

Oh dear, I'm looking at this now. Is there anything that is easy to delete from /srv/ ?

What about these two files in /srv/tmp.s1/? Can they be deleted?

btullis@dbstore1003:~$ ls -l /srv/tmp.s1/
total 386430080
-rw-rw---- 1 mysql mysql 395700494336 Dec 27 15:47 '#sql_75541_0.MAD'
-rw-rw---- 1 mysql mysql         8192 Dec 27 12:05 '#sql_75541_0.MAI'

I stopped the mariadb@s1 service which automatically removed the 400 GB of temporary files in /srv/tmp.s1.

I have now restarted the service and started the slave threads on s1, so it it catching up, but I haven't managed to do anything to fix the underlying issue with space.

Are there any suggested remedial actions which can be taken on the data in /srv? The only plan I had was to bring dbstore1008 into service, but unfortunately that hasn't happened yet.

The replica lag is dropping and the free space is now 8%.

btullis@dbstore1003:/srv$ df -h /srv
Filesystem             Size  Used Avail Use% Mounted on
/dev/mapper/tank-data  4.4T  4.0T  369G  92% /srv

I will monitor for stability now and do some research into whether I can gain some more space from running optimize tables commands or similar.

Unfortunately, it looks like some kind of query caused it to spill 400 GB or so to disk, so I don't know whether this will happen again in the next few days.

https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=dbstore1003&var-datasource=thanos&var-cluster=mysql&viewPanel=12&from=now-24h&to=now

I'll keep an eye on it for now and react if it happens again in the next few days.

BTullis lowered the priority of this task from Unbreak Now! to High.Dec 27 2023, 4:30 PM

• Marostegui mentioned this in T352974: puppet7 on cumin breaks database connections.Jan 2 2024, 6:11 AM

BTullis claimed this task.Jan 2 2024, 9:14 AM

BTullis moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.01.01 - 2024.01.21) board.

Change 987116 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Bring dbstore1008 into service

https://gerrit.wikimedia.org/r/987116

gerritbot added a project: Patch-For-Review.Jan 2 2024, 9:21 AM

Change 987117 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] dbstore1008: Add sections

https://gerrit.wikimedia.org/r/987117

Mentioned in SAL (#wikimedia-analytics) [2024-01-02T09:24:23Z] <btullis> adding three days' downtime to dbstore1008, prior to switching its role to mariadb::analytics_replica for T351921

Change 987117 abandoned by Marostegui:

[operations/puppet@production] dbstore1008: Add sections

Reason:

https://gerrit.wikimedia.org/r/987117

Change 987116 merged by Btullis:

[operations/puppet@production] Bring dbstore1008 into service

https://gerrit.wikimedia.org/r/987116

Maintenance_bot removed a project: Patch-For-Review.Jan 2 2024, 10:30 AM

I am running the following in a screen session on cumin1001 to recover the latest snapshot of s1 to dbstore1008.

sudo transfer.py --type=decompress dbprov1001.eqiad.wmnet:/srv/backups/snapshots/latest/snapshot.s1.2024-01-01--00-00-00.tar.gz dbstore1008.eqiad.wmnet:/srv/sqldata.s1

I am creating a backup of the grants from dbstore1003 with the following command:

root@dbstore1003:~# sudo pt-show-grants -S /run/mysqld/mysqld.s1.sock > /root/dbstore1003.s1.grants.sql

After the replica on dbstore1008 is up to date, I will drop the users (excluding mariadb.sys@localhost and root@localhost) on dbstore1008 and then restore this backup. This is similar to the approach taken in: T343109#9275655

The data transfer completed successfully.

2024-01-02 11:52:04  WARNING: Original size is 475259205252 but transferred size is 1334541723635 for copy to dbstore1008.eqiad.wmnet
2024-01-02 11:52:05  INFO: Parallel checksum of source on dbprov1001.eqiad.wmnet and the transmitted ones on dbstore1008.eqiad.wmnet match.
2024-01-02 11:52:06  INFO: 1334541723635 bytes correctly transferred from dbprov1001.eqiad.wmnet to dbstore1008.eqiad.wmnet
2024-01-02 11:52:07  INFO: Cleaning up....

Starting the service:

btullis@dbstore1008:~$ sudo systemctl start mariadb@s1

Obtained the gtid position:

sudo cat /srv/sqldata.s1/xtrabackup_slave_info | grep GLOBAL | sudo mysql.s1

Changed the master details:

CHANGE MASTER TO MASTER_HOST='db1184.eqiad.wmnet', MASTER_USER='repl', MASTER_PASSWORD='redacted', MASTER_SSL=1, master_use_gtid = slave_pos;
START SLAVE;

Zarcillo database updated.

Change 987139 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Upgrade dbstore100[89] to mariadb 10.6 with reimage

https://gerrit.wikimedia.org/r/987139

gerritbot added a project: Patch-For-Review.Jan 2 2024, 1:23 PM

Change 987139 merged by Btullis:

[operations/puppet@production] Upgrade dbstore100[89] to mariadb 10.6 with reimage

https://gerrit.wikimedia.org/r/987139

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host dbstore1008.eqiad.wmnet with OS bookworm

Maintenance_bot removed a project: Patch-For-Review.Jan 2 2024, 2:31 PM

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host dbstore1008.eqiad.wmnet with OS bookworm completed:

dbstore1008 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202401021437_btullis_1616611_dbstore1008.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB
- Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

I have reimaged dbstore1008 as bookworm, which caused it to pull in version 10.6 of mariadb as well.

I have repeated the transfer and started the replication threads again.

Now starting the transfer of s5 with:

btullis@cumin1002:~$ sudo transfer.py --type=decompress dbprov1003.eqiad.wmnet:/srv/backups/snapshots/latest/snapshot.s5.2024-01-01--04-27-04.tar.gz dbstore1008.eqiad.wmnet:/srv/sqldata.s5
2024-01-02 16:16:15  INFO: About to transfer /srv/backups/snapshots/latest/snapshot.s5.2024-01-01--04-27-04.tar.gz from dbprov1003.eqiad.wmnet to ['dbstore1008.eqiad.wmnet']:['/srv/sqldata.s5'] (168201935907 bytes)

The recovery of s5 has completed. I set the replication parameters with:

sudo cat /srv/sqldata.s5/xtrabackup_slave_info | grep GLOBAL | sudo mysql.s5

Followed by:

CHANGE MASTER TO MASTER_HOST='db1183.eqiad.wmnet', MASTER_USER='repl', MASTER_PASSWORD='redacted', MASTER_SSL=1, master_use_gtid = slave_pos;
START SLAVE;

Starting the recovery of s7 with:

btullis@cumin1002:~$ sudo transfer.py --type=decompress dbprov1002.eqiad.wmnet:/srv/backups/snapshots/latest/snapshot.s7.2024-01-01--04-51-34.tar.gz dbstore1008.eqiad.wmnet:/srv/sqldata.s7
2024-01-02 17:13:10  INFO: About to transfer /srv/backups/snapshots/latest/snapshot.s7.2024-01-01--04-51-34.tar.gz from dbprov1002.eqiad.wmnet to ['dbstore1008.eqiad.wmnet']:['/srv/sqldata.s7'] (376242120944 bytes)

Setting recovery coordinates with:

btullis@dbstore1008:/srv/sqldata.s7$ sudo cat /srv/sqldata.s7/xtrabackup_slave_info | grep GLOBAL | sudo mysql.s7

Followed by:

CHANGE MASTER TO MASTER_HOST='db1181.eqiad.wmnet', MASTER_USER='repl', MASTER_PASSWORD='redacted', MASTER_SSL=1, master_use_gtid = slave_pos;
START SLAVE;

Change 987419 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Enable monitoring for dbstore1008

https://gerrit.wikimedia.org/r/987419

gerritbot added a project: Patch-For-Review.Jan 3 2024, 11:36 AM

Change 987420 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Disable monitoring on dbstore1003

https://gerrit.wikimedia.org/r/987420

I have created the two remaining grants backups from dbstore1003 with:

root@dbstore1003:~# pt-show-grants -S /run/mysqld/mysqld.s5.sock > /root/dbstore1003.s5.grants.sql
root@dbstore1003:~# pt-show-grants -S /run/mysqld/mysqld.s7.sock > /root/dbstore1003.s7.grants.sql

I will transfer these three (with the s1 created earlier) grants backup files to dbstore1008.

I used transfer.py to copy these files from the old to the new host, as shown.

btullis@cumin1002:~$ sudo transfer.py dbstore1003.eqiad.wmnet:/root/dbstore1003.s1.grants.sql dbstore1008.eqiad.wmnet:/root/
2024-01-03 12:24:43  INFO: About to transfer /root/dbstore1003.s1.grants.sql from dbstore1003.eqiad.wmnet to ['dbstore1008.eqiad.wmnet']:['/root/'] (3683 bytes)
2024-01-03 12:27:00  INFO: Parallel checksum of source on dbstore1003.eqiad.wmnet and the transmitted ones on dbstore1008.eqiad.wmnet match.
2024-01-03 12:27:00  INFO: 3683 bytes correctly transferred from dbstore1003.eqiad.wmnet to dbstore1008.eqiad.wmnet
2024-01-03 12:27:01  INFO: Cleaning up....
btullis@cumin1002:~$ sudo transfer.py dbstore1003.eqiad.wmnet:/root/dbstore1003.s5.grants.sql dbstore1008.eqiad.wmnet:/root/
2024-01-03 12:27:42  INFO: About to transfer /root/dbstore1003.s5.grants.sql from dbstore1003.eqiad.wmnet to ['dbstore1008.eqiad.wmnet']:['/root/'] (3683 bytes)
2024-01-03 12:30:00  INFO: Parallel checksum of source on dbstore1003.eqiad.wmnet and the transmitted ones on dbstore1008.eqiad.wmnet match.
2024-01-03 12:30:00  INFO: 3683 bytes correctly transferred from dbstore1003.eqiad.wmnet to dbstore1008.eqiad.wmnet
2024-01-03 12:30:01  INFO: Cleaning up....
btullis@cumin1002:~$ sudo transfer.py dbstore1003.eqiad.wmnet:/root/dbstore1003.s7.grants.sql dbstore1008.eqiad.wmnet:/root/
2024-01-03 12:33:21  INFO: About to transfer /root/dbstore1003.s7.grants.sql from dbstore1003.eqiad.wmnet to ['dbstore1008.eqiad.wmnet']:['/root/'] (3918 bytes)
2024-01-03 12:35:38  INFO: Parallel checksum of source on dbstore1003.eqiad.wmnet and the transmitted ones on dbstore1008.eqiad.wmnet match.
2024-01-03 12:35:38  INFO: 3918 bytes correctly transferred from dbstore1003.eqiad.wmnet to dbstore1008.eqiad.wmnet
2024-01-03 12:35:39  INFO: Cleaning up....
btullis@cumin1002:~$

I have recreated the grants for the three sections.

In each case, I first dropped all grants except those for root@localhost and mariadb.sys@localhost.

pt-show-grants -S /run/mysqld/mysqld.s1.sock --drop --ignore root@localhost,mariadb.sys@localhost | grep "^DROP USER " | mysql.s1

pt-show-grants -S /run/mysqld/mysqld.s5.sock --drop --ignore root@localhost,mariadb.sys@localhost | grep "^DROP USER " | mysql.s5

pt-show-grants -S /run/mysqld/mysqld.s7.sock --drop --ignore root@localhost,mariadb.sys@localhost | grep "^DROP USER " | mysql.s7

I needed to create the research role, since this is not included in the pt-show-grants output that was used for the backup.

root@dbstore1008:s1[(none)]> CREATE ROLE research_role;

root@dbstore1008:s5[(none)]> CREATE ROLE research_role;

root@dbstore1008:s7[(none)]> CREATE ROLE research_role;

I also needed to re-order two lines in each of the dump files, since it tries to assign rights to a user before creating it:
Lines 10 and 11 required swapping so that they read as follows:

GRANT PROCESS, REPLICATION CLIENT ON *.* TO `nagios`@`localhost` IDENTIFIED VIA unix_socket;
GRANT ALTER, ALTER ROUTINE, CREATE, CREATE ROUTINE, CREATE TEMPORARY TABLES, CREATE VIEW, DELETE, DROP, EVENT, EXECUTE, INDEX, INSERT, LOCK TABLES, REFERENCES, SELECT, SHOW VIEW, TRIGGER, UPDATE ON `heartbeat`.* TO `nagios`@`localhost`;

Now I could import each of the dump files without error:

root@dbstore1008:~# mysql.s1 < dbstore1003.s1.grants.sql 

root@dbstore1008:~# mysql.s5 < dbstore1003.s5.grants.sql 

root@dbstore1008:~# mysql.s7 < dbstore1003.s7.grants.sql

I think that these are now complete, but I will do a bit of double-checking. I know that we have this file: modules/role/templates/mariadb/grants/analytics-replica.sql which is a kind of reference, but I'm not going to apply them all until I know whether they are all required. If I can update that file once I know more, I will do so.

Change 987425 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/dns@master] Switch s7-analytics-replica to dbstore1008

https://gerrit.wikimedia.org/r/987425

Change 987426 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/dns@master] Switch s5-analytics-replica to dbstore1008

https://gerrit.wikimedia.org/r/987426

Change 987427 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/dns@master] Switch s1-analytics-replica to dbstore1008

https://gerrit.wikimedia.org/r/987427

Change 987419 merged by Btullis:

[operations/puppet@production] Enable monitoring for dbstore1008

https://gerrit.wikimedia.org/r/987419

I've just noticed something odd about icinga, which I should probably look at before going any further. Icinga shows that it is not a slave and the prometheus stats for replication lag aren't being shown in grafana.

However, if I check the cli then replication looks OK.

btullis@dbstore1008:~$ sudo mysql.s1
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 681971
Server version: 10.6.16-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

root@dbstore1008:s1[(none)]> show slave status\G;
*************************** 1. row ***************************
                Slave_IO_State: Waiting for master to send event
                   Master_Host: db1184.eqiad.wmnet
                   Master_User: repl
                   Master_Port: 3306
                 Connect_Retry: 60
               Master_Log_File: db1184-bin.006072
           Read_Master_Log_Pos: 153403990
                Relay_Log_File: dbstore1008-relay-bin.000046
                 Relay_Log_Pos: 153404290
         Relay_Master_Log_File: db1184-bin.006072
              Slave_IO_Running: Yes
             Slave_SQL_Running: Yes
               Replicate_Do_DB: 
           Replicate_Ignore_DB: 
            Replicate_Do_Table: 
        Replicate_Ignore_Table: 
       Replicate_Wild_Do_Table: 
   Replicate_Wild_Ignore_Table: 
                    Last_Errno: 0
                    Last_Error: 
                  Skip_Counter: 0
           Exec_Master_Log_Pos: 153403990
               Relay_Log_Space: 153404653
               Until_Condition: None
                Until_Log_File: 
                 Until_Log_Pos: 0
            Master_SSL_Allowed: Yes
            Master_SSL_CA_File: 
            Master_SSL_CA_Path: 
               Master_SSL_Cert: 
             Master_SSL_Cipher: 
                Master_SSL_Key: 
         Seconds_Behind_Master: 0
 Master_SSL_Verify_Server_Cert: No
                 Last_IO_Errno: 0
                 Last_IO_Error: 
                Last_SQL_Errno: 0
                Last_SQL_Error: 
   Replicate_Ignore_Server_Ids: 
              Master_Server_Id: 171978826
                Master_SSL_Crl: 
            Master_SSL_Crlpath: 
                    Using_Gtid: Slave_Pos
                   Gtid_IO_Pos: 180355171-180355171-148310907,180359172-180359172-49702203,171970637-171970637-2116621969,171978826-171978826-333228372,180363268-180363268-2735766352,171974720-171974720-2572451842,171970745-171970745-3651346146,171978774-171978774-5,180355190-180355190-853292444,171970572-171970572-3935877275,171970661-171970661-3655324752,0-171970637-5484646134
       Replicate_Do_Domain_Ids: 
   Replicate_Ignore_Domain_Ids: 
                 Parallel_Mode: optimistic
                     SQL_Delay: 0
           SQL_Remaining_Delay: NULL
       Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
              Slave_DDL_Groups: 0
Slave_Non_Transactional_Groups: 0
    Slave_Transactional_Groups: 41533014
1 row in set (0.001 sec)

ERROR: No query specified

root@dbstore1008:s1[(none)]>

BTullis added a comment.Jan 3 2024, 2:06 PM

This comment was removed by BTullis.

prometheus exporters are failing with access denied, so I guess some users were not correctly migrated:

root@dbstore1008:~# systemctl status prometheus-mysqld-exporter@s7.service
● prometheus-mysqld-exporter@s7.service - Prometheus exporter for MySQL (instance s7)
     Loaded: loaded (/lib/systemd/system/prometheus-mysqld-exporter@.service; disabled; preset: enabled)
     Active: active (running) since Tue 2024-01-02 19:04:37 UTC; 5 days ago
       Docs: https://prometheus.io/docs/introduction/overview/
   Main PID: 159300 (prometheus-mysq)
      Tasks: 34 (limit: 618455)
     Memory: 29.2M
        CPU: 26min 48.610s
     CGroup: /system.slice/system-prometheus\x2dmysqld\x2dexporter.slice/prometheus-mysqld-exporter@s7.service
             └─159300 /usr/bin/prometheus-mysqld-exporter --web.listen-address :13317 --config.my-cnf /var/lib/prometheus/.my.s7.cnf --collect.global_status --collect.global_variables --collect.info_schema.processlist --collect.slave_status --no-collect.info_schema.tables

Jan 08 06:32:57 dbstore1008 prometheus-mysqld-exporter[159300]: ts=2024-01-08T06:32:57.018Z caller=exporter.go:174 level=error msg="Error from scraper" scraper=slave_status err="Error 1227: Access denied; you need (at least one of) the SUPER, SLAVE MONITOR privilege(s) f>
Jan 08 06:33:18 dbstore1008 prometheus-mysqld-exporter[159300]: ts=2024-01-08T06:33:18.649Z caller=exporter.go:174 level=error msg="Error from scraper" scraper=slave_status err="Error 1227: Access denied; you need (at least one of) the SUPER, SLAVE MONITOR privilege(s) f>

In T351921#9440216, @Marostegui wrote:
prometheus exporters are failing with access denied, so I guess some users were not correctly migrated:
Jan 08 06:32:57 dbstore1008 prometheus-mysqld-exporter[159300]: ts=2024-01-08T06:32:57.018Z caller=exporter.go:174 level=error msg="Error from scraper" scraper=slave_status err="Error 1227: Access denied; you need (at least one of) the SUPER, SLAVE MONITOR privilege(s) f>

I think I have worked out what happened here. It is due to changes in the grants between MariaDB 10.4 and 10.6.

When I took the backup of the grants from dbstore1003 using pt-show-grants it contained the following entry for prometheus@localhost

-- Grants for 'prometheus'@'localhost'
GRANT PROCESS, REPLICATION CLIENT ON *.* TO `prometheus`@`localhost` IDENTIFIED VIA unix_socket WITH MAX_USER_CONNECTIONS 5;
GRANT SELECT ON `heartbeat`.`heartbeat` TO `prometheus`@`localhost`;

After I restored the grants from this backup file on dbstore1008, this had been automatically changed to:

GRANT PROCESS, BINLOG MONITOR ON *.* TO `prometheus`@`localhost` IDENTIFIED VIA unix_socket WITH MAX_USER_CONNECTIONS 5 |
GRANT SELECT ON `heartbeat`.`heartbeat` TO `prometheus`@`localhost`

The REPLICATION CLIENT privilege was replaced with its synonym, BINLOG_MONITOR.

However, as per the docs here...

Starting in ES10.5 and CS10.5 BINLOG MONITOR is a synonym for REPLICATION CLIENT but the capabilities have changed.

If a user upgrades from ES10.4 or earlier to ES10.5.8-5 or later, any users with REPLICATION CLIENT and REPLICATION SLAVE privileges will automatically be granted the REPLICA MONITOR privilege. This privilege upgrade happens upon server startup, so mysql_upgrade is not required.

So I believe that by restoring the backup of privilieges taken from a 10.4 server to a 10.6 server, I have inadvertently skipped this migration from REPLICATION CLIENT to REPLICA MONITOR and ended up with a reduced set of privilieges as a result. This is also probably what happened with the nagios@localhost user.

I believe that I can fix this with the following grant:

root@dbstore1008:s7[(none)]> GRANT PROCESS, REPLICA MONITOR ON *.* TO `prometheus`@`localhost` IDENTIFIED VIA unix_socket WITH MAX_USER_CONNECTIONS 5;
Query OK, 0 rows affected (0.001 sec)

root@dbstore1008:s7[(none)]> flush privileges;
Query OK, 0 rows affected (0.002 sec)

I can then restart the exporter with:

btullis@dbstore1008:~$ sudo systemctl restart prometheus-mysqld-exporter@s7.service

It starts cleanly and we can see that the replica lag metric is now available for this section.

I've added the REPLICA MONITOR permission to the nagios and prometheus users for all sections on dbstore1008.
I've susequently re-enabled and restarted all of the prometheus exporters and they start cleanly.

Therefore, I think that dbstore1008 is ready to receive traffic. The only reason that I'm hesitant to do so now is whether we need to do T354411: Revert dbstore migration from puppet7 to puppet5 first, in order to allow for any pending schema updates to be applied from cumin1001.

In T351921#9441050, @BTullis wrote:

Therefore, I think that dbstore1008 is ready to receive traffic. The only reason that I'm hesitant to do so now is whether we need to do T354411: Revert dbstore migration from puppet7 to puppet5 first, in order to allow for any pending schema updates to be applied from cumin1001.

Having discussed this further, I don't think that we need to consider this as a blocker now. It looks as though we should be able to fix forward and get db-mysql and orchestrator working with the new dbstore servers, so I think we can start migrating sections to the new hosts. I'll start with s7-analytics-replica with https://gerrit.wikimedia.org/r/c/operations/dns/+/987425

Change 987425 merged by Btullis:

[operations/dns@master] Switch s7-analytics-replica to dbstore1008

https://gerrit.wikimedia.org/r/987425

Mentioned in SAL (#wikimedia-analytics) [2024-01-08T15:56:53Z] <btullis> migrating s7-analytics-replica to dbstore1008 for T351921

That didn't work as hoped.

btullis@stat1004:~$ analytics-mysql ukwiki
ERROR 1698 (28000): Access denied for user 'research'@'2620:0:861:104:10:64:5:104'

I will revert.

In T351921#9442090, @BTullis wrote:
That didn't work as hoped.
btullis@stat1004:~$ analytics-mysql ukwiki
ERROR 1698 (28000): Access denied for user 'research'@'2620:0:861:104:10:64:5:104'
I will revert.

dbstore hosts should not have ipv6 AAAA records otherwise you'd be running into https://phabricator.wikimedia.org/T270101.

root@cumin1001:~# host dbstore1003
dbstore1003.eqiad.wmnet has address 10.64.0.137
root@cumin1001:~# host dbstore1008
dbstore1008.eqiad.wmnet has address 10.64.131.23
dbstore1008.eqiad.wmnet has IPv6 address 2620:0:861:10a:10:64:131:23

In T351921#9442097, @Marostegui wrote:

dbstore hosts should not have ipv6 AAAA records otherwise you'd be running into https://phabricator.wikimedia.org/T270101.

Thanks @Marostegui - yes, I see that now.

I'll check with Infrastructure-Foundations whether this is still the right procedure: T270101#6688993 to remove these AAAA records and whether I'll need to reimage, or whether I can just get away with editing /etc/network/interfaces

That is indeed the right procedure, but double check with them to be sure. There's absolutely no need to reimage and/or edit the interfaces file. Editing netbox would do it.

Ugh, still not working.

btullis@stat1004:~$ analytics-mysql ukwiki
ERROR 1044 (42000): Access denied for user 'research'@'10.%' to database 'ukwiki'

The output from show grants for 'research'@'10.%'; appears to be identical on dbstore1008:s7 and dbstore1003:s7 though. I'll look for a few more minutes, but if I can't find it I'll revert again and do some more checking.

Hmm. research_role has a mysql_native_password on dbstore1003, whereas it has a unix_socket on dbstore1008. I wonder if thatr is the password being picked up by the analytics-mysql script.

In T351921#9442458, @BTullis wrote:
Ugh, still not working.
btullis@stat1004:~$ analytics-mysql ukwiki
ERROR 1044 (42000): Access denied for user 'research'@'10.%' to database 'ukwiki'
The output from show grants for 'research'@'10.%'; appears to be identical on dbstore1008:s7 and dbstore1003:s7 though. I'll look for a few more minutes, but if I can't find it I'll revert again and do some more checking.

The grants are not the same, the role doesn't have any grants:

mysql:root@localhost [ukwiki]> show grants for research_role;
+---------------------------------------+
| Grants for research_role              |
+---------------------------------------+
| GRANT USAGE ON *.* TO `research_role` |
+---------------------------------------+
1 row in set (0.000 sec)

Fixed:

btullis@stat1004:/root$ analytics-mysql ukwiki
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 1530950
Server version: 10.6.16-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql:research@dbstore1008.eqiad.wmnet [ukwiki]>

Please note I have only fixed it for s7. Please fix it for the rest of instances

• Marostegui added a project: Data-Persistence (work done).Jan 8 2024, 5:03 PM

• Marostegui edited projects, added Data-Persistence; removed Data-Persistence (work done).Jan 8 2024, 5:08 PM

In T351921#9442480, @Marostegui wrote:

Fixed:

btullis@stat1004:/root$ analytics-mysql ukwiki
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 1530950
Server version: 10.6.16-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql:research@dbstore1008.eqiad.wmnet [ukwiki]>

Please note I have only fixed it for s7. Please fix it for the rest of instances

Great! Thanks for your help. So was it just these five additional grants from here that you added by hand? https://github.com/wikimedia/operations-puppet/blob/production/modules/role/templates/mariadb/grants/analytics-replica.sql#L7-L11

It's frustrating that these grants weren't included in the backup file generated by pt-show-grants.

BTullis updated the task description. (Show Details)Jan 8 2024, 5:11 PM

pt-show-grants doesn't show roles grants as far as I remember. This is essentially what you need:

root@dbstore1003.eqiad.wmnet[(none)]>  show grants for research_role;
+-----------------------------------------------------------------------------------------------------------------------------+
| Grants for research_role                                                                                                    |
+-----------------------------------------------------------------------------------------------------------------------------+
| GRANT USAGE ON *.* TO `research_role`                                                                                       |
| GRANT SELECT ON `centralauth`.* TO `research_role`                                                                          |
| GRANT SELECT ON `wikishared`.* TO `research_role`                                                                           |
| GRANT SELECT ON `flowdb`.* TO `research_role`                                                                               |
| GRANT SELECT, INSERT, UPDATE, DELETE, CREATE, DROP, INDEX, ALTER, CREATE TEMPORARY TABLES ON `staging`.* TO `research_role` |
| GRANT SELECT ON `%wik%`.* TO `research_role`                                                                                |
+-----------------------------------------------------------------------------------------------------------------------------+

Change 987426 merged by Btullis:

[operations/dns@master] Switch s5-analytics-replica to dbstore1008

https://gerrit.wikimedia.org/r/987426

Mentioned in SAL (#wikimedia-analytics) [2024-01-08T17:19:34Z] <btullis> migrated s5-analytics-replica to dbstore1008 for T351921

In T351921#9442529, @Marostegui wrote:

pt-show-grants doesn't show roles grants as far as I remember. This is essentially what you need:

Got it!. Many thanks again. I've done this for the other sections on dbstoer1008 and I'll make a note that it needs to be done manually on dbstore1009.

BTullis updated the task description. (Show Details)Jan 8 2024, 5:21 PM

Change 987427 merged by Btullis:

[operations/dns@master] Switch s1-analytics-replica to dbstore1008

https://gerrit.wikimedia.org/r/987427

Mentioned in SAL (#wikimedia-analytics) [2024-01-08T17:22:52Z] <btullis> migrated s1-analytics-replica to dbstore1008 for T351921

In T351921#9442548, @BTullis wrote:

In T351921#9442529, @Marostegui wrote:

pt-show-grants doesn't show roles grants as far as I remember. This is essentially what you need:

Got it!. Many thanks again. I've done this for the other sections on dbstoer1008 and I'll make a note that it needs to be done manually on dbstore1009.

I just double checked, and these are also present on the grants file: https://github.com/wikimedia/operations-puppet/blob/production/modules/role/templates/mariadb/grants/analytics-replica.sql#L5

s5 and s1 sections are now also being served from dbstore1008.

btullis@stat1004:~$ analytics-mysql avkwiki 
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 971185
Server version: 10.6.16-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql:research@dbstore1008.eqiad.wmnet [avkwiki]> Bye
btullis@stat1004:~$ analytics-mysql enwiki
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 2312309
Server version: 10.6.16-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql:research@dbstore1008.eqiad.wmnet [enwiki]>

BTullis updated the task description. (Show Details)Jan 8 2024, 5:24 PM

I've checked that they wmfdata-python library is working for wikis on s1, s5, and s7 as well.
e.g.

BTullis updated the task description. (Show Details)Jan 8 2024, 5:36 PM

Change 987420 merged by Btullis:

[operations/puppet@production] Disable monitoring on dbstore1003

https://gerrit.wikimedia.org/r/987420

Maintenance_bot removed a project: Patch-For-Review.Jan 9 2024, 10:31 AM

BTullis closed this task as Resolved.Jan 9 2024, 12:20 PM

BTullis moved this task from In Progress to Done on the Data-Platform-SRE (2024.01.01 - 2024.01.21) board.

This host is not yet visible on orchestrator due to the migration of puppet7 (T352974), can we please revert it on dbstore hosts (T354411)

In T351921#9445600, @Marostegui wrote:

This host is not yet visible on orchestrator due to the migration of puppet7 (T352974), can we please revert it on dbstore hosts (T354411)

OK. I can do, but I thought we were still trying to fix forward with orchestrator at the moment, given the comments from @ABran-WMF and @MoritzMuehlenhoff here: T352974#9445831

VRiley-WMF closed subtask T351923: Decommission dbstore1003 as Resolved.Jan 16 2024, 6:15 PM

ABran-WMF mentioned this in T355157: Migrate MariaDB PKI.Jan 17 2024, 2:28 PM

	F41659147: image.png
	Jan 8 2024, 5:35 PM

	F41658550: image.png
	Jan 8 2024, 10:28 AM

	F41649627: image.png
	Jan 3 2024, 1:59 PM

	F41649625: image.png
	Jan 3 2024, 1:59 PM

Bring dbstore1008 into service to replace dbstore1003Closed, ResolvedPublicActions