Page MenuHomePhabricator

Bring dbstore1008 into service to replace dbstore1003
Closed, ResolvedPublic

Assigned To
Authored By
BTullis
Nov 24 2023, 10:22 AM
Referenced Files
F41659147: image.png
Jan 8 2024, 5:35 PM
F41658550: image.png
Jan 8 2024, 10:28 AM
F41649627: image.png
Jan 3 2024, 1:59 PM
F41649625: image.png
Jan 3 2024, 1:59 PM
F41636224: image.png
Dec 27 2023, 4:28 PM

Description

We need to configure dbstore1008 as a replacement for dbstore1003.

dbstore1003 is a MariaDB server containing replicas of mediawiki databases for analytics & research usage (mariadb::analytics_replica)

  • DB section s1 (alias: mysql.s1)
  • DB section s5 (alias: mysql.s5)
  • DB section s7 (alias: mysql.s7)

These are unredacted MariaDB replicas supporting analytics use cases: https://wikitech.wikimedia.org/wiki/Analytics/Systems/MariaDB

Once the databases have been instantiated on the new host, we will need to change the DNS CNAME and SRV records.

Acceptance criteria

These three sections must be served from dbstore1008

  • s1-analytics-replica.eqiad.wmnet
  • s5-analytics-replica.eqiad.wmnet
  • s7-analytics-replica.eqiad.wmnet
  • The analytics-mysql script has been proven to work with the new server
  • The wmfdata-python library has been proven to work with the new server. (It should, since it calls analytics-mysql)

n.b. dbstore1003 is currently at around 90% capacity on /srv so we should ensure that this work is done before long and that the new /srv volume on dbstore1008 is appropriately sized.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I'm setting this to high priority because dbstore1003 is currently at 90% of capacity on /srv

btullis@dbstore1003:~$ df -h /srv
Filesystem             Size  Used Avail Use% Mounted on
/dev/mapper/tank-data  4.4T  3.9T  470G  90% /srv
root@dbstore1003:~# df -hT /srv
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   4.4T  4.0T  385G  92% /srv

I hope this doesn't page during the break

Marostegui raised the priority of this task from High to Unbreak Now!.Dec 27 2023, 1:53 PM

And as expected...this got filled up.

/dev/mapper/tank-data xfs       4.4T  4.4T   20K 100% /srv

And as expected...this got filled up.

/dev/mapper/tank-data xfs       4.4T  4.4T   20K 100% /srv

Oh dear, I'm looking at this now. Is there anything that is easy to delete from /srv/ ?

What about these two files in /srv/tmp.s1/? Can they be deleted?

btullis@dbstore1003:~$ ls -l /srv/tmp.s1/
total 386430080
-rw-rw---- 1 mysql mysql 395700494336 Dec 27 15:47 '#sql_75541_0.MAD'
-rw-rw---- 1 mysql mysql         8192 Dec 27 12:05 '#sql_75541_0.MAI'

I stopped the mariadb@s1 service which automatically removed the 400 GB of temporary files in /srv/tmp.s1.

I have now restarted the service and started the slave threads on s1, so it it catching up, but I haven't managed to do anything to fix the underlying issue with space.

Are there any suggested remedial actions which can be taken on the data in /srv? The only plan I had was to bring dbstore1008 into service, but unfortunately that hasn't happened yet.

The replica lag is dropping and the free space is now 8%.

btullis@dbstore1003:/srv$ df -h /srv
Filesystem             Size  Used Avail Use% Mounted on
/dev/mapper/tank-data  4.4T  4.0T  369G  92% /srv

I will monitor for stability now and do some research into whether I can gain some more space from running optimize tables commands or similar.

Unfortunately, it looks like some kind of query caused it to spill 400 GB or so to disk, so I don't know whether this will happen again in the next few days.

image.png (1×3 px, 219 KB)

https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=dbstore1003&var-datasource=thanos&var-cluster=mysql&viewPanel=12&from=now-24h&to=now

I'll keep an eye on it for now and react if it happens again in the next few days.

BTullis lowered the priority of this task from Unbreak Now! to High.Dec 27 2023, 4:30 PM

Change 987116 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Bring dbstore1008 into service

https://gerrit.wikimedia.org/r/987116

Change 987117 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] dbstore1008: Add sections

https://gerrit.wikimedia.org/r/987117

Mentioned in SAL (#wikimedia-analytics) [2024-01-02T09:24:23Z] <btullis> adding three days' downtime to dbstore1008, prior to switching its role to mariadb::analytics_replica for T351921

Change 987117 abandoned by Marostegui:

[operations/puppet@production] dbstore1008: Add sections

Reason:

https://gerrit.wikimedia.org/r/987117

Change 987116 merged by Btullis:

[operations/puppet@production] Bring dbstore1008 into service

https://gerrit.wikimedia.org/r/987116

I am running the following in a screen session on cumin1001 to recover the latest snapshot of s1 to dbstore1008.

sudo transfer.py --type=decompress dbprov1001.eqiad.wmnet:/srv/backups/snapshots/latest/snapshot.s1.2024-01-01--00-00-00.tar.gz dbstore1008.eqiad.wmnet:/srv/sqldata.s1

I am creating a backup of the grants from dbstore1003 with the following command:

root@dbstore1003:~# sudo pt-show-grants -S /run/mysqld/mysqld.s1.sock > /root/dbstore1003.s1.grants.sql

After the replica on dbstore1008 is up to date, I will drop the users (excluding mariadb.sys@localhost and root@localhost) on dbstore1008 and then restore this backup. This is similar to the approach taken in: T343109#9275655

The data transfer completed successfully.

2024-01-02 11:52:04  WARNING: Original size is 475259205252 but transferred size is 1334541723635 for copy to dbstore1008.eqiad.wmnet
2024-01-02 11:52:05  INFO: Parallel checksum of source on dbprov1001.eqiad.wmnet and the transmitted ones on dbstore1008.eqiad.wmnet match.
2024-01-02 11:52:06  INFO: 1334541723635 bytes correctly transferred from dbprov1001.eqiad.wmnet to dbstore1008.eqiad.wmnet
2024-01-02 11:52:07  INFO: Cleaning up....

Starting the service:

btullis@dbstore1008:~$ sudo systemctl start mariadb@s1

Obtained the gtid position:

sudo cat /srv/sqldata.s1/xtrabackup_slave_info | grep GLOBAL | sudo mysql.s1

Changed the master details:

CHANGE MASTER TO MASTER_HOST='db1184.eqiad.wmnet', MASTER_USER='repl', MASTER_PASSWORD='redacted', MASTER_SSL=1, master_use_gtid = slave_pos;
START SLAVE;

Change 987139 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Upgrade dbstore100[89] to mariadb 10.6 with reimage

https://gerrit.wikimedia.org/r/987139

Change 987139 merged by Btullis:

[operations/puppet@production] Upgrade dbstore100[89] to mariadb 10.6 with reimage

https://gerrit.wikimedia.org/r/987139

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host dbstore1008.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host dbstore1008.eqiad.wmnet with OS bookworm completed:

  • dbstore1008 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202401021437_btullis_1616611_dbstore1008.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

I have reimaged dbstore1008 as bookworm, which caused it to pull in version 10.6 of mariadb as well.

I have repeated the transfer and started the replication threads again.

Now starting the transfer of s5 with:

btullis@cumin1002:~$ sudo transfer.py --type=decompress dbprov1003.eqiad.wmnet:/srv/backups/snapshots/latest/snapshot.s5.2024-01-01--04-27-04.tar.gz dbstore1008.eqiad.wmnet:/srv/sqldata.s5
2024-01-02 16:16:15  INFO: About to transfer /srv/backups/snapshots/latest/snapshot.s5.2024-01-01--04-27-04.tar.gz from dbprov1003.eqiad.wmnet to ['dbstore1008.eqiad.wmnet']:['/srv/sqldata.s5'] (168201935907 bytes)

The recovery of s5 has completed. I set the replication parameters with:

sudo cat /srv/sqldata.s5/xtrabackup_slave_info | grep GLOBAL | sudo mysql.s5

Followed by:

CHANGE MASTER TO MASTER_HOST='db1183.eqiad.wmnet', MASTER_USER='repl', MASTER_PASSWORD='redacted', MASTER_SSL=1, master_use_gtid = slave_pos;
START SLAVE;

Starting the recovery of s7 with:

btullis@cumin1002:~$ sudo transfer.py --type=decompress dbprov1002.eqiad.wmnet:/srv/backups/snapshots/latest/snapshot.s7.2024-01-01--04-51-34.tar.gz dbstore1008.eqiad.wmnet:/srv/sqldata.s7
2024-01-02 17:13:10  INFO: About to transfer /srv/backups/snapshots/latest/snapshot.s7.2024-01-01--04-51-34.tar.gz from dbprov1002.eqiad.wmnet to ['dbstore1008.eqiad.wmnet']:['/srv/sqldata.s7'] (376242120944 bytes)

Setting recovery coordinates with:

btullis@dbstore1008:/srv/sqldata.s7$ sudo cat /srv/sqldata.s7/xtrabackup_slave_info | grep GLOBAL | sudo mysql.s7

Followed by:

CHANGE MASTER TO MASTER_HOST='db1181.eqiad.wmnet', MASTER_USER='repl', MASTER_PASSWORD='redacted', MASTER_SSL=1, master_use_gtid = slave_pos;
START SLAVE;

Change 987419 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Enable monitoring for dbstore1008

https://gerrit.wikimedia.org/r/987419

Change 987420 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Disable monitoring on dbstore1003

https://gerrit.wikimedia.org/r/987420

I have created the two remaining grants backups from dbstore1003 with:

root@dbstore1003:~# pt-show-grants -S /run/mysqld/mysqld.s5.sock > /root/dbstore1003.s5.grants.sql
root@dbstore1003:~# pt-show-grants -S /run/mysqld/mysqld.s7.sock > /root/dbstore1003.s7.grants.sql

I will transfer these three (with the s1 created earlier) grants backup files to dbstore1008.

I used transfer.py to copy these files from the old to the new host, as shown.

btullis@cumin1002:~$ sudo transfer.py dbstore1003.eqiad.wmnet:/root/dbstore1003.s1.grants.sql dbstore1008.eqiad.wmnet:/root/
2024-01-03 12:24:43  INFO: About to transfer /root/dbstore1003.s1.grants.sql from dbstore1003.eqiad.wmnet to ['dbstore1008.eqiad.wmnet']:['/root/'] (3683 bytes)
2024-01-03 12:27:00  INFO: Parallel checksum of source on dbstore1003.eqiad.wmnet and the transmitted ones on dbstore1008.eqiad.wmnet match.
2024-01-03 12:27:00  INFO: 3683 bytes correctly transferred from dbstore1003.eqiad.wmnet to dbstore1008.eqiad.wmnet
2024-01-03 12:27:01  INFO: Cleaning up....
btullis@cumin1002:~$ sudo transfer.py dbstore1003.eqiad.wmnet:/root/dbstore1003.s5.grants.sql dbstore1008.eqiad.wmnet:/root/
2024-01-03 12:27:42  INFO: About to transfer /root/dbstore1003.s5.grants.sql from dbstore1003.eqiad.wmnet to ['dbstore1008.eqiad.wmnet']:['/root/'] (3683 bytes)
2024-01-03 12:30:00  INFO: Parallel checksum of source on dbstore1003.eqiad.wmnet and the transmitted ones on dbstore1008.eqiad.wmnet match.
2024-01-03 12:30:00  INFO: 3683 bytes correctly transferred from dbstore1003.eqiad.wmnet to dbstore1008.eqiad.wmnet
2024-01-03 12:30:01  INFO: Cleaning up....
btullis@cumin1002:~$ sudo transfer.py dbstore1003.eqiad.wmnet:/root/dbstore1003.s7.grants.sql dbstore1008.eqiad.wmnet:/root/
2024-01-03 12:33:21  INFO: About to transfer /root/dbstore1003.s7.grants.sql from dbstore1003.eqiad.wmnet to ['dbstore1008.eqiad.wmnet']:['/root/'] (3918 bytes)
2024-01-03 12:35:38  INFO: Parallel checksum of source on dbstore1003.eqiad.wmnet and the transmitted ones on dbstore1008.eqiad.wmnet match.
2024-01-03 12:35:38  INFO: 3918 bytes correctly transferred from dbstore1003.eqiad.wmnet to dbstore1008.eqiad.wmnet
2024-01-03 12:35:39  INFO: Cleaning up....
btullis@cumin1002:~$

I have recreated the grants for the three sections.

In each case, I first dropped all grants except those for root@localhost and mariadb.sys@localhost.

pt-show-grants -S /run/mysqld/mysqld.s1.sock --drop --ignore root@localhost,mariadb.sys@localhost | grep "^DROP USER " | mysql.s1

pt-show-grants -S /run/mysqld/mysqld.s5.sock --drop --ignore root@localhost,mariadb.sys@localhost | grep "^DROP USER " | mysql.s5

pt-show-grants -S /run/mysqld/mysqld.s7.sock --drop --ignore root@localhost,mariadb.sys@localhost | grep "^DROP USER " | mysql.s7

I needed to create the research role, since this is not included in the pt-show-grants output that was used for the backup.

root@dbstore1008:s1[(none)]> CREATE ROLE research_role;

root@dbstore1008:s5[(none)]> CREATE ROLE research_role;

root@dbstore1008:s7[(none)]> CREATE ROLE research_role;

I also needed to re-order two lines in each of the dump files, since it tries to assign rights to a user before creating it:
Lines 10 and 11 required swapping so that they read as follows:

GRANT PROCESS, REPLICATION CLIENT ON *.* TO `nagios`@`localhost` IDENTIFIED VIA unix_socket;
GRANT ALTER, ALTER ROUTINE, CREATE, CREATE ROUTINE, CREATE TEMPORARY TABLES, CREATE VIEW, DELETE, DROP, EVENT, EXECUTE, INDEX, INSERT, LOCK TABLES, REFERENCES, SELECT, SHOW VIEW, TRIGGER, UPDATE ON `heartbeat`.* TO `nagios`@`localhost`;

Now I could import each of the dump files without error:

root@dbstore1008:~# mysql.s1 < dbstore1003.s1.grants.sql 

root@dbstore1008:~# mysql.s5 < dbstore1003.s5.grants.sql 

root@dbstore1008:~# mysql.s7 < dbstore1003.s7.grants.sql

I think that these are now complete, but I will do a bit of double-checking. I know that we have this file: modules/role/templates/mariadb/grants/analytics-replica.sql which is a kind of reference, but I'm not going to apply them all until I know whether they are all required. If I can update that file once I know more, I will do so.

Change 987425 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/dns@master] Switch s7-analytics-replica to dbstore1008

https://gerrit.wikimedia.org/r/987425

Change 987426 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/dns@master] Switch s5-analytics-replica to dbstore1008

https://gerrit.wikimedia.org/r/987426

Change 987427 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/dns@master] Switch s1-analytics-replica to dbstore1008

https://gerrit.wikimedia.org/r/987427

Change 987419 merged by Btullis:

[operations/puppet@production] Enable monitoring for dbstore1008

https://gerrit.wikimedia.org/r/987419

I've just noticed something odd about icinga, which I should probably look at before going any further. Icinga shows that it is not a slave and the prometheus stats for replication lag aren't being shown in grafana.

image.png (565×1 px, 191 KB)

image.png (773×944 px, 98 KB)

However, if I check the cli then replication looks OK.

btullis@dbstore1008:~$ sudo mysql.s1
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 681971
Server version: 10.6.16-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

root@dbstore1008:s1[(none)]> show slave status\G;
*************************** 1. row ***************************
                Slave_IO_State: Waiting for master to send event
                   Master_Host: db1184.eqiad.wmnet
                   Master_User: repl
                   Master_Port: 3306
                 Connect_Retry: 60
               Master_Log_File: db1184-bin.006072
           Read_Master_Log_Pos: 153403990
                Relay_Log_File: dbstore1008-relay-bin.000046
                 Relay_Log_Pos: 153404290
         Relay_Master_Log_File: db1184-bin.006072
              Slave_IO_Running: Yes
             Slave_SQL_Running: Yes
               Replicate_Do_DB: 
           Replicate_Ignore_DB: 
            Replicate_Do_Table: 
        Replicate_Ignore_Table: 
       Replicate_Wild_Do_Table: 
   Replicate_Wild_Ignore_Table: 
                    Last_Errno: 0
                    Last_Error: 
                  Skip_Counter: 0
           Exec_Master_Log_Pos: 153403990
               Relay_Log_Space: 153404653
               Until_Condition: None
                Until_Log_File: 
                 Until_Log_Pos: 0
            Master_SSL_Allowed: Yes
            Master_SSL_CA_File: 
            Master_SSL_CA_Path: 
               Master_SSL_Cert: 
             Master_SSL_Cipher: 
                Master_SSL_Key: 
         Seconds_Behind_Master: 0
 Master_SSL_Verify_Server_Cert: No
                 Last_IO_Errno: 0
                 Last_IO_Error: 
                Last_SQL_Errno: 0
                Last_SQL_Error: 
   Replicate_Ignore_Server_Ids: 
              Master_Server_Id: 171978826
                Master_SSL_Crl: 
            Master_SSL_Crlpath: 
                    Using_Gtid: Slave_Pos
                   Gtid_IO_Pos: 180355171-180355171-148310907,180359172-180359172-49702203,171970637-171970637-2116621969,171978826-171978826-333228372,180363268-180363268-2735766352,171974720-171974720-2572451842,171970745-171970745-3651346146,171978774-171978774-5,180355190-180355190-853292444,171970572-171970572-3935877275,171970661-171970661-3655324752,0-171970637-5484646134
       Replicate_Do_Domain_Ids: 
   Replicate_Ignore_Domain_Ids: 
                 Parallel_Mode: optimistic
                     SQL_Delay: 0
           SQL_Remaining_Delay: NULL
       Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
              Slave_DDL_Groups: 0
Slave_Non_Transactional_Groups: 0
    Slave_Transactional_Groups: 41533014
1 row in set (0.001 sec)

ERROR: No query specified

root@dbstore1008:s1[(none)]>
This comment was removed by BTullis.

prometheus exporters are failing with access denied, so I guess some users were not correctly migrated:

root@dbstore1008:~# systemctl status prometheus-mysqld-exporter@s7.service
● prometheus-mysqld-exporter@s7.service - Prometheus exporter for MySQL (instance s7)
     Loaded: loaded (/lib/systemd/system/prometheus-mysqld-exporter@.service; disabled; preset: enabled)
     Active: active (running) since Tue 2024-01-02 19:04:37 UTC; 5 days ago
       Docs: https://prometheus.io/docs/introduction/overview/
   Main PID: 159300 (prometheus-mysq)
      Tasks: 34 (limit: 618455)
     Memory: 29.2M
        CPU: 26min 48.610s
     CGroup: /system.slice/system-prometheus\x2dmysqld\x2dexporter.slice/prometheus-mysqld-exporter@s7.service
             └─159300 /usr/bin/prometheus-mysqld-exporter --web.listen-address :13317 --config.my-cnf /var/lib/prometheus/.my.s7.cnf --collect.global_status --collect.global_variables --collect.info_schema.processlist --collect.slave_status --no-collect.info_schema.tables

Jan 08 06:32:57 dbstore1008 prometheus-mysqld-exporter[159300]: ts=2024-01-08T06:32:57.018Z caller=exporter.go:174 level=error msg="Error from scraper" scraper=slave_status err="Error 1227: Access denied; you need (at least one of) the SUPER, SLAVE MONITOR privilege(s) f>
Jan 08 06:33:18 dbstore1008 prometheus-mysqld-exporter[159300]: ts=2024-01-08T06:33:18.649Z caller=exporter.go:174 level=error msg="Error from scraper" scraper=slave_status err="Error 1227: Access denied; you need (at least one of) the SUPER, SLAVE MONITOR privilege(s) f>

prometheus exporters are failing with access denied, so I guess some users were not correctly migrated:

Jan 08 06:32:57 dbstore1008 prometheus-mysqld-exporter[159300]: ts=2024-01-08T06:32:57.018Z caller=exporter.go:174 level=error msg="Error from scraper" scraper=slave_status err="Error 1227: Access denied; you need (at least one of) the SUPER, SLAVE MONITOR privilege(s) f>

I think I have worked out what happened here. It is due to changes in the grants between MariaDB 10.4 and 10.6.

When I took the backup of the grants from dbstore1003 using pt-show-grants it contained the following entry for prometheus@localhost

-- Grants for 'prometheus'@'localhost'
GRANT PROCESS, REPLICATION CLIENT ON *.* TO `prometheus`@`localhost` IDENTIFIED VIA unix_socket WITH MAX_USER_CONNECTIONS 5;
GRANT SELECT ON `heartbeat`.`heartbeat` TO `prometheus`@`localhost`;

After I restored the grants from this backup file on dbstore1008, this had been automatically changed to:

GRANT PROCESS, BINLOG MONITOR ON *.* TO `prometheus`@`localhost` IDENTIFIED VIA unix_socket WITH MAX_USER_CONNECTIONS 5 |
GRANT SELECT ON `heartbeat`.`heartbeat` TO `prometheus`@`localhost`

The REPLICATION CLIENT privilege was replaced with its synonym, BINLOG_MONITOR.

However, as per the docs here...

Starting in ES10.5 and CS10.5 BINLOG MONITOR is a synonym for REPLICATION CLIENT but the capabilities have changed.

  • If a user upgrades from ES10.4 or earlier to ES10.5.8-5 or later, any users with REPLICATION CLIENT and REPLICATION SLAVE privileges will automatically be granted the REPLICA MONITOR privilege. This privilege upgrade happens upon server startup, so mysql_upgrade is not required.

So I believe that by restoring the backup of privilieges taken from a 10.4 server to a 10.6 server, I have inadvertently skipped this migration from REPLICATION CLIENT to REPLICA MONITOR and ended up with a reduced set of privilieges as a result. This is also probably what happened with the nagios@localhost user.

I believe that I can fix this with the following grant:

root@dbstore1008:s7[(none)]> GRANT PROCESS, REPLICA MONITOR ON *.* TO `prometheus`@`localhost` IDENTIFIED VIA unix_socket WITH MAX_USER_CONNECTIONS 5;
Query OK, 0 rows affected (0.001 sec)

root@dbstore1008:s7[(none)]> flush privileges;
Query OK, 0 rows affected (0.002 sec)

I can then restart the exporter with:

btullis@dbstore1008:~$ sudo systemctl restart prometheus-mysqld-exporter@s7.service

It starts cleanly and we can see that the replica lag metric is now available for this section.

image.png (507×985 px, 52 KB)

I've added the REPLICA MONITOR permission to the nagios and prometheus users for all sections on dbstore1008.
I've susequently re-enabled and restarted all of the prometheus exporters and they start cleanly.

Therefore, I think that dbstore1008 is ready to receive traffic. The only reason that I'm hesitant to do so now is whether we need to do T354411: Revert dbstore migration from puppet7 to puppet5 first, in order to allow for any pending schema updates to be applied from cumin1001.

Therefore, I think that dbstore1008 is ready to receive traffic. The only reason that I'm hesitant to do so now is whether we need to do T354411: Revert dbstore migration from puppet7 to puppet5 first, in order to allow for any pending schema updates to be applied from cumin1001.

Having discussed this further, I don't think that we need to consider this as a blocker now. It looks as though we should be able to fix forward and get db-mysql and orchestrator working with the new dbstore servers, so I think we can start migrating sections to the new hosts. I'll start with s7-analytics-replica with https://gerrit.wikimedia.org/r/c/operations/dns/+/987425

Change 987425 merged by Btullis:

[operations/dns@master] Switch s7-analytics-replica to dbstore1008

https://gerrit.wikimedia.org/r/987425

Mentioned in SAL (#wikimedia-analytics) [2024-01-08T15:56:53Z] <btullis> migrating s7-analytics-replica to dbstore1008 for T351921

That didn't work as hoped.

btullis@stat1004:~$ analytics-mysql ukwiki
ERROR 1698 (28000): Access denied for user 'research'@'2620:0:861:104:10:64:5:104'

I will revert.

That didn't work as hoped.

btullis@stat1004:~$ analytics-mysql ukwiki
ERROR 1698 (28000): Access denied for user 'research'@'2620:0:861:104:10:64:5:104'

I will revert.

dbstore hosts should not have ipv6 AAAA records otherwise you'd be running into https://phabricator.wikimedia.org/T270101.

root@cumin1001:~# host dbstore1003
dbstore1003.eqiad.wmnet has address 10.64.0.137
root@cumin1001:~# host dbstore1008
dbstore1008.eqiad.wmnet has address 10.64.131.23
dbstore1008.eqiad.wmnet has IPv6 address 2620:0:861:10a:10:64:131:23

dbstore hosts should not have ipv6 AAAA records otherwise you'd be running into https://phabricator.wikimedia.org/T270101.

Thanks @Marostegui - yes, I see that now.

I'll check with Infrastructure-Foundations whether this is still the right procedure: T270101#6688993 to remove these AAAA records and whether I'll need to reimage, or whether I can just get away with editing /etc/network/interfaces

That is indeed the right procedure, but double check with them to be sure. There's absolutely no need to reimage and/or edit the interfaces file. Editing netbox would do it.

Ugh, still not working.

btullis@stat1004:~$ analytics-mysql ukwiki
ERROR 1044 (42000): Access denied for user 'research'@'10.%' to database 'ukwiki'

The output from show grants for 'research'@'10.%'; appears to be identical on dbstore1008:s7 and dbstore1003:s7 though. I'll look for a few more minutes, but if I can't find it I'll revert again and do some more checking.

Hmm. research_role has a mysql_native_password on dbstore1003, whereas it has a unix_socket on dbstore1008. I wonder if thatr is the password being picked up by the analytics-mysql script.

Ugh, still not working.

btullis@stat1004:~$ analytics-mysql ukwiki
ERROR 1044 (42000): Access denied for user 'research'@'10.%' to database 'ukwiki'

The output from show grants for 'research'@'10.%'; appears to be identical on dbstore1008:s7 and dbstore1003:s7 though. I'll look for a few more minutes, but if I can't find it I'll revert again and do some more checking.

The grants are not the same, the role doesn't have any grants:

mysql:root@localhost [ukwiki]> show grants for research_role;
+---------------------------------------+
| Grants for research_role              |
+---------------------------------------+
| GRANT USAGE ON *.* TO `research_role` |
+---------------------------------------+
1 row in set (0.000 sec)

Fixed:

btullis@stat1004:/root$ analytics-mysql ukwiki
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 1530950
Server version: 10.6.16-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql:research@dbstore1008.eqiad.wmnet [ukwiki]>

Please note I have only fixed it for s7. Please fix it for the rest of instances

Fixed:

btullis@stat1004:/root$ analytics-mysql ukwiki
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 1530950
Server version: 10.6.16-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql:research@dbstore1008.eqiad.wmnet [ukwiki]>

Please note I have only fixed it for s7. Please fix it for the rest of instances

Great! Thanks for your help. So was it just these five additional grants from here that you added by hand? https://github.com/wikimedia/operations-puppet/blob/production/modules/role/templates/mariadb/grants/analytics-replica.sql#L7-L11

It's frustrating that these grants weren't included in the backup file generated by pt-show-grants.

pt-show-grants doesn't show roles grants as far as I remember. This is essentially what you need:

root@dbstore1003.eqiad.wmnet[(none)]>  show grants for research_role;
+-----------------------------------------------------------------------------------------------------------------------------+
| Grants for research_role                                                                                                    |
+-----------------------------------------------------------------------------------------------------------------------------+
| GRANT USAGE ON *.* TO `research_role`                                                                                       |
| GRANT SELECT ON `centralauth`.* TO `research_role`                                                                          |
| GRANT SELECT ON `wikishared`.* TO `research_role`                                                                           |
| GRANT SELECT ON `flowdb`.* TO `research_role`                                                                               |
| GRANT SELECT, INSERT, UPDATE, DELETE, CREATE, DROP, INDEX, ALTER, CREATE TEMPORARY TABLES ON `staging`.* TO `research_role` |
| GRANT SELECT ON `%wik%`.* TO `research_role`                                                                                |
+-----------------------------------------------------------------------------------------------------------------------------+

Change 987426 merged by Btullis:

[operations/dns@master] Switch s5-analytics-replica to dbstore1008

https://gerrit.wikimedia.org/r/987426

Mentioned in SAL (#wikimedia-analytics) [2024-01-08T17:19:34Z] <btullis> migrated s5-analytics-replica to dbstore1008 for T351921

pt-show-grants doesn't show roles grants as far as I remember. This is essentially what you need:

Got it!. Many thanks again. I've done this for the other sections on dbstoer1008 and I'll make a note that it needs to be done manually on dbstore1009.

Change 987427 merged by Btullis:

[operations/dns@master] Switch s1-analytics-replica to dbstore1008

https://gerrit.wikimedia.org/r/987427

Mentioned in SAL (#wikimedia-analytics) [2024-01-08T17:22:52Z] <btullis> migrated s1-analytics-replica to dbstore1008 for T351921

pt-show-grants doesn't show roles grants as far as I remember. This is essentially what you need:

Got it!. Many thanks again. I've done this for the other sections on dbstoer1008 and I'll make a note that it needs to be done manually on dbstore1009.

I just double checked, and these are also present on the grants file: https://github.com/wikimedia/operations-puppet/blob/production/modules/role/templates/mariadb/grants/analytics-replica.sql#L5

s5 and s1 sections are now also being served from dbstore1008.

btullis@stat1004:~$ analytics-mysql avkwiki 
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 971185
Server version: 10.6.16-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql:research@dbstore1008.eqiad.wmnet [avkwiki]> Bye
btullis@stat1004:~$ analytics-mysql enwiki
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 2312309
Server version: 10.6.16-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql:research@dbstore1008.eqiad.wmnet [enwiki]>

I've checked that they wmfdata-python library is working for wikis on s1, s5, and s7 as well.
e.g.

image.png (649×1 px, 108 KB)

Change 987420 merged by Btullis:

[operations/puppet@production] Disable monitoring on dbstore1003

https://gerrit.wikimedia.org/r/987420

This host is not yet visible on orchestrator due to the migration of puppet7 (T352974), can we please revert it on dbstore hosts (T354411)

This host is not yet visible on orchestrator due to the migration of puppet7 (T352974), can we please revert it on dbstore hosts (T354411)

OK. I can do, but I thought we were still trying to fix forward with orchestrator at the moment, given the comments from @ABran-WMF and @MoritzMuehlenhoff here: T352974#9445831