db1108 needs to be decommissioned, please move the data to db1208 (already running Bullseye, which was still something pending at T304492).
Please talk to me if you need help with the migration process.
Description
Details
Related Objects
- Mentioned In
- T336254: decommission db1108.eqiad.wmnet
T326669: Productionize db1206-db1225
T304492: Upgrade db1108 to Bullseye - Mentioned Here
- T336254: decommission db1108.eqiad.wmnet
P49601 slave status on db1108 analytics.meta T334055
P49602 slave status on db1108 matomo T334055
T304492: Upgrade db1108 to Bullseye
Event Timeline
Change 939653 had a related patch set uploaded (by Btullis; author: Btullis):
[operations/puppet@production] Install MariaDB to db1208
Change 939654 had a related patch set uploaded (by Btullis; author: Btullis):
[operations/puppet@production] Switch references from db1108 to db1208
My plan is to:
- add a silence in alertnamager for db1108 and db1208
- stop mariadb on db1108
- configure MariaDB on db1208 with: https://gerrit.wikimedia.org/r/c/939653
- use transfer.py to copy /srv/ to from db1108 to db1208
- switch references from db1108 to db1208 with: https://gerrit.wikimedia.org/r/939654
- check permissions on the files in /srv on db1208
- start mariadb on db1208
- start replication threads on db1208
- remove silences
- move on to decommissioning db1108 when happy that it's OK.
Does this seem like a reasonable plan @Marostegui
Does db1108 replicate from somewhere? If it does, you'd need to do some steps in between (I can help with).
Yes, and thanks. Both mariadb instances replicate from another source, using GTID.
- the analytics_meta instance replicates from an-coord1001
- the matomo instance replicates from matomo1002
Ok, then you'd steps would be:
- add a silence in alertnamager for db1108 and db1208
- connect to each mysql instance that has replication enabled from somewhere and run: stop slave; show slave status\G (and save that output somewhere)
- stop mariadb on db1108 (on all instances)
- configure MariaDB on db1208 with: https://gerrit.wikimedia.org/r/c/939653
- use transfer.py to copy /srv/ to from db1108 to db1208
- switch references from db1108 to db1208 with: https://gerrit.wikimedia.org/r/939654
- check permissions on the files in /srv on db1208
- start mariadb on db1208
- configure replication on each thread (ping @Marostegui) for this step
- start replication threads on db1208
- remove silences
- add this host to zarcillo (ping @Marostegui once this step has arrived)
- move on to decommissioning db1108 when happy that it's OK.
Silences added:
Slaves stopped:
MariaDB instances stopped:
btullis@db1108:/srv/sqldata.analytics_meta$ sudo systemctl stop mariadb@analytics_meta.service btullis@db1108:/srv/sqldata.analytics_meta$ sudo systemctl stop mariadb@matomo.service
Change 939653 merged by Btullis:
[operations/puppet@production] Install MariaDB to db1208
Starting the transfer of data now.
btullis@cumin1001:~$ sudo transfer.py --no-encrypt db1108.eqiad.wmnet:/srv db1208.eqiad.wmnet:/srv 2023-07-19 12:36:12 INFO: About to transfer /srv from db1108.eqiad.wmnet to ['db1208.eqiad.wmnet']:['/srv'] (62619333481 bytes)
The sync finished successfully. One warning abot a small size mismatch.
2023-07-19 12:39:32 WARNING: Original size is 62619333481 but transferred size is 62619329409 for copy to db1208.eqiad.wmnet 2023-07-19 12:39:32 INFO: Parallel checksum of source on db1108.eqiad.wmnet and the transmitted ones on db1208.eqiad.wmnet match. 2023-07-19 12:39:33 INFO: 62619329409 bytes correctly transferred from db1108.eqiad.wmnet to db1208.eqiad.wmnet 2023-07-19 12:39:34 INFO: Cleaning up....
I had to move files manually, because it ended up in a /srv/srv/ directory, but I've put it in the right place now and the permissions look fine.
btullis@db1208:/srv$ ls -l total 12 drwxr-xr-x 2 mysql mysql 6 Jul 29 2020 sqldata drwxr-xr-x 19 mysql mysql 4096 Jul 19 12:02 sqldata.analytics_meta drwxr-xr-x 6 mysql mysql 4096 Jul 19 12:02 sqldata.matomo drwxr-xr-x 6 mysql mysql 4096 Mar 15 11:18 sqldata.matomo.bak drwxr-xr-x 2 mysql mysql 6 Jul 29 2020 tmp drwxr-xr-x 2 mysql mysql 6 Jul 19 12:01 tmp.analytics_meta drwxr-xr-x 2 mysql mysql 6 Jul 19 12:01 tmp.matomo btullis@db1208:/srv$ sudo du -sh * 0 sqldata 43G sqldata.analytics_meta 8.6G sqldata.matomo 7.0G sqldata.matomo.bak 0 tmp 0 tmp.analytics_meta 0 tmp.matomo
Change 939654 merged by Btullis:
[operations/puppet@production] Switch references from db1108 to db1208
I went to start the services, but they weren't listed, so I enabled them by name and started them by name:
Created symlink /etc/systemd/system/multi-user.target.wants/mariadb@analytics_meta.service → /lib/systemd/system/mariadb@.service. btullis@db1208:/srv$ sudo systemctl enable mariadb@matomo Created symlink /etc/systemd/system/multi-user.target.wants/mariadb@matomo.service → /lib/systemd/system/mariadb@.service. btullis@db1208:/srv$ sudo systemctl start mariadb@analytics_meta btullis@db1208:/srv$ sudo systemctl start mariadb@matomo
I wasn't sure if puppet was supposed to enable those units, or not, but is seems OK.
The slave threads were started automatically, which I wasn't expecting.
Just checked why they started and worked. Your setup is different from production. Your relay logs aren't linked to the host name, so that's why replication worked right away. We have this different in production.
OK, that's interesting, thanks.
The only outstanding issue might be that the prometheus-mysqld-exporter is still showing as failed, but enabled.
btullis@db1208:/srv$ systemctl status prometheus-mysqld-exporter
● prometheus-mysqld-exporter.service - Prometheus exporter for MySQL server
Loaded: loaded (/lib/systemd/system/prometheus-mysqld-exporter.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Wed 2023-07-19 14:39:05 UTC; 4min 13s ago
Docs: https://prometheus.io/docs/introduction/overview/
Process: 1890964 ExecStart=/usr/bin/prometheus-mysqld-exporter $ARGS (code=exited, status=1/FAILURE)
Main PID: 1890964 (code=exited, status=1/FAILURE)
CPU: 16ms
Jul 19 14:39:05 db1208 systemd[1]: prometheus-mysqld-exporter.service: Scheduled restart job, restart counter is at 5.
Jul 19 14:39:05 db1208 systemd[1]: Stopped Prometheus exporter for MySQL server.
Jul 19 14:39:05 db1208 systemd[1]: prometheus-mysqld-exporter.service: Start request repeated too quickly.
Jul 19 14:39:05 db1208 systemd[1]: prometheus-mysqld-exporter.service: Failed with result 'exit-code'.
Jul 19 14:39:05 db1208 systemd[1]: Failed to start Prometheus exporter for MySQL server.The two other instances prometheus-mysqld-exporter@analytics_meta.service and prometheus-mysqld-exporter@matomo.service are fine.
I can't remember what is supposed to happen for the default prometheus exporter. Should I disable or mask the unit manually, or is it supposed to be automated?
I'm also not seeing it appear on here yet, after waiting a while and refreshing the dashboard.
I'll look to see if there's anywhere else that I might have to poke prometheus.
@BTullis can I just install mariadb 10.6 on this host before it goes to production so we don't have to do it at a latter time when it might be in use?
I think everything is now done from our side, then. I'll proceed with T336254: decommission db1108.eqiad.wmnet shortly.
Apologies that it took so long for us to get around to doing it.
Change 940273 had a related patch set uploaded (by Marostegui; author: Marostegui):
[operations/puppet@production] db1208: Install MariaDB 10.6
Change 940273 merged by Marostegui:
[operations/puppet@production] db1208: Install MariaDB 10.6
MariaDB 10.6 has been installed host is showing all green on Icinga, so closing this.
Thanks!
Change 940276 had a related patch set uploaded (by Marostegui; author: Marostegui):
[operations/puppet@production] db1208: Enable notifications
Change 940276 merged by Marostegui:
[operations/puppet@production] db1208: Enable notifications
Icinga downtime and Alertmanager silence (ID=6df61f58-13ae-431b-ba26-03ec11d8c678) set by btullis@cumin1001 for 7 days, 0:00:00 on 1 host(s) and their services with reason: db1108 has been replaced with db1208 - leaving for a few days before decom
db1108.eqiad.wmnet
Icinga downtime and Alertmanager silence (ID=78c3d470-e1af-4646-8758-79672fccf2d6) set by btullis@cumin1001 for 7 days, 0:00:00 on 1 host(s) and their services with reason: db1108 has been replaced with db1208 - leaving for a few days before decom
db1108.eqiad.wmnet
