Page MenuHomePhabricator

Replace db1108 with db1208
Closed, ResolvedPublic

Description

db1108 needs to be decommissioned, please move the data to db1208 (already running Bullseye, which was still something pending at T304492).
Please talk to me if you need help with the migration process.

Event Timeline

This is the ticket we decided to go for instead of T304492

Can we please give this some priority? This is the only Buster host we still have.

Change 939653 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Install MariaDB to db1208

https://gerrit.wikimedia.org/r/939653

Change 939654 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Switch references from db1108 to db1208

https://gerrit.wikimedia.org/r/939654

My plan is to:

  1. add a silence in alertnamager for db1108 and db1208
  2. stop mariadb on db1108
  3. configure MariaDB on db1208 with: https://gerrit.wikimedia.org/r/c/939653
  4. use transfer.py to copy /srv/ to from db1108 to db1208
  5. switch references from db1108 to db1208 with: https://gerrit.wikimedia.org/r/939654
  6. check permissions on the files in /srv on db1208
  7. start mariadb on db1208
  8. start replication threads on db1208
  9. remove silences
  10. move on to decommissioning db1108 when happy that it's OK.

Does this seem like a reasonable plan @Marostegui

Does db1108 replicate from somewhere? If it does, you'd need to do some steps in between (I can help with).

Does db1108 replicate from somewhere? If it does, you'd need to do some steps in between (I can help with).

Yes, and thanks. Both mariadb instances replicate from another source, using GTID.

  • the analytics_meta instance replicates from an-coord1001
  • the matomo instance replicates from matomo1002

Ok, then you'd steps would be:

  1. add a silence in alertnamager for db1108 and db1208
  2. connect to each mysql instance that has replication enabled from somewhere and run: stop slave; show slave status\G (and save that output somewhere)
  3. stop mariadb on db1108 (on all instances)
  4. configure MariaDB on db1208 with: https://gerrit.wikimedia.org/r/c/939653
  5. use transfer.py to copy /srv/ to from db1108 to db1208
  6. switch references from db1108 to db1208 with: https://gerrit.wikimedia.org/r/939654
  7. check permissions on the files in /srv on db1208
  8. start mariadb on db1208
  9. configure replication on each thread (ping @Marostegui) for this step
  10. start replication threads on db1208
  11. remove silences
  12. add this host to zarcillo (ping @Marostegui once this step has arrived)
  13. move on to decommissioning db1108 when happy that it's OK.

Silences added:
Slaves stopped:

  • Slave status of analytics_meta: P49601
  • Slave status of matomo: P49602

MariaDB instances stopped:

btullis@db1108:/srv/sqldata.analytics_meta$ sudo systemctl stop mariadb@analytics_meta.service 
btullis@db1108:/srv/sqldata.analytics_meta$ sudo systemctl stop mariadb@matomo.service

Change 939653 merged by Btullis:

[operations/puppet@production] Install MariaDB to db1208

https://gerrit.wikimedia.org/r/939653

Starting the transfer of data now.

btullis@cumin1001:~$ sudo transfer.py --no-encrypt db1108.eqiad.wmnet:/srv db1208.eqiad.wmnet:/srv
2023-07-19 12:36:12  INFO: About to transfer /srv from db1108.eqiad.wmnet to ['db1208.eqiad.wmnet']:['/srv'] (62619333481 bytes)

The sync finished successfully. One warning abot a small size mismatch.

2023-07-19 12:39:32  WARNING: Original size is 62619333481 but transferred size is 62619329409 for copy to db1208.eqiad.wmnet
2023-07-19 12:39:32  INFO: Parallel checksum of source on db1108.eqiad.wmnet and the transmitted ones on db1208.eqiad.wmnet match.
2023-07-19 12:39:33  INFO: 62619329409 bytes correctly transferred from db1108.eqiad.wmnet to db1208.eqiad.wmnet
2023-07-19 12:39:34  INFO: Cleaning up....

I had to move files manually, because it ended up in a /srv/srv/ directory, but I've put it in the right place now and the permissions look fine.

btullis@db1208:/srv$ ls -l
total 12
drwxr-xr-x  2 mysql mysql    6 Jul 29  2020 sqldata
drwxr-xr-x 19 mysql mysql 4096 Jul 19 12:02 sqldata.analytics_meta
drwxr-xr-x  6 mysql mysql 4096 Jul 19 12:02 sqldata.matomo
drwxr-xr-x  6 mysql mysql 4096 Mar 15 11:18 sqldata.matomo.bak
drwxr-xr-x  2 mysql mysql    6 Jul 29  2020 tmp
drwxr-xr-x  2 mysql mysql    6 Jul 19 12:01 tmp.analytics_meta
drwxr-xr-x  2 mysql mysql    6 Jul 19 12:01 tmp.matomo
btullis@db1208:/srv$ sudo du -sh *
0	sqldata
43G	sqldata.analytics_meta
8.6G	sqldata.matomo
7.0G	sqldata.matomo.bak
0	tmp
0	tmp.analytics_meta
0	tmp.matomo

Change 939654 merged by Btullis:

[operations/puppet@production] Switch references from db1108 to db1208

https://gerrit.wikimedia.org/r/939654

I went to start the services, but they weren't listed, so I enabled them by name and started them by name:

Created symlink /etc/systemd/system/multi-user.target.wants/mariadb@analytics_meta.service → /lib/systemd/system/mariadb@.service.
btullis@db1208:/srv$ sudo systemctl enable mariadb@matomo
Created symlink /etc/systemd/system/multi-user.target.wants/mariadb@matomo.service → /lib/systemd/system/mariadb@.service.
btullis@db1208:/srv$ sudo systemctl start mariadb@analytics_meta
btullis@db1208:/srv$ sudo systemctl start mariadb@matomo

I wasn't sure if puppet was supposed to enable those units, or not, but is seems OK.

The slave threads were started automatically, which I wasn't expecting.

Just checked why they started and worked. Your setup is different from production. Your relay logs aren't linked to the host name, so that's why replication worked right away. We have this different in production.

Just checked why they started and worked. Your setup is different from production. Your relay logs aren't linked to the host name, so that's why replication worked right away. We have this different in production.

OK, that's interesting, thanks.

The only outstanding issue might be that the prometheus-mysqld-exporter is still showing as failed, but enabled.

btullis@db1208:/srv$ systemctl status prometheus-mysqld-exporter
● prometheus-mysqld-exporter.service - Prometheus exporter for MySQL server
     Loaded: loaded (/lib/systemd/system/prometheus-mysqld-exporter.service; enabled; vendor preset: enabled)
     Active: inactive (dead) since Wed 2023-07-19 14:39:05 UTC; 4min 13s ago
       Docs: https://prometheus.io/docs/introduction/overview/
    Process: 1890964 ExecStart=/usr/bin/prometheus-mysqld-exporter $ARGS (code=exited, status=1/FAILURE)
   Main PID: 1890964 (code=exited, status=1/FAILURE)
        CPU: 16ms

Jul 19 14:39:05 db1208 systemd[1]: prometheus-mysqld-exporter.service: Scheduled restart job, restart counter is at 5.
Jul 19 14:39:05 db1208 systemd[1]: Stopped Prometheus exporter for MySQL server.
Jul 19 14:39:05 db1208 systemd[1]: prometheus-mysqld-exporter.service: Start request repeated too quickly.
Jul 19 14:39:05 db1208 systemd[1]: prometheus-mysqld-exporter.service: Failed with result 'exit-code'.
Jul 19 14:39:05 db1208 systemd[1]: Failed to start Prometheus exporter for MySQL server.

The two other instances prometheus-mysqld-exporter@analytics_meta.service and prometheus-mysqld-exporter@matomo.service are fine.

I can't remember what is supposed to happen for the default prometheus exporter. Should I disable or mask the unit manually, or is it supposed to be automated?

You can just disable it and reset it so it clears the alert

I'm also not seeing it appear on here yet, after waiting a while and refreshing the dashboard.
I'll look to see if there's anywhere else that I might have to poke prometheus.

image.png (347×585 px, 39 KB)

You can just disable it and reset it so it clears the alert

Done. Thanks for clarification.

I'm also not seeing it appear on here yet, after waiting a while and refreshing the dashboard.
I'll look to see if there's anywhere else that I might have to poke prometheus.

image.png (347×585 px, 39 KB)

I need to add it to zarcillo first

I need to add it to zarcillo first

Thanks. I forgot that you mentioned that.

@BTullis can I just install mariadb 10.6 on this host before it goes to production so we don't have to do it at a latter time when it might be in use?

@BTullis can I just install mariadb 10.6 on this host before it goes to production so we don't have to do it at a latter time when it might be in use?

Yes, feel free.

I think everything is now done from our side, then. I'll proceed with T336254: decommission db1108.eqiad.wmnet shortly.
Apologies that it took so long for us to get around to doing it.

The host is now showing up on icinga

I'll get it to Mariadb 10.6 tomorrow or Friday and close this task when done

Change 940273 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1208: Install MariaDB 10.6

https://gerrit.wikimedia.org/r/940273

Change 940273 merged by Marostegui:

[operations/puppet@production] db1208: Install MariaDB 10.6

https://gerrit.wikimedia.org/r/940273

MariaDB 10.6 has been installed host is showing all green on Icinga, so closing this.
Thanks!

Change 940276 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1208: Enable notifications

https://gerrit.wikimedia.org/r/940276

Change 940276 merged by Marostegui:

[operations/puppet@production] db1208: Enable notifications

https://gerrit.wikimedia.org/r/940276

Icinga downtime and Alertmanager silence (ID=6df61f58-13ae-431b-ba26-03ec11d8c678) set by btullis@cumin1001 for 7 days, 0:00:00 on 1 host(s) and their services with reason: db1108 has been replaced with db1208 - leaving for a few days before decom

db1108.eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=78c3d470-e1af-4646-8758-79672fccf2d6) set by btullis@cumin1001 for 7 days, 0:00:00 on 1 host(s) and their services with reason: db1108 has been replaced with db1208 - leaving for a few days before decom

db1108.eqiad.wmnet