Page MenuHomePhabricator
Paste P11076

reimaging procedure for db hosts to buster+mariadb 10.4
ArchivedPublic

Authored by Kormat on Apr 29 2020, 9:36 AM.
Tags
None
Referenced Files
F31786457: raw.txt
Apr 29 2020, 9:49 AM
F31786435: raw.txt
Apr 29 2020, 9:36 AM
Subscribers
None
Buster + 10.4 epic: https://phabricator.wikimedia.org/T250666
* Log reimage: `!log reimaging HOST to buster T250666`
* Send change against puppet repo:
** Disable notifications for host (e.g. https://gerrit.wikimedia.org/r/c/operations/puppet/+/592876)
** Allow host to pxe install, but pause at partitioning step. (e.g. https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/592884)
** Reverted after
* Run puppet agent on apt1001, apt2001, and icinga.
* Set host to install as buster (e.g. https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/592887)
* Depool host (potentially from multiple sections)
* `systemctl stop mariadb && umount /srv`
** For ''multi-instance'', need to stop `mariadb@sX`.
* Take copy of `/srv` entry from `/etc/fstab`
* Connect to mgmt interface
* Attach to serial console
** On dells (`/admin1->`), use `console com2`. Escape is `^\`
* From cumin host, inside screen: `sudo -E wmf-auto-reimage --no-verify -p TICKET FQDN`
* When install reaches partitioning step, select "manual", format the 40G partition asext4, set mountpoint as `/`
** Partitioner should wipe `/` and `swap`. Anything else, you done fucked up.
* [[Tendril doesn't like this in-place upgrade, so it requires a disable + drop + add + enable after upgrade, otherwise the Act. (last contact) field doesn't get updated.|[https://wikitech.wikimedia.org/wiki/MariaDB#Stretch_+_10.1_-%3E_Buster_+_10.4_known_issues]]
** Check out [[tendril repo|https://gerrit.wikimedia.org/r/#/admin/projects/operations/software/tendril]] on a cumin host. (Use http, as you don't have your ssh key available).
** For ''multi-instance'', remember to run these for all ports
** Remove host from tendril:<div>
```
./tendril-host-drop.sh HOST PORT | sudo -i mysql -h db1115.eqiad.wmnet tendril
```
</div>
** After, re-add host to tendril:<div>
```
./tendril-host-add.sh HOST PORT ~/.my.cnf.tendril tendril | sudo -i mysql -h db1115.eqiad.wmnet tendril
./tendril-host-enable.sh HOST PORT | sudo -i mysql -h db1115.eqiad.wmnet tendril
```
</div>
* Wait for host to finish reimaging
* Check that wmf-mariadb104 is installed.
* Re-add `/srv` to `/etc/fstab`
* Mount `/srv`
* Check if the contents of `/srv` are already owned by the `mysql` user, if not, fix.
* Disable replication while we run `mysql_upgrade`: ` systemctl set-environment MYSQLD_OPTS="--skip-slave-start"`
** Does not need to be reverted.
* Start mariadb: `systemctl start mariadb`
** For ''multi-instance'', need to start `mariadb@sX`
* Check service logs: `journalctl -xe -u mariadb`, should only see errors about internal tables that will be fixed by `mysql_upgrade`
* For ''multi-instance'', need to specify socket: `-S /run/mysqld/mysqld.sX.sock`
** Run `mysql_upgrade`
** Start slave: `mysql -e "start slave"`
** Check slave status: `mysql -e "show slave status\G"`
* [[https://phabricator.wikimedia.org/T247290#5956794]]: Restart prom mysql exporter
** For ''multi-instance'', need to use per-instance service `prometheus-mysqld-exporter@sX.service`
* Re-add host to tendril (see above)
* Once it's back in tendril, revert partman change
* Wait for icinga to be fully green, then revert notifications change.
* Wait until replication lag is fully gone, then start slowly repooling server. (If it's in codfw, can just go straight to full repoolling).