Page MenuHomePhabricator

Upgrade furud/flerovium to Debian Buster
Closed, ResolvedPublic

Description

Furud and Flerovium are two special hadoop client nodes, since they were created to host data backups that we still care/need.

We can avoid to copy data to another host before the reimage if we use the reuse-parts script that Stevie created for partman (and that we used for all the Hadoop workers). This will be a good use case for @razzi to learn how to reimage without loosing data :)

High level plan:

  • Check what we do/use for other use cases, like for the hadoop workers.
    • Check analytics105[1-9]|analytics106[0-9]|analytics107[0-9]|an-worker*) echo reuse-parts.cfg partman/custom/reuse-analytics-hadoop-worker-12dev.cfg ;; \ in netboot.cfg (in puppet)
    • More specifically, the above points to the partman/custom/reuse-analytics-hadoop-worker-12dev.cfg partman recipe.
    • https://gerrit.wikimedia.org/r/c/operations/puppet/+/664788 for an example about how the above recipe was created.
  • Create the partman recipe to keep /srv for furud/flerovium, and set it in netboot.cfg.
  • The first time we should use echo reuse-parts-test.cfg ..., since it stops the Debian Install process right before confirming to proceed with the creation/destruction of partitions. It is extremely helpful to be sure that the recipe is right and that we'll not wipe everything.
  • Use T231067#6863800 (if needed) before reimaging to set fixed uid/gids properly.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

@razzi let's start working on this, let me know doubts/suggestions/etc.. :)

Change 679607 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] netboot: WIP make flerovium reuse /srv directory

https://gerrit.wikimedia.org/r/679607

@elukey I figured the easiest way to see if I was on the right track was to make a wip patch - see https://gerrit.wikimedia.org/r/c/operations/puppet/+/679607. Will either of the approaches I called out in the commit message work?

Upgrade procedure:

  1. Rename data volume group to srv so it matches other hosts and the partman recipe

lvrename flerovium-vg data srv

Question: do we have to stop any services / unmount srv before doing this?

  1. Merge puppet patch that preserves srv volume group, and wipes /, something like:

reuse-analytics-hadoop-client.cfg

d-i	partman/reuse_partitions_recipe	string \
	/dev/sda|1 biosboot ignore none|2 mdraid ignore none, \  # question 1
	/dev/sdb|1 biosboot ignore none|2 mdraid ignore none, \
	/dev/md0|1 lvmpv ignore none, \
	/dev/mapper/*-root|1 ext4 format /, \   # question 2
	/dev/mapper/*-srv|1 ext4 keep /srv

Questions:

  1. is mdraid the right option for sda and sdb? reuse-raid1-2dev.cfg uses raid, reuse-analytics-hadoop-test.cfg uses mdraid, not sure what the difference is
  2. Should we add sda1 and sdb1 to a logical volume root so that this line matches?

@elukey I saw in your comment on https://gerrit.wikimedia.org/r/c/operations/puppet/+/679607 you'd expect something like

	/dev/md0|1 lvmpv ignore none, \
	/dev/md1|1 lvmpv ignore none, \

From what I understand md0|1 should handle both md0 and md1, so I think we wouldn't need the second line.

Upgrade procedure:

  1. Rename data volume group to srv so it matches other hosts and the partman recipe

lvrename flerovium-vg data srv

Question: do we have to stop any services / unmount srv before doing this?

Yes umount /srv first, then /etc/fstab will need to be updated as well with the new lvname (the UUID should remain the same), and finally remount all via mount -a just to be sure that everything works.

  1. Merge puppet patch that preserves srv volume group, and wipes /, something like:

reuse-analytics-hadoop-client.cfg

d-i	partman/reuse_partitions_recipe	string \
	/dev/sda|1 biosboot ignore none|2 mdraid ignore none, \  # question 1
	/dev/sdb|1 biosboot ignore none|2 mdraid ignore none, \
	/dev/md0|1 lvmpv ignore none, \
	/dev/mapper/*-root|1 ext4 format /, \   # question 2
	/dev/mapper/*-srv|1 ext4 keep /srv

Questions:

  1. is mdraid the right option for sda and sdb? reuse-raid1-2dev.cfg uses raid, reuse-analytics-hadoop-test.cfg uses mdraid, not sure what the difference is

reuse-recipe.sh in puppet has some documentation, but I think that in our case mdraid is more appropriate/descriptive (since we have md raid arrays). Not sure about the raid vs mdraid difference, but I suspect it is none.

  1. Should we add sda1 and sdb1 to a logical volume root so that this line matches?

Not sure about this one, can you add more context?

In the code review Stevie added a very good point that since the data stored is not a lot we could copy it elsewhere, do a full reimage to use the new partition scheme (the standard one introduced by SRE) and then copy back the data. I was convinced that all the data needed was stored under /srv, but I found very few things.

@Ottomata I see that you have been loading in the past stuff under /mnt/etc.., were those disks attached to the host only temporarily? If so, can we proceed with a more deeper wipe (copying the content of srv in a tarball first)?

were those disks attached to the host only temporarily?

Yes, IIRC the disks were unattached and put into a closet somewhere. I don't think there is anything here that needs to be saved.

@Ottomata perfect! Can you review the content of /srv to see if anything needs to be preserved of if we can nuke the partitions and standardize them?

Just in case, i'll tarball them into hdfs archive.

Done, archived and stored in hdfs in

-rw-r-----   3 hdfs analytics-admins   53282186 2021-04-20 13:19 /wmf/data/archive/backup/wmf-vs-nsa/flerovium-archive-2021-04.tar.gz
-rw-r-----   3 hdfs analytics-admins  166903846 2021-04-20 13:18 /wmf/data/archive/backup/wmf-vs-nsa/furud-archive-2021-04.tar.gz

I've deleted the files from furud and flerovium, re-image at will! ;)

+1 @razzi reimage when you have time :)

Change 679607 merged by Razzi:

[operations/puppet@production] netboot: update flerovium.eqiad.wmnet, furud.codfw.wmnet to install buster

https://gerrit.wikimedia.org/r/679607

Script wmf-auto-reimage was launched by razzi on cumin1001.eqiad.wmnet for hosts:

flerovium.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202104202250_razzi_29887_flerovium_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by razzi on cumin1001.eqiad.wmnet for hosts:

furud.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202104202255_razzi_30448_furud_codfw_wmnet.log.

Completed auto-reimage of hosts:

['flerovium.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['furud.codfw.wmnet']

and were ALL successful.

razzi@flerovium:~$ cat /etc/issue
Debian GNU/Linux 10 \n \l

razzi@flerovium:~$ lsblk
NAME           MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda              8:0    0 447.1G  0 disk
├─sda1           8:1    0   285M  0 part
└─sda2           8:2    0 446.9G  0 part
  └─md0          9:0    0 446.7G  0 raid1
    ├─vg0-root 253:0    0  74.5G  0 lvm   /
    ├─vg0-swap 253:1    0   976M  0 lvm   [SWAP]
    └─vg0-srv  253:2    0 281.9G  0 lvm   /srv
sdb              8:16   0 447.1G  0 disk
├─sdb1           8:17   0   285M  0 part
└─sdb2           8:18   0 446.9G  0 part
  └─md0          9:0    0 446.7G  0 raid1
    ├─vg0-root 253:0    0  74.5G  0 lvm   /
    ├─vg0-swap 253:1    0   976M  0 lvm   [SWAP]
    └─vg0-srv  253:2    0 281.9G  0 lvm   /srv
razzi@furud:~$ cat /etc/issue
Debian GNU/Linux 10 \n \l

razzi@furud:~$ lsblk
NAME           MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda              8:0    0 447.1G  0 disk
├─sda1           8:1    0   285M  0 part
└─sda2           8:2    0 446.9G  0 part
  └─md0          9:0    0 446.7G  0 raid1
    ├─vg0-root 253:0    0  74.5G  0 lvm   /
    ├─vg0-swap 253:1    0   976M  0 lvm   [SWAP]
    └─vg0-srv  253:2    0 281.9G  0 lvm   /srv
sdb              8:16   0 447.1G  0 disk
├─sdb1           8:17   0   285M  0 part
└─sdb2           8:18   0 446.9G  0 part
  └─md0          9:0    0 446.7G  0 raid1
    ├─vg0-root 253:0    0  74.5G  0 lvm   /
    ├─vg0-swap 253:1    0   976M  0 lvm   [SWAP]
    └─vg0-srv  253:2    0 281.9G  0 lvm   /srv

Disks look great! Thanks @elukey @Kormat @Ottomata!