Page MenuHomePhabricator

unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem
Open, Needs TriagePublic

Description

This is a node where the installer picked a wrong drive to try and build the md-raid set for / onto; this results in the installer failing at the run grub step, and a destroyed large swift filesystem that takes hours to rebuild (blocking further reimages in this cluster).

Pre-reimage state:

mvernon@ms-be2060:~$ df -lh
Filesystem      Size  Used Avail Use% Mounted on
udev            252G     0  252G   0% /dev
tmpfs            51G  4.1G   47G   8% /run
/dev/md0         55G  9.5G   43G  19% /
tmpfs           252G  4.0K  252G   1% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           252G     0  252G   0% /sys/fs/cgroup
/dev/sdb4       297G  339M  297G   1% /srv/swift-storage/sdb4
/dev/sdb3        94G   30G   64G  32% /srv/swift-storage/sdb3
/dev/sda4       297G  338M  297G   1% /srv/swift-storage/sda4
/dev/sds1       7.3T  5.3T  2.1T  73% /srv/swift-storage/sds1
/dev/sdc1       7.3T  5.3T  2.1T  73% /srv/swift-storage/sdc1
/dev/sdr1       7.3T  5.3T  2.0T  73% /srv/swift-storage/sdr1
/dev/sdv1       7.3T  5.3T  2.1T  73% /srv/swift-storage/sdv1
/dev/sdu1       7.3T  5.3T  2.1T  72% /srv/swift-storage/sdu1
/dev/sda3        94G   18G   76G  19% /srv/swift-storage/sda3
/dev/sdt1       7.3T  5.3T  2.1T  72% /srv/swift-storage/sdt1
/dev/sdq1       7.3T  5.3T  2.1T  73% /srv/swift-storage/sdq1
/dev/sdp1       7.3T  5.3T  2.1T  72% /srv/swift-storage/sdp1
/dev/sdo1       7.3T  5.3T  2.1T  72% /srv/swift-storage/sdo1
/dev/sdl1       7.3T  5.3T  2.1T  72% /srv/swift-storage/sdl1
/dev/sdm1       7.3T  5.3T  2.1T  72% /srv/swift-storage/sdm1
/dev/sdk1       7.3T  5.3T  2.1T  72% /srv/swift-storage/sdk1
/dev/sdn1       7.3T  5.3T  2.1T  72% /srv/swift-storage/sdn1
/dev/sdg1       7.3T  5.3T  2.1T  73% /srv/swift-storage/sdg1
/dev/sdh1       7.3T  5.2T  2.1T  72% /srv/swift-storage/sdh1
/dev/sdj1       7.3T  5.3T  2.1T  72% /srv/swift-storage/sdj1
/dev/sdi1       7.3T  5.3T  2.1T  73% /srv/swift-storage/sdi1
/dev/sdd1       7.3T  5.3T  2.1T  73% /srv/swift-storage/sdd1
/dev/sde1       7.3T  5.3T  2.1T  72% /srv/swift-storage/sde1
/dev/sdf1       7.3T  5.3T  2.1T  73% /srv/swift-storage/sdf1
/dev/sdx1       7.3T  5.3T  2.1T  73% /srv/swift-storage/sdx1
/dev/sdw1       7.3T  5.3T  2.1T  73% /srv/swift-storage/sdw1
/dev/sdy1       7.3T  5.3T  2.1T  72% /srv/swift-storage/sdy1
/dev/sdz1       7.3T  5.3T  2.1T  73% /srv/swift-storage/sdz1
tmpfs            51G     0   51G   0% /run/user/33349
mvernon@ms-be2060:~$ sudo blkid
/dev/md0: UUID="305517c2-0808-4589-a213-4cc37ae382d4" TYPE="ext4"
/dev/sdc1: LABEL="swift-sdc1" UUID="17936d56-c4d9-451a-bf3a-dfd5ab0c3d07" TYPE="xfs" PARTLABEL="swift-sdc1" PARTUUID="36568373-61aa-4c7f-a1ff-9eafda6e5897"
/dev/sdb1: UUID="18da7c90-7e68-0e9d-56ee-e30021164f77" UUID_SUB="1ba5819c-fd43-b741-9b2b-92ea8bda6b24" LABEL="ms-be2060:0" TYPE="linux_raid_member" PARTUUID="48f4f878-01"
/dev/sdb2: UUID="2f0bd683-249f-8367-6eed-23ee26fda0e5" UUID_SUB="6652dcaf-e59b-f49b-5aa3-e0026eda7286" LABEL="ms-be2060:1" TYPE="linux_raid_member" PARTUUID="48f4f878-02"
/dev/sdb3: LABEL="swift-sdb3" UUID="a29c83a2-c844-40a6-9310-1c641f44b20d" TYPE="xfs" PARTUUID="48f4f878-03"
/dev/sdb4: LABEL="swift-sdb4" UUID="6b2d1529-e071-4b88-b087-f5378a384a4e" TYPE="xfs" PARTUUID="48f4f878-04"
/dev/sdj1: LABEL="swift-sdj1" UUID="4f5a63f1-49e7-4138-a574-62bbddcb15bb" TYPE="xfs" PARTLABEL="swift-sdj1" PARTUUID="36d62e38-69dc-4e33-a2cb-43e8ddb2bf55"
/dev/sdh1: LABEL="swift-sdh1" UUID="5f1e9875-b1db-403f-9a80-80adc8a1a99e" TYPE="xfs" PARTLABEL="swift-sdh1" PARTUUID="ed5f5a18-8732-4a88-8695-7e1837bd6e96"
/dev/sdd1: LABEL="swift-sdd1" UUID="ff341b95-7587-4ea6-8276-38a5cd03f41a" TYPE="xfs" PARTLABEL="swift-sdd1" PARTUUID="c9f91b02-50e5-4f56-8012-7b279bdfa389"
/dev/sde1: LABEL="swift-sde1" UUID="e3c45c18-3302-4eae-983c-136f2b5525fd" TYPE="xfs" PARTLABEL="swift-sde1" PARTUUID="13b51dc5-74da-4228-b628-fe73d89e9483"
/dev/sdm1: LABEL="swift-sdm1" UUID="4005dc6a-e96c-4646-8687-634dd73d065f" TYPE="xfs" PARTLABEL="swift-sdm1" PARTUUID="d5b6d685-e958-4468-912c-a52b1044251d"
/dev/sdf1: LABEL="swift-sdf1" UUID="bdea7f93-9602-448d-b15e-9be5c34ae583" TYPE="xfs" PARTLABEL="swift-sdf1" PARTUUID="31fede44-566a-4ef1-94c3-1640327a18c9"
/dev/sdr1: LABEL="swift-sdr1" UUID="fe58126d-c971-4531-b3ae-75fb38b1e2e4" TYPE="xfs" PARTLABEL="swift-sdr1" PARTUUID="12794c2a-d470-4cfb-91a7-959745d4f870"
/dev/sdp1: LABEL="swift-sdp1" UUID="c91ff1e8-39b8-4f31-8d44-126667f3304e" TYPE="xfs" PARTLABEL="swift-sdp1" PARTUUID="85c13299-e07e-4e43-8aab-79f62797ea2f"
/dev/sdv1: LABEL="swift-sdv1" UUID="8ff60693-4aba-4a0d-92b7-2e1deae53989" TYPE="xfs" PARTLABEL="swift-sdv1" PARTUUID="90958a37-bfa7-4700-88c5-9adc9da570d3"
/dev/sdn1: LABEL="swift-sdn1" UUID="aeb04028-c6ea-412a-ac85-d90fa2cef7b1" TYPE="xfs" PARTLABEL="swift-sdn1" PARTUUID="05031404-a10d-4307-9b44-d8ed95aa3ab2"
/dev/sdo1: LABEL="swift-sdo1" UUID="05a05184-c3e9-45fc-be50-d8e12079b4a4" TYPE="xfs" PARTLABEL="swift-sdo1" PARTUUID="224d5e68-7e31-4e25-aa22-4cfffd8fb3e3"
/dev/sdk1: LABEL="swift-sdk1" UUID="06935047-b818-4167-9e15-4581144d2c12" TYPE="xfs" PARTLABEL="swift-sdk1" PARTUUID="c8b89584-4015-4b44-a953-0ac48f7f446f"
/dev/sdi1: LABEL="swift-sdi1" UUID="66558e6d-720d-4616-8cac-d59b0bc7f15d" TYPE="xfs" PARTLABEL="swift-sdi1" PARTUUID="c37a6f7d-9ef2-4d99-8d5f-5b8c3cd92069"
/dev/sdg1: LABEL="swift-sdg1" UUID="ae70febe-303e-43cd-8182-6446a93a8e18" TYPE="xfs" PARTLABEL="swift-sdg1" PARTUUID="26f48171-136e-4b72-8602-0b2b3e614a15"
/dev/sdt1: LABEL="swift-sdt1" UUID="9aa4a95f-a5f6-461d-82ab-6393fa246f24" TYPE="xfs" PARTLABEL="swift-sdt1" PARTUUID="f91ce737-6cc7-451d-a4a5-21e39d2a6dc3"
/dev/sdl1: LABEL="swift-sdl1" UUID="d9bf2f08-3f39-4044-b054-8e678663dcd1" TYPE="xfs" PARTLABEL="swift-sdl1" PARTUUID="a9606e83-0129-49b8-9cc7-93dd243519c2"
/dev/sds1: LABEL="swift-sds1" UUID="a3f8b82b-a440-4a28-a7bf-c6ab50962ac0" TYPE="xfs" PARTLABEL="swift-sds1" PARTUUID="d3936cf9-43b4-445a-b57d-5ad19f59de47"
/dev/sdy1: LABEL="swift-sdy1" UUID="da7fcb1e-fa41-4c96-934c-26340a3fbb03" TYPE="xfs" PARTLABEL="swift-sdy1" PARTUUID="e379af10-9cb7-45eb-8354-a62ec8a59aa9"
/dev/sdx1: LABEL="swift-sdx1" UUID="7fc6d355-ee6f-4160-b046-cc0eef0922de" TYPE="xfs" PARTLABEL="swift-sdx1" PARTUUID="bc5260d4-84ab-48f4-8b08-66029e849416"
/dev/sdw1: LABEL="swift-sdw1" UUID="70612c8e-823f-4b71-b25b-0ce5c571d6f7" TYPE="xfs" PARTLABEL="swift-sdw1" PARTUUID="4329abcf-e4c4-4eb5-ac9c-8e265a81c2a1"
/dev/sdq1: LABEL="swift-sdq1" UUID="5a18629a-0f8a-46e9-be41-5eba631c98f3" TYPE="xfs" PARTLABEL="swift-sdq1" PARTUUID="d9c0d534-bea9-4cf4-a71c-8ec39e31bc66"
/dev/sdu1: LABEL="swift-sdu1" UUID="e35b9c89-5e2c-4035-a9c6-8c44d33388b4" TYPE="xfs" PARTLABEL="swift-sdu1" PARTUUID="1495cd77-a802-4dd7-b26d-55f18b00100e"
/dev/sdz1: LABEL="swift-sdz1" UUID="c589b4a3-f52e-4278-b401-709ebbcf015f" TYPE="xfs" PARTLABEL="swift-sdz1" PARTUUID="f501772c-f4d9-4e05-809c-7cb652efd053"
/dev/sda1: UUID="18da7c90-7e68-0e9d-56ee-e30021164f77" UUID_SUB="9341a737-7bb2-1a57-3520-4fddf17f3460" LABEL="ms-be2060:0" TYPE="linux_raid_member" PARTUUID="e09d002c-01"
/dev/sda2: UUID="2f0bd683-249f-8367-6eed-23ee26fda0e5" UUID_SUB="bb568916-62d8-0b5b-8ea6-902e33ab19ab" LABEL="ms-be2060:1" TYPE="linux_raid_member" PARTUUID="e09d002c-02"
/dev/sda3: LABEL="swift-sda3" UUID="d0a2b599-1010-45a6-839d-e634bcba151d" TYPE="xfs" PARTUUID="e09d002c-03"
/dev/sda4: LABEL="swift-sda4" UUID="ca13477b-f15b-4ab5-b7c7-230024aadac6" TYPE="xfs" PARTUUID="e09d002c-04"
/dev/md1: UUID="29f14cd2-8c33-4a3c-a68d-3f7039f6b6db" TYPE="swap"

State once the installer had failed:

~ # blkid | more
/dev/md0: UUID="1c189c09-a080-41e0-b816-c28cd1b06b98" BLOCK_SIZE="4096" TYPE="e"
/dev/md1: UUID="ffca2266-f2db-4512-8899-0e544081f6a8" TYPE="swap"
/dev/sda3: UUID="2677c651-8778-43b7-831c-3607f933bb58" BLOCK_SIZE="4096" TYPE=""
/dev/sda4: UUID="5875309a-ea75-4a89-9a92-d60e9316d71d" BLOCK_SIZE="4096" TYPE=""
/dev/sdb3: UUID="03caf1fe-1629-41ff-aa5e-8e5fb08149c5" BLOCK_SIZE="4096" TYPE=""
/dev/sdb4: UUID="324e82b2-c7b4-429c-b547-258fc4d14a8a" BLOCK_SIZE="4096" TYPE=""
/dev/sdc3: LABEL="swift-sdb3" UUID="a29c83a2-c844-40a6-9310-1c641f44b20d" BLOCK"
/dev/sdc4: LABEL="swift-sdb4" UUID="6b2d1529-e071-4b88-b087-f5378a384a4e" BLOCK"
/dev/sdd1: LABEL="swift-sdd1" UUID="ff341b95-7587-4ea6-8276-38a5cd03f41a" BLOCK"
/dev/sde1: LABEL="swift-sde1" UUID="e3c45c18-3302-4eae-983c-136f2b5525fd" BLOCK"
/dev/sdf1: LABEL="swift-sdg1" UUID="ae70febe-303e-43cd-8182-6446a93a8e18" BLOCK 
SIZE="4096" TYPE="xfs" PARTLABEL="swift-sdg1" PARTUUID="26f48171-136e-4b72-8602"
/dev/sdg1: LABEL="swift-sdh1" UUID="5f1e9875-b1db-403f-9a80-80adc8a1a99e" BLOCK"
/dev/sdh1: LABEL="swift-sdf1" UUID="bdea7f93-9602-448d-b15e-9be5c34ae583" BLOCK"
/dev/sdi1: LABEL="swift-sdi1" UUID="66558e6d-720d-4616-8cac-d59b0bc7f15d" BLOCK"
/dev/sdj1: LABEL="swift-sdj1" UUID="4f5a63f1-49e7-4138-a574-62bbddcb15bb" BLOCK"
/dev/sdk1: LABEL="swift-sdk1" UUID="06935047-b818-4167-9e15-4581144d2c12" BLOCK"
/dev/sdl1: LABEL="swift-sdm1" UUID="4005dc6a-e96c-4646-8687-634dd73d065f" BLOCK"
/dev/sdm1: LABEL="swift-sdn1" UUID="aeb04028-c6ea-412a-ac85-d90fa2cef7b1" BLOCK 
/dev/sdn1: LABEL="swift-sdl1" UUID="d9bf2f08-3f39-4044-b054-8e678663dcd1" BLOCK"
/dev/sdo1: LABEL="swift-sdo1" UUID="05a05184-c3e9-45fc-be50-d8e12079b4a4" BLOCK"
/dev/sdp1: LABEL="swift-sdq1" UUID="5a18629a-0f8a-46e9-be41-5eba631c98f3" BLOCK"
/dev/sdq1: LABEL="swift-sdp1" UUID="c91ff1e8-39b8-4f31-8d44-126667f3304e" BLOCK"
/dev/sdr1: LABEL="swift-sds1" UUID="a3f8b82b-a440-4a28-a7bf-c6ab50962ac0" BLOCK"
/dev/sds1: LABEL="swift-sdr1" UUID="fe58126d-c971-4531-b3ae-75fb38b1e2e4" BLOCK"
/dev/sdt1: LABEL="swift-sdu1" UUID="e35b9c89-5e2c-4035-a9c6-8c44d33388b4" BLOCK"
/dev/sdu1: LABEL="swift-sdt1" UUID="9aa4a95f-a5f6-461d-82ab-6393fa246f24" BLOCK"
/dev/sdv1: LABEL="swift-sdv1" UUID="8ff60693-4aba-4a0d-92b7-2e1deae53989" BLOCK"
/dev/sdw1: LABEL="swift-sdw1" UUID="70612c8e-823f-4b71-b25b-0ce5c571d6f7" BLOCK"
/dev/sdx1: LABEL="swift-sdx1" UUID="7fc6d355-ee6f-4160-b046-cc0eef0922de" BLOCK"
/dev/sdy1: LABEL="swift-sdy1" UUID="da7fcb1e-fa41-4c96-934c-26340a3fbb03" BLOCK"
/dev/sdz1: LABEL="swift-sdz1" UUID="c589b4a3-f52e-4278-b401-709ebbcf015f" BLOCK"
/dev/sdb1: UUID="91100df9-5d7a-0bf0-c298-8f4d242e724f" UUID_SUB="0e9cd793-bfaf-"
/dev/sdb2: UUID="bbd03217-5ffd-dd88-40dd-2916c499bbc9" UUID_SUB="41ae9a63-af0e- 
/dev/sda1: UUID="91100df9-5d7a-0bf0-c298-8f4d242e724f" UUID_SUB="7eaf1110-5fb6-"
/dev/sda2: UUID="bbd03217-5ffd-dd88-40dd-2916c499bbc9" UUID_SUB="57a42d7d-ab63-"
/dev/sdc1: PARTUUID="48f4f878-01"
/dev/sdc2: PARTUUID="48f4f878-02"

So the SSDs were sda and sdc. The following installer ran successfully, puppet run & reboot and subsequent puppet run (i.e. the rest of the regular reimage playbook) ran to completion, with disks thus:

Filesystem      Size  Used Avail Use% Mounted on
udev            252G     0  252G   0% /dev
tmpfs            51G  2.0M   51G   1% /run
/dev/md0         55G  2.8G   50G   6% /
tmpfs           252G  4.0K  252G   1% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/sda3        94G   16G   78G  17% /srv/swift-storage/sda3
/dev/sdb3        94G   27G   67G  29% /srv/swift-storage/sdb3
/dev/sdb4       297G  2.1G  295G   1% /srv/swift-storage/sdb4
/dev/sda4       297G  2.1G  295G   1% /srv/swift-storage/sda4
/dev/sdf1       7.3T  5.3T  2.1T  72% /srv/swift-storage/sde1
/dev/sdd1       7.3T  5.3T  2.1T  73% /srv/swift-storage/sdd1
/dev/sde1       7.3T  5.3T  2.1T  73% /srv/swift-storage/sdf1
/dev/sdj1       7.3T  5.3T  2.1T  72% /srv/swift-storage/sdj1
/dev/sdh1       7.3T  5.2T  2.1T  72% /srv/swift-storage/sdh1
/dev/sdk1       7.3T  5.3T  2.1T  72% /srv/swift-storage/sdk1
/dev/sdg1       7.3T  5.3T  2.1T  73% /srv/swift-storage/sdg1
/dev/sdi1       7.3T  5.3T  2.1T  73% /srv/swift-storage/sdi1
/dev/sdn1       7.3T  5.3T  2.1T  72% /srv/swift-storage/sdn1
/dev/sdm1       7.3T  5.3T  2.1T  72% /srv/swift-storage/sdm1
/dev/sdl1       7.3T  5.3T  2.1T  72% /srv/swift-storage/sdl1
/dev/sdp1       7.3T  5.3T  2.1T  72% /srv/swift-storage/sdp1
/dev/sdo1       7.3T  5.3T  2.1T  72% /srv/swift-storage/sdo1
/dev/sdq1       7.3T  5.3T  2.1T  73% /srv/swift-storage/sdq1
/dev/sds1       7.3T  5.3T  2.1T  73% /srv/swift-storage/sds1
/dev/sdr1       7.3T  5.3T  2.0T  73% /srv/swift-storage/sdr1
/dev/sdt1       7.3T  5.3T  2.1T  72% /srv/swift-storage/sdt1
/dev/sdu1       7.3T  5.3T  2.1T  72% /srv/swift-storage/sdu1
/dev/sdv1       7.3T  5.3T  2.1T  73% /srv/swift-storage/sdv1
/dev/sdx1       7.3T  5.3T  2.1T  73% /srv/swift-storage/sdx1
/dev/sdy1       7.3T  5.3T  2.1T  72% /srv/swift-storage/sdy1
/dev/sdw1       7.3T  5.3T  2.1T  73% /srv/swift-storage/sdw1
/dev/sdz1       7.3T  5.3T  2.1T  73% /srv/swift-storage/sdz1
tmpfs            51G     0   51G   0% /run/user/33349

Note that puppet runs OK despite /dev/sdc1 in fact now being a part-finished RAID

mvernon@ms-be2060:~$ cat /proc/mdstat 
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md127 : inactive sdc1[1](S)
      58558464 blocks super 1.2
       
md1 : active (auto-read-only) raid1 sda2[0] sdb2[1]
      975872 blocks super 1.2 [2/2] [UU]
        resync=PENDING
      
md0 : active raid1 sda1[0] sdb1[1]
      58558464 blocks super 1.2 [2/2] [UU]
      
unused devices: <none>

This requires manual fixing - stop md127, remove the superblock, and make a new filesystem on /dev/sdc1, mount it, and re-run puppet. Then wait O(6) hours for swift to backfill.

Details

SubjectRepoBranchLines +/-
operations/cookbooksmaster+183 -169
operations/cookbooksmaster+85 -0
operations/puppetproduction+5 -1
operations/puppetproduction+2 -2
operations/puppetproduction+1 -1
operations/puppetproduction+12 -0
operations/puppetproduction+0 -2
operations/puppetproduction+33 -29
operations/puppetproduction+79 -0
operations/puppetproduction+8 -2
operations/puppetproduction+10 -6
operations/puppetproduction+9 -6
operations/puppetproduction+1 -0
operations/puppetproduction+16 -0
operations/puppetproduction+67 -5
operations/puppetproduction+5 -4
operations/puppetproduction+1 -1
operations/puppetproduction+0 -4
operations/puppetproduction+1 -15
operations/puppetproduction+2 -1
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 849595 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] R:swift::label_filesystem: jst check that any lable is on the disk

https://gerrit.wikimedia.org/r/849595

as currently set up, puppet is unhappy if the SSDs come up as anything other than sda and sdb

I suspect that is related to the following

the label renaming aux drives i.e. https://phabricator.wikimedia.org/T308677#8340723

this change could make that a bit better but its not a great fix as its just looking for *a* label i.e. any label.

The underlining issues is that theses classes try to make the disk mounting stable by adding a label, but the label is based on the thing that is unstable to begin with i.e. the device name. We need to mount theses disks with something that doesn't change on every boot. Currently with the rust disks this works because we only ever add the label once, after that if a label exists no matter what it is we dont relabel. (see issue highlighted by the audit cookbook) however as mentioned above for the ssd disks we relabel them every time they change which means there is no benefit to labelling them in the first place other then adding a bit if confusion.

We can continue to mount things in /srv/swift-disks/sd* using something more stable then a label based on the device name i.e.

/dev/disk/by-path/pci-0000:3b:00.0-scsi-0:0:12:0-part3 - /srv/swift-storage/sda3
/dev/disk/by-path/pci-0000:3b:00.0-scsi-0:0:13:0-part3 - /srv/swift-storage/sdb3
/dev/disk/by-path/pci-0000:3b:00.0-scsi-0:0:12:0-part4 - /srv/swift-storage/sda4
/dev/disk/by-path/pci-0000:3b:00.0-scsi-0:0:13:0-part4 - /srv/swift-storage/sdb4

but i honestly dont see why we would proliferate this as its only going to cause confusion, when after a few weeks, months years of reboots /dev/disk/by-path/pci-0000:3b:00.0-scsi-0:0:12:0-part3 is lablked as /dev/sdn3 but still mounted as /srv/swift-storage/sda3 and .devsda3 dosn't even exist (again see audit cookbook results)

I also note here that as far as i can tell the device names them self i.e. /dev/sda can't be made stable. the kernel scans and names them as they show up, then udev quires the disks ( with udevadm info /dev/sda) and creates the stable names under /dev/disks

luckily puppet doesn't relabel them

I have noticed for the aux drives which use swift::lable_filesystem that they do relabel the disks, this means you could get some really strange behaviour if dev/sda came up as /dev/sdb or the other way round

Sorry, to come back to this, a nit: puppet tries to relabel them but fails (because xfs_admin -L won't adjust a mounted partition I think). So the failure mode is usually puppet unhappyness rather than too much relabelling. [I don't think this really affects your main point, but seemed worth noting in passing]

...so I think your change might at least get us back to systems booting reliably, which would be quite a win!

Change 849595 merged by Jbond:

[operations/puppet@production] R:swift::label_filesystem: jst check that any lable is on the disk

https://gerrit.wikimedia.org/r/849595

just putting a note here. after looking at the dell anisble module it seems we should be able to change the disk (e.g. /redfish/v1/Systems/System.Embedded.1/Storage/RAID.Integrated.1-1/Drives/Disk.Bay.25:Enclosure.Internal.0-2:RAID.Integrated.1-1) ["Oem"]["Dell"]["DellPhysicalDisk"]["RaidStatus"] = 'Offline'

or possibly

'post', "/redfish/v1/Systems/System.Embedded.1/Oem/Dell/DellRaidService/Actions/DellRaidService.ChangePDState
 data={"TargetFQDD": drive_id, "State": Offline}

this unfortunately doesn't work as I expected

Change 848451 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] C:swift: add swift disks fact

https://gerrit.wikimedia.org/r/848451

Change 848418 merged by Jbond:

[operations/puppet@production] C:swift::storage: add variable for data directory

https://gerrit.wikimedia.org/r/848418

Change 848451 merged by Jbond:

[operations/puppet@production] C:swift: add swift disks fact

https://gerrit.wikimedia.org/r/848451

Change 848419 merged by Jbond:

[operations/puppet@production] P:swift::storage: add new resource to format via pci path

https://gerrit.wikimedia.org/r/848419

Change 848420 merged by Jbond:

[operations/puppet@production] ms-be2050: enable disks by path configuerations

https://gerrit.wikimedia.org/r/848420

Change 859581 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] swift::mount_filesystem: allow overriding the mount point

https://gerrit.wikimedia.org/r/859581

Change 859581 merged by Jbond:

[operations/puppet@production] swift::mount_filesystem: allow overriding the mount point

https://gerrit.wikimedia.org/r/859581

Change 859584 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] swift: Allow for mounting using the device directly

https://gerrit.wikimedia.org/r/859584

Change 859584 merged by Jbond:

[operations/puppet@production] swift: Allow for mounting using the device directly

https://gerrit.wikimedia.org/r/859584

Change 859592 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] swift: move ms-be2050 to new naming schema

https://gerrit.wikimedia.org/r/859592

Change 859607 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] install_server: Add dynamic raid configuration

https://gerrit.wikimedia.org/r/859607

Change 859992 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] swift: base the object number on the scsi path

https://gerrit.wikimedia.org/r/859992

Change 859992 merged by Jbond:

[operations/puppet@production] swift: base the object number on the scsi path

https://gerrit.wikimedia.org/r/859992

The underlining issues is that theses classes try to make the disk mounting stable by adding a label, but the label is based on the thing that is unstable to begin with i.e. the device name. We need to mount theses disks with something that doesn't change on every boot.

I'm probably missing something here, but why don't we simply mount the disks based on UUIDs? blkid will print that and it's stable across reboots, e.g. for the disk I'm using as my root partition:

$ sudo blkid /dev/nvme0n1p2
/dev/nvme0n1p2: UUID="1586da60-3648-4ecf-ae57-1acdbde90080" BLOCK_SIZE="4096" TYPE="ext4" PARTUUID="6924d437-fff4-47b9-bc94-696c35a7843d"

And then it /etc/fstab:

UUID=1586da60-3648-4ecf-ae57-1acdbde90080 /               ext4    errors=remount-ro 0       1

Change 859607 merged by Jbond:

[operations/puppet@production] install_server: Add dynamic raid configuration

https://gerrit.wikimedia.org/r/859607

Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin2002 for host ms-be2050.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin2002 for host ms-be2050.codfw.wmnet with OS bullseye executed with errors:

  • ms-be2050 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin2002 for host ms-be2050.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin2002 for host ms-be2050.codfw.wmnet with OS bullseye executed with errors:

  • ms-be2050 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin2002 for host ms-be2050.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin2002 for host ms-be2050.codfw.wmnet with OS bullseye executed with errors:

  • ms-be2050 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin2002 for host ms-be2050.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin2002 for host ms-be2050.codfw.wmnet with OS bullseye executed with errors:

  • ms-be2050 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin2002 for host ms-be2050.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin2002 for host ms-be2050.codfw.wmnet with OS bullseye executed with errors:

  • ms-be2050 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin2002 for host ms-be2050.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin2002 for host ms-be2050.codfw.wmnet with OS bullseye executed with errors:

  • ms-be2050 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin2002 for host ms-be2050.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin2002 for host ms-be2050.codfw.wmnet with OS bullseye executed with errors:

  • ms-be2050 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin2002 for host ms-be2050.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin2002 for host ms-be2050.codfw.wmnet with OS bullseye executed with errors:

  • ms-be2050 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin2002 for host ms-be2050.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin2002 for host ms-be2050.codfw.wmnet with OS bullseye executed with errors:

  • ms-be2050 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin2002 for host ms-be2050.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin2002 for host ms-be2050.codfw.wmnet with OS bullseye executed with errors:

  • ms-be2050 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

The underlining issues is that theses classes try to make the disk mounting stable by adding a label, but the label is based on the thing that is unstable to begin with i.e. the device name. We need to mount theses disks with something that doesn't change on every boot.

I'm probably missing something here, but why don't we simply mount the disks based on UUIDs? blkid will print that and it's stable across reboots, e.g. for the disk I'm using as my root partition:

We could use UUID's however the problems is about how we manage that i.e. we would need a way to first profile the the server so that we no which uuid relate to the rust fdisks and which to the ssd disks and then plug them into hiera somewhere. Further when replacing a disk we would also need to update the hiera data. Using the scsi by path value makes it a little easier to:

  • handle replaced disks as they will have the same bath
  • map the disk to a specific mount point e.g. container1 as we can use the scsi path value as the mount point e.g. 0:0:1:0 -> /mnt/object1, 0:0:7:0 -> /mnt/object7 etc

I have now update ms-be2050 to use the newer mounting scheme based on the by-path id and that seems to be able to reboot and apply its puppet policy without issue.

I have also made some progress towards the re-imaging process in that i can use a script to discover the ssd disks and plug them into partman. This mostly seems to work in that d-i sees the configuration partition table, formats the xfs partitions 3 and 4 and attempts to create the software raid drive allusing the correct disks . however i seems to hit an issue with the last step and d-i stops with the following error

	Nov 24 11:39:01 partman-auto-raid: mdadm: cannot open /dev/sdl: Device or resource busy
	Nov 24 11:39:01 partman-auto-raid: Error creating array /dev/md1

This seems to be a more generic issue with partman creating the sowftare raid device then ms-be specific, and thus seems to be my time at the WMF where i need to dig into partman internals :/. but if anyone else has hit a similar issue or has some pointers please let me know

This seems to be a more generic issue with partman creating the sowftare raid device then ms-be specific, and thus seems to be my time at the WMF where i need to dig into partman internals :/. but if anyone else has hit a similar issue or has some pointers please let me know

I just reliaed my partman-auto-raid/recipe uses /dev/sda#/dev/sdb vs /dev/sda1#/dev/sdb1 im going to try this change and see if it works

Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin2002 for host ms-be2050.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin2002 for host ms-be2050.codfw.wmnet with OS bullseye executed with errors:

  • ms-be2050 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin2002 for host ms-be2050.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin2002 for host ms-be2050.codfw.wmnet with OS bullseye executed with errors:

  • ms-be2050 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin2002 for host ms-be2050.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin2002 for host ms-be2050.codfw.wmnet with OS bullseye completed:

  • ms-be2050 (WARN)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202211241242_jbond_583391_ms-be2050.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Skipping waiting for Icinga optimal status and not removing the downtime, --no-check-icinga was set
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin2002 for host ms-be2050.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin2002 for host ms-be2050.codfw.wmnet with OS bullseye executed with errors:

  • ms-be2050 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin2002 for host ms-be2050.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin2002 for host ms-be2050.codfw.wmnet with OS bullseye executed with errors:

  • ms-be2050 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin2002 for host ms-be2050.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin2002 for host ms-be2050.codfw.wmnet with OS bullseye completed:

  • ms-be2050 (WARN)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202211241413_jbond_600747_ms-be2050.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Skipping waiting for Icinga optimal status and not removing the downtime, --no-check-icinga was set
    • Updated Netbox data from PuppetDB

Change 860581 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] install_server: migrate ms-bs_simple top GPT

https://gerrit.wikimedia.org/r/860581

ok with T308677#8419843 and T308677#8420119 i have now managed to successfully re-image ms-be2050 a couple of times with no problems. I think this resolves the main issue of the task, although ill leave it for @MatthewVernon to confirm.

There where two issues with theses hosts.

The first was that reimages where unstable as partman recipe uses sda and sdb regardless of weather theses disks where ssd or spinning disks. To fix this i have created a partman_early script to find the SSD disks and use them to construct the raid devices.

The second issues is that all of the swift ring mount points where mounted under /srv/swift-storage/sd? this cases issues when disks come up in the wrong order specifically if one of the ssd disks comes up with the wrong name. To fix this i have migrated away from using labels and the sd? device name to using the /dev/disk/by-path udev maps. This provides a much more stable mounting profile and also has the benefit that a replaced disk will be mounted in the place (as it will have the same scisi path).

Fixing theses two issues should provide for a much smoother experience, however all servers will need to be migrated to the new schema. From a data PoV it would be possible to mount both disks in both locations while we update the various swift configurations. however i suspect that things will be a bit tricker then that as we will somehow have to tell swift that ms-be250:/srv/swift-storage/sdc1 is now ms-be250:/srv/swift-storage/objects1. This is made more complicated by the fact that the mappings wont be the same across servers i.e. ms-be2049 may have /srv/swift-storage/sdc1 => /srv/swift-storage/objects8. As such i suspect migrating to this scheme will require a full rebuild of all ms-be servers (but again ill wait for @MatthewVernon to confirm)

If a rebuild of all back-ends is required then i think we should also consider converting the partitioning to use GPT and also move to having two swap partitions instead of having on mirrored swap partition. i attempted this in T308677#8419864 and T308677#8419970, hence the two failures, using a patch similar to 860581 however i was unsuccessful. The main issue is that we need to add an additional boot partition which means that the xfs partitions need to become extended instead of primary and im not sure i got the syntax correct. moving to this would also need a minor change to how we identify the account and container disks however that is not concerting.

The last point i wanted to raise relates to T309027 where disks where presenting as rotational and not SSD. During that task it was discussed to use the convert-ssds cookbook to migrate all disks away from the current schema where all disks are configured as a single raid0 devices with one HD present. To a new schema where we just configure all disks as none-raid disks. However looking at the current servers i noticed that most (possibly all) are still configured in the old manner. As things now work i guess this is not such a big issue however i think we should agree on what format as this discussion impacts the by-path value that the disk gets. My vote would be to convert everything to a None raid disk (while we re-image) as it removes the dell virtual drive and simplifies things slightly

ms-be2050 looks good to me now, thank you :)

I think any approach other than re-imaging is going to be fiddly at best; you'd have to reconstruct the sdX -> objectY mapping at boot each time before the filesystems were mounted, I think. And I don't think there's any "rename" operation in swift-ring-builder, so I think you're still going to have to drain (or just drop) the old names and then add the new ones (maybe with the immediate setting).

I think what we did was convert all the SSDs to non-RAID but left the spinning rust alone. I don't know if there's any performance gain to having them be single RAID0 devices (there's certainly admin pain of having to map between the RAID device and physical one).

ms-be2050 looks good to me now, thank you :)

Great

I think any approach other than re-imaging is going to be fiddly at best; you'd have to reconstruct the sdX -> objectY mapping at boot each time before the filesystems were mounted, I think. And I don't think there's any "rename" operation in swift-ring-builder, so I think you're still going to have to drain (or just drop) the old names and then add the new ones (maybe with the immediate setting).

Thats what i assumed

I think what we did was convert all the SSDs to non-RAID but left the spinning rust alone.

Ahh that explains things

I don't know if there's any performance gain to having them be single RAID0 devices (there's certainly admin pain of having to map between the RAID device and physical one).

Tbh i doubt there would be a performance gain if anything i would expect some tiny overhead (we have already seen some issues with the ssds) but who knows perhaps you can tweak different cache and seek settings if its virtual. however id say that the admin points could end up being a bit painful. For instance right now the scis 0:0:0:0 physical disk shows up as 0:2:2:0 im guessing it is starting at index 2 instead of 0 because originally the ssd where mapped to the virtual disks 0 and 1 (probably to try and win the pci scan race). however this shows that its a little bit less deterministic for instance if two disks died they may get configured with a different virtual disk number. its possible the next virtual disk will be assigned 0:2:0:0 for instance. So id say if we are going to re-image anyway it probably worth fixing the other cookbook* to make all disks none raid and then call the reimage cookbook?

*as this is a one of task we doesn't need to be as gentle as riccardo was trying to be :). We can then add something to the provisioning cookbook to make sure new hosts are configured in the same way

what do you think about the GPT/swap recommendations?

Yes, I think I agree; although I'm not sure if Riccardo managed to iron out all the issues with the convert-ssd cookbook in the end (I did a bunch of them manually whilst reimaging anyway).

I think not RAIDing the swap makes a deal of sense, and am happy with GPT partition tables.

Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin2002 for host ms-be2050.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin2002 for host ms-be2050.codfw.wmnet with OS bullseye executed with errors:

  • ms-be2050 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin2002 for host ms-be2050.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin2002 for host ms-be2050.codfw.wmnet with OS bullseye completed:

  • ms-be2050 (WARN)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202211281327_jbond_1524484_ms-be2050.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Skipping waiting for Icinga optimal status and not removing the downtime, --no-check-icinga was set
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin2002 for host ms-be2050.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin2002 for host ms-be2050.codfw.wmnet with OS bullseye executed with errors:

  • ms-be2050 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin2002 for host ms-be2050.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin2002 for host ms-be2050.codfw.wmnet with OS bullseye executed with errors:

  • ms-be2050 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin2002 for host ms-be2050.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin2002 for host ms-be2050.codfw.wmnet with OS bullseye executed with errors:

  • ms-be2050 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run failed, asking the operator what to do
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin2002 for host ms-be2050.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin2002 for host ms-be2050.codfw.wmnet with OS bullseye executed with errors:

  • ms-be2050 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin2002 for host ms-be2050.codfw.wmnet with OS bullseye

Change 860581 merged by Jbond:

[operations/puppet@production] install_server: migrate ms-bs_simple top GPT

https://gerrit.wikimedia.org/r/860581

Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin2002 for host ms-be2050.codfw.wmnet with OS bullseye executed with errors:

  • ms-be2050 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run failed, asking the operator what to do
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin2002 for host ms-be2050.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin2002 for host ms-be2050.codfw.wmnet with OS bullseye completed:

  • ms-be2050 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202211281721_jbond_1563631_ms-be2050.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Skipping waiting for Icinga optimal status and not removing the downtime, --no-check-icinga was set
    • Updated Netbox data from PuppetDB

Change 859470 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/cookbooks@master] convrt-ssds: update cookbook to reimage ms-be with new partition schema

https://gerrit.wikimedia.org/r/859470

MatthewVernon mentioned this in Unknown Object (Task).Dec 22 2022, 12:25 PM
MatthewVernon mentioned this in Unknown Object (Task).Dec 22 2022, 1:12 PM

Change 875811 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] swift: Remove ms-be2050 from the rings

https://gerrit.wikimedia.org/r/875811

Change 875811 merged by MVernon:

[operations/puppet@production] swift: Remove ms-be2050 from the rings

https://gerrit.wikimedia.org/r/875811

Change 859592 merged by MVernon:

[operations/puppet@production] swift: move ms-be2050 to new naming schema

https://gerrit.wikimedia.org/r/859592

Change 877101 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] hiera: remove ms-be2050 from servers_per_port 0 setting

https://gerrit.wikimedia.org/r/877101

Change 877101 merged by MVernon:

[operations/puppet@production] hiera: remove ms-be2050 from servers_per_port 0 setting

https://gerrit.wikimedia.org/r/877101

Change 894009 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] install_server: use newer partman setup for new ms backends

https://gerrit.wikimedia.org/r/894009

Change 894009 merged by MVernon:

[operations/puppet@production] install_server: use newer partman setup for new ms backends

https://gerrit.wikimedia.org/r/894009

Change 895141 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] hiera: use a regex to specify new-style storage hosts

https://gerrit.wikimedia.org/r/895141

Change 895141 merged by MVernon:

[operations/puppet@production] hiera: use a regex to specify new-style storage hosts

https://gerrit.wikimedia.org/r/895141

Mentioned in SAL (#wikimedia-operations) [2024-02-06T14:32:31Z] <Emperor> debug convert-disks cookbook against out-of-use ms-be2044 T308677

Change 859470 merged by jenkins-bot:

[operations/cookbooks@master] convert-disks: update cookbook to reimage ms-be with new partition schema

https://gerrit.wikimedia.org/r/859470