I can't reimage cloudcephosd1012 because of partman changes that are no longer compatible with this host.
The first big batch of cloudceph nodes (cloudcephosd10[04-15]) have raid controllers, ordered by mistake. There should still be two small OS drives and a bunch of unpartitioned big drives, but the partman process fails saying that there are 10 bootable drives.
That's caused by this snippet here:
+ # Double checking that we have exactly two SCSI devices
+ num_devices=$(echo ${devices} | egrep -o '\/dev\/sd[a-z]'|wc -l)
+ if [ ${num_devices} -ne 2 ]
+ then
+ echo "We expected to find two boot devices, but instead found ${num_devices}".
+ exit 1
+ fiAlthough by my reading it should complain about 0 bootable drives rather than 10.
Topics for discussion:
- is the new partman recipe likely to work for all the rest of our osd nodes? Do they all conform to the 2-scsci/the rest non-scsi pattern assumed here?
- assuming 'yes,' what's a good clean way to opt these few servers out of that check?
- dcaro seems to have already reimaged cloudcephosd1006-1011. How?
- should we just scrap cloudcephosd10[04-15] rather than deal with this exception? They're due for refresh in 2025.