In T382874 we had to replace a broken disk in ms-be1090, one of the new Supermicro Config J nodes. We were unable to hot-swap the disk, and after a chat between DP/Infra-Foundations/Dcops the following issues came up:
- The servers are high density, and they host two rows of disks (12 each) that are hot-swappable. One of the two rows requires the operator to slide the server forward to be able to extract the disk, and we cannot easily do it now due to short cables (fiber, power etc..). This means that a shutdown is needed if the broken disk is placed in the more internal/hidden row, because dcops needs to disconnect cables, slide forward, replace and do everything in reverse. This is not an ideal situation, so we should try to figure out if there are compromises/solutions. I had a chat with Valerie that mentioned an extra structure to fold the extra cabling required for sliding (to avoid them hanging and risking to be cut or hit by other servers), but for the moment it seems not a viable solution since it would require a longer/different rows where the server slides onto, that won't fit in our racks.
- Assuming that we can hot-swap in the DC, another issue arises: is the new disk recognized straight away by the OS without extra intervention? From T382874 it seems that after the power up the new disk was visible, but we need to test it. From various chats I gathered that the disk may end up being in a "Foreign" state (close to what happens with regular RAID scenarios), and something like megactl would be needed to set it to JBOD properly. I tested megactl and storecli in T377853#10457893 but neither of them worked, so the only alternative seems to be to set the JBOD value via BIOS (so a reboot would be required). We could try to test extracting/re-adding a disk in either ms-be2088 or ms-be1091 since we left them out of Swift production, to better understand the procedure. There may also be the option to replace the controllers with something that better supports JBOD (I wrote "may" since this is my assumption, I need to verify it) but this is something that Data Persistence needs to decide.