Platform: Wikimedia Cloud Services
Project: mwoffliner
Impact: prod was down, now restored, seeking guidance to avoid new incident
Issue:
mwoffliner1, mwoffliner2 and mwoffliner3 have rebooted automatically (don't know why) on 20th June 2024.
Since then, it looks like their filesystem have been completely mixed-up.
It looks like something has inverted sda and sdb devices, and puppet failed to realize this while generating the /etc/fstab file.
Consequence is that the same device is mounted at "/" and "/data", both on sdb device (which was sda few days ago). From its content and filesize, however, it looks like the volume mounted is the proper one.
I tried a reboot of the virtual machines and it kinda solved the issue, i.e. proper devices are back at sda and sdb, they are not mixed anymore on all machines and production is back up.
Is there however something we can do to avoid the incident to happen again? Would it be possible that pupper generate the fstab only with UUIDs instead of device name like it is recommended nowadays (this would allow to not care at all about the orders between sda and sdb)? Is this a known incident that has been advertised somewhere but we missed the info?