Disk volumes of cloud instances are completely mixed-up
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Benoit74
	Mon, Jun 24, 12:56 PM

Description

Platform: Wikimedia Cloud Services
Project: mwoffliner
Impact: prod was down, now restored, seeking guidance to avoid new incident

Issue:

mwoffliner1, mwoffliner2 and mwoffliner3 have rebooted automatically (don't know why) on 20th June 2024.

Since then, it looks like their filesystem have been completely mixed-up.

It looks like something has inverted sda and sdb devices, and puppet failed to realize this while generating the /etc/fstab file.

Consequence is that the same device is mounted at "/" and "/data", both on sdb device (which was sda few days ago). From its content and filesize, however, it looks like the volume mounted is the proper one.

I tried a reboot of the virtual machines and it kinda solved the issue, i.e. proper devices are back at sda and sdb, they are not mixed anymore on all machines and production is back up.

Is there however something we can do to avoid the incident to happen again? Would it be possible that pupper generate the fstab only with UUIDs instead of device name like it is recommended nowadays (this would allow to not care at all about the orders between sda and sdb)? Is this a known incident that has been advertised somewhere but we missed the info?

Event Timeline

Benoit74 created this task.Mon, Jun 24, 12:56 PM

The Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832/ and replace it with a more specific project tag to this task. Thanks!

Benoit74 edited projects, added Data-Services; removed Cloud-Services.Mon, Jun 24, 12:58 PM

Benoit74 edited projects, added Cloud-VPS; removed Data-Services.

Hi, sorry for that. The servers were rebooted to pick up updated network settings: https://lists.wikimedia.org/hyperkitty/list/cloud-announce@lists.wikimedia.org/message/IYVYMGLPNOU6JON52PV6R6NKX2XHMK6R/

As far as I can tell, the wmcs-prepare-cinder-volume script is already using UUIDs in fstab, and has been since the very commit it was introduced in. Do you happen to remember how did you configure the mount in the first place?

/cc @Andrew

@Rgaudin probably has higher chances to remember, since I wasn't there at that time.

We did not touch the mount points.

I can't guess at the historical reason why fstab doesn't have uuids, but adding them there is the right solution for this. Assignment of things like sda and sdb are entirely indeterminate in newer linux versions, much to my constant surprise and dismay.

I can make the uuid changes if you'd like me to. I believe that /etc/stab is not managed by puppet so hand edits should persist.

We do have a header which indicate that /etc/fstab is managed by Puppet:

# HEADER: This file was autogenerated at 2022-05-08 14:55:14 +0000
# HEADER: by puppet.  While it can still be managed manually, it
# HEADER: is definitely not recommended.
# /etc/fstab: static file system information
UUID=ff1c4e0f-a73d-4a27-b991-ac8ac212c065	/	ext4	rw,discard,errors=remount-ro,x-systemd.growfs	0	1
UUID=860E-392C	/boot/efi	vfat	defaults	0	0
scratch.svc.cloudinfra-nfs.eqiad1.wikimedia.cloud:/srv/scratch	/mnt/nfs/secondary-scratch	nfs	vers=4,bg,intr,sec=sys,proto=tcp,noatime,lookupcache=all,nofsc,rw,soft,timeo=300,retrans=3	0	0
/dev/sdb1	/data	ext4	defaults	0	0

But I would be very happy anyway if you achieve to find the proper way to move to UUID

fnegri moved this task from Unsorted to Storage on the Cloud-VPS board.Mon, Jun 24, 3:08 PM

In T368265#9917991, @Benoit74 wrote:

We do have a header which indicate that /etc/fstab is managed by Puppet:

Ah, you're right, because of the nfs mount there's some puppet involvement. Nevertheless experiments show that that puppet doesn't touch the line for sdb1, so I've substituted in the uuid.

Puppet seems to leave it alone, but I'll leave the test reboots to you.

RhinosF1 subscribed.Mon, Jun 24, 4:02 PM

I rebooted all four instances (I modified mwoffliner4 myself) and everything is ok on mwoffliner2, mwoffliner3 and mwoffliner4.

mwoffliner1 seems to be up but ssh is probably not starting, at least I cannot SSH in the instance. I tried to put the instance in Rescue mode without success, it still coudn't SSH. I unrescued the instance. Could you please have a look?

@Audiodude do you wanna do the same on WP1 instance mwcurator? I just checked and it has the same problem in /etc/fstab of using /dev/sdb1 instead of UUID

I can try, but I'm not sure I know what I'm doing. Where do I get the UUIDs from?

If you have root on mwoffliner you should have it on mwcurator

mwoffliner1 seems to be up but ssh is probably not starting, at least I cannot SSH in the instance. I tried to put the instance in Rescue mode without success, it still coudn't SSH. I unrescued the instance. Could you please have a look?

This was due to me making a typo in fstab, which is now fixed. (I have access to a raw console which is annoying but which can bypass ssh failure.) Sorry for the commotion!

Just to be completely clear, I don't really feel comfortable editing /fstab on mwcurator. If one of you could do it that would be great! Thanks!

This was due to me making a typo in fstab, which is now fixed. (I have access to a raw console which is annoying but which can bypass ssh failure.) Sorry for the commotion!

No worries, we've all been there ^^

Just to be completely clear, I don't really feel comfortable editing /fstab on mwcurator. If one of you could do it that would be great! Thanks!

Update done, machine reboot in sync with @Audiodude

We can now close this ticket.

Thank you all for your prompt help on this ticket!

I don't know what "claiming" a task actually mean, feel free to claim this if it is important to you, I honestly absolutely don't care ^^

Disk volumes of cloud instances are completely mixed-upClosed, ResolvedPublicActions

Description

Event Timeline

Disk volumes of cloud instances are completely mixed-up
Closed, ResolvedPublic
Actions