Page MenuHomePhabricator

Decommissioning two hosts end up with: Failed to wipe swraid
Closed, ResolvedPublic

Description

For T311591: decommission db2075 and T311589: decommission db2071 I ended up with the following error (for both hosts):

----- OUTPUT of 'lsblk --all --ou...-all --force %*'' -----
/dev/sda: 8 bytes were erased at offset 0x00000200 (gpt): 45 46 49 20 50 41 52 54
/dev/sda: 8 bytes were erased at offset 0x3a2c7fffe00 (gpt): 45 46 49 20 50 41 52 54
/dev/sda: 2 bytes were erased at offset 0x000001fe (PMBR): 55 aa
/dev/sda1: 2 bytes were erased at offset 0x00000438 (ext4): 53 ef
wipefs: /dev/sda2: failed to erase swap magic string at offset 0x00000ff6: Text file busy
================
PASS |                                                                                                 |   0% (0/1) [00:00<?, ?hosts/s]
FAIL |█████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  1.55hosts/s]
100.0% (1/1) of nodes failed to execute command 'lsblk --all --ou...-all --force %*'': db2075.codfw.wmnet
0.0% (0/1) success ratio (< 100.0% threshold) for command: 'lsblk --all --ou...-all --force %*''. Aborting.
0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
**Failed to wipe swraid, partition-table and filesystem signatures, manual intervention required to make it unbootable**: Cumin execution failed (exit_code=2)

The rest of the decom process went fine.
Both hosts were up and running normally before the decommissioning.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Maybe we need run "swapoff -a" prior to the wipefs call?

I have a few more hosts to decommission. I can try to do so, but we'd not know whether it helped or it would have just worked without it too :)
Up to you!

I have a few more hosts to decommission. I can try to do so, but we'd not know whether it helped or it would have just worked without it too :)

It's more of a shot in the dark/hunch based on the fact that the device which fails is a swap partition. It could still be a useful hint, so if you could manually run "swapoff -a" before the next decom, that would be great.

I ran swapoff -a on db2081 and it went fine. Could be coincidence or it could've been the fix. Hard to know.
However, I guess it doesn't hurt to include it as part of the normal decommissioning process anyways if that doesn't require much work. Up to you!

Change 809599 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/cookbooks@master] Disable swap before running wipefs

https://gerrit.wikimedia.org/r/809599

Change 809599 merged by Muehlenhoff:

[operations/cookbooks@master] Disable swap before running wipefs

https://gerrit.wikimedia.org/r/809599

@Marostegui Did this happen again for any reimage after I merged by patch above?

Nope, it all went fine! Good to close. I need to decom a lot more in the upcoming days, will reopen if needed.
Thanks for fixing it!

MoritzMuehlenhoff claimed this task.

Ack, closing then :-)