Page MenuHomePhabricator

sre.hosts.reimage: wait reboot time timeout on aqs nodes
Closed, ResolvedPublic

Description

With the aqs node it is taking a long time for the server to finish the disk partitioning.

┌────────────────────────┤ Partitions formatting ├────────────────────────┐
│                                                                         │
│                                   33%                                   │
│                                                                         │
│ Creating ext4 file system for /srv/cassandra-a in partition #1 of       │
│ RAID10 device #1...                                                     │
└───────────────────────────────────────────────────────────

by the time the disk partitioning finished and proceed to the next step which is "Installing the base system" the "spicerack.remote.RemoteHosts.wait_reboot_since" is already at [107/120] .

[107/120, retrying in 10.00s] Attempt to run 'spicerack.remote.RemoteHosts.wait_reboot_since' raised: Uptime for aqs2002.codfw.wmnet higher than threshold: 1388.26 > 1293.56.

Before the OS install complete and the server reboots into the OS the ""spicerack.remote.RemoteHosts.wait_reboot_since" is already at [120/120] . and makes reimage to failed .

You need to run the sre.host.reimage again with the --no-pxe option for the install to finish

Is it possible when running the re-image cookbook to have an option to set the 120 to a different value or increase that value ?

Thanks.

Event Timeline

Volans triaged this task as Medium priority.Apr 30 2022, 9:50 AM

@Papaul is it normal that it's so slow to just create an empty partition?
We can surely increase the number or add some tweak in spicerack to make it a bit more dynamic. I'll take a look at it next week.

@Volans the only reason i see is the size of the disks and number of disks. We are using software RAID on 8x ~2TB disks

We looked at the logs with John and Papaul during our last meeting and agreed that it took a long time for mdadm+mkfs to create the software raid partition and format it. Hence decided to just increase the current timeout in spicerack, I'll make the patch.

Change 791335 had a related patch set uploaded (by Volans; author: Volans):

[operations/software/spicerack@master] remote: increase reboot wait time

https://gerrit.wikimedia.org/r/791335

Change 791335 merged by Volans:

[operations/software/spicerack@master] remote: increase reboot wait time

https://gerrit.wikimedia.org/r/791335

This should be resolved. Feel free to reopen it in case it's not.