Page MenuHomePhabricator

WDQS servers taking up to 30 minutes to reboot
Open, HighPublic8 Estimated Story Points

Description

As an administrator of WDQS, I want reboots to be fast so that I can run cluster wide operations in a reasonable amount of time.

While doing a full cluster restart of WDQS for kernel upgrade, multiple servers took at least 30 minutes to reboot. Looking at console, it looks like the shutdown is waiting to unmount disks. Stopping blazegaph (both wdqs and categories) and the wdqs-updater before the reboot does not have a significant impact on shutdown time.

Maybe related logs (wdqs1007:/var/log/syslog):

Feb  9 16:21:50 wdqs1007 blkdeactivate[16486]:   [SKIP]: unmount of vg0-swap (dm-1) mounted on [SWAP]
Feb  9 16:21:51 wdqs1007 blkdeactivate[16486]:   [UMOUNT]: unmounting vg0-srv (dm-2) mounted on /srv... skipping
Feb  9 16:21:51 wdqs1007 blkdeactivate[16486]:   [SKIP]: unmount of vg0-root (dm-0) mounted on /

AC:

  • wdqs servers can be rebooted in < 5 minutes

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Gehel triaged this task as High priority.Feb 15 2021, 4:05 PM
Gehel moved this task from Incoming to Operations/SRE on the Wikidata-Query-Service board.
MPhamWMF set the point value for this task to 8.Mar 1 2021, 4:49 PM

Actions tried so far: disabling swap via systemd before rebooting. Worked on wdqs2007, did not work on wdqs2002. Also worth noting is that we had previously rebooted wdqs2007 within the last 30 minutes, so a minor kernel update (from 4.19.0-16-amd64 to 4.19.0-20-amd64) or any other reboot-required updates could have fixed the issue. It's also possible the system hadn't been up long enough to cause any problems. Compare to wdqs2002 which has been running a production workload and has not been rebooted recently.

We will pick this up again tomorrow, attempting the systemd workaround linked in my last comment.

Another piece of the puzzle, some wdqs hosts use MDRAID for their /srv partition, some use LVM . Working assumption is that only the LVM hosts will take forever to reboot.

Correction: both MDRAID and LVM servers have this problem. Both services' systemd unit files have the same "Conflicts=shutdown.target" directive. Still haven't tried the systemd workaround though, will test that today.

Unfortunately, the systemd workaround listed above did not work. We will try adjusting some other unit file values when time permits.