WDQS servers taking up to 30 minutes to reboot
Open, HighPublic8 Estimated Story Points
Actions

Assigned To

None

Authored By

	Gehel
	Feb 9 2021, 4:45 PM

Description

As an administrator of WDQS, I want reboots to be fast so that I can run cluster wide operations in a reasonable amount of time.

While doing a full cluster restart of WDQS for kernel upgrade, multiple servers took at least 30 minutes to reboot. Looking at console, it looks like the shutdown is waiting to unmount disks. Stopping blazegaph (both wdqs and categories) and the wdqs-updater before the reboot does not have a significant impact on shutdown time.

Maybe related logs (wdqs1007:/var/log/syslog):

Feb  9 16:21:50 wdqs1007 blkdeactivate[16486]:   [SKIP]: unmount of vg0-swap (dm-1) mounted on [SWAP]
Feb  9 16:21:51 wdqs1007 blkdeactivate[16486]:   [UMOUNT]: unmounting vg0-srv (dm-2) mounted on /srv... skipping
Feb  9 16:21:51 wdqs1007 blkdeactivate[16486]:   [SKIP]: unmount of vg0-root (dm-0) mounted on /

AC:

wdqs servers can be rebooted in < 5 minutes

Related Objects

Mentioned In: T305012: Review management console access for search-owned physical hosts
T274213: Reboot wdqs hosts

Event Timeline

Gehel created this task.Feb 9 2021, 4:45 PM

Restricted Application added a project: Wikidata. · View Herald TranscriptFeb 9 2021, 4:45 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Gehel updated the task description. (Show Details)Feb 9 2021, 4:49 PM

RKemper subscribed.Feb 9 2021, 6:30 PM

RKemper mentioned this in T274213: Reboot wdqs hosts.Feb 9 2021, 7:09 PM

Gehel triaged this task as High priority.Feb 15 2021, 4:05 PM

Gehel moved this task from Incoming to Operations/SRE on the Wikidata-Query-Service board.

Gehel moved this task from Operations/SRE to Current work on the Wikidata-Query-Service board.Mar 1 2021, 4:27 PM

Gehel added a project: Discovery-Search (Current work).

MPhamWMF set the point value for this task to 8.Mar 1 2021, 4:49 PM

MPhamWMF moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.

Gehel edited projects, added Discovery-Search; removed Discovery-Search (Current work).Feb 16 2022, 8:48 PM

Gehel moved this task from needs triage to Ops / SRE on the Discovery-Search board.

bking subscribed.Mar 29 2022, 9:15 PM

This is still happening, @RKemper found some interesting links that could explain this behavior:

https://wiki.freedesktop.org/www/Software/systemd/Debugging/#diagnosingshutdownproblems

https://old.reddit.com/r/archlinux/comments/ba3zec/very_slow_shutdownreboot_fixed/

https://github.com/systemd/systemd/issues/11821#issuecomment-477545885

I think it's worth trying the systemd workaround mentioned in the last github thread.

Actions tried so far: disabling swap via systemd before rebooting. Worked on wdqs2007, did not work on wdqs2002. Also worth noting is that we had previously rebooted wdqs2007 within the last 30 minutes, so a minor kernel update (from 4.19.0-16-amd64 to 4.19.0-20-amd64) or any other reboot-required updates could have fixed the issue. It's also possible the system hadn't been up long enough to cause any problems. Compare to wdqs2002 which has been running a production workload and has not been rebooted recently.

We will pick this up again tomorrow, attempting the systemd workaround linked in my last comment.

Another piece of the puzzle, some wdqs hosts use MDRAID for their /srv partition, some use LVM . Working assumption is that only the LVM hosts will take forever to reboot.

Correction: both MDRAID and LVM servers have this problem. Both services' systemd unit files have the same "Conflicts=shutdown.target" directive. Still haven't tried the systemd workaround though, will test that today.

Unfortunately, the systemd workaround listed above did not work. We will try adjusting some other unit file values when time permits.

Gehel removed a project: Discovery-Search.Jul 25 2022, 3:43 PM

Gehel moved this task from Current work to Operations/SRE on the Wikidata-Query-Service board.

Gehel added a project: Discovery-Search (Current work).Dec 20 2022, 10:01 AM

Gehel moved this task from Ready for Dev -- SWE to Incoming on the Discovery-Search (Current work) board.

Gehel removed a project: Discovery-Search (Current work).Jan 23 2023, 4:34 PM

Gehel edited projects, added Discovery-Search (Current work); removed Wikidata-Query-Service.Mar 16 2023, 2:03 PM

MPhamWMF moved this task from Incoming to Ready for Dev -- SRE/Ops on the Discovery-Search (Current work) board.Apr 10 2023, 3:46 PM

bking edited projects, added Discovery-Search; removed Discovery-Search (Current work).May 15 2023, 7:37 PM

WDQS servers taking up to 30 minutes to rebootOpen, HighPublic8 Estimated Story PointsActions

Description

Related Objects

Event Timeline

WDQS servers taking up to 30 minutes to reboot
Open, HighPublic8 Estimated Story Points
Actions