Page MenuHomePhabricator

[EXPEDITED] Investigate Jupyter failing to spawn new environments.
Closed, ResolvedPublic

Assigned To
Authored By
EChetty
Oct 4 2022, 8:26 PM
Referenced Files
F35546980: image.png
Oct 4 2022, 9:49 PM
F35546965: image.png
Oct 4 2022, 9:14 PM
F35546924: image.png
Oct 4 2022, 8:37 PM
F35546922: image.png
Oct 4 2022, 8:37 PM
F35546920: image.png
Oct 4 2022, 8:37 PM

Description

Problem Description:
Product Analytics have noticed that Jupyterhub is failing to spawn new conda environments on some of the stat-machines.

Bug Thread: https://wikimedia.slack.com/archives/CSV483812/p1664847261071309

Examples from Product Analytics:

image.png (472×1 px, 43 KB)

image.png (354×1 px, 32 KB)

image.png (350×1 px, 29 KB)

Impact:
Critical. Prevents the analytics teams from using the jupyterhub and jupyter notebooks from performing analysis.

Event Timeline

EChetty renamed this task from Investigate Jupyter failing to spawn new environments. to [EXPEDITED] Investigate Jupyter failing to spawn new environments..Oct 4 2022, 8:26 PM
EChetty triaged this task as Unbreak Now! priority.
EChetty moved this task from Backlog to Sprint 02 on the Shared-Data-Infrastructure board.
EChetty updated the task description. (Show Details)

I have opened an Incident Report about this as well, due to the gravity of the outage.
https://docs.google.com/document/d/1KUfVX9-tymmhbWGj0f7_rUTX3ym6RJ7FQ61SOezQcNk/edit

When I restart the jupyterhub-conda service the resulting process seems to be stuck in a D state, which means it is stuck waiting for I/O (usually).

image.png (306×887 px, 73 KB)

This appears to be resolved now. The cause seems to have been that some stale NFS mount points existed to servers that were no longer serving the files. The result was that processes were stalling on NFS access, waiting for the NFS servers to come back. This was an attempt to list the patch of one of these directories from stat1006.

image.png (293×1 px, 86 KB)

I manually ran a forced, lazy unmount of both of the labstore100[6-7] mount points.

btullis@stat1006:~$ sudo umount -f -l /mnt/nfs/dumps-labstore1006.wikimedia.org
btullis@stat1006:~$ sudo umount -f -l /mnt/nfs/dumps-labstore1007.wikimedia.org

I then edited the /etc/fstab file on each of the boxes and manually removed the labstore100[6-7] entries. I did a run-puppet-agent afterwards to make sure that the entries weren't re-added.

BTullis lowered the priority of this task from Unbreak Now! to High.Oct 4 2022, 9:49 PM

I'm still getting spawn failures on stat1004, although all the other servers are fine now. Is it possible the fix didn't get applied to stat1004?

I'm still getting spawn failures on stat1004, although all the other servers are fine now. Is it possible the fix didn't get applied to stat1004?

Yes, you're quite right @nshahquinn-wmf - It seems that I had accidentally skipped the unmount on stat1004. That's now corrected.

btullis@stat1004:~$ sudo umount -f -l /mnt/nfs/dumps-labstore1006.wikimedia.org
btullis@stat1004:~$ sudo umount -f -l /mnt/nfs/dumps-labstore1007.wikimedia.org