Maniphest T319346

[EXPEDITED] Investigate Jupyter failing to spawn new environments.
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• EChetty
	Oct 4 2022, 8:26 PM

Description

Problem Description:
Product Analytics have noticed that Jupyterhub is failing to spawn new conda environments on some of the stat-machines.

Bug Thread: https://wikimedia.slack.com/archives/CSV483812/p1664847261071309

Examples from Product Analytics:

Impact:
Critical. Prevents the analytics teams from using the jupyterhub and jupyter notebooks from performing analysis.

Related Objects

Mentioned In: T319217: decommission labstore100[67].wikimedia.org
T319360: [EXPEDITED] Cannot query string data from MariaDB using Wmfdata-Python

Event Timeline

• EChetty created this task.Oct 4 2022, 8:26 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 4 2022, 8:26 PM

• EChetty renamed this task from Investigate Jupyter failing to spawn new environments. to [EXPEDITED] Investigate Jupyter failing to spawn new environments..Oct 4 2022, 8:26 PM

• EChetty triaged this task as Unbreak Now! priority.

• EChetty moved this task from Backlog to Sprint 02 on the Shared-Data-Infrastructure board.

• EChetty edited projects, added Shared-Data-Infrastructure (Sprint 02); removed Shared-Data-Infrastructure.

• EChetty moved this task from Next Up to In Progress on the Shared-Data-Infrastructure (Sprint 02) board.

• EChetty updated the task description. (Show Details)

• EChetty updated the task description. (Show Details)Oct 4 2022, 8:37 PM

I have opened an Incident Report about this as well, due to the gravity of the outage.
https://docs.google.com/document/d/1KUfVX9-tymmhbWGj0f7_rUTX3ym6RJ7FQ61SOezQcNk/edit

When I restart the jupyterhub-conda service the resulting process seems to be stuck in a D state, which means it is stuck waiting for I/O (usually).

This appears to be resolved now. The cause seems to have been that some stale NFS mount points existed to servers that were no longer serving the files. The result was that processes were stalling on NFS access, waiting for the NFS servers to come back. This was an attempt to list the patch of one of these directories from stat1006.

I manually ran a forced, lazy unmount of both of the labstore100[6-7] mount points.

btullis@stat1006:~$ sudo umount -f -l /mnt/nfs/dumps-labstore1006.wikimedia.org
btullis@stat1006:~$ sudo umount -f -l /mnt/nfs/dumps-labstore1007.wikimedia.org

I then edited the /etc/fstab file on each of the boxes and manually removed the labstore100[6-7] entries. I did a run-puppet-agent afterwards to make sure that the entries weren't re-added.

BTullis lowered the priority of this task from Unbreak Now! to High.Oct 4 2022, 9:49 PM

I'm still getting spawn failures on stat1004, although all the other servers are fine now. Is it possible the fix didn't get applied to stat1004?

In T319346#8285365, @nshahquinn-wmf wrote:

I'm still getting spawn failures on stat1004, although all the other servers are fine now. Is it possible the fix didn't get applied to stat1004?

Yes, you're quite right @nshahquinn-wmf - It seems that I had accidentally skipped the unmount on stat1004. That's now corrected.

btullis@stat1004:~$ sudo umount -f -l /mnt/nfs/dumps-labstore1006.wikimedia.org
btullis@stat1004:~$ sudo umount -f -l /mnt/nfs/dumps-labstore1007.wikimedia.org

nshahquinn-wmf mentioned this in T319360: [EXPEDITED] Cannot query string data from MariaDB using Wmfdata-Python.Oct 6 2022, 12:35 AM

BTullis mentioned this in T319217: decommission labstore100[67].wikimedia.org.Oct 6 2022, 9:48 AM

BTullis moved this task from In Progress to Done on the Shared-Data-Infrastructure (Sprint 02) board.Oct 6 2022, 10:49 AM

BTullis closed this task as Resolved.Oct 7 2022, 3:39 PM

	F35546965: image.png
	Oct 4 2022, 9:14 PM

[EXPEDITED] Investigate Jupyter failing to spawn new environments.Closed, ResolvedPublicActions

Description

Related Objects

Event Timeline

[EXPEDITED] Investigate Jupyter failing to spawn new environments.
Closed, ResolvedPublic
Actions