[infra] NFS hangs in some workers until the worker is rebooted
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	dcaro
	Apr 16 2024, 4:00 PM

Description

Currently tools-k8s-worker-nfs-1 has a stuck nfs mount, trying to ls -a on it gets the process in D state.

Remounting (mount -o remount ...) does not help, and mounting with the same options in another mountpoint does not help either (mount -o ...copied options... /root/test).

The pods were able to be stopped and moved to other workers even if they had processes in D state.

Related Objects

Mentioned In: T364822: [infra] NFS hangs in some workers until the worker is rebooted (2024-05-14)
Mentioned Here: P60764 (An Untitled Masterwork)

Event Timeline

dcaro created this task.Apr 16 2024, 4:00 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 16 2024, 4:00 PM

dcaro triaged this task as High priority.Apr 16 2024, 4:00 PM

dcaro updated the task description. (Show Details)

dcaro edited projects, added Toolforge (Toolforge iteration 09); removed Toolforge (Toolforge iteration 08).Apr 17 2024, 7:49 AM

dcaro changed the task status from Open to In Progress.Apr 17 2024, 8:05 AM

dcaro moved this task from Next Up to In Progress on the Toolforge (Toolforge iteration 09) board.

Things I've tried:

Yesterday

Remount the nfs volume:

root@tools-k8s-worker-nfs-1:~# mount -o remount /mnt/nfs/labstore-secondary-tools-home

root@tools-k8s-worker-nfs-1:~# ls -la /mnt/nfs/labstore-secondary-tools-home  # stuck in D state

Mount the nfs volume in a different path with the same options:

root@tools-k8s-worker-nfs-1:~# mkdir test
root@tools-k8s-worker-nfs-1:~# mount -o rw,noatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=172.16.0.87,local_lock=none,addr=172.16.7.14 tools-nfs.svc.tools.eqiad1.wikimedia.cloud:/srv/tools/home test

root@tools-k8s-worker-nfs-1:~# ls -la test  # stuck in D state

Today

For some reason NFS is responding today, though slow:

root@tools-k8s-worker-nfs-1:~# ls -l /mnt/nfs/labstore-secondary-tools-home
... gets stuff and returns prompt

But the processes that got stuck yesterday are still there:

root@tools-k8s-worker-nfs-1:~# ps aux | grep labstore-secondary
root     2309951  0.0  0.0  20712  2448 pts/4    D+   Apr16   0:00 ls --color=auto -l /mnt/nfs/labstore-secondary-tools-home/dcaro

I can't seem to get the process stuck again on D state so not able to reproduce the issue :/

More tests:

root@tools-k8s-worker-nfs-1:~# ls -l /mnt/nfs/labstore-secondary-tools-home/ &  -> works (takes ~25s)

root@tools-k8s-worker-nfs-1:~# ls --color=auto -l /mnt/nfs/labstore-secondary-tools-home/dcaro &  -> gets stuck

On a new mount with intr option added, I get the same behavior (intr does not show up later when showing the mounts though, so not sure it's being applied):

root@tools-k8s-worker-nfs-1:~# mount -o rw,noatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,intr,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=172.16.0.87,local_lock=none,addr=172.16.7.14 tools-nfs.svc.tools.eqiad1.wikimedia.cloud:/srv/tools/home test
root@tools-k8s-worker-nfs-1:~# mount
...
tools-nfs.svc.tools.eqiad1.wikimedia.cloud:/srv/tools/home on /root/test type nfs4 (rw,noatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=172.16.0.87,local_lock=none,addr=172.16.7.14)

root@tools-k8s-worker-nfs-1:~# ls -l test/dcaro &
[2] 2704279

root@tools-k8s-worker-nfs-1:~# ps w 2704279
    PID TTY      STAT   TIME COMMAND
2704279 pts/8    D      0:00 ls --color=auto -l test/dcaro  --> does not recover

Trying to kill the process does nothing either (using the default or -9).

I wonder if intr is supported by our stack, from netapp help page:

The intr option allows NFS processes to be interrupted when a mount is specified as a hard mount. This
policy is deprecated in new clients such as RHEL 6.4 and is hardcoded to nointr. Kill -9 is the only way to
interrupt a process in newer kernels.
Note: For business-critical NFS exports, NetApp recommends using intr with hard mounts with NFS
clients that support it.

Mounting it as soft:

root@tools-k8s-worker-nfs-1:~# mount -o rw,noatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,soft,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=172.16.0.87,local_lock=none,addr=172.16.7.14 tools-nfs.svc.tools.eqiad1.wikimedia.cloud:/srv/tools/home test

root@tools-k8s-worker-nfs-1:~# mount
...
tools-nfs.svc.tools.eqiad1.wikimedia.cloud:/srv/tools/home on /root/test type nfs4 (rw,noatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,soft,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=172.16.0.87,local_lock=none,addr=172.16.7.14)

root@tools-k8s-worker-nfs-1:~# ls -l test/dcaro &
[3] 2705535
root@tools-k8s-worker-nfs-1:~# total 32996
-rw-r--r-- 1 root    25603      236 May 17  2023 1
-rw-r--r-- 1 dcaro   25603     4707 Dec 11  2020 apt_upgrade_report.20201211
....

It works right away :/

root@tools-k8s-worker-nfs-1:~# time ls -l test/dcaro/*
...
real    0m1.778s
user    0m0.012s
sys     0m1.017s

yep, it did not change anything on the hard mount:

root@tools-k8s-worker-nfs-1:~# time ls -l test/dcaro/* &
[3] 2706229

root@tools-k8s-worker-nfs-1:~# ps w 2706229
    PID TTY      STAT   TIME COMMAND
2706229 pts/8    D      0:00 -bash  --> still gets stuck

I rebooted the node, we'll have to do some extra testing :/

@aborrero pointed out that this might be a consequence of the OOM killer killing a process at a bad moment.

OOM killer is the standard way that cgroups have to limit processes memory usage, so it's something that we can't really avoid unfortunately :/

It's also pretty common it seems:

P60764 (An Untitled Masterwork)

1	root@cloudcumin1001:~# cumin 'O{project:tools name:tools-k8s-worker}' 'journalctl \| grep OOM \| wc'
2	59 hosts will be targeted:
3	tools-k8s-worker-[102-104].tools.eqiad1.wikimedia.cloud,tools-k8s-worker-nfs-[1-56].tools.eqiad1.wikimedia.cloud
4	OK to proceed on 59 hosts? Enter the number of affected hosts to confirm or "q" to quit: 59
5	===== NODE GROUP =====
6	(1) tools-k8s-worker-nfs-55.tools.eqiad1.wikimedia.cloud
7	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
8	23 350 4680
9	===== NODE GROUP =====
10	(1) tools-k8s-worker-nfs-38.tools.eqiad1.wikimedia.cloud
11	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
12	205 2898 43721
13	===== NODE GROUP =====
14	(1) tools-k8s-worker-nfs-47.tools.eqiad1.wikimedia.cloud
15	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
16	25 394 4773
17	===== NODE GROUP =====
18	(1) tools-k8s-worker-nfs-53.tools.eqiad1.wikimedia.cloud
19	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
20	49 746 9727
21	===== NODE GROUP =====
22	(1) tools-k8s-worker-nfs-36.tools.eqiad1.wikimedia.cloud
23	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
24	40 650 9297
25	===== NODE GROUP =====
26	(1) tools-k8s-worker-nfs-41.tools.eqiad1.wikimedia.cloud
27	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
28	48 702 9782
29	===== NODE GROUP =====
30	(1) tools-k8s-worker-nfs-37.tools.eqiad1.wikimedia.cloud
31	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
32	2435 34116 524300
33	===== NODE GROUP =====
34	(1) tools-k8s-worker-nfs-15.tools.eqiad1.wikimedia.cloud
35	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
36	14 278 3713
37	===== NODE GROUP =====
38	(1) tools-k8s-worker-nfs-20.tools.eqiad1.wikimedia.cloud
39	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
40	16599 232422 3576619
41	===== NODE GROUP =====
42	(1) tools-k8s-worker-nfs-4.tools.eqiad1.wikimedia.cloud
43	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
44	36 618 8032
45	===== NODE GROUP =====
46	(1) tools-k8s-worker-nfs-32.tools.eqiad1.wikimedia.cloud
47	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
48	5557 77924 1197765
49	===== NODE GROUP =====
50	(1) tools-k8s-worker-nfs-34.tools.eqiad1.wikimedia.cloud
51	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
52	517 7272 110812
53	===== NODE GROUP =====
54	(1) tools-k8s-worker-nfs-50.tools.eqiad1.wikimedia.cloud
55	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
56	206 3012 42933
57	===== NODE GROUP =====
58	(1) tools-k8s-worker-nfs-14.tools.eqiad1.wikimedia.cloud
59	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
60	14 220 2690
61	===== NODE GROUP =====
62	(1) tools-k8s-worker-nfs-48.tools.eqiad1.wikimedia.cloud
63	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
64	48 710 9883
65	===== NODE GROUP =====
66	(1) tools-k8s-worker-nfs-5.tools.eqiad1.wikimedia.cloud
67	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
68	252 3566 53413
69	===== NODE GROUP =====
70	(1) tools-k8s-worker-nfs-25.tools.eqiad1.wikimedia.cloud
71	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
72	87 1330 18912
73	===== NODE GROUP =====
74	(1) tools-k8s-worker-nfs-56.tools.eqiad1.wikimedia.cloud
75	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
76	583 8230 124955
77	===== NODE GROUP =====
78	(1) tools-k8s-worker-nfs-35.tools.eqiad1.wikimedia.cloud
79	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
80	3080 43286 663225
81	===== NODE GROUP =====
82	(1) tools-k8s-worker-nfs-7.tools.eqiad1.wikimedia.cloud
83	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
84	15682 219670 3364060
85	===== NODE GROUP =====
86	(1) tools-k8s-worker-nfs-51.tools.eqiad1.wikimedia.cloud
87	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
88	46 692 9259
89	===== NODE GROUP =====
90	(1) tools-k8s-worker-nfs-9.tools.eqiad1.wikimedia.cloud
91	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
92	1685 23666 360412
93	===== NODE GROUP =====
94	(1) tools-k8s-worker-nfs-39.tools.eqiad1.wikimedia.cloud
95	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
96	484 6830 103403
97	===== NODE GROUP =====
98	(1) tools-k8s-worker-nfs-11.tools.eqiad1.wikimedia.cloud
99	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
100	113 1754 23732
101	===== NODE GROUP =====
102	(1) tools-k8s-worker-nfs-30.tools.eqiad1.wikimedia.cloud
103	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
104	175 2600 37607
105	===== NODE GROUP =====
106	(1) tools-k8s-worker-nfs-12.tools.eqiad1.wikimedia.cloud
107	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
108	32 470 6473
109	===== NODE GROUP =====
110	(1) tools-k8s-worker-nfs-54.tools.eqiad1.wikimedia.cloud
111	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
112	54 772 11479
113	===== NODE GROUP =====
114	(1) tools-k8s-worker-nfs-33.tools.eqiad1.wikimedia.cloud
115	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
116	78 1132 16144
117	===== NODE GROUP =====
118	(1) tools-k8s-worker-nfs-2.tools.eqiad1.wikimedia.cloud
119	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
120	24 376 4603
121	===== NODE GROUP =====
122	(1) tools-k8s-worker-nfs-31.tools.eqiad1.wikimedia.cloud
123	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
124	5886 82444 1267858
125	===== NODE GROUP =====
126	(1) tools-k8s-worker-nfs-49.tools.eqiad1.wikimedia.cloud
127	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
128	702 9940 149854
129	===== NODE GROUP =====
130	(1) tools-k8s-worker-nfs-13.tools.eqiad1.wikimedia.cloud
131	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
132	175 2534 36573
133	===== NODE GROUP =====
134	(1) tools-k8s-worker-nfs-3.tools.eqiad1.wikimedia.cloud
135	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
136	1502 21068 321514
137	===== NODE GROUP =====
138	(1) tools-k8s-worker-nfs-21.tools.eqiad1.wikimedia.cloud
139	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
140	689 9890 152472
141	===== NODE GROUP =====
142	(1) tools-k8s-worker-nfs-52.tools.eqiad1.wikimedia.cloud
143	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
144	1384 19400 297865
145	===== NODE GROUP =====
146	(1) tools-k8s-worker-nfs-44.tools.eqiad1.wikimedia.cloud
147	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
148	256 3624 54583
149	===== NODE GROUP =====
150	(1) tools-k8s-worker-nfs-29.tools.eqiad1.wikimedia.cloud
151	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
152	107 1542 22444
153	===== NODE GROUP =====
154	(1) tools-k8s-worker-nfs-10.tools.eqiad1.wikimedia.cloud
155	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
156	16941 237237 3650025
157	===== NODE GROUP =====
158	(1) tools-k8s-worker-nfs-45.tools.eqiad1.wikimedia.cloud
159	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
160	213 2994 45693
161	===== NODE GROUP =====
162	(1) tools-k8s-worker-nfs-46.tools.eqiad1.wikimedia.cloud
163	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
164	20 312 3904
165	===== NODE GROUP =====
166	(1) tools-k8s-worker-nfs-6.tools.eqiad1.wikimedia.cloud
167	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
168	51 1084 17517
169	===== NODE GROUP =====
170	(1) tools-k8s-worker-nfs-40.tools.eqiad1.wikimedia.cloud [0/207]
171	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
172	59 894 11893
173	===== NODE GROUP =====
174	(1) tools-k8s-worker-nfs-28.tools.eqiad1.wikimedia.cloud
175	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
176	12 192 2159
177	===== NODE GROUP =====
178	(1) tools-k8s-worker-nfs-23.tools.eqiad1.wikimedia.cloud
179	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
180	8 120 1645
181	===== NODE GROUP =====
182	(1) tools-k8s-worker-nfs-8.tools.eqiad1.wikimedia.cloud
183	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
184	30 468 5781
185	===== NODE GROUP =====
186	(1) tools-k8s-worker-nfs-42.tools.eqiad1.wikimedia.cloud
187	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
188	9998 140004 2154264
189	===== NODE GROUP =====
190	(1) tools-k8s-worker-nfs-24.tools.eqiad1.wikimedia.cloud
191	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
192	56 832 11504
193	===== NODE GROUP =====
194	(1) tools-k8s-worker-102.tools.eqiad1.wikimedia.cloud
195	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
196	213 3588 49452
197	===== NODE GROUP =====
198	(1) tools-k8s-worker-104.tools.eqiad1.wikimedia.cloud
199	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
200	166 2588 33878
201	===== NODE GROUP =====
202	(1) tools-k8s-worker-nfs-22.tools.eqiad1.wikimedia.cloud
203	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
204	39 632 9121
205	===== NODE GROUP =====
206	(1) tools-k8s-worker-nfs-43.tools.eqiad1.wikimedia.cloud
207	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
208	5 74 1038
209	===== NODE GROUP =====
210	(1) tools-k8s-worker-nfs-1.tools.eqiad1.wikimedia.cloud
211	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
212	46107 645766 9893517
213	===== NODE GROUP =====
214	(1) tools-k8s-worker-nfs-17.tools.eqiad1.wikimedia.cloud
215	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
216	585 8274 126586
217	===== NODE GROUP =====
218	(1) tools-k8s-worker-nfs-27.tools.eqiad1.wikimedia.cloud
219	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
220	143719 2012046 30971537
221	===== NODE GROUP =====
222	(1) tools-k8s-worker-nfs-19.tools.eqiad1.wikimedia.cloud
223	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
224	2235 31952 484880
225	===== NODE GROUP =====
226	(1) tools-k8s-worker-nfs-18.tools.eqiad1.wikimedia.cloud
227	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
228	214 3518 50917
229	===== NODE GROUP =====
230	(1) tools-k8s-worker-103.tools.eqiad1.wikimedia.cloud
231	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
232	42 898 15918
233	===== NODE GROUP =====
234	(1) tools-k8s-worker-nfs-26.tools.eqiad1.wikimedia.cloud
235	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
236	706 10298 158172
237	===== NODE GROUP =====
238	(1) tools-k8s-worker-nfs-16.tools.eqiad1.wikimedia.cloud
239	----- OUTPUT of 'journalctl \| grep OOM \| wc' -----
240	9 456 9078
241	================
242	PASS \|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 100% (59/59) [01:16<00:00, 1.30s/hosts]
243	FAIL \| \| 0% (0/59) [01:16<?, ?hosts/s]
244	100.0% (59/59) success ratio (>= 100.0% threshold) for command: 'journalctl \| grep OOM \| wc'.
245	100.0% (59/59) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

It's interesting to see though that some nodes (like worker-nfs-1) had a way higher amount that most nodes, though there's nodes with comparable amounts (or even higher, like worker-nfs-27) that did not have NFS issues.

I think that the high counts might be due to jobs getting restarted indefinitely on the same worker and just dying over and over increasing the counts on the workers that they are in (just a guess though).

It did not happen again, will open a new task and continue the debugging if the issue happens anew.

dcaro closed this task as Resolved.Tue, Apr 30, 9:44 AM

dcaro moved this task from In Progress to Done on the Toolforge (Toolforge iteration 09) board.

dcaro mentioned this in T364822: [infra] NFS hangs in some workers until the worker is rebooted (2024-05-14).Tue, May 14, 7:54 AM

[infra] NFS hangs in some workers until the worker is rebootedClosed, ResolvedPublicActions