Page MenuHomePhabricator

[toolschecker] jobs mtime check is flapping
Closed, ResolvedPublic

Description

From emails (we got ~100 since yesterday):

From:	nagios@alert1001.wikimedia.org
Reply-To:	cloud-admin-feed@lists.wikimedia.org
To:	cloud-admin-feed@lists.wikimedia.org
Subject:	[Cloud-admin-feed] ** PROBLEM alert - checker.tools.wmflabs.org/toolschecker: check mtime mod from tools cron job is CRITICAL **
Date:	Tue, 25 Apr 2023 02:46:22 +0000 (04/25/2023 04:46:22 AM)
	
Notification Type: PROBLEM

Service: toolschecker: check mtime mod from tools cron job
Host: checker.tools.wmflabs.org
Address: checker.tools.wmflabs.org
State: CRITICAL

Date/Time: Tue Apr 25 02:46:22 UTC 2023

Notes URLs: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker

Acknowledged by : 

Additional Info:

HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/cron - 177 bytes in 0.006 second response time
_______________________________________________
Cloud-admin-feed mailing list -- cloud-admin-feed@lists.wikimedia.org
List information: https://lists.wikimedia.org/postorius/lists/cloud-admin-feed.lists.wikimedia.org/

The queued jobs jumped again:

image.png (1×2 px, 199 KB)

Event Timeline

dcaro changed the task status from Open to In Progress.Apr 25 2023, 8:55 AM
dcaro triaged this task as High priority.
dcaro created this task.
dcaro moved this task from To refine to Doing on the User-dcaro board.

It seems there's a few queues in error, looking:

1###### Nodes
2tools-sgeexec-10-13.tools.eqiad1.wikimedia.cloud: !!python/object:wmcs_libs.grid.GridNodeInfo
3 arch_string: lx-amd64
4 load_avg: '10.39'
5 m_core: '4'
6 m_socket: '4'
7 m_thread: '4'
8 mem_total: 7.8G
9 mem_used: 1.3G
10 name: tools-sgeexec-10-13.tools.eqiad1.wikimedia.cloud
11 num_proc: '4'
12 queues_info:
13 continuous: !!python/object:wmcs_libs.grid.GridQueueInfo
14 arch: null
15 load_avg: null
16 messages: null
17 name: continuous
18 slots: '50'
19 slots_resv: '0'
20 slots_total: null
21 slots_used: '0'
22 states: !GridQueueStatesSet
23 - !GridQueueState 'ALARM1'
24 types: !GridQueueTypesSet
25 - !GridQueueType 'BATCH'
26 - !GridQueueType 'CHECKPOINTING'
27 task: !!python/object:wmcs_libs.grid.GridQueueInfo
28 arch: null
29 load_avg: null
30 messages: null
31 name: task
32 slots: '50'
33 slots_resv: '0'
34 slots_total: null
35 slots_used: '5'
36 states: !GridQueueStatesSet
37 - !GridQueueState 'ALARM1'
38 types: !GridQueueTypesSet
39 - !GridQueueType 'BATCH'
40 - !GridQueueType 'INTERACTIVE'
41 swap_total: 24.0M
42 swap_used: 24.0M
43tools-sgeexec-10-15.tools.eqiad1.wikimedia.cloud: !!python/object:wmcs_libs.grid.GridNodeInfo
44 arch_string: lx-amd64
45 load_avg: '13.22'
46 m_core: '4'
47 m_socket: '4'
48 m_thread: '4'
49 mem_total: 7.8G
50 mem_used: 1.1G
51 name: tools-sgeexec-10-15.tools.eqiad1.wikimedia.cloud
52 num_proc: '4'
53 queues_info:
54 continuous: !!python/object:wmcs_libs.grid.GridQueueInfo
55 arch: null
56 load_avg: null
57 messages: null
58 name: continuous
59 slots: '50'
60 slots_resv: '0'
61 slots_total: null
62 slots_used: '10'
63 states: !GridQueueStatesSet
64 - !GridQueueState 'ALARM1'
65 types: !GridQueueTypesSet
66 - !GridQueueType 'BATCH'
67 - !GridQueueType 'CHECKPOINTING'
68 task: !!python/object:wmcs_libs.grid.GridQueueInfo
69 arch: null
70 load_avg: null
71 messages: null
72 name: task
73 slots: '50'
74 slots_resv: '0'
75 slots_total: null
76 slots_used: '0'
77 states: !GridQueueStatesSet
78 - !GridQueueState 'ALARM1'
79 types: !GridQueueTypesSet
80 - !GridQueueType 'BATCH'
81 - !GridQueueType 'INTERACTIVE'
82 swap_total: 24.0M
83 swap_used: 23.7M
84tools-sgeexec-10-16.tools.eqiad1.wikimedia.cloud: !!python/object:wmcs_libs.grid.GridNodeInfo
85 arch_string: lx-amd64
86 load_avg: '12.05'
87 m_core: '4'
88 m_socket: '4'
89 m_thread: '4'
90 mem_total: 7.8G
91 mem_used: 2.7G
92 name: tools-sgeexec-10-16.tools.eqiad1.wikimedia.cloud
93 num_proc: '4'
94 queues_info:
95 continuous: !!python/object:wmcs_libs.grid.GridQueueInfo
96 arch: null
97 load_avg: null
98 messages: null
99 name: continuous
100 slots: '50'
101 slots_resv: '0'
102 slots_total: null
103 slots_used: '7'
104 states: !GridQueueStatesSet
105 - !GridQueueState 'ALARM1'
106 types: !GridQueueTypesSet
107 - !GridQueueType 'BATCH'
108 - !GridQueueType 'CHECKPOINTING'
109 task: !!python/object:wmcs_libs.grid.GridQueueInfo
110 arch: null
111 load_avg: null
112 messages: null
113 name: task
114 slots: '50'
115 slots_resv: '0'
116 slots_total: null
117 slots_used: '4'
118 states: !GridQueueStatesSet
119 - !GridQueueState 'ALARM1'
120 types: !GridQueueTypesSet
121 - !GridQueueType 'BATCH'
122 - !GridQueueType 'INTERACTIVE'
123 swap_total: 8.0G
124 swap_used: 68.0M
125tools-sgeexec-10-17.tools.eqiad1.wikimedia.cloud: !!python/object:wmcs_libs.grid.GridNodeInfo
126 arch_string: lx-amd64
127 load_avg: '13.26'
128 m_core: '4'
129 m_socket: '4'
130 m_thread: '4'
131 mem_total: 7.8G
132 mem_used: 1.2G
133 name: tools-sgeexec-10-17.tools.eqiad1.wikimedia.cloud
134 num_proc: '4'
135 queues_info:
136 continuous: !!python/object:wmcs_libs.grid.GridQueueInfo
137 arch: null
138 load_avg: null
139 messages: null
140 name: continuous
141 slots: '50'
142 slots_resv: '0'
143 slots_total: null
144 slots_used: '7'
145 states: !GridQueueStatesSet
146 - !GridQueueState 'ALARM1'
147 types: !GridQueueTypesSet
148 - !GridQueueType 'BATCH'
149 - !GridQueueType 'CHECKPOINTING'
150 task: !!python/object:wmcs_libs.grid.GridQueueInfo
151 arch: null
152 load_avg: null
153 messages: null
154 name: task
155 slots: '50'
156 slots_resv: '0'
157 slots_total: null
158 slots_used: '5'
159 states: !GridQueueStatesSet
160 - !GridQueueState 'ALARM1'
161 types: !GridQueueTypesSet
162 - !GridQueueType 'BATCH'
163 - !GridQueueType 'INTERACTIVE'
164 swap_total: 8.0G
165 swap_used: 137.8M
166tools-sgeexec-10-8.tools.eqiad1.wikimedia.cloud: !!python/object:wmcs_libs.grid.GridNodeInfo
167 arch_string: lx-amd64
168 load_avg: '10.63'
169 m_core: '4'
170 m_socket: '4'
171 m_thread: '4'
172 mem_total: 7.8G
173 mem_used: 1.3G
174 name: tools-sgeexec-10-8.tools.eqiad1.wikimedia.cloud
175 num_proc: '4'
176 queues_info:
177 continuous: !!python/object:wmcs_libs.grid.GridQueueInfo
178 arch: null
179 load_avg: null
180 messages: null
181 name: continuous
182 slots: '50'
183 slots_resv: '0'
184 slots_total: null
185 slots_used: '10'
186 states: !GridQueueStatesSet
187 - !GridQueueState 'ALARM1'
188 types: !GridQueueTypesSet
189 - !GridQueueType 'BATCH'
190 - !GridQueueType 'CHECKPOINTING'
191 task: !!python/object:wmcs_libs.grid.GridQueueInfo
192 arch: null
193 load_avg: null
194 messages: null
195 name: task
196 slots: '50'
197 slots_resv: '0'
198 slots_total: null
199 slots_used: '4'
200 states: !GridQueueStatesSet
201 - !GridQueueState 'ALARM1'
202 types: !GridQueueTypesSet
203 - !GridQueueType 'BATCH'
204 - !GridQueueType 'INTERACTIVE'
205 swap_total: 24.0M
206 swap_used: 24.0M
207tools-sgewebgen-10-2.tools.eqiad1.wikimedia.cloud: !!python/object:wmcs_libs.grid.GridNodeInfo
208 arch_string: lx-amd64
209 load_avg: '15.32'
210 m_core: '4'
211 m_socket: '4'
212 m_thread: '4'
213 mem_total: 7.8G
214 mem_used: 2.2G
215 name: tools-sgewebgen-10-2.tools.eqiad1.wikimedia.cloud
216 num_proc: '4'
217 queues_info:
218 webgrid-generic: !!python/object:wmcs_libs.grid.GridQueueInfo
219 arch: null
220 load_avg: null
221 messages: null
222 name: webgrid-generic
223 slots: '256'
224 slots_resv: '0'
225 slots_total: null
226 slots_used: '13'
227 states: !GridQueueStatesSet
228 - !GridQueueState 'ALARM1'
229 types: !GridQueueTypesSet
230 - !GridQueueType 'BATCH'
231 swap_total: 24.0M
232 swap_used: '0.0'
233tools-sgeweblight-10-22.tools.eqiad1.wikimedia.cloud: !!python/object:wmcs_libs.grid.GridNodeInfo
234 arch_string: lx-amd64
235 load_avg: '22.03'
236 m_core: '4'
237 m_socket: '4'
238 m_thread: '4'
239 mem_total: 7.8G
240 mem_used: 1.9G
241 name: tools-sgeweblight-10-22.tools.eqiad1.wikimedia.cloud
242 num_proc: '4'
243 queues_info:
244 webgrid-lighttpd: !!python/object:wmcs_libs.grid.GridQueueInfo
245 arch: null
246 load_avg: null
247 messages: null
248 name: webgrid-lighttpd
249 slots: '256'
250 slots_resv: '0'
251 slots_total: null
252 slots_used: '19'
253 states: !GridQueueStatesSet
254 - !GridQueueState 'ALARM1'
255 types: !GridQueueTypesSet
256 - !GridQueueType 'BATCH'
257 swap_total: 8.0G
258 swap_used: 15.0M
259tools-sgeweblight-10-24.tools.eqiad1.wikimedia.cloud: !!python/object:wmcs_libs.grid.GridNodeInfo
260 arch_string: lx-amd64
261 load_avg: '12.03'
262 m_core: '4'
263 m_socket: '4'
264 m_thread: '4'
265 mem_total: 7.8G
266 mem_used: 1.8G
267 name: tools-sgeweblight-10-24.tools.eqiad1.wikimedia.cloud
268 num_proc: '4'
269 queues_info:
270 webgrid-lighttpd: !!python/object:wmcs_libs.grid.GridQueueInfo
271 arch: null
272 load_avg: null
273 messages: null
274 name: webgrid-lighttpd
275 slots: '256'
276 slots_resv: '0'
277 slots_total: null
278 slots_used: '17'
279 states: !GridQueueStatesSet
280 - !GridQueueState 'ALARM1'
281 types: !GridQueueTypesSet
282 - !GridQueueType 'BATCH'
283 swap_total: 8.0G
284 swap_used: '0.0'
285
286###### Failed queues extended info
287- !!python/object:wmcs_libs.grid.GridQueueInfo
288 arch: lx-amd64
289 load_avg: '10.39000'
290 messages: null
291 name: task@tools-sgeexec-10-13.tools.eqiad1.wikimedia.cloud
292 slots: null
293 slots_resv: '0'
294 slots_total: '50'
295 slots_used: '5'
296 states: !GridQueueStatesSet
297 - !GridQueueState 'ALARM1'
298 types: !GridQueueTypesSet
299 - !GridQueueType 'BATCH'
300 - !GridQueueType 'INTERACTIVE'
301- !!python/object:wmcs_libs.grid.GridQueueInfo
302 arch: lx-amd64
303 load_avg: '13.22000'
304 messages: null
305 name: task@tools-sgeexec-10-15.tools.eqiad1.wikimedia.cloud
306 slots: null
307 slots_resv: '0'
308 slots_total: '50'
309 slots_used: '0'
310 states: !GridQueueStatesSet
311 - !GridQueueState 'ALARM1'
312 types: !GridQueueTypesSet
313 - !GridQueueType 'BATCH'
314 - !GridQueueType 'INTERACTIVE'
315- !!python/object:wmcs_libs.grid.GridQueueInfo
316 arch: lx-amd64
317 load_avg: '12.05000'
318 messages: null
319 name: task@tools-sgeexec-10-16.tools.eqiad1.wikimedia.cloud
320 slots: null
321 slots_resv: '0'
322 slots_total: '50'
323 slots_used: '4'
324 states: !GridQueueStatesSet
325 - !GridQueueState 'ALARM1'
326 types: !GridQueueTypesSet
327 - !GridQueueType 'BATCH'
328 - !GridQueueType 'INTERACTIVE'
329- !!python/object:wmcs_libs.grid.GridQueueInfo
330 arch: lx-amd64
331 load_avg: '13.26000'
332 messages: null
333 name: task@tools-sgeexec-10-17.tools.eqiad1.wikimedia.cloud
334 slots: null
335 slots_resv: '0'
336 slots_total: '50'
337 slots_used: '5'
338 states: !GridQueueStatesSet
339 - !GridQueueState 'ALARM1'
340 types: !GridQueueTypesSet
341 - !GridQueueType 'BATCH'
342 - !GridQueueType 'INTERACTIVE'
343- !!python/object:wmcs_libs.grid.GridQueueInfo
344 arch: lx-amd64
345 load_avg: '10.63000'
346 messages: null
347 name: task@tools-sgeexec-10-8.tools.eqiad1.wikimedia.cloud
348 slots: null
349 slots_resv: '0'
350 slots_total: '50'
351 slots_used: '4'
352 states: !GridQueueStatesSet
353 - !GridQueueState 'ALARM1'
354 types: !GridQueueTypesSet
355 - !GridQueueType 'BATCH'
356 - !GridQueueType 'INTERACTIVE'
357- !!python/object:wmcs_libs.grid.GridQueueInfo
358 arch: lx-amd64
359 load_avg: '10.39000'
360 messages: null
361 name: continuous@tools-sgeexec-10-13.tools.eqiad1.wikimedia.cloud
362 slots: null
363 slots_resv: '0'
364 slots_total: '50'
365 slots_used: '0'
366 states: !GridQueueStatesSet
367 - !GridQueueState 'ALARM1'
368 types: !GridQueueTypesSet
369 - !GridQueueType 'BATCH'
370 - !GridQueueType 'CHECKPOINTING'
371- !!python/object:wmcs_libs.grid.GridQueueInfo
372 arch: lx-amd64
373 load_avg: '13.22000'
374 messages: null
375 name: continuous@tools-sgeexec-10-15.tools.eqiad1.wikimedia.cloud
376 slots: null
377 slots_resv: '0'
378 slots_total: '50'
379 slots_used: '10'
380 states: !GridQueueStatesSet
381 - !GridQueueState 'ALARM1'
382 types: !GridQueueTypesSet
383 - !GridQueueType 'BATCH'
384 - !GridQueueType 'CHECKPOINTING'
385- !!python/object:wmcs_libs.grid.GridQueueInfo
386 arch: lx-amd64
387 load_avg: '12.05000'
388 messages: null
389 name: continuous@tools-sgeexec-10-16.tools.eqiad1.wikimedia.cloud
390 slots: null
391 slots_resv: '0'
392 slots_total: '50'
393 slots_used: '7'
394 states: !GridQueueStatesSet
395 - !GridQueueState 'ALARM1'
396 types: !GridQueueTypesSet
397 - !GridQueueType 'BATCH'
398 - !GridQueueType 'CHECKPOINTING'
399- !!python/object:wmcs_libs.grid.GridQueueInfo
400 arch: lx-amd64
401 load_avg: '13.26000'
402 messages: null
403 name: continuous@tools-sgeexec-10-17.tools.eqiad1.wikimedia.cloud
404 slots: null
405 slots_resv: '0'
406 slots_total: '50'
407 slots_used: '7'
408 states: !GridQueueStatesSet
409 - !GridQueueState 'ALARM1'
410 types: !GridQueueTypesSet
411 - !GridQueueType 'BATCH'
412 - !GridQueueType 'CHECKPOINTING'
413- !!python/object:wmcs_libs.grid.GridQueueInfo
414 arch: lx-amd64
415 load_avg: '10.63000'
416 messages: null
417 name: continuous@tools-sgeexec-10-8.tools.eqiad1.wikimedia.cloud
418 slots: null
419 slots_resv: '0'
420 slots_total: '50'
421 slots_used: '10'
422 states: !GridQueueStatesSet
423 - !GridQueueState 'ALARM1'
424 types: !GridQueueTypesSet
425 - !GridQueueType 'BATCH'
426 - !GridQueueType 'CHECKPOINTING'
427- !!python/object:wmcs_libs.grid.GridQueueInfo
428 arch: lx-amd64
429 load_avg: '15.32000'
430 messages: null
431 name: webgrid-generic@tools-sgewebgen-10-2.tools.eqiad1.wikimedia.cloud
432 slots: null
433 slots_resv: '0'
434 slots_total: '256'
435 slots_used: '13'
436 states: !GridQueueStatesSet
437 - !GridQueueState 'ALARM1'
438 types: !GridQueueTypesSet
439 - !GridQueueType 'BATCH'
440- !!python/object:wmcs_libs.grid.GridQueueInfo
441 arch: lx-amd64
442 load_avg: '22.02000'
443 messages: null
444 name: webgrid-lighttpd@tools-sgeweblight-10-22.tools.eqiad1.wikimedia.cloud
445 slots: null
446 slots_resv: '0'
447 slots_total: '256'
448 slots_used: '19'
449 states: !GridQueueStatesSet
450 - !GridQueueState 'ALARM1'
451 types: !GridQueueTypesSet
452 - !GridQueueType 'BATCH'
453- !!python/object:wmcs_libs.grid.GridQueueInfo
454 arch: lx-amd64
455 load_avg: '12.03000'
456 messages: null
457 name: webgrid-lighttpd@tools-sgeweblight-10-24.tools.eqiad1.wikimedia.cloud
458 slots: null
459 slots_resv: '0'
460 slots_total: '256'
461 slots_used: '17'
462 states: !GridQueueStatesSet
463 - !GridQueueState 'ALARM1'
464 types: !GridQueueTypesSet
465 - !GridQueueType 'BATCH'
466
467###### Failed jobs logs

tools-sgewebgen-10-2

It seems it had issues with NFS at some point:

[Fri Apr 21 01:05:01 2023] nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud OK
[Fri Apr 21 01:07:59 2023] nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud not responding, still trying
[Fri Apr 21 01:08:00 2023] nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud not responding, still trying
[Fri Apr 21 01:08:06 2023] nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud not responding, still trying

I'll reboot this an any others in a similar situation to force a new session NFS bootstrap.

Mentioned in SAL (#wikimedia-cloud) [2023-04-25T11:44:11Z] <dcaro> rebooting tools-sgewebgen-10-2 (T335336)

It seems that they are not mounting the swap properly, that might be one of the reasons for the performance issues:

root@tools-sgewebgen-10-2:~# run-puppet-agent
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts
Info: Caching catalog for tools-sgewebgen-10-2.tools.eqiad1.wikimedia.cloud
Info: Applying configuration version '(3ba84e2203) Vgutierrez - varnish: Allow disabling port 80'
Notice: /Stage[main]/Profile::Toolforge::Grid::Node::All/Cinderutils::Swap[big]/Mount[mount-swap-big]/ensure: defined 'ensure' as 'defined' (corrective)
Info: Computing checksum on file /etc/fstab
Info: /Stage[main]/Profile::Toolforge::Grid::Node::All/Cinderutils::Swap[big]/Mount[mount-swap-big]: Scheduling refresh of Exec[swapon-big]
Info: /Stage[main]/Profile::Toolforge::Grid::Node::All/Cinderutils::Swap[big]/Mount[mount-swap-big]: Scheduling refresh of Mount[mount-swap-big]
Notice: /Stage[main]/Profile::Toolforge::Grid::Node::All/Cinderutils::Swap[big]/Mount[mount-swap-big]: Triggered 'refresh' from 1 event
Info: /Stage[main]/Profile::Toolforge::Grid::Node::All/Cinderutils::Swap[big]/Mount[mount-swap-big]: Scheduling refresh of Exec[swapon-big]
Info: /Stage[main]/Profile::Toolforge::Grid::Node::All/Cinderutils::Swap[big]/Mount[mount-swap-big]: Scheduling refresh of Mount[mount-swap-big]
Notice: /Stage[main]/Profile::Toolforge::Grid::Node::All/Cinderutils::Swap[big]/Exec[swapon-big]/returns: swapon: /dev/sdb: swapon failed: Device or resource busy
Error: /Stage[main]/Profile::Toolforge::Grid::Node::All/Cinderutils::Swap[big]/Exec[swapon-big]: Failed to call refresh: '/sbin/swapon /dev/sdb' returned 255 instead of one of [0]
Error: /Stage[main]/Profile::Toolforge::Grid::Node::All/Cinderutils::Swap[big]/Exec[swapon-big]: '/sbin/swapon /dev/sdb' returned 255 instead of one of [0]

Mentioned in SAL (#wikimedia-cloud) [2023-04-25T12:20:31Z] <dcaro> rebooting tools-sgeweblight-10-24 (T335336)

Mentioned in SAL (#wikimedia-cloud) [2023-04-25T12:22:13Z] <dcaro> rebooting tools-sgeweblight-10-22 (T335336)

Mentioned in SAL (#wikimedia-cloud) [2023-04-25T12:31:42Z] <dcaro> rebooting tools-sgeexec-10-17 (T335336)

Mentioned in SAL (#wikimedia-cloud) [2023-04-25T12:32:07Z] <dcaro> rebooting tools-sgeexec-10-16 (T335336)

Mentioned in SAL (#wikimedia-cloud) [2023-04-25T12:33:10Z] <dcaro> rebooting tools-sgeexec-10-15 (T335336)

Mentioned in SAL (#wikimedia-cloud) [2023-04-25T12:34:06Z] <dcaro> rebooting tools-sgeexec-10-13 (T335336)

All the hots that had queues in alarm state have been rebooted (also had nfs server not responding logs since the last reboot), let's see if the load improves

I've rebooted all the hosts that had queues in alarm status, now I'm looking for the ones with high load, looking at one, they have a bunch of processes in "Uninterruptible sleep" state D:

1root@tools-sgeweblight-10-20:~# ps aux | grep ' D '
2root 1526 0.0 0.0 6140 884 pts/0 S+ 12:53 0:00 grep --color=auto D
3tools.o+ 7366 0.0 0.0 11296 6084 ? D Apr03 0:54 /usr/sbin/lighttpd -f /var/run/lighttpd/ocr4wikisource -D
4tools.l+ 7643 0.0 0.0 11296 6160 ? D Apr03 0:54 /usr/sbin/lighttpd -f /var/run/lighttpd/labelimgohs -D
5tools.z+ 7909 0.0 0.0 11424 6276 ? D Apr03 0:56 /usr/sbin/lighttpd -f /var/run/lighttpd/zygserv -D
653033 8169 0.0 0.0 11420 6264 ? D Apr03 1:00 /usr/sbin/lighttpd -f /var/run/lighttpd/articles-by-lat-lon-without-images -D
7tools.n+ 8508 0.0 0.0 11456 6448 ? D Apr03 0:55 /usr/sbin/lighttpd -f /var/run/lighttpd/nccroptool -D
8tools.t+ 9041 0.0 0.0 11424 6036 ? D Apr03 0:53 /usr/sbin/lighttpd -f /var/run/lighttpd/test-vvv -D
9tools.r+ 12013 0.0 0.0 11756 6700 ? D Apr03 8:42 /usr/sbin/lighttpd -f /var/run/lighttpd/render-tests -D
10tools.r+ 31985 0.0 0.3 48028 32088 ? D Apr06 0:00 /usr/bin/python /data/project/render-tests/public_html/tlgbe/tlgwsgi.py
11tools.r+ 31989 0.0 0.3 48032 32132 ? D Apr06 0:00 /usr/bin/python /data/project/render-tests/public_html/tlgbe/tlgwsgi.py
12tools.r+ 31992 0.0 0.3 48024 32124 ? D Apr06 0:00 /usr/bin/python /data/project/render-tests/public_html/tlgbe/tlgwsgi.py

That points to NFS, I'll reboot those too

Mentioned in SAL (#wikimedia-cloud) [2023-04-25T12:57:23Z] <dcaro> rebooting tools-sgeweblight-10-20 (T335336)

Mentioned in SAL (#wikimedia-cloud) [2023-04-25T12:58:49Z] <dcaro> rebooting tools-sgeweblight-10-21 (T335336)

Rebooted most if not all the machines that were stuck, the loads are going back to normal:

image.png (350×465 px, 80 KB)

Note that the sge-cron had >1200 load xd

root@tools-sgecron-2:~# uptime
 14:25:13 up 22 days, 16 min,  2 users,  load average: 1328.81, 1325.47, 1322.92

I did not notice before because it's not really running a queue, so it does not show on qhost

Things seems stabler now:

image.png (1×2 px, 171 KB)

There's still one alert that got triggered overnight, will investigate:
[Cloud-admin-feed] [FIRING:2] ProbeDown tools (ip4 probes/custom http_this_tool_does_not_exist_toolforge_org_ip4 warning prometheus wmcs)

There's one host tools-sgeweblight-10-28 that seems to have gotten stuck since it was rebooted two days ago:

image.png (530×2 px, 374 KB)

(there you see also the sge shadow, that was not rebooted until earlier today)

In the last boot logs, that start on Apr 25 13:03:53, there's one first entry about nfs being unreachable at Apr 26 09:29:45:

kernel: nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud not responding

That matches the graph above when the load started increasing.

This one is then an instance where nfs went away after the last round of reboots :/

@Andrew we should investigate why that happened while it's still more or less recent

Mentioned in SAL (#wikimedia-cloud) [2023-04-28T08:27:06Z] <dcaro> rebooting tools-sgeweblight-10-28 (T335336)

nfs-exportd logs show some dns failures... probably unrelated but I'm going to start with understanding that.

Change 913200 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] nfs-exportd: Don't crash out if a dns lookup fails

https://gerrit.wikimedia.org/r/913200

Change 913200 merged by Andrew Bogott:

[operations/puppet@production] nfs-exportd: Don't crash out if a dns lookup fails

https://gerrit.wikimedia.org/r/913200