Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | dcaro | T335009 Toolforge grid seems overloaded | |||
Resolved | dcaro | T335336 [toolschecker] jobs mtime check is flapping | |||
Declined | None | T336034 Toolforge grid automation: consider creating a cookbook to heal the grid from D state procs |
Event Timeline
Example from toolschecker, which has been alerting since 12:15 AM UTC / 20:15 ET:
tools.toolschecker@tools-sgebastion-10:~$ qstat job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 586699 0.35210 test-long- tools.toolsc Rr 04/03/2023 13:57:40 continuous@tools-sgeexec-10-13 1 9999836 0.27349 test-long- tools.toolsc r 04/03/2023 14:32:16 continuous@tools-sgeexec-10-8. 1 42207 0.25001 toolscheck tools.toolsc qw 04/19/2023 04:10:06 1 tools.toolschecker@tools-sgebastion-10:~$ qstat -j 42207 ============================================================== job_number: 42207 exec_file: job_scripts/42207 submission_time: Wed Apr 19 04:10:06 2023 owner: tools.toolschecker uid: 52524 group: tools.toolschecker gid: 52524 sge_o_home: /data/project/toolschecker sge_o_log_name: tools.toolschecker sge_o_path: /usr/bin:/bin sge_o_shell: /bin/sh sge_o_workdir: /mnt/nfs/labstore-secondary-tools-project/toolschecker sge_o_host: tools-sgecron-2 account: sge stderr_path_list: NONE:NONE:/data/project/toolschecker/logs/crontest.log merge: y hard resource_list: h_vmem=524288k,release=buster mail_list: tools.toolschecker@toolforge.org notify: FALSE job_name: toolschecker.crontest stdout_path_list: NONE:NONE:/data/project/toolschecker/logs/crontest.log jobshare: 0 hard_queue_list: task env_list: TERM=NONE job_args: /data/project/toolschecker/crontest.txt script_file: /usr/bin/touch binding: NONE job_type: binary scheduling info: queue instance "continuous@tools-sgeexec-10-17.tools.eqiad1.wikimedia.cloud" dropped because it is overloaded: np_load_avg=3.282500 (= 3.282500 + 0.50 * 0.000000 with nproc=4) >= 1.75 queue instance "continuous@tools-sgeexec-10-16.tools.eqiad1.wikimedia.cloud" dropped because it is overloaded: np_load_avg=2.920000 (= 2.920000 + 0.50 * 0.000000 with nproc=4) >= 1.75 queue instance "continuous@tools-sgeexec-10-15.tools.eqiad1.wikimedia.cloud" dropped because it is overloaded: np_load_avg=3.567500 (= 3.567500 + 0.50 * 0.000000 with nproc=4) >= 1.75 queue instance "continuous@tools-sgeexec-10-8.tools.eqiad1.wikimedia.cloud" dropped because it is overloaded: np_load_avg=2.797500 (= 2.797500 + 0.50 * 0.000000 with nproc=4) >= 1.75 queue instance "task@tools-sgeexec-10-17.tools.eqiad1.wikimedia.cloud" dropped because it is overloaded: np_load_avg=3.282500 (= 3.282500 + 0.50 * 0.000000 with nproc=4) >= 1.75 queue instance "task@tools-sgeexec-10-16.tools.eqiad1.wikimedia.cloud" dropped because it is overloaded: np_load_avg=2.920000 (= 2.920000 + 0.50 * 0.000000 with nproc=4) >= 1.75 queue instance "task@tools-sgeexec-10-15.tools.eqiad1.wikimedia.cloud" dropped because it is overloaded: np_load_avg=3.567500 (= 3.567500 + 0.50 * 0.000000 with nproc=4) >= 1.75 queue instance "task@tools-sgeexec-10-8.tools.eqiad1.wikimedia.cloud" dropped because it is overloaded: np_load_avg=2.797500 (= 2.797500 + 0.50 * 0.000000 with nproc=4) >= 1.75 queue instance "webgrid-lighttpd@tools-sgeweblight-10-24.tools.eqiad1.wikimedia.cloud" dropped because it is overloaded: np_load_avg=2.770000 (= 2.770000 + 0.50 * 0.000000 with nproc=4) >= 2.75 queue instance "webgrid-lighttpd@tools-sgeweblight-10-22.tools.eqiad1.wikimedia.cloud" dropped because it is overloaded: np_load_avg=2.787500 (= 2.787500 + 0.50 * 0.000000 with nproc=4) >= 2.75 cannot run in queue "webgrid-lighttpd" because it is not contained in its hard queue list (-q) cannot run in queue "webgrid-generic" because it is not contained in its hard queue list (-q) (-l h_vmem=524288k,release=buster) cannot run at host "tools-sgeexec-10-14.tools.eqiad1.wikimedia.cloud" because it offers only hc:h_vmem=146.312M (-l h_vmem=524288k,release=buster) cannot run at host "tools-sgeexec-10-20.tools.eqiad1.wikimedia.cloud" because it offers only hc:h_vmem=122.875M (-l h_vmem=524288k,release=buster) cannot run at host "tools-sgeexec-10-22.tools.eqiad1.wikimedia.cloud" because it offers only hc:h_vmem=171.703M (-l h_vmem=524288k,release=buster) cannot run at host "tools-sgeexec-10-21.tools.eqiad1.wikimedia.cloud" because it offers only hc:h_vmem=13.500M
The grid seems to have stabilized, I will close this one, but there's a child task to continue investigating some of the NFS issues.
Waiting over 10 minutes for this job to start today:
Noticed that toolschecker also started alerting around the same time (timestamps EDT):
02:13:36 <+icinga-wm> PROBLEM - toolschecker: check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/cron - 177 bytes in 0.012 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker 02:15:08 <+icinga-wm> RECOVERY - toolschecker: check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker 02:27:34 <+icinga-wm> PROBLEM - toolschecker: check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/cron - 177 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker 02:28:20 <+icinga-wm> RECOVERY - Persistent high iowait on clouddumps1001 is OK: (C)10 ge (W)5 ge 4.691 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005 02:32:20 <+icinga-wm> RECOVERY - toolschecker: check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker
A quick look on the hosts from the master seems to point to tools-sgeexec-10-16:
# qhost ... tools-sgeexec-10-16.tools.eqiad1.wikimedia.cloud lx-amd64 4 4 4 4 11.35 7.8G 2.4G 8.0G 102.2M ...
looking
Capturing here the kernel stack trace from tools-sgeexec-10-16:
[Sat Apr 29 06:32:34 2023] INFO: task python:23958 blocked for more than 120 seconds. [Sat Apr 29 06:32:34 2023] Not tainted 4.19.0-23-cloud-amd64 #1 Debian 4.19.269-1 [Sat Apr 29 06:32:34 2023] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Sat Apr 29 06:32:34 2023] python D 0 23958 23857 0x00000000 [Sat Apr 29 06:32:34 2023] Call Trace: [Sat Apr 29 06:32:34 2023] __schedule+0x29f/0x840 [Sat Apr 29 06:32:34 2023] ? bit_wait_timeout+0x90/0x90 [Sat Apr 29 06:32:34 2023] schedule+0x28/0x80 [Sat Apr 29 06:32:34 2023] io_schedule+0x12/0x40 [Sat Apr 29 06:32:34 2023] bit_wait_io+0xd/0x50 [Sat Apr 29 06:32:34 2023] __wait_on_bit+0x73/0x90 [Sat Apr 29 06:32:34 2023] out_of_line_wait_on_bit+0x91/0xb0 [Sat Apr 29 06:32:34 2023] ? init_wait_var_entry+0x40/0x40 [Sat Apr 29 06:32:34 2023] nfs_lock_and_join_requests+0x329/0x4e0 [nfs] [Sat Apr 29 06:32:34 2023] nfs_updatepage+0x302/0x940 [nfs] [Sat Apr 29 06:32:34 2023] ? nfs_flush_incompatible+0x157/0x160 [nfs] [Sat Apr 29 06:32:34 2023] nfs_write_end+0x63/0x4c0 [nfs] [Sat Apr 29 06:32:34 2023] ? iov_iter_copy_from_user_atomic+0xc3/0x300 [Sat Apr 29 06:32:34 2023] generic_perform_write+0x141/0x1c0 [Sat Apr 29 06:32:34 2023] nfs_file_write+0xdc/0x210 [nfs] [Sat Apr 29 06:32:34 2023] new_sync_write+0xfb/0x160 [Sat Apr 29 06:32:34 2023] vfs_write+0xa5/0x1a0 [Sat Apr 29 06:32:34 2023] ksys_write+0x57/0xd0 [Sat Apr 29 06:32:34 2023] do_syscall_64+0x50/0xf0 [Sat Apr 29 06:32:34 2023] entry_SYSCALL_64_after_hwframe+0x5c/0xc1 [Sat Apr 29 06:32:34 2023] RIP: 0033:0x14d8047303b4 [Sat Apr 29 06:32:34 2023] Code: Bad RIP value. [Sat Apr 29 06:32:34 2023] RSP: 002b:00007ffec80a0448 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [Sat Apr 29 06:32:34 2023] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 000014d8047303b4 [Sat Apr 29 06:32:34 2023] RDX: 0000000000000006 RSI: 000014d7e1ed5a14 RDI: 0000000000000002 [Sat Apr 29 06:32:34 2023] RBP: 000014d7e1ed5a14 R08: 000014d8048028b0 R09: 0000000000000000 [Sat Apr 29 06:32:34 2023] R10: 0000000000000018 R11: 0000000000000246 R12: 000014d804801680 [Sat Apr 29 06:32:34 2023] R13: 0000000000000006 R14: 000014d8047fc760 R15: 0000000000000006 [Sat Apr 29 06:32:54 2023] nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud not responding, still trying
So yep, nfs seems stuck, it seems it has been happening for a few days, there's this message from the last boot:
[Tue Apr 25 12:39:49 2023] FS-Cache: Duplicate cookie detected [Tue Apr 25 12:39:49 2023] FS-Cache: O-cookie c=0000000064e0ee72 [p=00000000fe0209fa fl=222 nc=0 na=1] [Tue Apr 25 12:39:49 2023] FS-Cache: O-cookie d=0000000069901e35 n=00000000029ec6dc [Tue Apr 25 12:39:49 2023] FS-Cache: O-key=[16] '040000000000000002000801ac10070e' [Tue Apr 25 12:39:49 2023] FS-Cache: N-cookie c=0000000096ac7cac [p=00000000fe0209fa fl=2 nc=0 na=1] [Tue Apr 25 12:39:49 2023] FS-Cache: N-cookie d=0000000069901e35 n=000000004df46263 [Tue Apr 25 12:39:49 2023] FS-Cache: N-key=[16] '040000000000000002000801ac10070e'
And a few days after the stack trace from the comment before starts appearing
On the graphs it seems that ~the 29th of April it started increasing, and betwen the 1st and 2nd of May it jumped again:
{F36976612}
Probably processes getting stuck on IO
Mentioned in SAL (#wikimedia-cloud) [2023-05-05T09:13:19Z] <dcaro> rebooted tools-sgeexec-10-16 as it was stuck (T335009)
@dcaro found https://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html that explains why a D state process results in increased loadavg on linux servers.
If some NFS hiccup (otherwise harmless) result in D state processes on the exec nodes, and the load avg goes up as a result, and if the grid schedules jobs based on grid load avg (just a theory at this point), then the failure mode is clear:
Any NFS hiccup (otherwise harmless) can result in the Grid becoming unavailable and/or unreliable.
And goes to confirm what we already knew: Grid and NFS are unsustainable old technologies that should be deprecated from the WMCS offerings.
Yep, the first time I had this issue was in my first job >15 years ago xd, we extensively used NFS too.
If some NFS hiccup (otherwise harmless) result in D state processes on the exec nodes, and the load avg goes up as a result, and if the grid schedules jobs based on grid load avg (just a theory at this point)
I think the log in the first comment confirms this:
scheduling info: queue instance "continuous@tools-sgeexec-10-17.tools.eqiad1.wikimedia.cloud" dropped because it is overloaded: np_load_avg=3.282500 (= 3.282500 + 0.50 * 0.000000 with nproc=4) >= 1.75
And goes to confirm what we already knew: Grid and NFS are unsustainable old technologies that should be deprecated from the WMCS offerings.
👍
This was due to NFS hiccups getting processes stuck, we know about this issue and have a dashboard now to try to hunt down those processes when it happens.
https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview?orgId=1&var-cluster=tools
New improvements will be in a new task.