Page MenuHomePhabricator

Toolforge grid seems overloaded
Closed, ResolvedPublic

Description

Anecdotally it seems like the grid is overloaded and jobs are sitting in the queue for a while. For example, P47169 was waiting over an hour, normally it executes instantly. Here's another one that's still waiting: P47179.

Event Timeline

Example from toolschecker, which has been alerting since 12:15 AM UTC / 20:15 ET:

tools.toolschecker@tools-sgebastion-10:~$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
 586699 0.35210 test-long- tools.toolsc Rr    04/03/2023 13:57:40 continuous@tools-sgeexec-10-13     1        
9999836 0.27349 test-long- tools.toolsc r     04/03/2023 14:32:16 continuous@tools-sgeexec-10-8.     1        
  42207 0.25001 toolscheck tools.toolsc qw    04/19/2023 04:10:06                                    1        
tools.toolschecker@tools-sgebastion-10:~$ qstat -j 42207
==============================================================
job_number:                 42207
exec_file:                  job_scripts/42207
submission_time:            Wed Apr 19 04:10:06 2023
owner:                      tools.toolschecker
uid:                        52524
group:                      tools.toolschecker
gid:                        52524
sge_o_home:                 /data/project/toolschecker
sge_o_log_name:             tools.toolschecker
sge_o_path:                 /usr/bin:/bin
sge_o_shell:                /bin/sh
sge_o_workdir:              /mnt/nfs/labstore-secondary-tools-project/toolschecker
sge_o_host:                 tools-sgecron-2
account:                    sge
stderr_path_list:           NONE:NONE:/data/project/toolschecker/logs/crontest.log
merge:                      y
hard resource_list:         h_vmem=524288k,release=buster
mail_list:                  tools.toolschecker@toolforge.org
notify:                     FALSE
job_name:                   toolschecker.crontest
stdout_path_list:           NONE:NONE:/data/project/toolschecker/logs/crontest.log
jobshare:                   0
hard_queue_list:            task
env_list:                   TERM=NONE
job_args:                   /data/project/toolschecker/crontest.txt
script_file:                /usr/bin/touch
binding:                    NONE
job_type:                   binary
scheduling info:            queue instance "continuous@tools-sgeexec-10-17.tools.eqiad1.wikimedia.cloud" dropped because it is overloaded: np_load_avg=3.282500 (= 3.282500 + 0.50 * 0.000000 with nproc=4) >= 1.75
                            queue instance "continuous@tools-sgeexec-10-16.tools.eqiad1.wikimedia.cloud" dropped because it is overloaded: np_load_avg=2.920000 (= 2.920000 + 0.50 * 0.000000 with nproc=4) >= 1.75
                            queue instance "continuous@tools-sgeexec-10-15.tools.eqiad1.wikimedia.cloud" dropped because it is overloaded: np_load_avg=3.567500 (= 3.567500 + 0.50 * 0.000000 with nproc=4) >= 1.75
                            queue instance "continuous@tools-sgeexec-10-8.tools.eqiad1.wikimedia.cloud" dropped because it is overloaded: np_load_avg=2.797500 (= 2.797500 + 0.50 * 0.000000 with nproc=4) >= 1.75
                            queue instance "task@tools-sgeexec-10-17.tools.eqiad1.wikimedia.cloud" dropped because it is overloaded: np_load_avg=3.282500 (= 3.282500 + 0.50 * 0.000000 with nproc=4) >= 1.75
                            queue instance "task@tools-sgeexec-10-16.tools.eqiad1.wikimedia.cloud" dropped because it is overloaded: np_load_avg=2.920000 (= 2.920000 + 0.50 * 0.000000 with nproc=4) >= 1.75
                            queue instance "task@tools-sgeexec-10-15.tools.eqiad1.wikimedia.cloud" dropped because it is overloaded: np_load_avg=3.567500 (= 3.567500 + 0.50 * 0.000000 with nproc=4) >= 1.75
                            queue instance "task@tools-sgeexec-10-8.tools.eqiad1.wikimedia.cloud" dropped because it is overloaded: np_load_avg=2.797500 (= 2.797500 + 0.50 * 0.000000 with nproc=4) >= 1.75
                            queue instance "webgrid-lighttpd@tools-sgeweblight-10-24.tools.eqiad1.wikimedia.cloud" dropped because it is overloaded: np_load_avg=2.770000 (= 2.770000 + 0.50 * 0.000000 with nproc=4) >= 2.75
                            queue instance "webgrid-lighttpd@tools-sgeweblight-10-22.tools.eqiad1.wikimedia.cloud" dropped because it is overloaded: np_load_avg=2.787500 (= 2.787500 + 0.50 * 0.000000 with nproc=4) >= 2.75
                            cannot run in queue "webgrid-lighttpd" because it is not contained in its hard queue list (-q)
                            cannot run in queue "webgrid-generic" because it is not contained in its hard queue list (-q)
                            (-l h_vmem=524288k,release=buster) cannot run at host "tools-sgeexec-10-14.tools.eqiad1.wikimedia.cloud" because it offers only hc:h_vmem=146.312M
                            (-l h_vmem=524288k,release=buster) cannot run at host "tools-sgeexec-10-20.tools.eqiad1.wikimedia.cloud" because it offers only hc:h_vmem=122.875M
                            (-l h_vmem=524288k,release=buster) cannot run at host "tools-sgeexec-10-22.tools.eqiad1.wikimedia.cloud" because it offers only hc:h_vmem=171.703M
                            (-l h_vmem=524288k,release=buster) cannot run at host "tools-sgeexec-10-21.tools.eqiad1.wikimedia.cloud" because it offers only hc:h_vmem=13.500M
dcaro changed the task status from Open to In Progress.Apr 25 2023, 8:56 AM
dcaro claimed this task.
dcaro moved this task from To refine to Doing on the User-dcaro board.
dcaro added a project: User-dcaro.

The grid seems to have stabilized, I will close this one, but there's a child task to continue investigating some of the NFS issues.

Waiting over 10 minutes for this job to start today:

1tools.poty-stuff@tools-sgebastion-10:~$ qstat -j 493364
2==============================================================
3job_number: 493364
4exec_file: job_scripts/493364
5submission_time: Fri May 5 06:24:47 2023
6owner: tools.poty-stuff
7uid: 55366
8group: tools.poty-stuff
9gid: 55366
10sge_o_home: /data/project/poty-stuff
11sge_o_log_name: tools.poty-stuff
12sge_o_path: /data/project/rustup/rustup/.cargo/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games
13sge_o_shell: /bin/bash
14sge_o_workdir: /mnt/nfs/labstore-secondary-tools-project/poty-stuff/www/rust
15sge_o_host: tools-sgebastion-10
16account: sge
17cwd: /data/project/poty-stuff/www/rust
18stderr_path_list: NONE:NONE:/data/project/poty-stuff/build.err
19hard resource_list: h_vmem=2097152k,release=buster
20mail_list: tools.poty-stuff@toolforge.org
21notify: FALSE
22job_name: build
23stdout_path_list: NONE:NONE:/data/project/poty-stuff/build.out
24jobshare: 0
25hard_queue_list: task
26env_list: TERM=NONE
27job_args: build,--release
28script_file: /data/project/rustup/rustup/.cargo/bin/cargo
29binding: NONE
30job_type: binary
31scheduling info: queue instance "continuous@tools-sgeexec-10-16.tools.eqiad1.wikimedia.cloud" dropped because it is overloaded: np_load_avg=2.842500 (= 2.842500 + 0.50 * 0.000000 with nproc=4) >= 1.75
32 queue instance "task@tools-sgeexec-10-16.tools.eqiad1.wikimedia.cloud" dropped because it is overloaded: np_load_avg=2.842500 (= 2.842500 + 0.50 * 0.000000 with nproc=4) >= 1.75
33 queue instance "continuous@tools-sgeexec-10-20.tools.eqiad1.wikimedia.cloud" dropped because it is disabled
34 queue instance "task@tools-sgeexec-10-20.tools.eqiad1.wikimedia.cloud" dropped because it is disabled
35 cannot run in queue "webgrid-lighttpd" because it is not contained in its hard queue list (-q)
36 cannot run in queue "webgrid-generic" because it is not contained in its hard queue list (-q)
37 (-l h_vmem=2097152k,release=buster) cannot run at host "tools-sgeexec-10-17.tools.eqiad1.wikimedia.cloud" because it offers only hc:h_vmem=13.500M
38 (-l h_vmem=2097152k,release=buster) cannot run at host "tools-sgeexec-10-13.tools.eqiad1.wikimedia.cloud" because it offers only hc:h_vmem=423.656M
39 (-l h_vmem=2097152k,release=buster) cannot run at host "tools-sgeexec-10-14.tools.eqiad1.wikimedia.cloud" because it offers only hc:h_vmem=681.469M
40 (-l h_vmem=2097152k,release=buster) cannot run at host "tools-sgeexec-10-22.tools.eqiad1.wikimedia.cloud" because it offers only hc:h_vmem=615.062M
41 (-l h_vmem=2097152k,release=buster) cannot run at host "tools-sgeexec-10-8.tools.eqiad1.wikimedia.cloud" because it offers only hc:h_vmem=99.438M
42 (-l h_vmem=2097152k,release=buster) cannot run at host "tools-sgeexec-10-15.tools.eqiad1.wikimedia.cloud" because it offers only hc:h_vmem=545.727M
43 (-l h_vmem=2097152k,release=buster) cannot run at host "tools-sgeexec-10-18.tools.eqiad1.wikimedia.cloud" because it offers only hc:h_vmem=134.594M
44 (-l h_vmem=2097152k,release=buster) cannot run at host "tools-sgeexec-10-21.tools.eqiad1.wikimedia.cloud" because it offers only hc:h_vmem=462.719M
45 (-l h_vmem=2097152k,release=buster) cannot run at host "tools-sgeexec-10-19.tools.eqiad1.wikimedia.cloud" because it offers only hc:h_vmem=400.219M

Noticed that toolschecker also started alerting around the same time (timestamps EDT):

02:13:36 <+icinga-wm> PROBLEM - toolschecker: check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/cron - 177 bytes in 0.012 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker
02:15:08 <+icinga-wm> RECOVERY - toolschecker: check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker
02:27:34 <+icinga-wm> PROBLEM - toolschecker: check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/cron - 177 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker
02:28:20 <+icinga-wm> RECOVERY - Persistent high iowait on clouddumps1001 is OK: (C)10 ge (W)5 ge 4.691 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005
02:32:20 <+icinga-wm> RECOVERY - toolschecker: check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker
dcaro reopened this task as Open.EditedMay 5 2023, 8:00 AM

Yep, noticed too, reopening, thanks!

A quick look on the hosts from the master seems to point to tools-sgeexec-10-16:

# qhost
...
tools-sgeexec-10-16.tools.eqiad1.wikimedia.cloud lx-amd64        4    4    4    4 11.35    7.8G    2.4G    8.0G  102.2M
...

looking

Capturing here the kernel stack trace from tools-sgeexec-10-16:

[Sat Apr 29 06:32:34 2023] INFO: task python:23958 blocked for more than 120 seconds.
[Sat Apr 29 06:32:34 2023]       Not tainted 4.19.0-23-cloud-amd64 #1 Debian 4.19.269-1
[Sat Apr 29 06:32:34 2023] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Sat Apr 29 06:32:34 2023] python          D    0 23958  23857 0x00000000
[Sat Apr 29 06:32:34 2023] Call Trace:
[Sat Apr 29 06:32:34 2023]  __schedule+0x29f/0x840
[Sat Apr 29 06:32:34 2023]  ? bit_wait_timeout+0x90/0x90
[Sat Apr 29 06:32:34 2023]  schedule+0x28/0x80
[Sat Apr 29 06:32:34 2023]  io_schedule+0x12/0x40
[Sat Apr 29 06:32:34 2023]  bit_wait_io+0xd/0x50
[Sat Apr 29 06:32:34 2023]  __wait_on_bit+0x73/0x90
[Sat Apr 29 06:32:34 2023]  out_of_line_wait_on_bit+0x91/0xb0
[Sat Apr 29 06:32:34 2023]  ? init_wait_var_entry+0x40/0x40
[Sat Apr 29 06:32:34 2023]  nfs_lock_and_join_requests+0x329/0x4e0 [nfs]
[Sat Apr 29 06:32:34 2023]  nfs_updatepage+0x302/0x940 [nfs]
[Sat Apr 29 06:32:34 2023]  ? nfs_flush_incompatible+0x157/0x160 [nfs]
[Sat Apr 29 06:32:34 2023]  nfs_write_end+0x63/0x4c0 [nfs]
[Sat Apr 29 06:32:34 2023]  ? iov_iter_copy_from_user_atomic+0xc3/0x300
[Sat Apr 29 06:32:34 2023]  generic_perform_write+0x141/0x1c0
[Sat Apr 29 06:32:34 2023]  nfs_file_write+0xdc/0x210 [nfs]
[Sat Apr 29 06:32:34 2023]  new_sync_write+0xfb/0x160
[Sat Apr 29 06:32:34 2023]  vfs_write+0xa5/0x1a0
[Sat Apr 29 06:32:34 2023]  ksys_write+0x57/0xd0
[Sat Apr 29 06:32:34 2023]  do_syscall_64+0x50/0xf0
[Sat Apr 29 06:32:34 2023]  entry_SYSCALL_64_after_hwframe+0x5c/0xc1
[Sat Apr 29 06:32:34 2023] RIP: 0033:0x14d8047303b4
[Sat Apr 29 06:32:34 2023] Code: Bad RIP value.
[Sat Apr 29 06:32:34 2023] RSP: 002b:00007ffec80a0448 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[Sat Apr 29 06:32:34 2023] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 000014d8047303b4
[Sat Apr 29 06:32:34 2023] RDX: 0000000000000006 RSI: 000014d7e1ed5a14 RDI: 0000000000000002
[Sat Apr 29 06:32:34 2023] RBP: 000014d7e1ed5a14 R08: 000014d8048028b0 R09: 0000000000000000
[Sat Apr 29 06:32:34 2023] R10: 0000000000000018 R11: 0000000000000246 R12: 000014d804801680
[Sat Apr 29 06:32:34 2023] R13: 0000000000000006 R14: 000014d8047fc760 R15: 0000000000000006
[Sat Apr 29 06:32:54 2023] nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud not responding, still trying

So yep, nfs seems stuck, it seems it has been happening for a few days, there's this message from the last boot:

[Tue Apr 25 12:39:49 2023] FS-Cache: Duplicate cookie detected
[Tue Apr 25 12:39:49 2023] FS-Cache: O-cookie c=0000000064e0ee72 [p=00000000fe0209fa fl=222 nc=0 na=1]
[Tue Apr 25 12:39:49 2023] FS-Cache: O-cookie d=0000000069901e35 n=00000000029ec6dc
[Tue Apr 25 12:39:49 2023] FS-Cache: O-key=[16] '040000000000000002000801ac10070e'
[Tue Apr 25 12:39:49 2023] FS-Cache: N-cookie c=0000000096ac7cac [p=00000000fe0209fa fl=2 nc=0 na=1]
[Tue Apr 25 12:39:49 2023] FS-Cache: N-cookie d=0000000069901e35 n=000000004df46263
[Tue Apr 25 12:39:49 2023] FS-Cache: N-key=[16] '040000000000000002000801ac10070e'

And a few days after the stack trace from the comment before starts appearing

On the graphs it seems that ~the 29th of April it started increasing, and betwen the 1st and 2nd of May it jumped again:

{F36976612}

Probably processes getting stuck on IO

Mentioned in SAL (#wikimedia-cloud) [2023-05-05T09:13:19Z] <dcaro> rebooted tools-sgeexec-10-16 as it was stuck (T335009)

@dcaro found https://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html that explains why a D state process results in increased loadavg on linux servers.

If some NFS hiccup (otherwise harmless) result in D state processes on the exec nodes, and the load avg goes up as a result, and if the grid schedules jobs based on grid load avg (just a theory at this point), then the failure mode is clear:

Any NFS hiccup (otherwise harmless) can result in the Grid becoming unavailable and/or unreliable.

And goes to confirm what we already knew: Grid and NFS are unsustainable old technologies that should be deprecated from the WMCS offerings.

@dcaro found https://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html that explains why a D state process results in increased loadavg on linux servers.

Yep, the first time I had this issue was in my first job >15 years ago xd, we extensively used NFS too.

If some NFS hiccup (otherwise harmless) result in D state processes on the exec nodes, and the load avg goes up as a result, and if the grid schedules jobs based on grid load avg (just a theory at this point)

I think the log in the first comment confirms this:

scheduling info:            queue instance "continuous@tools-sgeexec-10-17.tools.eqiad1.wikimedia.cloud" dropped because it is overloaded: np_load_avg=3.282500 (= 3.282500 + 0.50 * 0.000000 with nproc=4) >= 1.75

And goes to confirm what we already knew: Grid and NFS are unsustainable old technologies that should be deprecated from the WMCS offerings.

👍

This was due to NFS hiccups getting processes stuck, we know about this issue and have a dashboard now to try to hunt down those processes when it happens.

https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview?orgId=1&var-cluster=tools

New improvements will be in a new task.