Page MenuHomePhabricator

Grid jobs don't run as of Tue, March 19, 2019
Open, HighPublic

Description

my jobs don't work
Tool username: jarbot-iii

Description imported from the dupe task

Neither in wikihistory, nor in persondata. Nothing was startet there, no *.err and no *.out files has a modification date of today.

Does the cron daemon run?

This "cannot fork" might be related:

tools.persondata@tools-sgebastion-07:~$ cat dewiki_watchdog.err  
/bin/sh: 1: Cannot fork

Event Timeline

Jar created this task.Mar 19 2019, 11:16 AM
Restricted Application added subscribers: alanajjar, Aklapper. · View Herald TranscriptMar 19 2019, 11:16 AM

Same here! last action on 00 UTC

MarcoAurelio renamed this task from Grid jobs don't run to Grid jobs don't run as of Tue, March 19, 2019.
MarcoAurelio updated the task description. (Show Details)
MarcoAurelio added a subscriber: Wurgl.
alanajjar triaged this task as High priority.Mar 19 2019, 12:13 PM
alanajjar raised the priority of this task from High to Needs Triage.
aborrero triaged this task as High priority.
aborrero added a subscriber: aborrero.

I took a look at the cron servers and found this:

[...]
[Fri Mar 15 10:50:38 2019] INFO: task python:9760 blocked for more than 120 seconds.
[Fri Mar 15 10:50:38 2019]       Not tainted 4.9.0-8-amd64 #1 Debian 4.9.144-3
[Fri Mar 15 10:50:38 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Mar 15 10:50:38 2019] python          D    0  9760   9591 0x00000000
[Fri Mar 15 10:50:38 2019]  ffff9a01c1b64000 ffff9a0105284d80 ffff9a01c8d49100 ffff9a01ffc98980
[Fri Mar 15 10:50:38 2019]  ffff9a00f06170c0 ffffac2a02577b40 ffffffff872144b9 0000000000000000
[Fri Mar 15 10:50:38 2019]  0000000000000000 ffff9a01ffc98980 0007ffffffffffff ffff9a01c8d49100
[Fri Mar 15 10:50:38 2019] Call Trace:
[Fri Mar 15 10:50:38 2019]  [<ffffffff872144b9>] ? __schedule+0x239/0x6f0
[Fri Mar 15 10:50:38 2019]  [<ffffffff87215170>] ? bit_wait+0x50/0x50
[Fri Mar 15 10:50:38 2019]  [<ffffffff872149a2>] ? schedule+0x32/0x80
[Fri Mar 15 10:50:38 2019]  [<ffffffff87217d4d>] ? schedule_timeout+0x1dd/0x380
[Fri Mar 15 10:50:38 2019]  [<ffffffff86d910fe>] ? write_cache_pages+0xfe/0x470
[Fri Mar 15 10:50:38 2019]  [<ffffffffc0679490>] ? nfs_do_writepage+0x260/0x260 [nfs]
[Fri Mar 15 10:50:38 2019]  [<ffffffff86c5af1a>] ? kvm_clock_get_cycles+0x1a/0x20
[Fri Mar 15 10:50:38 2019]  [<ffffffff87215170>] ? bit_wait+0x50/0x50
[Fri Mar 15 10:50:38 2019]  [<ffffffff8721421d>] ? io_schedule_timeout+0x9d/0x100
[Fri Mar 15 10:50:38 2019]  [<ffffffff86cbd047>] ? prepare_to_wait_exclusive+0x57/0x80
[Fri Mar 15 10:50:38 2019]  [<ffffffff87215187>] ? bit_wait_io+0x17/0x60
[Fri Mar 15 10:50:38 2019]  [<ffffffff87214dc6>] ? __wait_on_bit_lock+0x56/0xc0
[Fri Mar 15 10:50:38 2019]  [<ffffffff86d81e62>] ? __lock_page+0x82/0xa0
[Fri Mar 15 10:50:38 2019]  [<ffffffff86cbd3f0>] ? wake_atomic_t_function+0x60/0x60
[Fri Mar 15 10:50:38 2019]  [<ffffffff86d958d7>] ? invalidate_inode_pages2_range+0x287/0x400
[Fri Mar 15 10:50:38 2019]  [<ffffffff86d9308d>] ? pagevec_lookup_tag+0x1d/0x30
[Fri Mar 15 10:50:38 2019]  [<ffffffffc066e6b9>] ? nfs_revalidate_mapping+0x159/0x2e0 [nfs]
[Fri Mar 15 10:50:38 2019]  [<ffffffffc0669c81>] ? nfs_file_read+0x41/0xa0 [nfs]
[Fri Mar 15 10:50:38 2019]  [<ffffffff86e0aacd>] ? new_sync_read+0xdd/0x130
[Fri Mar 15 10:50:38 2019]  [<ffffffff86e0b261>] ? vfs_read+0x91/0x130
[Fri Mar 15 10:50:38 2019]  [<ffffffff86e0c732>] ? SyS_read+0x52/0xc0
[Fri Mar 15 10:50:38 2019]  [<ffffffff86c03b7d>] ? do_syscall_64+0x8d/0xf0
[Fri Mar 15 10:50:38 2019]  [<ffffffff8721924e>] ? entry_SYSCALL_64_after_swapgs+0x58/0xc6
[Sat Mar 16 01:02:15 2019] nfs: server nfs-tools-project.svc.eqiad.wmnet OK
[Sat Mar 16 01:02:16 2019] nfs: server nfs-tools-project.svc.eqiad.wmnet OK
[Sat Mar 16 06:25:07 2019] Process accounting resumed
[Sun Mar 17 06:25:05 2019] Process accounting resumed
[Mon Mar 18 06:25:04 2019] Process accounting resumed
[Mon Mar 18 15:56:59 2019] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, still trying
[Mon Mar 18 15:56:59 2019] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, still trying
[Mon Mar 18 15:56:59 2019] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, still trying
[Mon Mar 18 15:56:59 2019] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, still trying
[Mon Mar 18 15:56:59 2019] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, still trying
[Mon Mar 18 15:56:59 2019] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, still trying
[Mon Mar 18 15:56:59 2019] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, still trying
[Mon Mar 18 15:56:59 2019] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, still trying
[Mon Mar 18 15:56:59 2019] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, still trying
[Mon Mar 18 15:56:59 2019] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, still trying
[Mon Mar 18 15:56:59 2019] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, still trying
[Mon Mar 18 15:56:59 2019] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, still trying
[Mon Mar 18 15:56:59 2019] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, still trying
[Mon Mar 18 15:56:59 2019] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, still trying
[Mon Mar 18 15:56:59 2019] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, still trying
[Mon Mar 18 15:56:59 2019] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, still trying
[Mon Mar 18 23:00:03 2019] qstat[30207]: segfault at 0 ip 00007f98f56cfa8e sp 00007ffcd3e213b0 error 4 in libjemalloc.so.1[7f98f56cb000+34000]
[Mon Mar 18 23:00:04 2019] qsub[30211]: segfault at 0 ip 00007f0222d82a8e sp 00007ffdf75e70f0 error 4 in libjemalloc.so.1[7f0222d7e000+34000]
[Mon Mar 18 23:59:56 2019] cgroup: fork rejected by pids controller in /system.slice/cron.service

My interpretation of this is that the NFS is unable to handle the IO load in Toolforge right now.

Also, qstat and qsub segfaulted while allocating memory? The load numbers in the server are really high:

aborrero@tools-sgecron-01:~$ w
 13:38:49 up 5 days, 20:21,  2 users,  load average: 1640.17, 1640.25, 1640.21

Which may mean that the kernel scheduler is full of unhandled NFS IO requests.

I would reboot the cron server and see what happens, while taking a look at NFS to see if we can do something about it being overloaded.

Jar added a comment.Mar 19 2019, 1:47 PM

Does the cron daemon run?

No, the cron daemon either does not work.

Mentioned in SAL (#wikimedia-cloud) [2019-03-19T13:56:20Z] <arturo> T218649 rebooting tools-sgecron-01

Jar added a comment.Mar 19 2019, 2:16 PM

Now it's working, thank you

TheDJ added subscribers: MarcoAurelio, TheDJ.EditedMar 19 2019, 2:25 PM

Fully disabled tile rendering of maps volunteer project for now. Since this project only recently got write access back T218145: maps: take back root owned files/dirs from root_squash protected nfs for many of the files, and since prelim investigation with stopping the render server + apache of maps seemed to indicate a possible correlation, it seemed safer to keep rendering off (serving stale tiles from NFS via http is still enabled though).

Graph of labstore1004 loadavg.

Timeline:
12:55 @MarcoAurelio reports problem
13:00 I note that maps has been pretty busy and is a new NFS variable in the system recently introduced
13:08 I stop maps render_old, renderd and apache server upon request of @aborrero just to check
13:29 I restart maps apache and rendering (outdated requested tiles only)
13:36 I restart the render_old (outdated layer rendering) which feeds renderd
13:46 I stop the render_old
13:56 I restart render_old with ionice -c 3 nice -19 (as it contains a directory walker to check timestamps, which might add to io load)
14:00 All rendering stopped again

We observe some correlations, but are not entirely sure
13:07 starts a rapid drop of 1min loadavg on 1004 after stopping all maps services
13:33 rise in 1min loadavg, after restarting renderer
13:40 all rendering resume does seem to make all numbers rise again to previous highs
13:49 sees a spike, but that is within a timeframe that the maps render_old was actually halted which should have cause LESS pressure by renderd

We decide to defer the problem for now

Relevant dashboards:

Bstorm added a subscriber: Bstorm.Mar 19 2019, 3:18 PM

This happened last week, but I noticed it then because of the load numbers on tools-sgecron-01 alone (few crons were impacted). Almost all NFS was working well except a couple directories (which is bizarre). It required a reboot. If the directory affected includes one of the grid directories used by gridengine, the grid engine client dies--which has mostly affected exec nodes and not submit nodes.

I could find no resolution other than reboot at that time as well. This may also be related to the LDAP problem as NFS became less stable overall when those troubles started. While number of NFS clients has climbed, actual load on NFS is not significantly different (since the number of grid and k8s jobs haven't changed much). There's a clear load problem on LDAP starting in February, there isn't in NFS. In fact, load drops a bit on average in March (possibly because of the errors).

I am hoping that yesterday's fixes help, but the way the errors worked it will take at least another day or so to clear up locked up servers. It is just as likely that LDAP is related as well as the new Debian Stretch client libraries. We already know that the more recent kernels handle mounts poorly.

Also important, the Trusty grid has had problems, but the problems are entirely different. The mounts are just not handled the same by that old kernel.

Checked the nfs server, the old error I cleared up yesterday didn't happen at the time this server lost mount.

The only thing that happened there was a puppet run and nfs-exportd running at the same time. nfs-exportd uses LDAP and has had issues periodically. I just had the thought that it could be occasionally blocking tools from mounting if it doesn't get LDAP right. I'm going to read through the script today, I think, to make sure that is not true.

Puppet run would increase load (but this doesn't happen on every puppet run).

The bastion at the same time did see the problem:
Mar 18 15:56:59 tools-sgebastion-07 kernel: [444961.916277] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, still trying
However, it didn't have as many problems or segfault gridengine, probably because it wasn't trying to run anything at the time. Checking trusty hosts.

I see no nfs error on trusty, and ldap is proceeding there easier.

In all fairness, not all stretch hosts show the NFS issue. It's just some of them.

Here, we see the load rise from the puppet run, but it never really gets bad (this is a 64 core system, over 65 = bad). I don't think load (at least not load average--actual network congestion and such could be a problem) is the issue from that. Network was lower than it often is, of course. There was a jump on /dev/sdb's wait, which is the tools volume ( /dev/sdb tools lvm2 a-- 9.09t 1.09t). However, the problem was still localized to some servers and only measurable in miliseconds. That shouldn't explain the problem unless the stretch kernel is insanely sensitive--and only selectively since not all stretch clients experienced the brief outage.

https://grafana.wikimedia.org/d/000000568/labstore1004-1005?orgId=1&from=1552921200000&to=1552924800000

I will say that I fixed the filesystem issue after this event. I am still uncomfortable calling that the final fix just now, though.

Meno25 added a subscriber: Meno25.Mar 23 2019, 8:27 PM

Over the weekend, the NFS issue did not resurface after several attempts to fix. We may actually have squashed the main problem there. The LDAP issue is still knocking queues offline, but it is far easier to clear.

GTirloni removed a subscriber: GTirloni.Apr 3 2019, 10:20 AM