Page MenuHomePhabricator

Heavy load on all machines of toolforge -- jobs do not start
Closed, ResolvedPublic

Description

When I look at:

https://tools.wmflabs.org/admin/oge/status

I see:

tools-exec-1401 		11 	1669% 	21% 	8.5G
tools-exec-1402 		7 	1642% 	28% 	1.6G
tools-exec-1403 		9 	1532% 	14% 	5.5G
tools-exec-1404 		6 	1584% 	17% 	19.3G
tools-exec-1405 		8 	1668% 	10% 	-
tools-exec-1406 		3 	1702% 	15% 	7.5G
tools-exec-1407 		8 	1647% 	30% 	-
tools-exec-1408 		4 	1636% 	12% 	17.3G
tools-exec-1409 		3 	1689% 	16% 	4.7G
tools-exec-1410 		1 	1530% 	4% 	29G
tools-exec-1411 		5 	1550% 	8% 	10.4G
tools-exec-1412 		2 	1600% 	23% 	27.1G
tools-exec-1413 		3 	1567% 	5% 	27G
tools-exec-1414 		2 	1492% 	14% 	13.3G
tools-exec-1415 		7 	1634% 	14% 	16.3G
tools-exec-1416 		3 	1465% 	5% 	24.4G
tools-exec-1417 		5 	1563% 	10% 	11.2G
tools-exec-1418 		2 	1499% 	4% 	27G
tools-exec-1419 		1 	1462% 	7% 	29G
tools-exec-1420 		2 	1557% 	7% 	28G
tools-exec-1421 		6 	1662% 	11% 	18.1G
tools-exec-1422 		8 	1638% 	10% 	9.8G
tools-exec-1423 		3 	1593% 	12% 	15.3G
tools-exec-1424 		2 	1622% 	7% 	28G
tools-exec-1425 		6 	1678% 	9% 	21G
tools-exec-1426 		10 	1617% 	15% 	-
tools-exec-1427 		4 	1493% 	8% 	16.3G
tools-exec-1428 		6 	1604% 	8% 	24.5G
tools-exec-1429 		4 	1641% 	8% 	24.1G
tools-exec-1430 		10 	1669% 	21% 	1.5G
tools-exec-1431 		4 	1614% 	8% 	25.1G
tools-exec-1432 		5 	1577% 	26% 	7.5G
tools-exec-1433 		3 	1522% 	16% 	9.4G
tools-exec-1434 		3 	1530% 	11% 	18.2G
tools-exec-1435 		3 	1533% 	12% 	18.3G
tools-exec-1436 		0 	1585% 	5% 	30G
tools-exec-1437 		1 	1596% 	12% 	20.2G
tools-exec-1438 		1 	1590% 	38% 	20.2G
tools-exec-1439 		5 	1555% 	21% 	7.5G
tools-exec-1440 		4 	1430% 	5% 	26G
tools-exec-1441 		2 	1668% 	9% 	21.2G
tools-exec-1442 		0 	1523% 	16% 	30G
tools-webgrid-generic-1401 	3 	1582% 	11% 	-
tools-webgrid-generic-1402 	14 	1562% 	25% 	-
tools-webgrid-generic-1403 	12 	1594% 	19% 	-
tools-webgrid-generic-1404 	15 	1511% 	28% 	-
tools-webgrid-lighttpd-1402 	22 	1546% 	20% 	-
tools-webgrid-lighttpd-1403 	12 	1602% 	15% 	4G
tools-webgrid-lighttpd-1404 	3 	1494% 	8% 	76G
tools-webgrid-lighttpd-1405 	21 	1520% 	24% 	-
tools-webgrid-lighttpd-1406 	2 	1583% 	5% 	84G
tools-webgrid-lighttpd-1407 	22 	1397% 	21% 	-
tools-webgrid-lighttpd-1408 	17 	1505% 	11% 	-
tools-webgrid-lighttpd-1409 	19 	1514% 	14% 	-
tools-webgrid-lighttpd-1410 	3 	1530% 	5% 	76G
tools-webgrid-lighttpd-1411 	27 	1620% 	11% 	-
tools-webgrid-lighttpd-1412 	30 	1528% 	14% 	-
tools-webgrid-lighttpd-1414 	26 	1554% 	12% 	-
tools-webgrid-lighttpd-1415 	16 	1463% 	14% 	-
tools-webgrid-lighttpd-1416 	13 	1522% 	14% 	-
tools-webgrid-lighttpd-1417 	20 	1529% 	16% 	-
tools-webgrid-lighttpd-1418 	27 	1641% 	16% 	-
tools-webgrid-lighttpd-1420 	26 	1517% 	16% 	-
tools-webgrid-lighttpd-1421 	21 	1540% 	15% 	-
tools-webgrid-lighttpd-1422 	24 	1602% 	17% 	-
tools-webgrid-lighttpd-1425 	24 	1596% 	19% 	-

I checked, because a small job did not get active for an hour.

Event Timeline

Wurgl triaged this task as Unbreak Now! priority.Mar 2 2019, 10:45 AM

Something strange is going on here. Taking tools-exec-1401 as an example:

valhallasw@tools-exec-1401:~$ uptime
 11:07:56 up 294 days, 20:36,  1 user,  load average: 148.57, 143.21, 128.77

Normally these kinds of load averages indicate an NFS issue, but NFS in the home directory feels fast. Looking at the NFS mounts:

labstore1003.eqiad.wmnet:/scratch on /mnt/nfs/labstore1003-scratch type nfs (rw,noexec,nosuid,nodev,noatime,vers=4,bg,intr,sec=sys,proto=tcp,port=0,lookupcache=all,nofsc,soft,timeo=300,retrans=3,addr=10.64.4.10,clientaddr=10.68.17.202)

nfs-tools-project.svc.eqiad.wmnet:/project/tools/project on /mnt/nfs/labstore-secondary-tools-project type nfs (rw,noatime,vers=4,bg,intr,sec=sys,proto=tcp,port=0,lookupcache=all,nofsc,hard,addr=10.64.37.18,clientaddr=10.68.17.202)

nfs-tools-project.svc.eqiad.wmnet:/project/tools/home on /mnt/nfs/labstore-secondary-tools-home type nfs (rw,noatime,vers=4,bg,intr,sec=sys,proto=tcp,port=0,lookupcache=all,nofsc,hard,addr=10.64.37.18,clientaddr=10.68.17.202)

labstore1006.wikimedia.org:/dumps on /mnt/nfs/dumps-labstore1006.wikimedia.org type nfs (ro,noatime,vers=4,bg,intr,sec=sys,proto=tcp,port=0,lookupcache=all,nofsc,soft,timeo=300,retrans=3,addr=208.80.154.7,clientaddr=10.68.17.202)

labstore1007.wikimedia.org:/dumps on /mnt/nfs/dumps-labstore1007.wikimedia.org type nfs (ro,noatime,vers=4,bg,intr,sec=sys,proto=tcp,port=0,lookupcache=all,nofsc,soft,timeo=300,retrans=3,addr=208.80.155.106,clientaddr=10.68.17.202)
  • ls /mnt/nfs/labstore1003-scratch hangs
  • ls /mnt/nfs/labstore-secondary-tools-project is OK
  • ls /mnt/nfs/labstore-secondary-tools-home is OK
  • ls /mnt/nfs/dumps-labstore1006.wikimedia.org hangs
  • ls /mnt/nfs/dumps-labstore1007.wikimedia.org is OK

The high load averages seem to be there because of prometheus, which has a large number of threads in D (uninterruptible sleep -- typically 'waiting for NFS') state:

valhallasw@tools-exec-1401:~$ ps -T aux | grep " D" | wc -l
163
valhallasw@tools-exec-1401:~$ ps -T aux | grep " D" | grep "prometheus" | wc -l
161
valhallasw@tools-exec-1401:~$ uptime
 11:17:32 up 294 days, 20:45,  2 users,  load average: 163.55, 158.31, 143.06

This also makes sense -- it's probably trying to get statistics on the mounted NFS shares.

The two problematic mounts are set to soft,timeo=300, which means any activity should time out after 30 seconds. But this is not consistent with what I see: ls does not time out, and there are many prometheus processes older than 30 seconds.

sudo service prometheus-node-exporter restart seems to at least temporarily resolve the issue, but it seems likely prometheus subprocesses will hang again sooner or later.

sudo mount -o remount /mnt/nfs/labstore1003-scratch does not help to bring the scratch mount back online. Note, the mount parameters changed from

labstore1003.eqiad.wmnet:/scratch on /mnt/nfs/labstore1003-scratch type nfs (rw,noexec,nosuid,nodev,noatime,vers=4,bg,intr,sec=sys,proto=tcp,port=0,lookupcache=all,nofsc,soft,timeo=300,retrans=3,addr=10.64.4.10,clientaddr=10.68.17.202)

to

labstore1003.eqiad.wmnet:/scratch on /mnt/nfs/labstore1003-scratch type nfs (rw,noexec,nosuid,nodev,noatime,vers=4,bg,intr,sec=sys,proto=tcp,port=0,lookupcache=all,nofsc,soft,timeo=300,retrans=3)

likely because no connection could be made.

I cannot ping labstore1003:

valhallasw@tools-exec-1401:~$ ping labstore1003.eqiad.wmnet
PING labstore1003.eqiad.wmnet (10.64.4.10) 56(84) bytes of data.
From ae2-1118.cr2-eqiad.wikimedia.org (10.64.20.3) icmp_seq=1 Packet filtered

but it does respond on the nfs port:

valhallasw@tools-exec-1401:~$ telnet labstore1003.eqiad.wmnet 111
Trying 10.64.4.10...
Connected to labstore1003.eqiad.wmnet.

https://grafana.wikimedia.org/d/000000032/labs-monitoring?refresh=5m&orgId=1&from=1550921996943&to=1551526796943&var-LabstoreServers=All does not show anything particularly different on the labstore servers.

I'm not sure whether the Puppet are related -- the puppetmaster does not (always?) start correctly when it needs to respond to requests:

Mar  2 11:45:28 tools-exec-1421 puppet-agent[25767]: (/Stage[main]/Apt/File[/usr/local/bin/apt2xml]) Could not evaluate: Could not retrieve file metadata for puppet:///modules/apt/apt2xml.py: Error 500 on SERVER: <!DOCTYPE HTML PUBLIC "-
//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
[...]
Mar  2 11:45:28 tools-exec-1421 puppet-agent[25767]: (/Stage[main]/Apt/File[/usr/local/bin/apt2xml]) #011#011#011#011An error occurred while starting up the preloader: it did not write a startup response in time. Please read <a href="https://github.com/phusion/passenger/wiki/Debugging-application-startup-problems">this article</a> for more information about this problem.<br>

Does the puppetmaster use NFS for some storage?

labstore1006 seems to have recovered:

valhallasw@tools-exec-1401:~$ ls  /mnt/nfs/dumps-labstore1006.wikimedia.org
lost+found  xmldatadumps

There were some alerts in #wikimedia-operations:

[09:25:21] <icinga-wm>	 PROBLEM - Host labstore1006 is DOWN: PING CRITICAL - Packet loss = 100%
[09:29:13] <icinga-wm>	 RECOVERY - Host labstore1006 is UP: PING WARNING - Packet loss = 28%, RTA = 36.35 ms
[09:31:51] <icinga-wm>	 PROBLEM - NFS on labstore1006 is CRITICAL: connect to address 208.80.154.7 and port 2049: Connection refused
[09:43:55] <icinga-wm>	 PROBLEM - toolschecker: check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 185 bytes in 0.086 second response time
[11:56:47] <icinga-wm>	 RECOVERY - NFS on labstore1006 is OK: TCP OK - 0.036 second response time on 208.80.154.7 port 2049

Not clear why it recovered, though.

valhallasw lowered the priority of this task from Unbreak Now! to Needs Triage.Mar 2 2019, 12:00 PM

/mnt/nfs/labstore1003-scratch also seems accessible again, so I think the problem should resolve itself now, but I think this still needs some investigation of why this happened.

It seems SGE has a hard dependency on the NFS dumps share.

Related T217473

@valhallasw pointed out that job not getting scheduled could be caused by high load average and SGE deciding the servers aren't fit for purpose (instead of the NFS dumps share being down and causing the SGE scheduler itself to lock).

The soft-mounted dumps shares are known to bring toolforge to its knees (kubernetes as well) T203469 mentions some of that. If one of the NFS shares must be actually offline, anything connected needs to be rebooted to preserve functionality. The interesting thing is that when this happens to a kubernetes server, the effect is generally far worse than gridengine. Failing over will not help. I'll take a look at the servers quick. There was no maintenance going on, and I believe labstore1006 is not even the active partner for toolforge (though, it is known already that even the inactive will cause this because it is monitored and connected to containers).

Finally read back and got to T217473. That really should have paged. I may copy the hiera values I added to make them page over to host records instead, I think 😭

labstore1003 is weirder: up 905 days (presuming @GTirloni already checked that but wanting it noted here that the reboots weren't contagious)
I guess that one was hanging just because it's slow and unreliable? Not sure.

Luckily, there is a fix already in store for labstore1003 in general: T209527

@valhallasw pointed out that this was not really happening on the new grid! They just gained some load and kept running. Upgrading may have actually fixed something!

loadaverages.png (308×586 px, 51 KB)

Bstorm claimed this task.

Since there are many reasons load could become high, and a high load is a resource being theoretically starved, we don't want to change grid engine to allow jobs on nodes with excessive load. The load itself is passed (so closing the ticket). Rather, I've created a subtask to address the load problem in the future.

In addition, it appears that trusty deprecation (T204530) incidentally fixes the problem in-part, and that is already underway.