Page MenuHomePhabricator

labstore1007: high load avg issue
Closed, DeclinedPublic

Description

This server (dumps) seems to be suffering high load average issues related to rsync jobs or nfs.

On 2018-10-02 we got paged by icinga about the load issue:

imagen.png (456×1 px, 77 KB)

https://graphite.wikimedia.org/S/P

I jumped into the machine and rsyncd was the most CPU intensive process (about 100%) in both htop and iotop.

@ArielGlenn created this patch: https://gerrit.wikimedia.org/r/463925

But probably the load is also high because NFS.

Apparently labstore1006 is not suffering the same issues.

Event Timeline

labstore1007 is the Cloud VPS NFS partner. It gets high load whenever someone does a cp on a very large dump over NFS, basically. labstore1006 will do the same thing if exposed to VPS (currently, it's NFS for the stats servers, but not Cloud VPS).

This is essentially totally expected. Adding the monitoring task as a related task.

aborrero moved this task from Inbox to Graveyard on the cloud-services-team (Kanban) board.

Mentioned in SAL (#wikimedia-operations) [2018-12-05T10:17:52Z] <arturo> T205969 icinga downtime the load avg check in labstore1007 for 1 week

This week we got several pages related to this. I just downtimed the server for 1 week.

image.png (456×1 px, 124 KB)

Mentioned in SAL (#wikimedia-operations) [2018-12-12T12:35:27Z] <arturo> T205969 icinga downtime load-avg check for labstore1007 until January (1 month)

High load is now fairly normal for NFS since the upgrades for Spectre and Meltdown (and others). It is an email-only alert at this point because it is very hard to distinguish a real problem using just a load number like we used to. If load is extremely high for a long time, it might suggest a problem, but otherwise, I think it is not worth it to try to stop the load from being over 20 on a regular basis when it doesn't even slow down the system's web server.

Not saying this is fixed. I'm just thinking it's a "can't-fix" and we should get used to high load as more of a possible warning than a real problem.