Page MenuHomePhabricator

labstore1007: high load avg issue
Closed, DeclinedPublic

Description

This server (dumps) seems to be suffering high load average issues related to rsync jobs or nfs.

On 2018-10-02 we got paged by icinga about the load issue:


https://graphite.wikimedia.org/S/P

I jumped into the machine and rsyncd was the most CPU intensive process (about 100%) in both htop and iotop.

@ArielGlenn created this patch: https://gerrit.wikimedia.org/r/463925

But probably the load is also high because NFS.

Apparently labstore1006 is not suffering the same issues.

Related Objects

Event Timeline

aborrero created this task.Oct 2 2018, 12:16 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 2 2018, 12:16 PM
Bstorm added a comment.EditedOct 3 2018, 5:56 PM

labstore1007 is the Cloud VPS NFS partner. It gets high load whenever someone does a cp on a very large dump over NFS, basically. labstore1006 will do the same thing if exposed to VPS (currently, it's NFS for the stats servers, but not Cloud VPS).

This is essentially totally expected. Adding the monitoring task as a related task.

aborrero triaged this task as Low priority.Nov 12 2018, 5:06 PM
aborrero moved this task from Inbox to Graveyard on the cloud-services-team (Kanban) board.

Mentioned in SAL (#wikimedia-operations) [2018-12-05T10:17:52Z] <arturo> T205969 icinga downtime the load avg check in labstore1007 for 1 week

This week we got several pages related to this. I just downtimed the server for 1 week.

Mentioned in SAL (#wikimedia-operations) [2018-12-12T12:35:27Z] <arturo> T205969 icinga downtime load-avg check for labstore1007 until January (1 month)

GTirloni removed a subscriber: GTirloni.Dec 20 2018, 6:43 PM
bd808 moved this task from Backlog to Dumps on the Data-Services board.May 30 2019, 7:04 PM
Bstorm closed this task as Declined.Jun 11 2020, 11:09 PM

High load is now fairly normal for NFS since the upgrades for Spectre and Meltdown (and others). It is an email-only alert at this point because it is very hard to distinguish a real problem using just a load number like we used to. If load is extremely high for a long time, it might suggest a problem, but otherwise, I think it is not worth it to try to stop the load from being over 20 on a regular basis when it doesn't even slow down the system's web server.

Not saying this is fixed. I'm just thinking it's a "can't-fix" and we should get used to high load as more of a possible warning than a real problem.