labstore1007: high load avg issue
Open, LowPublic

Description

This server (dumps) seems to be suffering high load average issues related to rsync jobs or nfs.

On 2018-10-02 we got paged by icinga about the load issue:


https://graphite.wikimedia.org/S/P

I jumped into the machine and rsyncd was the most CPU intensive process (about 100%) in both htop and iotop.

@ArielGlenn created this patch: https://gerrit.wikimedia.org/r/463925

But probably the load is also high because NFS.

Apparently labstore1006 is not suffering the same issues.

aborrero created this task.Oct 2 2018, 12:16 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 2 2018, 12:16 PM
Bstorm added a comment.EditedOct 3 2018, 5:56 PM

labstore1007 is the Cloud VPS NFS partner. It gets high load whenever someone does a cp on a very large dump over NFS, basically. labstore1006 will do the same thing if exposed to VPS (currently, it's NFS for the stats servers, but not Cloud VPS).

This is essentially totally expected. Adding the monitoring task as a related task.

aborrero moved this task from Inbox to Graveyard on the cloud-services-team (Kanban) board.
aborrero triaged this task as Low priority.

Mentioned in SAL (#wikimedia-operations) [2018-12-05T10:17:52Z] <arturo> T205969 icinga downtime the load avg check in labstore1007 for 1 week

This week we got several pages related to this. I just downtimed the server for 1 week.

Mentioned in SAL (#wikimedia-operations) [2018-12-12T12:35:27Z] <arturo> T205969 icinga downtime load-avg check for labstore1007 until January (1 month)