Page MenuHomePhabricator

labstore1007: high load avg issue
Open, LowPublic


This server (dumps) seems to be suffering high load average issues related to rsync jobs or nfs.

On 2018-10-02 we got paged by icinga about the load issue:

I jumped into the machine and rsyncd was the most CPU intensive process (about 100%) in both htop and iotop.

@ArielGlenn created this patch:

But probably the load is also high because NFS.

Apparently labstore1006 is not suffering the same issues.

Event Timeline

aborrero created this task.Oct 2 2018, 12:16 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 2 2018, 12:16 PM
Bstorm added a comment.EditedOct 3 2018, 5:56 PM

labstore1007 is the Cloud VPS NFS partner. It gets high load whenever someone does a cp on a very large dump over NFS, basically. labstore1006 will do the same thing if exposed to VPS (currently, it's NFS for the stats servers, but not Cloud VPS).

This is essentially totally expected. Adding the monitoring task as a related task.

aborrero triaged this task as Low priority.Nov 12 2018, 5:06 PM
aborrero moved this task from Inbox to Graveyard on the cloud-services-team (Kanban) board.

Mentioned in SAL (#wikimedia-operations) [2018-12-05T10:17:52Z] <arturo> T205969 icinga downtime the load avg check in labstore1007 for 1 week

This week we got several pages related to this. I just downtimed the server for 1 week.

Mentioned in SAL (#wikimedia-operations) [2018-12-12T12:35:27Z] <arturo> T205969 icinga downtime load-avg check for labstore1007 until January (1 month)

GTirloni removed a subscriber: GTirloni.Dec 20 2018, 6:43 PM
bd808 moved this task from Backlog to Dumps on the Data-Services board.May 30 2019, 7:04 PM