Page MenuHomePhabricator

labstore1003 load spikes
Closed, DeclinedPublic

Description

We have been paged 3 times for this and it resolves before I can see what's up.

At the moment I see the following serious users:

root@tools-bastion-03:~# host 10.68.20.221
221.20.68.10.in-addr.arpa domain name pointer mwoffliner4.mwoffliner.eqiad.wmflabs.
root@tools-bastion-03:~# host 10.68.16.103
103.16.68.10.in-addr.arpa domain name pointer maps-tiles3.maps.eqiad.wmflabs.
root@tools-bastion-03:~# host 10.68.16.70
70.16.68.10.in-addr.arpa domain name pointer maps-wma1.maps.eqiad.wmflabs.

But none of them are doing anything abusive afaict.

Event Timeline

I jumped in on mwoffliner4.mwoffliner.eqiad.wmflabs and hot patched it to:

modules='act_mirr ifb'
nfs_write='6000kbps'
nfs_read='700kbps'
nfs_dumps_read='800kbps'
egress='30000kbps'
iface='eth0'

Then ran (idempotent) /usr/local/sbin/tc-setup as that was the only instance I could see at the time thrashing IO. We didn't get paged again after this and the 3.

For the record, this paged again today 2018-05-04 (flapping)

Some graph data.

eth0 RX/TX bytes:


https://graphite.wikimedia.org/S/B

load avg:


https://graphite.wikimedia.org/S/C

I get an error trying to run nethogs:

# nethogs eth0
creating socket failed while establishing local IP - are you root?

It is now replaced by cloudstore1008/9 T187456