Page MenuHomePhabricator

Check bandwidth limitation on integration-castor03.integration.eqiad.wmflabs / cloudvirt1002
Closed, ResolvedPublic

Description

The CI jobs would rsync some cached materials to a central instance integration-castor03.integration.eqiad.wmflabs. Typically the local maven cache, npm/composer install caches. That is to speed up the build/save bandwith.

T188375: castor rsync's taking 3-5 minutes for mwgate-npm jobs is about speeding it up, a change has been made to have rsync to no more compress the data to save on CPU usage. When several jobs fetch caches, the CPU was busy and saving compressing save CPU cycles. However that means the amount of data to transfer is way larger.

The Grafana view shows the same network cap at roughly 20 MBps.

The purpose of this task is to find out whether #WMCS does any rate limiting of instances and whether it is possible to raise the cap a bit if at all possible.

The instance hosting the cache is integration-castor03.integration.eqiad.wmflabs (172.16.5.161). It is running on the server cloudvirt1002.

Note the central caching system has reached its limitation and is now a bottleneck, would thus have to replace it with something new and more distributed. But that is an entire topic


Possible fix:

https://horizon.wikimedia.org/project/instances/8911a89e-4247-48d0-9cb4-7377244b37bb/
labstore::traffic_shaping::egress: 50000kbps
Run tc-setup on the instance

Verify by checking integration.integration-castor03.network.eth0.tx_bit: https://graphite-labs.wikimedia.org/render/?width=1280&height=720&target=integration.integration-castor03.network.eth0.tx_bit

Event Timeline

Aklapper renamed this task from Check bandwith limitation on integration-castor03.integration.eqiad.wmflabs / cloudvirt1002 to Check bandwidth limitation on integration-castor03.integration.eqiad.wmflabs / cloudvirt1002.Sep 11 2019, 6:47 PM
hashar triaged this task as High priority.

Eventually I have dig in the puppet log. I found out that all wmcs instance have a nfsclient puppet class applied which ends up invoking labstore::traffic_shapping. That classes creates a file /usr/local/sbin/tc-setup which has various shaping parameters.

And the instance has thus:

$ sudo tc class show dev eth0
class htb 1:100 root prio 0 rate 240Mbit ceil 240Mbit burst 1560b cburst 1560b 
...

In Horizon I have set a hieradata at https://horizon.wikimedia.org/project/instances/8911a89e-4247-48d0-9cb4-7377244b37bb/ :

labstore::traffic_shaping::egress: 50000kbps

Ran puppet:

Notice: /Stage[main]/Labstore::Traffic_shaping/File[/usr/local/sbin/tc-setup]/content: 
--- /usr/local/sbin/tc-setup	2019-09-17 14:16:05.267105129 +0000
+++ /tmp/puppet-file20190917-25073-1vfgqfu	2019-09-17 14:20:06.980876742 +0000
@@ -14,7 +14,7 @@
 nfs_write='8500kbps'
 nfs_read='1000kbps'
 nfs_dumps_read='5000kbps'
-egress='30000kbps'
+egress='50000kbps'
 iface='eth0'
 
 function clean_ingress {

Manually ran tc-setup and:

$ sudo tc class show dev eth0
class htb 1:100 root prio 0 rate 400Mbit ceil 400Mbit burst 1600b cburst 1600b 

So the egress filter went from 240Mbit up to 400Mbit. I guess that will soon reflect in the stats for the instance :]

Mentioned in SAL (#wikimedia-releng) [2019-09-17T14:31:24Z] <hashar> Raise egress filter trafic shaping on integration-castor03 from 240Mbit to 400Mbit ( T232644 / T188375 )

I'm sorry I didn't get to this! It sounds like you are (probably) all set.

Mentioned in SAL (#wikimedia-releng) [2019-09-18T07:36:19Z] <hashar> Raised integration-castor03 egress trafic shaping from 50mbps to 100mbps # T232644

I'm sorry I didn't get to this! It sounds like you are (probably) all set.

At least you have hinted at traffic shaping being in place to protect NFS, that gave me the clue I was missing and eventually has lead me to labstore::traffic_shaping :-]

I further raised the shaping:

- labstore::traffic_shaping::egress: 500000kbps  # 50 mbps
+ labstore::traffic_shaping::egress: 100mbps
tc class show dev eth0|head -n1
class htb 1:100 root prio 0 rate 800Mbit ceil 800Mbit burst 1600b cburst 1600b

Solved by applying labstore::traffic_shaping::egress: 100mbps to the instance hiera configuration.

It can not be applied at the role level since there is no role based lookup on labs T120165

Good enough for now, can bump as needed later on.

Mentioned in SAL (#wikimedia-releng) [2020-06-18T18:28:04Z] <hashar> integration-castor03: remove labstore::traffic_shaping::egress: 100mbps in horizon. It is now applied project wide via puppet.git # T232644 T255371