Check bandwidth limitation on integration-castor03.integration.eqiad.wmflabs / cloudvirt1002
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	hashar
	Sep 11 2019, 6:00 PM

Description

The CI jobs would rsync some cached materials to a central instance integration-castor03.integration.eqiad.wmflabs. Typically the local maven cache, npm/composer install caches. That is to speed up the build/save bandwith.

T188375: castor rsync's taking 3-5 minutes for mwgate-npm jobs is about speeding it up, a change has been made to have rsync to no more compress the data to save on CPU usage. When several jobs fetch caches, the CPU was busy and saving compressing save CPU cycles. However that means the amount of data to transfer is way larger.

The Grafana view shows the same network cap at roughly 20 MBps.

The purpose of this task is to find out whether #WMCS does any rate limiting of instances and whether it is possible to raise the cap a bit if at all possible.

The instance hosting the cache is integration-castor03.integration.eqiad.wmflabs (172.16.5.161). It is running on the server cloudvirt1002.

Note the central caching system has reached its limitation and is now a bottleneck, would thus have to replace it with something new and more distributed. But that is an entire topic

Possible fix:

https://horizon.wikimedia.org/project/instances/8911a89e-4247-48d0-9cb4-7377244b37bb/
labstore::traffic_shaping::egress: 50000kbps
Run tc-setup on the instance

Verify by checking integration.integration-castor03.network.eth0.tx_bit: https://graphite-labs.wikimedia.org/render/?width=1280&height=720&target=integration.integration-castor03.network.eth0.tx_bit

Related Objects
Search...

Status	Assigned	Task
Open	None	T225730 Reduce runtime of MW shared gate Jenkins jobs to 5 min
Resolved	hashar	T188375 castor rsync's taking 3-5 minutes for mwgate-npm jobs
Resolved	hashar	T232644 Check bandwidth limitation on integration-castor03.integration.eqiad.wmflabs / cloudvirt1002

Event Timeline

hashar created this task.Sep 11 2019, 6:00 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 11 2019, 6:00 PM

hashar mentioned this in T188375: castor rsync's taking 3-5 minutes for mwgate-npm jobs.Sep 11 2019, 6:20 PM

Aklapper renamed this task from Check bandwith limitation on integration-castor03.integration.eqiad.wmflabs / cloudvirt1002 to Check bandwidth limitation on integration-castor03.integration.eqiad.wmflabs / cloudvirt1002.Sep 11 2019, 6:47 PM

Eventually I have dig in the puppet log. I found out that all wmcs instance have a nfsclient puppet class applied which ends up invoking labstore::traffic_shapping. That classes creates a file /usr/local/sbin/tc-setup which has various shaping parameters.

And the instance has thus:

$ sudo tc class show dev eth0
class htb 1:100 root prio 0 rate 240Mbit ceil 240Mbit burst 1560b cburst 1560b 
...

In Horizon I have set a hieradata at https://horizon.wikimedia.org/project/instances/8911a89e-4247-48d0-9cb4-7377244b37bb/ :

labstore::traffic_shaping::egress: 50000kbps

Ran puppet:

Notice: /Stage[main]/Labstore::Traffic_shaping/File[/usr/local/sbin/tc-setup]/content: 
--- /usr/local/sbin/tc-setup	2019-09-17 14:16:05.267105129 +0000
+++ /tmp/puppet-file20190917-25073-1vfgqfu	2019-09-17 14:20:06.980876742 +0000
@@ -14,7 +14,7 @@
 nfs_write='8500kbps'
 nfs_read='1000kbps'
 nfs_dumps_read='5000kbps'
-egress='30000kbps'
+egress='50000kbps'
 iface='eth0'
 
 function clean_ingress {

Manually ran tc-setup and:

$ sudo tc class show dev eth0
class htb 1:100 root prio 0 rate 400Mbit ceil 400Mbit burst 1600b cburst 1600b

So the egress filter went from 240Mbit up to 400Mbit. I guess that will soon reflect in the stats for the instance :]

We will see the effect on https://graphite-labs.wikimedia.org/render/?width=1280&height=720&target=integration.integration-castor03.network.eth0.tx_bit

Mentioned in SAL (#wikimedia-releng) [2019-09-17T14:31:24Z] <hashar> Raise egress filter trafic shaping on integration-castor03 from 240Mbit to 400Mbit ( T232644 / T188375 )

I'm sorry I didn't get to this! It sounds like you are (probably) all set.

Mentioned in SAL (#wikimedia-releng) [2019-09-18T07:36:19Z] <hashar> Raised integration-castor03 egress trafic shaping from 50mbps to 100mbps # T232644

In T232644#5501870, @Andrew wrote:

I'm sorry I didn't get to this! It sounds like you are (probably) all set.

At least you have hinted at traffic shaping being in place to protect NFS, that gave me the clue I was missing and eventually has lead me to labstore::traffic_shaping :-]

I further raised the shaping:

- labstore::traffic_shaping::egress: 500000kbps  # 50 mbps
+ labstore::traffic_shaping::egress: 100mbps

tc class show dev eth0|head -n1
class htb 1:100 root prio 0 rate 800Mbit ceil 800Mbit burst 1600b cburst 1600b

Solved by applying labstore::traffic_shaping::egress: 100mbps to the instance hiera configuration.

It can not be applied at the role level since there is no role based lookup on labs T120165

Good enough for now, can bump as needed later on.

hashar added a parent task: T188375: castor rsync's taking 3-5 minutes for mwgate-npm jobs.Sep 19 2019, 6:18 PM

Andrew awarded a token.Sep 19 2019, 8:54 PM

Jdforrester-WMF moved this task from INBOX to Completed on the Release-Engineering-Team-TODO (201909) board.Sep 20 2019, 8:43 PM

hashar mentioned this in T255371: publish-to-doc job is close to its timeout, with some builds lost.Jun 15 2020, 9:57 AM

Mentioned in SAL (#wikimedia-releng) [2020-06-18T18:28:04Z] <hashar> integration-castor03: remove labstore::traffic_shaping::egress: 100mbps in horizon. It is now applied project wide via puppet.git # T232644 T255371

Check bandwidth limitation on integration-castor03.integration.eqiad.wmflabs / cloudvirt1002Closed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Check bandwidth limitation on integration-castor03.integration.eqiad.wmflabs / cloudvirt1002
Closed, ResolvedPublic
Actions

Related Objects
Search...