Page MenuHomePhabricator

labstore: Re-evaluate traffic shaping settings
Open, MediumPublic

Description

A quick test from a Cloud VPS instance (Toolforge k8s worker) while labstore1004 isn't particularly busy:

# dd if=/dev/zero of=test.dat bs=64k count=10000
10000+0 records in
10000+0 records out
655360000 bytes (655 MB) copied, 83.7286 s, 7.8 MB/s


# dd if=test.dat of=/dev/null bs=64k 
10000+0 records in
10000+0 records out
655360000 bytes (655 MB) copied, 686.471 s, 955 kB/s

Discarding protocol overheads, this seems in line with the traffic shaping settings in modules/labstore/manifests/traffic_shaping.pp:

class labstore::traffic_shaping(
    $nfs_write = '8500kbps',
    $nfs_read = '1000kbps',
    $nfs_dumps_read = '5000kbps',
    $egress = '30000kbps',
    $interface = $facts['interface_primary'],
) {

However, these limits might be too low to sustain decent performance on exec/worker nodes that are running dozens of apps simultaneously.

Event Timeline

GTirloni created this task.

I don't know if I'm understanding the TC settings correctly but a few things caught my attention:

  1. Why we have a higher limit for writes (non-cacheable) than reads
  2. Read limit at 1MB/s seems extremely low (if it's per IP)

I think we could increase both limits to 10MB/s to give the instances at least a 100Mbps-like pipe to use. Or possibly just increase reads which have a better chance of being cached and not hitting the spinning disks.

Also, I don't understand what the $egress setting is doing, it seems to be an overall limit on the interface (in which case why 30Mb/s on a 1000Mb/s interface).

Again, I could be misinterpreting this all so forgive my ignorance.

I may have some of this in the dusty halls of my brain:

I don't know if I'm understanding the TC settings correctly but a few things caught my attention:

  1. Why we have a higher limit for writes (non-cacheable) than reads

Historically it was read traffic that was smothering the labstores in elevated load, so it was set very conservatively and no one complained! Well, it was raised and lowered a few times but settled pretty conservatively for sure. That's the short of it.

  1. Read limit at 1MB/s seems extremely low (if it's per IP)

It's basically per instance yes, but the dogpile effect of a handful of instances reading too much at one time is a worry. Take the dumps server for example, sometimes a single client just reading dumps inflates load more than seems sane. It's worth thinking about in at least two ways: it's better to have more instances with lower limits to reduce clustering of CPU/Mem contention (at least in grid terms), and the cumulative effect of whatever limit is set per instance can get messy. In general, these numbers have been tweaked and revisited over time with the idea that very little traffic is needed and stability of the system is paramount. Far back we had situations where a single tool would overwhelm NFS and spike load, usually without serious effect until it reached some critical mass of scheduling insanity and then the servers would downward spiral. *hand waves* seems sane but be careful :)

I think we could increase both limits to 10MB/s to give the instances at least a 100Mbps-like pipe to use. Or possibly just increase reads which have a better chance of being cached and not hitting the spinning disks.

I would be super practical about it, if you can get away with higher and it's needed then cool, if it's not needed then I wouldn't risk it. If someone says it's needed then I would think about why. One approach that was intended originally that never happened was to separate /home and /data/project limits so that users tinkering about in /home would be much lower for adhoc things. That is still possible, but once this reached some minimum level of sanity there were bigger fires.

Also, I don't understand what the $egress setting is doing, it seems to be an overall limit on the interface (in which case why 30Mb/s on a 1000Mb/s interface).

Yes, that's exactly it. It's a cumulative thing mostly, there are 1G interface on most of hte /hosts/ but every host has 30-90 instances usually. We were seeing some instances drown out everything else on a particular hypervisor. I believe at the time we had like 15 hypervisors at 1G for isntances, but the gateway at that time was only 1G itself. So it was basically a ratchet down until things stopped exploding iirc. I don't think there is a right or wrong approach here, but the more room there are for consumption spikes the less predictable life will be.

Again, I could be misinterpreting this all so forgive my ignorance.

I've repeatedly found instances where the traffic shaping settings cause problems on VMs because they limit hosts that often contain shared services. This makes their performance bad regardless of NFS. Also, the egress settings never applied to hosts that didn't have NFS mounted in the first place.

At this point most hosts are now on 10 Gig Ethernet, so the thinking that created these numbers is dated. Dumps is on 10Gb. The secondary cluster has no traffic shaping except the egress effects.

The degree to which the traffic shaping harms performance is directly proportional to how shared the VM is. Bastions need a bump in performance so that they stop being crippled by cp commands (with a little help from things like https://gerrit.wikimedia.org/r/c/operations/puppet/+/635888 to discourage misuse). The dumps read should really be raised in general since it hasn't changed when the servers became much more network-capable. Finally, I think upping NFS read across the board a little would be a good idea, but that might be best kept to a small increase until after we get the primary cluster on 10Gb T266198: Move labstore1004 and labstore1005 to 10G Ethernet

The write throttle is unchanged partly because we haven't upgraded the DRBD network yet. At very least, NFS reads should no longer feel like you are mounting it over a cell phone network.

Change 656269 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] nfs: set default monitors for 10Gb Ethernet

https://gerrit.wikimedia.org/r/656269

Change 656269 merged by Bstorm:
[operations/puppet@production] nfs: set default monitors for 10Gb Ethernet

https://gerrit.wikimedia.org/r/656269

Now that the link for DRBD is fully 10G on the tools and misc dns server, we can remove more of the tight caps on writes for that cluster.

Change 691267 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] cloud nfs: Change primary cluster rate limits dramatically

https://gerrit.wikimedia.org/r/691267

Mentioned in SAL (#wikimedia-cloud) [2021-05-14T19:18:54Z] <bstorm> adjusting the rate limits for bastions nfs_write upward a lot to make NFS writes faster now that the cluster is finally using 10Gb on the backend and frontend T218338

Change 691267 merged by Bstorm:

[operations/puppet@production] cloud nfs: Change primary cluster rate limits dramatically

https://gerrit.wikimedia.org/r/691267