Page MenuHomePhabricator

Investigate ms-be hosts performance during rebalances
Closed, ResolvedPublic

Description

During the latest eqiad swift / ms-be rebalances I've noticed the new hosts experience higher latency than the rest, especially around PUT/DELETE.

Some things off the top of my head that are worth investigating:

  1. perf top shows native_queued_spin_lock_slowpath and that made me realize we're not load-balancing IRQs across CPUs, but we should (similar to cp / lvs hosts)
  2. test rebalances with less weight (i.e. moving less partitions around)

Event Timeline

Change 655636 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] WIP: add interface::rps to swift::storage

https://gerrit.wikimedia.org/r/655636

Change 655636 merged by Filippo Giunchedi:
[operations/puppet@production] role: add interface::rps to swift::storage

https://gerrit.wikimedia.org/r/655636

Change 655902 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] role: add interface::rps to swift::storage

https://gerrit.wikimedia.org/r/655902

Change 655902 merged by Filippo Giunchedi:
[operations/puppet@production] role: add interface::rps to swift::storage

https://gerrit.wikimedia.org/r/655902

Change 656132 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] swift: apply interface::rps to bnx2x as well

https://gerrit.wikimedia.org/r/656132

Change 656132 merged by Filippo Giunchedi:
[operations/puppet@production] swift: apply interface::rps to bnx2x as well

https://gerrit.wikimedia.org/r/656132

The interface::rps define is active for broadcom NICs at the moment; I noticed that some HP hosts use the i40e driver instead. AFAICT we don't have interface::rps applied already to any hosts using i40e. I tried testing interface-rps.py on ms-be2056.

The script worked in the sense that there were no errors, although I'd like confirmation from e.g. @BBlack or @faidon perhaps. Namely that things look as they should on ms-be2056 and/or the script's logic needs adjusting for i40e NICs, thanks!

root@ms-be2056:/home/filippo# ./interface-rps.py eno5
/sys/class/net/eno5/queues/rx-0/rps_cpus = 1
/sys/class/net/eno5/queues/tx-0/xps_cpus = 1
/sys/class/net/eno5/queues/rx-1/rps_cpus = 2
/sys/class/net/eno5/queues/tx-1/xps_cpus = 2
/sys/class/net/eno5/queues/rx-2/rps_cpus = 4
/sys/class/net/eno5/queues/tx-2/xps_cpus = 4
/sys/class/net/eno5/queues/rx-3/rps_cpus = 8
/sys/class/net/eno5/queues/tx-3/xps_cpus = 8
/sys/class/net/eno5/queues/rx-4/rps_cpus = 10
/sys/class/net/eno5/queues/tx-4/xps_cpus = 10
/sys/class/net/eno5/queues/rx-5/rps_cpus = 20
/sys/class/net/eno5/queues/tx-5/xps_cpus = 20
/sys/class/net/eno5/queues/rx-6/rps_cpus = 40
/sys/class/net/eno5/queues/tx-6/xps_cpus = 40
/sys/class/net/eno5/queues/rx-7/rps_cpus = 80
/sys/class/net/eno5/queues/tx-7/xps_cpus = 80
/sys/class/net/eno5/queues/rx-8/rps_cpus = 100
/sys/class/net/eno5/queues/tx-8/xps_cpus = 100
/sys/class/net/eno5/queues/rx-9/rps_cpus = 200
/sys/class/net/eno5/queues/tx-9/xps_cpus = 200
/sys/class/net/eno5/queues/rx-10/rps_cpus = 1
/sys/class/net/eno5/queues/tx-10/xps_cpus = 1
/sys/class/net/eno5/queues/rx-11/rps_cpus = 2
/sys/class/net/eno5/queues/tx-11/xps_cpus = 2
/sys/class/net/eno5/queues/rx-12/rps_cpus = 4
/sys/class/net/eno5/queues/tx-12/xps_cpus = 4
/sys/class/net/eno5/queues/rx-13/rps_cpus = 8
/sys/class/net/eno5/queues/tx-13/xps_cpus = 8
/sys/class/net/eno5/queues/rx-14/rps_cpus = 10
/sys/class/net/eno5/queues/tx-14/xps_cpus = 10
/sys/class/net/eno5/queues/rx-15/rps_cpus = 20
/sys/class/net/eno5/queues/tx-15/xps_cpus = 20
/sys/class/net/eno5/queues/rx-16/rps_cpus = 40
/sys/class/net/eno5/queues/tx-16/xps_cpus = 40
/sys/class/net/eno5/queues/rx-17/rps_cpus = 80
/sys/class/net/eno5/queues/tx-17/xps_cpus = 80
/sys/class/net/eno5/queues/rx-18/rps_cpus = 100
/sys/class/net/eno5/queues/tx-18/xps_cpus = 100
/sys/class/net/eno5/queues/rx-19/rps_cpus = 200
/sys/class/net/eno5/queues/tx-19/xps_cpus = 200
/sys/class/net/eno5/queues/rx-20/rps_cpus = 1
/sys/class/net/eno5/queues/tx-20/xps_cpus = 1
/sys/class/net/eno5/queues/rx-21/rps_cpus = 2
/sys/class/net/eno5/queues/tx-21/xps_cpus = 2
/sys/class/net/eno5/queues/rx-22/rps_cpus = 4
/sys/class/net/eno5/queues/tx-22/xps_cpus = 4
/sys/class/net/eno5/queues/rx-23/rps_cpus = 8
/sys/class/net/eno5/queues/tx-23/xps_cpus = 8
/sys/class/net/eno5/queues/rx-24/rps_cpus = 10
/sys/class/net/eno5/queues/tx-24/xps_cpus = 10
/sys/class/net/eno5/queues/rx-25/rps_cpus = 20
/sys/class/net/eno5/queues/tx-25/xps_cpus = 20
/sys/class/net/eno5/queues/rx-26/rps_cpus = 40
/sys/class/net/eno5/queues/tx-26/xps_cpus = 40
/sys/class/net/eno5/queues/rx-27/rps_cpus = 80
/sys/class/net/eno5/queues/tx-27/xps_cpus = 80
/sys/class/net/eno5/queues/rx-28/rps_cpus = 100
/sys/class/net/eno5/queues/tx-28/xps_cpus = 100
/sys/class/net/eno5/queues/rx-29/rps_cpus = 200
/sys/class/net/eno5/queues/tx-29/xps_cpus = 200
/sys/class/net/eno5/queues/rx-30/rps_cpus = 1
/sys/class/net/eno5/queues/tx-30/xps_cpus = 1
/sys/class/net/eno5/queues/rx-31/rps_cpus = 2
/sys/class/net/eno5/queues/tx-31/xps_cpus = 2
/sys/class/net/eno5/queues/rx-32/rps_cpus = 4
/sys/class/net/eno5/queues/tx-32/xps_cpus = 4
/sys/class/net/eno5/queues/rx-33/rps_cpus = 8
/sys/class/net/eno5/queues/tx-33/xps_cpus = 8
/sys/class/net/eno5/queues/rx-34/rps_cpus = 10
/sys/class/net/eno5/queues/tx-34/xps_cpus = 10
/sys/class/net/eno5/queues/rx-35/rps_cpus = 20
/sys/class/net/eno5/queues/tx-35/xps_cpus = 20
/sys/class/net/eno5/queues/rx-36/rps_cpus = 40
/sys/class/net/eno5/queues/tx-36/xps_cpus = 40
/sys/class/net/eno5/queues/rx-37/rps_cpus = 80
/sys/class/net/eno5/queues/tx-37/xps_cpus = 80
/sys/class/net/eno5/queues/rx-38/rps_cpus = 100
/sys/class/net/eno5/queues/tx-38/xps_cpus = 100
/sys/class/net/eno5/queues/rx-39/rps_cpus = 200
/sys/class/net/eno5/queues/tx-39/xps_cpus = 200

Change 656837 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] swift: decrease object replicator concurrency

https://gerrit.wikimedia.org/r/656837

Change 657372 had a related patch set uploaded (by Joal; owner: Joal):
[operations/puppet@production] Update HDFS folder creation for analytics refinery

https://gerrit.wikimedia.org/r/657372

Change 657372 merged by Elukey:
[operations/puppet@production] Update HDFS folder creation for analytics refinery

https://gerrit.wikimedia.org/r/657372

Change 656837 merged by Filippo Giunchedi:
[operations/puppet@production] swift: decrease object replicator concurrency

https://gerrit.wikimedia.org/r/656837

The interface::rps define is active for broadcom NICs at the moment; I noticed that some HP hosts use the i40e driver instead. AFAICT we don't have interface::rps applied already to any hosts using i40e. I tried testing interface-rps.py on ms-be2056.

The script worked in the sense that there were no errors, although I'd like confirmation from e.g. @BBlack or @faidon perhaps. Namely that things look as they should on ms-be2056 and/or the script's logic needs adjusting for i40e NICs, thanks!

Catching up on this from my backlog: The basics all seem to miraculously work well enough by default for this case. The NUMA filtering works, the tx/rx queue mapping works, the IRQ counts indicate that it's doing what it claims to do, etc. The only minor issue is that the i40e driver is, by default, configuring 40 queues to match the 40 CPUs it counts on the host, while interface-rps knows that there are only 10 real (as opposed to Hyperthread sibling) CPUs attached to the closest NUMA domain to the card. interface-rps.py deals with this pretty ok and just maps 4 queues to each of its real target CPUs, but it would probably be more-ideal to just trim the card down to 10 queues and let it all map 1:1.

interface::rps already has support for doing this, which looks like it would also work fine as-is on this card, at https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/interface/manifests/rps.pp#45 . However, because we were afraid of fallout on $random_cards, we have an if-guard there that only lets it run on the Broadcom drivers that we know well. If you add a match for i40e to that regex, it should work. Keep in mind the comments there, that deploying this ethtool Exec to a host for the first time will likely blip the link status of the interface while it's reconfiguring the queues and cause a tiny outage.

Change 661053 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] interfaces: allow setting queues on i40e NICs

https://gerrit.wikimedia.org/r/661053

Change 661054 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] swift: apply interface::rps to i40e NICs

https://gerrit.wikimedia.org/r/661054

The interface::rps define is active for broadcom NICs at the moment; I noticed that some HP hosts use the i40e driver instead. AFAICT we don't have interface::rps applied already to any hosts using i40e. I tried testing interface-rps.py on ms-be2056.

The script worked in the sense that there were no errors, although I'd like confirmation from e.g. @BBlack or @faidon perhaps. Namely that things look as they should on ms-be2056 and/or the script's logic needs adjusting for i40e NICs, thanks!

Catching up on this from my backlog: The basics all seem to miraculously work well enough by default for this case. The NUMA filtering works, the tx/rx queue mapping works, the IRQ counts indicate that it's doing what it claims to do, etc. The only minor issue is that the i40e driver is, by default, configuring 40 queues to match the 40 CPUs it counts on the host, while interface-rps knows that there are only 10 real (as opposed to Hyperthread sibling) CPUs attached to the closest NUMA domain to the card. interface-rps.py deals with this pretty ok and just maps 4 queues to each of its real target CPUs, but it would probably be more-ideal to just trim the card down to 10 queues and let it all map 1:1.

interface::rps already has support for doing this, which looks like it would also work fine as-is on this card, at https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/interface/manifests/rps.pp#45 . However, because we were afraid of fallout on $random_cards, we have an if-guard there that only lets it run on the Broadcom drivers that we know well. If you add a match for i40e to that regex, it should work. Keep in mind the comments there, that deploying this ethtool Exec to a host for the first time will likely blip the link status of the interface while it's reconfiguring the queues and cause a tiny outage.

Thank you for checking things out! I'm glad interface-rps.py DTRT by default in the i40e case as well. I've added i40e to the list here for review: https://gerrit.wikimedia.org/r/c/operations/puppet/+/661053 and then indeed ethtool's exec gets run on https://puppet-compiler.wmflabs.org/compiler1003/27800/ms-be2056.codfw.wmnet/index.html when used from swift::performance (https://gerrit.wikimedia.org/r/c/operations/puppet/+/661054)

Change 661054 merged by Filippo Giunchedi:
[operations/puppet@production] swift: apply interface::rps to i40e NICs

https://gerrit.wikimedia.org/r/661054

Change 661053 abandoned by Filippo Giunchedi:
[operations/puppet@production] interfaces: allow setting queues on i40e NICs

Reason:
Superseded by https://gerrit.wikimedia.org/r/c/operations/puppet/ /662688/5/modules/interface/files/interface-rps.py#162

https://gerrit.wikimedia.org/r/661053

fgiunchedi claimed this task.

We're balancing IRQs amongst CPUs now, and will be setting nic queues for i40e NICs with https://gerrit.wikimedia.org/r/c/operations/puppet/+/662688/5/modules/interface/files/interface-rps.py#162