Page MenuHomePhabricator

Switch Source Hashing ('sh') scheduling on LVS hosts to Maglev hashing ('mh')
Closed, ResolvedPublic

Description

The Maglev algorithm, a recent scheduler added to LVS, spreads connections fairly, minimizes disruptions during changes and is quite fast at building its lookup table.

  • Merge pybal patch to allow the new 'mh' scheduler and release a new pybal version
  • Test the switch from 'sh' to 'mh' on a datacenter-level to make sure there are no adverse effects before rolling out to all LVSes
    • ulsfo
    • eqsin
    • drmrs
    • esams
    • codfw
    • eqiad
  • Merge the Puppet patch switching all DC 'sh' users to 'mh'

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ArielGlenn triaged this task as Medium priority.Sep 28 2020, 9:47 AM
BBlack subscribed.

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all such tickets that haven't been updated in 6 months or more. This does not imply any human judgement about the validity or importance of the task, and is simply the first step in a larger task cleanup effort. Further manual triage and/or requests for updates will happen this month for all such tickets. For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!

BCornwall changed the task status from Open to In Progress.Apr 24 2023, 6:41 PM

Change 911349 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/software/spicerack@master] service: Set LVS default scheduler to Maglev (mh)

https://gerrit.wikimedia.org/r/911349

Change 911350 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] lvs: Switch scheduler from wrr to mh (Maglev)

https://gerrit.wikimedia.org/r/911350

BCornwall updated the task description. (Show Details)

@BBlack, Do you have suggestions on how best to test this change for performance improvements/degradations?

I think we need to rewind a step here. We do want mh, but we want it for the current public sh cases (basically: text and upload ports 80+443), and maybe the other three sh cases (kibana + thanos), although we can start with text+upload first and then talk about those others with the respective teams. The current ticket description and patches seem to be going after the opposite: switching the current wrr services to mh via hieradata and spicerack changes. I think this would be actively harmful. sh and mh choose the destination based on hashes of the source address, which is great for public-facing, but would be hasing on our very limited set of internal cache exit IPs (or other internal service clusters for internal LVS'd traffic), and so it wouldn't balance very well at all. One could potentially address that by including the source port in the hash, but it still seems like it would be more-complicated and less-optimal than just sticking with wrr for these cases.

I would suggest we re-target this maglev ticket to doing s/sh/mh/ for the public upload+text services first, possibly stretching to the kibana and thanos sh cases later, and drop the idea of switching any of the current wrr services.

@BBlack: The ticket was literally just the title "switch to maglev hashing (mh) on LVS hosts" and I went with it based on that :) Thanks for the clarification. I'll update the ticket/patches to more appropriately address this!

Change 911349 abandoned by BCornwall:

[operations/software/spicerack@master] service: Set LVS default scheduler to Maglev (mh)

Reason:

We only wanted to change 'sh' to 'mh'; We should leave 'wrr' alone

https://gerrit.wikimedia.org/r/911349

BCornwall renamed this task from Switch to Maglev hashing ('mh') on LVS hosts to Switch Source Hashing ('sh') scheduling on LVS hosts to Maglev hashing ('mh').Apr 25 2023, 4:40 PM
BCornwall updated the task description. (Show Details)

Change 912354 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] pybal: Switch esams LVS to use Maglev scheduler

https://gerrit.wikimedia.org/r/912354

Mentioned in SAL (#wikimedia-operations) [2023-04-26T19:47:37Z] <brett> Disable Puppet on LVS[4008-4010] for rollout of LVS maglev hashing scheduler - T263797

Change 912354 merged by BCornwall:

[operations/puppet@production] pybal: Switch ulsfo LVS to use Maglev scheduler

https://gerrit.wikimedia.org/r/912354

Change 912365 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] wmflib: Add Maglev Hashing (mh) to supported types

https://gerrit.wikimedia.org/r/912365

Change 912367 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/debs/pybal@master] ipvs: Add Maglev Hashing (mh) scheduler type

https://gerrit.wikimedia.org/r/912367

Change 912369 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/debs/pybal@1.15-stretch] ipvs: Add Maglev Hashing (mh) scheduler type

https://gerrit.wikimedia.org/r/912369

Change 912367 abandoned by BCornwall:

[operations/debs/pybal@master] ipvs: Add Maglev Hashing (mh) scheduler type

Reason:

<bblack> I wouldn't touch master, it hasn't been updated recently and might confuse history or future diggers

https://gerrit.wikimedia.org/r/912367

Change 912372 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/debs/pybal@1.15-stretch] Release 1.15.11

https://gerrit.wikimedia.org/r/912372

Change 912369 merged by BCornwall:

[operations/debs/pybal@1.15-stretch] ipvs: Add Maglev Hashing (mh) scheduler type

https://gerrit.wikimedia.org/r/912369

Change 912372 merged by BCornwall:

[operations/debs/pybal@1.15-stretch] Release 1.15.11

https://gerrit.wikimedia.org/r/912372

Mentioned in SAL (#wikimedia-operations) [2023-04-26T21:39:10Z] <brett> Re-enable Puppet on LVS[4008-4010] - T263797

Change 912365 merged by BCornwall:

[operations/puppet@production] wmflib: Add Maglev Hashing (mh) to supported types

https://gerrit.wikimedia.org/r/912365

The changes are ready for deployment but I've been unable to build pybal and am waiting for help to come back from vacation. In the meantime deployment patches have been reverted

Mentioned in SAL (#wikimedia-operations) [2023-05-03T21:55:21Z] <brett> Disable puppet on lvs4008 for new pybal deployment (just in case immediate config rollback is required) - T263797

Change 917399 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] pybal: Switch esams LVS to use Maglev scheduler

https://gerrit.wikimedia.org/r/917399

Change 919861 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] pybal: Switch eqsin LVS to use Maglev scheduler

https://gerrit.wikimedia.org/r/919861

Change 919861 merged by BCornwall:

[operations/puppet@production] pybal: Switch eqsin LVS to use Maglev scheduler

https://gerrit.wikimedia.org/r/919861

Mentioned in SAL (#wikimedia-operations) [2023-05-15T18:33:01Z] <brett> Rolling out maglev LVS scheduler in eqsin - T263797

Change 920353 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] pybal: Switch drmrs LVS to use Maglev scheduler

https://gerrit.wikimedia.org/r/920353

Mentioned in SAL (#wikimedia-operations) [2023-05-16T17:27:18Z] <brett> Rolling out maglev LVS scheduler in drmrs - T263797

Change 920353 merged by BCornwall:

[operations/puppet@production] pybal: Switch drmrs LVS to use Maglev scheduler

https://gerrit.wikimedia.org/r/920353

Mentioned in SAL (#wikimedia-operations) [2023-05-16T20:30:15Z] <brett> Rolling out maglev LVS scheduler in drmrs (for real this time) - T263797

Mentioned in SAL (#wikimedia-operations) [2023-05-17T15:52:14Z] <brett> Rolling out maglev LVS scheduler in esams - T263797

Change 917399 merged by BCornwall:

[operations/puppet@production] pybal: Switch esams LVS to use Maglev scheduler

https://gerrit.wikimedia.org/r/917399

Mentioned in SAL (#wikimedia-operations) [2023-05-17T17:19:00Z] <brett> Maglev LVS scheduler rollout finished in esams - T263797

Change 924559 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] pybal: Switch codfw LVS to use Maglev scheduler

https://gerrit.wikimedia.org/r/924559

Change 924561 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/dns@master] depool codfw (emergency patch, do not merge)

https://gerrit.wikimedia.org/r/924561

Mentioned in SAL (#wikimedia-operations) [2023-05-31T15:42:02Z] <brett> Maglev LVS scheduler rollout finished in codfw - T263797

Mentioned in SAL (#wikimedia-operations) [2023-05-31T15:42:26Z] <brett> Maglev LVS scheduler rollout began IN PROGRESS, not finished - T263797

Change 924559 merged by BCornwall:

[operations/puppet@production] pybal: Switch codfw LVS to use Maglev scheduler

https://gerrit.wikimedia.org/r/924559

Mentioned in SAL (#wikimedia-operations) [2023-05-31T17:10:30Z] <brett> Maglev LVS scheduler rollout in codfw finished - T263797

Change 927247 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] pybal: Switch eqiad LVS to use Maglev scheduler

https://gerrit.wikimedia.org/r/927247

Mentioned in SAL (#wikimedia-operations) [2023-06-05T18:28:12Z] <brett> Maglev LVS scheduler rollout in eqiad (puppet disabled) - T263797

Change 927247 merged by BCornwall:

[operations/puppet@production] pybal: Switch eqiad LVS to use Maglev scheduler

https://gerrit.wikimedia.org/r/927247

Mentioned in SAL (#wikimedia-operations) [2023-06-05T19:32:29Z] <brett> Maglev LVS scheduler rollout in eqiad finished (puppet re-enabled) - T263797

Change 924561 abandoned by BCornwall:

[operations/dns@master] depool codfw (emergency patch, do not merge)

Reason:

Changes have been rolled out successfully

https://gerrit.wikimedia.org/r/924561

Change 911350 merged by BCornwall:

[operations/puppet@production] lvs: Switch text/upload 'sh' schedulers to 'mh'

https://gerrit.wikimedia.org/r/911350

BCornwall updated the task description. (Show Details)