Page MenuHomePhabricator

Cutover wdqs-internal to new split endpoints
Closed, ResolvedPublic

Description

Order of operations:

  • With the precondition that T376150 is complete, we should have a pool of codfw hosts that can serve existing wdqs-internal traffic. Eqiad needs to be depooled at the dns level and codfw needs to have any hosts that aren't the 5 chosen in T376150 depooled.
  • Take the old eqiad and codfw wdqs-internal hosts offline to reconfigure them as necessary for the new graph split
  • Bring hosts back online
  • Return the 5 codfw hosts in T376150 to wdqs full graph service.
  • Tear down the old wdqs-internal service completely, set up eqiad lvs for wdqs-internal-[main,scholarly]

Event Timeline

Gehel triaged this task as High priority.Nov 8 2024, 2:22 PM

Change #1094074 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs-internal: bring graph split into production

https://gerrit.wikimedia.org/r/1094074

Change #1094074 merged by Ryan Kemper:

[operations/puppet@production] wdqs-internal: bring graph split into production

https://gerrit.wikimedia.org/r/1094074

WDQS Internal Cutover Plan

Step 0: Do some curl tests from mediawiki to make sure they can talk to envoy of wdqs internal graph split
Step 1: Flip wikibaseconstraints in mediawiki-config https://phabricator.wikimedia.org/T374021

  • Q: How do we get change live? I think mediawiki-config changes are propagated without any restarts needed but we should verify

Step 2: Start switching over hosts from wdqs internal (non-graph-split) to wdqs internal graph split
Step 3: Tear down old pybal/lvs/etc config

Remaining work:

  • Tear down LVS components for wdqs-internal
  • Convert old wdqs-internal hosts into wdqs-internal graph split and/or public wdqs graph split hosts

Change #1136740 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/dns@master] wdqs-internal: remove disc records

https://gerrit.wikimedia.org/r/1136740

Change #1136744 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs-internal: move back to lvs_setup

https://gerrit.wikimedia.org/r/1136744

Change #1136747 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs-internal: remove from LBs and backend servers

https://gerrit.wikimedia.org/r/1136747

Change #1136756 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs-internal: remove service catalog entry

https://gerrit.wikimedia.org/r/1136756

Change #1136757 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs-internal: rip out remaining logic/config

https://gerrit.wikimedia.org/r/1136757

Change #1139936 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/dns@master] wdqs-internal: remove lvs VIP

https://gerrit.wikimedia.org/r/1139936

Change #1136740 merged by Ryan Kemper:

[operations/dns@master] wdqs-internal: remove disc records

https://gerrit.wikimedia.org/r/1136740

Change #1136744 merged by Ryan Kemper:

[operations/puppet@production] wdqs-internal: move back to lvs_setup

https://gerrit.wikimedia.org/r/1136744

Mentioned in SAL (#wikimedia-operations) [2025-05-01T18:26:48Z] <ryankemper> T376151 (wdqs-internal lvs teardown) Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/1136744 to flip wdqs-internal service state to lvs_setup and running puppet across A:dnsbox

Change #1136747 merged by Ryan Kemper:

[operations/puppet@production] wdqs-internal: remove from LBs and backend servers

https://gerrit.wikimedia.org/r/1136747

Mentioned in SAL (#wikimedia-operations) [2025-05-01T18:44:54Z] <ryankemper> T376151 [wdqs-internal lvs teardown -> pybal rolling restart] ran puppet on O:Lvs::balancer after merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1136747

Mentioned in SAL (#wikimedia-operations) [2025-05-01T18:48:38Z] <ryankemper> T376151 [wdqs-internal lvs teardown -> pybal rolling restart] Restarted pybal on A:lvs-secondary-eqiad, it only restarted on lvs1020 but for some reason lvs1013 doesn't have a pybal service running

Mentioned in SAL (#wikimedia-operations) [2025-05-01T18:55:30Z] <ryankemper> T376151 [wdqs-internal lvs teardown -> pybal rolling restart] Restarted pybal on A:lvs-low-traffic-eqiad (lvs1019), waiting few mins before proceeding

Mentioned in SAL (#wikimedia-operations) [2025-05-01T18:58:18Z] <ryankemper> T376151 [wdqs-internal lvs teardown -> pybal rolling restart] Restarted pybal on A:lvs-secondary-codfw (lvs2014), waiting 2 mins before proceeding

Mentioned in SAL (#wikimedia-operations) [2025-05-01T18:59:44Z] <ryankemper> T376151 [wdqs-internal lvs teardown -> pybal rolling restart] Restarted pybal on A:lvs-low-traffic-codfw (lvs2013)

Mentioned in SAL (#wikimedia-operations) [2025-05-01T19:03:05Z] <ryankemper> T376151 [wdqs-internal lvs teardown -> pybal rolling restart] ipvsadm --delete-service --tcp-service 10.2.1.41:80 on A:lvs-secondary-codfw OR A:lvs-low-traffic-codfw(lvs2013, lvs2014)

Mentioned in SAL (#wikimedia-operations) [2025-05-01T19:04:21Z] <ryankemper> T376151 [wdqs-internal lvs teardown -> pybal rolling restart] ipvsadm --delete-service --tcp-service 10.2.2.41:80 on lvs1019 and lvs1020

Mentioned in SAL (#wikimedia-operations) [2025-05-01T19:09:41Z] <ryankemper> T376151 [wdqs-internal lvs teardown -> pybal rolling restart] all IPVS diff check alerts have recovered, rolling restart complete

Mentioned in SAL (#wikimedia-operations) [2025-05-01T19:09:57Z] <ryankemper> T376151 [wdqs-internal lvs teardown] running puppet across A:wdqs-internal now that pybal has been restarted

Change #1140531 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: remove realserver includes

https://gerrit.wikimedia.org/r/1140531

Change #1140531 merged by Ryan Kemper:

[operations/puppet@production] wdqs: remove realserver includes

https://gerrit.wikimedia.org/r/1140531

Change #1136756 merged by Ryan Kemper:

[operations/puppet@production] wdqs-internal: remove service catalog entry

https://gerrit.wikimedia.org/r/1136756

Change #1136757 merged by Ryan Kemper:

[operations/puppet@production] wdqs-internal: rip out remaining logic/config

https://gerrit.wikimedia.org/r/1136757

Change #1140535 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs-internal: remove old alias

https://gerrit.wikimedia.org/r/1140535

Change #1140535 merged by Bking:

[operations/puppet@production] wdqs-internal: remove old alias

https://gerrit.wikimedia.org/r/1140535

Change #1139936 merged by Ryan Kemper:

[operations/dns@master] wdqs-internal: remove lvs VIP

https://gerrit.wikimedia.org/r/1139936

Mentioned in SAL (#wikimedia-operations) [2025-05-01T20:50:23Z] <ryankemper> T376151 [wdqs-internal lvs teardown] Surrendered 10.2.2.41/32 (eqiad wdqs-internal vip) and 10.2.1.41/32 (codfw wdqs-internal vip) from netbox interface

Mentioned in SAL (#wikimedia-operations) [2025-05-01T20:54:53Z] <ryankemper> T376151 [wdqs-internal lvs teardown] sudo rm -fv /srv/config-master/pybal/eqiad/wdqs-internal && sudo rm -fv /srv/config-master/pybal/codfw/wdqs-internal on config-master[1,2]001

Mentioned in SAL (#wikimedia-operations) [2025-05-01T21:01:00Z] <ryankemper> T376151 [wdqs-internal lvs teardown] sudo etcdctl -C https://conf1007.eqiad.wmnet:4001 --username root rmdir /conftool/v1/pools/eqiad/wdqs-internal/wdqs && sudo etcdctl -C https://conf1007.eqiad.wmnet:4001 --username root rmdir /conftool/v1/pools/eqiad/wdqs-internal/

Mentioned in SAL (#wikimedia-operations) [2025-05-01T21:01:28Z] <ryankemper> T376151 [wdqs-internal lvs teardown] sudo etcdctl -C https://conf1007.eqiad.wmnet:4001 --username root rmdir /conftool/v1/pools/codfw/wdqs-internal/wdqs && sudo etcdctl -C https://conf1007.eqiad.wmnet:4001 --username root rmdir /conftool/v1/pools/codfw/wdqs-internal/

Mentioned in SAL (#wikimedia-operations) [2025-05-01T21:03:01Z] <ryankemper> T376151 [wdqs-internal lvs teardown] Declaring this officially done. No more irc log spam from me today :)

Change #1140547 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] Remove references to wdqs-internal listenter in fixtures

https://gerrit.wikimedia.org/r/1140547

Change #1140547 merged by jenkins-bot:

[operations/deployment-charts@master] Remove references to wdqs-internal listenter in fixtures

https://gerrit.wikimedia.org/r/1140547

Icinga downtime and Alertmanager silence (ID=67961f02-3a38-486d-8f1d-5d6bb2760fe4) set by bking@cumin2002 for 6 days, 0:00:00 on 1 host(s) and their services with reason: bringing host online after reimage

wdqs1017.eqiad.wmnet

Update on the wdqs1017 reimage that @RKemper started last night:

  • Reimage worked, but Puppet failed. I had to manually sign the cert and run puppet manually with puppet agent -tv from install-console.
  • Puppet "works", but cannot compile. Per our docs, I believe it needs a scap deploy . Ryan and/or myself will investigate further as time permits.
bking updated Other Assignee, added: bking.