Page MenuHomePhabricator

Teardown lvs for wdqs public pool
Closed, ResolvedPublic

Description

With the graph split cutover being completed, the next step is to tear down the old wdqs public lvs pool and free up the remaining hosts.

Coordinate with traffic team for the lvs teardown

Event Timeline

RKemper renamed this task from Teardown lvs for wdqs public to Teardown lvs for wdqs public pool.Jun 2 2025, 7:06 AM

Mentioned in SAL (#wikimedia-operations) [2025-06-18T18:56:18Z] <ryankemper@cumin1003> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on 6 hosts with reason: T395772 hosts not serving production traffic

Gehel triaged this task as Medium priority.Jun 20 2025, 8:06 AM
Gehel moved this task from Incoming to Operations/SRE on the Wikidata-Query-Service board.

Change #1182975 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: (step 3) shift service state to lvs_setup

https://gerrit.wikimedia.org/r/1182975

Change #1182976 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/dns@master] wdqs: (step 2) remove wdqs discovery dns records

https://gerrit.wikimedia.org/r/1182976

Change #1182977 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: (step 4) remove from LBs and wdqs backends

https://gerrit.wikimedia.org/r/1182977

Change #1182978 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: (steps 5,6) => final removal

https://gerrit.wikimedia.org/r/1182978

Hi, @RKemper! Would you like to schedule some time to go through this process?

Change #1182976 merged by BCornwall:

[operations/dns@master] wdqs: (step 2) remove wdqs discovery dns records

https://gerrit.wikimedia.org/r/1182976

Change #1182975 merged by BCornwall:

[operations/puppet@production] wdqs: (step 3) shift service state to lvs_setup

https://gerrit.wikimedia.org/r/1182975

Change #1182977 merged by BCornwall:

[operations/puppet@production] wdqs: (step 4) remove from LBs and wdqs backends

https://gerrit.wikimedia.org/r/1182977

Mentioned in SAL (#wikimedia-operations) [2025-09-18T21:18:34Z] <brett> Restarting pybal on secondary eqiad/codfw lvs servers - T395772

Mentioned in SAL (#wikimedia-operations) [2025-09-18T21:25:37Z] <brett> Restarting pybal on low-traffic eqiad/codfw lvs servers - T395772

Mentioned in SAL (#wikimedia-operations) [2025-09-18T21:30:45Z] <brett> Deleting wdqs, wdqs-heavy-queries, and wdqs-ssl ipvs services from A:lvs-secondary-codfw - T395772

Mentioned in SAL (#wikimedia-operations) [2025-09-18T21:32:07Z] <brett> Deleting wdqs, wdqs-heavy-queries, and wdqs-ssl ipvs services from A:lvs-secondary-eqiad - T395772

Mentioned in SAL (#wikimedia-operations) [2025-09-18T21:33:39Z] <brett> Deleting wdqs, wdqs-heavy-queries, and wdqs-ssl ipvs services from A:lvs-low-traffic-codfw - T395772

Mentioned in SAL (#wikimedia-operations) [2025-09-18T21:34:21Z] <brett> Deleting wdqs, wdqs-heavy-queries, and wdqs-ssl ipvs services from A:lvs-low-traffic-eqiad - T395772

Change #1182978 merged by BCornwall:

[operations/puppet@production] wdqs: (steps 5,6) => final removal

https://gerrit.wikimedia.org/r/1182978

Change #1189595 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: remove ipip

https://gerrit.wikimedia.org/r/1189595

Change #1189595 merged by Ryan Kemper:

[operations/puppet@production] wdqs: remove ipip

https://gerrit.wikimedia.org/r/1189595

Change #1189600 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: shift old wdqs-public hosts to test

https://gerrit.wikimedia.org/r/1189600

Change #1189600 merged by Ryan Kemper:

[operations/puppet@production] wdqs: shift old wdqs-public hosts to test

https://gerrit.wikimedia.org/r/1189600

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host wdqs2016.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host wdqs2017.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host wdqs2017.codfw.wmnet with OS bullseye executed with errors:

  • wdqs2017 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Host successfully migrated to the new VLAN
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console wdqs2017.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Mentioned in SAL (#wikimedia-operations) [2025-09-21T18:15:27Z] <ryankemper@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on wdqs[2009,2016].codfw.wmnet,wdqs[1018-1020].eqiad.wmnet with reason: T395772

Change #1189979 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: point to wdqs-main svc for maxlag

https://gerrit.wikimedia.org/r/1189979

Change #1189979 merged by Ryan Kemper:

[operations/puppet@production] wdqs: point to wdqs-main svc for maxlag

https://gerrit.wikimedia.org/r/1189979

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host wdqs2017.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host wdqs2017.codfw.wmnet with OS bullseye executed with errors:

  • wdqs2017 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console wdqs2017.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Removed via sudo -E cumin 'A:config-master' 'rm -fv /srv/config-master/pybal/*/wdqs':

ryankemper@cumin2002:~$ sudo -E cumin 'A:config-master' 'rm -fv /srv/config-master/pybal/*/wdqs'
2 hosts will be targeted:
config-master2001.codfw.wmnet,config-master1001.eqiad.wmnet
OK to proceed on 2 hosts? Enter the number of affected hosts to confirm or "q" to quit: 2
===== NODE GROUP =====
(2) config-master2001.codfw.wmnet,config-master1001.eqiad.wmnet
----- OUTPUT of 'rm -fv /srv/conf...ter/pybal/*/wdqs' -----
removed '/srv/config-master/pybal/codfw/wdqs'
removed '/srv/config-master/pybal/eqiad/wdqs'
================
PASS |███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (2/2) [00:00<00:00,  2.83hosts/s]
FAIL |                                                                                                                                               |   0% (0/2) [00:00<?, ?hosts/s]
100.0% (2/2) success ratio (>= 100.0% threshold) for command: 'rm -fv /srv/conf...ter/pybal/*/wdqs'.
100.0% (2/2) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

Mentioned in SAL (#wikimedia-operations) [2025-09-25T21:27:51Z] <ryankemper@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on wdqs[2009,2016].codfw.wmnet,wdqs[1018-1020].eqiad.wmnet with reason: T395772

Change #1191513 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: these hosts no longer in wdqs-public

https://gerrit.wikimedia.org/r/1191513

Change #1191513 merged by Ryan Kemper:

[operations/puppet@production] wdqs: these hosts no longer in wdqs-public

https://gerrit.wikimedia.org/r/1191513

Mentioned in SAL (#wikimedia-operations) [2025-09-25T21:43:16Z] <ryankemper@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on wdqs[2009,2016].codfw.wmnet,wdqs[1018-1020].eqiad.wmnet with reason: T395772

Change #1191525 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: shift old full graph hosts to wdqs-main

https://gerrit.wikimedia.org/r/1191525

Change #1191525 merged by Ryan Kemper:

[operations/puppet@production] wdqs: shift old full graph hosts to new roles

https://gerrit.wikimedia.org/r/1191525