With the graph split cutover being completed, the next step is to tear down the old wdqs public lvs pool and free up the remaining hosts.
Coordinate with traffic team for the lvs teardown
With the graph split cutover being completed, the next step is to tear down the old wdqs public lvs pool and free up the remaining hosts.
Coordinate with traffic team for the lvs teardown
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Open | None | T335067 Epic: Wikidata Query Service stabilization | |||
| Resolved | BTracy-WMF | T337013 [Epic] Splitting the graph in WDQS | |||
| Resolved | RKemper | T395772 Teardown lvs for wdqs public pool | |||
| Resolved | RKemper | T405978 Re-image remaining full graph hosts to post-graph-split roles | |||
| Invalid | RKemper | T406587 Repeated reimage failures on WDQS hosts | |||
| Resolved | bking | T406609 wdqs2017: Apparent hardware issue, rack C2 | |||
| Invalid | bking | T406617 wdqs1020: Hanging during partitioning step of installation, rack E2 |
Mentioned in SAL (#wikimedia-operations) [2025-06-18T18:56:18Z] <ryankemper@cumin1003> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on 6 hosts with reason: T395772 hosts not serving production traffic
Change #1182975 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):
[operations/puppet@production] wdqs: (step 3) shift service state to lvs_setup
Change #1182976 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):
[operations/dns@master] wdqs: (step 2) remove wdqs discovery dns records
Change #1182977 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):
[operations/puppet@production] wdqs: (step 4) remove from LBs and wdqs backends
Change #1182978 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):
[operations/puppet@production] wdqs: (steps 5,6) => final removal
Change #1182976 merged by BCornwall:
[operations/dns@master] wdqs: (step 2) remove wdqs discovery dns records
Change #1182975 merged by BCornwall:
[operations/puppet@production] wdqs: (step 3) shift service state to lvs_setup
Change #1182977 merged by BCornwall:
[operations/puppet@production] wdqs: (step 4) remove from LBs and wdqs backends
Mentioned in SAL (#wikimedia-operations) [2025-09-18T21:18:34Z] <brett> Restarting pybal on secondary eqiad/codfw lvs servers - T395772
Mentioned in SAL (#wikimedia-operations) [2025-09-18T21:25:37Z] <brett> Restarting pybal on low-traffic eqiad/codfw lvs servers - T395772
Mentioned in SAL (#wikimedia-operations) [2025-09-18T21:30:45Z] <brett> Deleting wdqs, wdqs-heavy-queries, and wdqs-ssl ipvs services from A:lvs-secondary-codfw - T395772
Mentioned in SAL (#wikimedia-operations) [2025-09-18T21:32:07Z] <brett> Deleting wdqs, wdqs-heavy-queries, and wdqs-ssl ipvs services from A:lvs-secondary-eqiad - T395772
Mentioned in SAL (#wikimedia-operations) [2025-09-18T21:33:39Z] <brett> Deleting wdqs, wdqs-heavy-queries, and wdqs-ssl ipvs services from A:lvs-low-traffic-codfw - T395772
Mentioned in SAL (#wikimedia-operations) [2025-09-18T21:34:21Z] <brett> Deleting wdqs, wdqs-heavy-queries, and wdqs-ssl ipvs services from A:lvs-low-traffic-eqiad - T395772
Change #1182978 merged by BCornwall:
[operations/puppet@production] wdqs: (steps 5,6) => final removal
Change #1189595 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):
[operations/puppet@production] wdqs: remove ipip
Change #1189595 merged by Ryan Kemper:
[operations/puppet@production] wdqs: remove ipip
Change #1189600 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):
[operations/puppet@production] wdqs: shift old wdqs-public hosts to test
Change #1189600 merged by Ryan Kemper:
[operations/puppet@production] wdqs: shift old wdqs-public hosts to test
Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host wdqs2016.codfw.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host wdqs2017.codfw.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host wdqs2017.codfw.wmnet with OS bullseye executed with errors:
Mentioned in SAL (#wikimedia-operations) [2025-09-21T18:15:27Z] <ryankemper@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on wdqs[2009,2016].codfw.wmnet,wdqs[1018-1020].eqiad.wmnet with reason: T395772
Change #1189979 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):
[operations/puppet@production] wdqs: point to wdqs-main svc for maxlag
Change #1189979 merged by Ryan Kemper:
[operations/puppet@production] wdqs: point to wdqs-main svc for maxlag
Mentioned in SAL (#wikimedia-operations) [2025-09-21T18:40:08Z] <ryankemper> T395772 Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/1189979 to fix puppet failures on deploy servers
Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host wdqs2017.codfw.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host wdqs2017.codfw.wmnet with OS bullseye executed with errors:
Removed via sudo -E cumin 'A:config-master' 'rm -fv /srv/config-master/pybal/*/wdqs':
ryankemper@cumin2002:~$ sudo -E cumin 'A:config-master' 'rm -fv /srv/config-master/pybal/*/wdqs' 2 hosts will be targeted: config-master2001.codfw.wmnet,config-master1001.eqiad.wmnet OK to proceed on 2 hosts? Enter the number of affected hosts to confirm or "q" to quit: 2 ===== NODE GROUP ===== (2) config-master2001.codfw.wmnet,config-master1001.eqiad.wmnet ----- OUTPUT of 'rm -fv /srv/conf...ter/pybal/*/wdqs' ----- removed '/srv/config-master/pybal/codfw/wdqs' removed '/srv/config-master/pybal/eqiad/wdqs' ================ PASS |███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (2/2) [00:00<00:00, 2.83hosts/s] FAIL | | 0% (0/2) [00:00<?, ?hosts/s] 100.0% (2/2) success ratio (>= 100.0% threshold) for command: 'rm -fv /srv/conf...ter/pybal/*/wdqs'. 100.0% (2/2) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Mentioned in SAL (#wikimedia-operations) [2025-09-25T21:27:51Z] <ryankemper@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on wdqs[2009,2016].codfw.wmnet,wdqs[1018-1020].eqiad.wmnet with reason: T395772
Change #1191513 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):
[operations/puppet@production] wdqs: these hosts no longer in wdqs-public
Change #1191513 merged by Ryan Kemper:
[operations/puppet@production] wdqs: these hosts no longer in wdqs-public
Mentioned in SAL (#wikimedia-operations) [2025-09-25T21:43:16Z] <ryankemper@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on wdqs[2009,2016].codfw.wmnet,wdqs[1018-1020].eqiad.wmnet with reason: T395772
Change #1191525 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):
[operations/puppet@production] wdqs: shift old full graph hosts to wdqs-main
Change #1191525 merged by Ryan Kemper:
[operations/puppet@production] wdqs: shift old full graph hosts to new roles