Page MenuHomePhabricator

Create separate pybal pools for wdqs graph split (main vs scholarly)
Closed, ResolvedPublic

Description

Context

Currently we've got one pybal pool per-DC for public wdqs (https://config-master.wikimedia.org/pybal/eqiad/wdqs) corresponding to query.wikidata.org and a separate one for wdqs-internal.

We'll ultimately want to split the public wdqs into two pybal pools: wdqs-main and wdqs-scholarly. Among other things, the separate pools will allow us to shift hosts over from one type of graph split host to the other in response to evolving usage.

AC
  • Pools exist for wdqs-main and wdqs-scholarly instead of there just being a single monolithic public wdqs
  • Tacked on: Decide what to do about wdqs-internal, i.e. can we just have wdqs-internal hosts contain the journal for the wdqs-main graph but not wdqs-scholarly

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@Stevemunene and I paired on deciding some initial host/pybal allocations. The following numbers are assuming we need 2 hosts per pool to keep pybal happy, but can be adjusted if we're fine starting with just 1 host for scholarly.

EQIAD
    main
        1021 (current wdqs-public)
        1022 (current test host)
    scholarly
        1023 (current test host)
        1024 (current test host)

CODFW
    main
        2021 (current wdqs-public)
        2022 (current wdqs-public)
    scholarly
        2023 (current test host)
        2024 (current wdqs-public)
    test
        2025 (current wdqs-public)

That would leave us with the following numbers for public wdqs:

eqiad-public: 7 hosts
codfw-public: 11 hosts

Change #1054342 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] wdqs: add main and scholarly role assignments

https://gerrit.wikimedia.org/r/1054342

Change #1054520 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] [WIP] wdqs: create wdqs split pybal pools

https://gerrit.wikimedia.org/r/1054520

Change #1046123 merged by Ryan Kemper:

[operations/puppet@production] wdqs: add main and scholarly puppet config

https://gerrit.wikimedia.org/r/1046123

Change #1056230 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs graph split: fix tab alignment

https://gerrit.wikimedia.org/r/1056230

Change #1056230 merged by Ryan Kemper:

[operations/puppet@production] wdqs graph split: fix tab alignment

https://gerrit.wikimedia.org/r/1056230

Change #1046120 abandoned by Stevemunene:

[operations/puppet@production] [WIP] wdqs: create wdqs split pybal pools

Reason:

Duplicate of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1054520

https://gerrit.wikimedia.org/r/1046120

Depooled the relevant hosts that will be no longer in wdqs-public:

sudo -E cumin 'wdqs1021*,wdqs2021*,wdqs2022*,wdqs2024*,wdqs2025*' 'depool'

Change #1054342 merged by Ryan Kemper:

[operations/puppet@production] wdqs: add main and scholarly role assignments

https://gerrit.wikimedia.org/r/1054342

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host wdqs1021.eqiad.wmnet with OS bullseye

Mentioned in SAL (#wikimedia-operations) [2024-08-03T00:54:18Z] <ryankemper@cumin2002> START - Cookbook sre.hosts.downtime for 6 days, 0:00:00 on wdqs[2021-2022,2024-2025].codfw.wmnet with reason: T364368 rejiggering hosts

Mentioned in SAL (#wikimedia-operations) [2024-08-03T00:54:38Z] <ryankemper@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6 days, 0:00:00 on wdqs[2021-2022,2024-2025].codfw.wmnet with reason: T364368 rejiggering hosts

Mentioned in SAL (#wikimedia-operations) [2024-08-03T01:15:18Z] <ryankemper@cumin2002> START - Cookbook sre.hosts.downtime for 6 days, 0:00:00 on 9 hosts with reason: T364368 rejiggering hosts

Mentioned in SAL (#wikimedia-operations) [2024-08-03T01:15:33Z] <ryankemper@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6 days, 0:00:00 on 9 hosts with reason: T364368 rejiggering hosts

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host wdqs1022.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host wdqs1023.eqiad.wmnet with OS bullseye

Change #1059441 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs graph-split: temp remove main/scholarly pools

https://gerrit.wikimedia.org/r/1059441

Change #1059441 merged by Ryan Kemper:

[operations/puppet@production] wdqs graph-split: temp remove main/scholarly pools

https://gerrit.wikimedia.org/r/1059441

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host wdqs1023.eqiad.wmnet with OS bullseye executed with errors:

  • wdqs1023 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" wdqs1023.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host wdqs1021.eqiad.wmnet with OS bullseye executed with errors:

  • wdqs1021 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202408030114_ryankemper_950847_wdqs1021.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" wdqs1021.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host wdqs1022.eqiad.wmnet with OS bullseye executed with errors:

  • wdqs1022 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202408030153_ryankemper_994613_wdqs1022.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" wdqs1022.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host wdqs1023.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host wdqs1023.eqiad.wmnet with OS bullseye executed with errors:

  • wdqs1023 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202408051728_ryankemper_680045_wdqs1023.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" wdqs1023.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host wdqs1023.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host wdqs1023.eqiad.wmnet with OS bullseye executed with errors:

  • wdqs1023 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" wdqs1023.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host wdqs1024.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host wdqs2021.codfw.wmnet with OS bullseye

Change #1060902 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: update scap wdqs hostlist

https://gerrit.wikimedia.org/r/1060902

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host wdqs1024.eqiad.wmnet with OS bullseye executed with errors:

  • wdqs1024 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" wdqs1024.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host wdqs2021.codfw.wmnet with OS bullseye executed with errors:

  • wdqs2021 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" wdqs2021.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Change #1060902 merged by Ryan Kemper:

[operations/puppet@production] wdqs: update scap wdqs hostlist

https://gerrit.wikimedia.org/r/1060902

Mentioned in SAL (#wikimedia-operations) [2024-08-09T04:40:09Z] <ryankemper@cumin2002> START - Cookbook sre.hosts.downtime for 15:00:00 on 9 hosts with reason: T364368 non-prod hosts

Mentioned in SAL (#wikimedia-operations) [2024-08-09T04:40:35Z] <ryankemper@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 15:00:00 on 9 hosts with reason: T364368 non-prod hosts

Mentioned in SAL (#wikimedia-operations) [2024-08-15T07:31:01Z] <ryankemper@cumin2002> START - Cookbook sre.hosts.downtime for 3 days, 10:00:00 on 9 hosts with reason: T364368 non-prod hosts

Mentioned in SAL (#wikimedia-operations) [2024-08-15T07:31:16Z] <ryankemper@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 10:00:00 on 9 hosts with reason: T364368 non-prod hosts

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host wdqs2022.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host wdqs2023.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host wdqs2024.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host wdqs2025.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host wdqs2024.codfw.wmnet with OS bullseye executed with errors:

  • wdqs2024 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" wdqs2024.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host wdqs2022.codfw.wmnet with OS bullseye executed with errors:

  • wdqs2022 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" wdqs2022.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host wdqs2025.codfw.wmnet with OS bullseye executed with errors:

  • wdqs2025 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" wdqs2025.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host wdqs2023.codfw.wmnet with OS bullseye executed with errors:

  • wdqs2023 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" wdqs2023.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host wdqs2024.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host wdqs2024.codfw.wmnet with OS bullseye executed with errors:

  • wdqs2024 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" wdqs2024.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Mentioned in SAL (#wikimedia-operations) [2024-08-20T06:43:40Z] <ryankemper@cumin2002> START - Cookbook sre.hosts.downtime for 18:00:00 on wdqs[2021-2023,2025].codfw.wmnet with reason: T364368 non-prod hosts

Mentioned in SAL (#wikimedia-operations) [2024-08-20T06:43:43Z] <ryankemper@cumin2002> END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 18:00:00 on wdqs[2021-2023,2025].codfw.wmnet with reason: T364368 non-prod hosts

@Stevemunene and I paired on deciding some initial host/pybal allocations. The following numbers are assuming we need 2 hosts per pool to keep pybal happy, but can be adjusted if we're fine starting with just 1 host for scholarly.

EQIAD
    main
        1021 (current wdqs-public)
        1022 (current test host)
    scholarly
        1023 (current test host)
        1024 (current test host)

CODFW
    main
        2021 (current wdqs-public)
        2022 (current wdqs-public)
    scholarly
        2023 (current test host)
        2024 (current wdqs-public)
    test
        2025 (current wdqs-public)

That would leave us with the following numbers for public wdqs:

eqiad-public: 7 hosts
codfw-public: 11 hosts

Considering T371833, should we remove wdqs2025 as a test host and re assign it or do we plan on retaining one test instance/ endpoint say query-full-experimental?

Change #1064473 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: add graph split hosts to conftool_data

https://gerrit.wikimedia.org/r/1064473

Change #1064473 merged by Ryan Kemper:

[operations/puppet@production] wdqs: add graph split hosts to conftool_data

https://gerrit.wikimedia.org/r/1064473

Change #1064479 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: new -main, -scholarly services

https://gerrit.wikimedia.org/r/1064479

Change #1064829 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: add wdqs2024 to scholarly pool

https://gerrit.wikimedia.org/r/1064829

Mentioned in SAL (#wikimedia-operations) [2024-08-22T19:01:50Z] <ryankemper> T364368 Pooled all wdqs main/scholarly hosts except wdqs2024, which won't be ready for another hour

Change #1064829 merged by Ryan Kemper:

[operations/puppet@production] wdqs: add wdqs2024 to scholarly pool

https://gerrit.wikimedia.org/r/1064829

Mentioned in SAL (#wikimedia-operations) [2024-08-22T19:31:13Z] <ryankemper> T364368 Pooled wdqs2024 (its data transfer has completed successfully)

Change #1064840 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: -main and -scholarly are different services

https://gerrit.wikimedia.org/r/1064840

Change #1064843 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: Prepare to configure the load balancers

https://gerrit.wikimedia.org/r/1064843

Change #1064848 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: move -main and -scholarly to production

https://gerrit.wikimedia.org/r/1064848

Change #1054520 abandoned by Ryan Kemper:

[operations/puppet@production] wdqs: create wdqs split pybal pools

Reason:

duplicate of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1064479

https://gerrit.wikimedia.org/r/1054520

Change #1064840 merged by Ryan Kemper:

[operations/puppet@production] wdqs: -main and -scholarly are different services

https://gerrit.wikimedia.org/r/1064840

Change #1064843 merged by Ryan Kemper:

[operations/puppet@production] wdqs: Prepare to configure the load balancers

https://gerrit.wikimedia.org/r/1064843

Mentioned in SAL (#wikimedia-operations) [2024-08-26T19:23:43Z] <ryankemper> T364368 [codfw] sudo ipvsadm -L -n on lvs secondary looks good, proceeding

Mentioned in SAL (#wikimedia-operations) [2024-08-26T19:24:05Z] <ryankemper> T364368 [codfw] Restarted lvs primary: sudo cumin 'A:lvs-low-traffic-codfw' 'systemctl restart pybal.service'

Mentioned in SAL (#wikimedia-operations) [2024-08-26T19:25:24Z] <ryankemper> T364368 [codfw] sudo ipvsadm -L -n on lvs primary looks good, all done with lvs restarts

LVS restarts have completed successfully, following the below step:

sudo cumin 'A:lvs and (A:eqiad or A:codfw)' 'disable-puppet "adding new services wdqs-main & wdqs-scholarly"'
!log T364368 Disabled puppet on all lvs hosts in preparation for rolling restart
(merge patch)

[EQIAD]

sudo cumin 'A:lvs and A:eqiad' 'run-puppet-agent --enable "adding new services wdqs-main & wdqs-scholarly"'
!log T364368 [eqiad] enabled puppet on eqiad lvs hosts, expecting alerts soon

ack alerts

sudo cumin 'A:lvs-secondary-eqiad' 'systemctl restart pybal.service'
!log T364368 [eqiad] Restarted lvs secondary: `sudo cumin 'A:lvs-secondary-eqiad' 'systemctl restart pybal.service'`


sudo cumin 'A:lvs-secondary-eqiad' 'ipvsadm -L -n'
# wait 120s while looking at https://icinga.wikimedia.org/alerts
!log T364368 [eqiad] `sudo ipvsadm -L -n` on lvs secondary looks good, proceeding


sudo cumin 'A:lvs-low-traffic-eqiad' 'systemctl restart pybal.service'
!log T364368 [eqiad] Restarted lvs primary: `sudo cumin 'A:lvs-low-traffic-eqiad' 'systemctl restart pybal.service'`


sudo cumin 'A:lvs-low-traffic-eqiad' 'ipvsadm -L -n'
# wait 120s while looking at https://icinga.wikimedia.org/alerts
!log T364368 [eqiad] `sudo ipvsadm -L -n` on lvs primary looks good, proceeding

curl -v -k http://wdqs-main.svc.eqiad.wmnet:80/
curl -v -k http://wdqs-scholarly.svc.eqiad.wmnet:80/

[CODFW]

sudo cumin 'A:lvs and A:codfw' 'run-puppet-agent --enable "adding new services wdqs-main & wdqs-scholarly"'
!log T364368 [codfw] ran puppet on codfw lvs hosts, expecting alerts soon

ack alerts

sudo cumin 'A:lvs-secondary-codfw' 'systemctl restart pybal.service'
!log T364368 [codfw] Restarted lvs secondary: `sudo cumin 'A:lvs-secondary-codfw' 'systemctl restart pybal.service'`


sudo cumin 'A:lvs-secondary-codfw' 'ipvsadm -L -n'
# wait 120s while looking at https://icinga.wikimedia.org/alerts
!log T364368 [codfw] `sudo ipvsadm -L -n` on lvs secondary looks good, proceeding


sudo cumin 'A:lvs-low-traffic-codfw' 'systemctl restart pybal.service'
!log T364368 [codfw] Restarted lvs primary: `sudo cumin 'A:lvs-low-traffic-codfw' 'systemctl restart pybal.service'`

sudo cumin 'A:lvs-low-traffic-codfw' 'ipvsadm -L -n'
# wait 120s while looking at https://icinga.wikimedia.org/alerts
!log T364368 [codfw] `sudo ipvsadm -L -n` on lvs primary looks good, all done with lvs restarts

curl -v -k http://wdqs-main.svc.codfw.wmnet:80/
curl -v -k http://wdqs-scholarly.svc.codfw.wmnet:80/

Change #1064848 merged by Ryan Kemper:

[operations/puppet@production] wdqs: move -main and -scholarly to production

https://gerrit.wikimedia.org/r/1064848

Mentioned in SAL (#wikimedia-operations) [2024-08-26T19:42:27Z] <ryankemper> T364368 [codfw] sudo ipvsadm -L -n on lvs primary looks good, all done with lvs restarts

Mentioned in SAL (#wikimedia-operations) [2024-08-26T19:43:15Z] <ryankemper> T364368 Merged patch to move lvs state to production for wdqs-main and wdqs-scholarly (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1064848) and ran puppet on all LVS hosts

Mentioned in SAL (#wikimedia-operations) [2024-08-26T19:42:27Z] <ryankemper> T364368 [codfw] sudo ipvsadm -L -n on lvs primary looks good, all done with lvs restarts

Copy-paste error, this log message can be ignored this step was already performed earlier

Mentioned in SAL (#wikimedia-operations) [2024-08-26T19:45:09Z] <ryankemper> T364368 Merged patch to add dns discovery resources for wdqs-main and wdqs-scholarly (https://gerrit.wikimedia.org/r/c/operations/dns/+/1064831), and ran puppet on all DNS hosts

Change #1064479 merged by Ryan Kemper:

[operations/puppet@production] wdqs: new -main, -scholarly services

https://gerrit.wikimedia.org/r/1064479

Change #1067383 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wdqs-main, wdqs-scholarly: use TLS for pybal pools

https://gerrit.wikimedia.org/r/1067383

Change #1067383 merged by Bking:

[operations/puppet@production] wdqs-main, wdqs-scholarly: use TLS for pybal pools

https://gerrit.wikimedia.org/r/1067383

Mentioned in SAL (#wikimedia-operations) [2024-08-27T17:08:46Z] <ryankemper> T364368 Disabled puppet on all lvs hosts in preparation for rolling restart

Mentioned in SAL (#wikimedia-operations) [2024-08-27T17:13:50Z] <ryankemper> T364368 Ran puppet on A:lvs-secondary-eqiad and restarted pybal.service

Change #1067388 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wdqs-main, wdqs-scholarly: use HTTPS for health check

https://gerrit.wikimedia.org/r/1067388

Change #1067388 merged by Bking:

[operations/puppet@production] wdqs-main, wdqs-scholarly: use HTTPS for health check

https://gerrit.wikimedia.org/r/1067388

Mentioned in SAL (#wikimedia-operations) [2024-08-27T17:24:54Z] <ryankemper> T364368 ryankemper@cumin2002:~$ sudo cumin 'A:lvs-secondary-eqiad' 'systemctl status pybal.service'

Mentioned in SAL (#wikimedia-operations) [2024-08-27T17:37:18Z] <ryankemper> T364368 Ran puppet on A:lvs-low-traffic-eqiad and restarted pybal.service

Mentioned in SAL (#wikimedia-operations) [2024-08-27T17:40:26Z] <ryankemper> T364368 Cleared away old ipvs entries for 10.2.2.33:80 and 10.2.2.36:80

Mentioned in SAL (#wikimedia-operations) [2024-08-27T17:47:45Z] <ryankemper> T364368 Ran puppet on A:lvs-secondary-codfw, restarted pybal.service, and cleared away old ipvs entries for 10.2.1.33:80 and 10.2.1.36:80

Mentioned in SAL (#wikimedia-operations) [2024-08-27T17:50:55Z] <ryankemper> T364368 Ran puppet on A:lvs-low-traffic-codfw, restarted pybal.service, and cleared away old ipvs entries for 10.2.1.33:80 and 10.2.1.36:80

Mentioned in SAL (#wikimedia-operations) [2024-08-27T17:54:10Z] <ryankemper> T364368 Our LVS operation is done; I've enabled/ran puppet on the remaining lvs hosts

Stevemunene updated the task description. (Show Details)