Page MenuHomePhabricator

Put the alert1002 and alert2002 hosts in production
Closed, ResolvedPublic

Description

Overview

This task tracks the hardware migration of the alert1002 and alert2002 hosts to replace the alert1001 and alert2001 hosts.

The four host use Debian Bookworm so it's expected for the Puppet role to work as is.

Proposed solution:

Stage 1: Prepare hosts

  1. Add the alert1002, and the alert2002 hosts to the Acme Chief list of authorized domains.
    1. Merge Gerrit patch #1064107 alert: Add the alertx002 hosts to acme chief
  2. Apply the alerting_host role for the alert1002, and alert2002 hosts.
    1. Merge Gerrit patch #1062444 - alert: Ensure the alert*002 hosts use the alerting_host role
    2. Run Puppet on the alert[12]002 hosts: sudo cumin 'alert*[02]*' 'run-puppet-agent'
  3. Arm the keyholder agent in the alert1002, and alert2002 hosts.
    1. Use the metamonitor passphrase pwstore/pw.git/metamonitor-key-passphrase
    2. Ensure the keyholder agent is armed sudo cumin 'alert*002wikimedia.org,' 'keyholder status'
  4. Verify the IP addresses are propagated across the infrastructure
  5. Add the new hosts to the Prometheus Blackbox exporter list
    1. Merge Gerrit patch #1064097 - alert: Add the alertx002 hosts to Prometheus blackbox exporter

Stage 2: Enable hosts as passive alertmanagers

  1. Allow connections from the alert[12]002 addresses.
    1. Merge Gerrit patch #1064818 - alert: Allow connections from the alertx002 addresses
    2. Merge Gerrit patch #1064821 - alert: Allow Apache2 connections for the alertx002 hosts
    3. Run Puppet on the alert hosts: sudo cumin 'alert*' 'run-puppet-agent'
  2. Add the alert1002, and alert2002 hosts as Icinga and Alertmanager partners to work as passive hosts.
    1. Merge Gerrit patch #1064820 - alert: Add the alertx002 hosts as Icinga and AM partners
    2. Run Puppet on the alert hosts: sudo cumin 'alert*' 'run-puppet-agent'
  3. Enable the alert[12]002 hosts as alertmanagers
    1. Merge Gerrit patch #1072318 - alert: Enable the alertx002 hosts as alertmanagers
  4. Verify the hosts are working as intended as standby hosts (e.g. no puppet or unit failures)

Stage 3: Make alert2002 the active alertmanager host

  1. Disable meta-monitoring for the alert hosts.
    1. SSH as root into wikitech-static.wikimedia.org with the metamonitor-key-passphrase.
    2. Comment the following crontab entries to stop meta-monitoring against both alert hosts:
      • */2 * * * * /usr/bin/systemd-cat -t "check_icinga" /usr/local/bin/check_icinga alert1001.wikimedia.org
      • */2 * * * * /usr/bin/systemd-cat -t "check_icinga" /usr/local/bin/check_icinga alert2001.wikimedia.org
  2. Stop services in the alert1001 host.
  3. Make alert2002 the active host.
    1. Merge Gerrit patch #1071700 - alert: Failover from alert1001 to alert2002
    2. Run Puppet on the alert hosts: sudo cumin 'alert*' 'run-puppet-agent'
    3. Merge Gerrit patch #1072326 - alert: Resolve alerts DNS queries to alert2002
    4. Update DNS records: $ sudo cumin 'dns1004.wikimedia.org' 'sudo -i authdns-update'
  4. Ensure services work as expected.
  5. Enable metamonitoring for the alert1001, and alert2002 hosts.
    1. SSH as root into wikitech-static.wikimedia.org with the metamonitor-key-passphrase.
    2. Uncomment the following crontab entries to enable meta-monitoring for the alert1001 host, and add meta-monitoring for the alert2002 host.
      • # */2 * * * * /usr/bin/systemd-cat -t "check_icinga" /usr/local/bin/check_icinga alert1001.wikimedia.org
      • # */2 * * * * /usr/bin/systemd-cat -t "check_icinga" /usr/local/bin/check_icinga alert2002.wikimedia.org

Stage 4: Make alert1002 the active alertmanager host

  1. Disable meta-monitoring for the alert hosts.
    1. SSH as root into wikitech-static.wikimedia.org with the metamonitor-key-passphrase.
    2. Comment the following crontab entries to stop meta-monitoring against both alert hosts:
      • */2 * * * * /usr/bin/systemd-cat -t "check_icinga" /usr/local/bin/check_icinga alert1001.wikimedia.org
      • */2 * * * * /usr/bin/systemd-cat -t "check_icinga" /usr/local/bin/check_icinga alert2002.wikimedia.org
  2. Stop services in the alert2002 host.
  3. Make alert1002 the active host.
    1. Merge Gerrit patch #1071701 - alert: Failover from alert1001 to alert2002
    2. Run Puppet on the alert hosts: sudo cumin 'alert*' 'run-puppet-agent'
    3. Merge Gerrit patch #1063078 - alert: Resolve alerts DNS queries to alert1002
    4. Update DNS records: $ sudo cumin 'dns1004.wikimedia.org' 'sudo -i authdns-update'
  4. Ensure services work as expected.
  5. Enable metamonitoring for the alert1002, and alert2002 hosts.
    1. SSH as root into wikitech-static.wikimedia.org with the metamonitor-key-passphrase.
    2. Uncomment the following crontab entries to enable meta-monitoring for the alert2002 host, and add meta-monitoring for the alert1002 host.
      • # */2 * * * * /usr/bin/systemd-cat -t "check_icinga" /usr/local/bin/check_icinga alert1002.wikimedia.org
      • # */2 * * * * /usr/bin/systemd-cat -t "check_icinga" /usr/local/bin/check_icinga alert2002.wikimedia.org

Step 5: Cleanup

  1. Update hostnames for alertmanager tests
    1. Merge Gerrit patch #1063235 - alert: Update alertmanager tests hostnames

Details

Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/puppetproduction+4 -4
operations/dnsmaster+5 -5
operations/puppetproduction+5 -5
operations/dnsmaster+5 -5
operations/puppetproduction+5 -5
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+4 -4
operations/dnsmaster+5 -5
operations/puppetproduction+19 -19
operations/puppetproduction+4 -0
operations/puppetproduction+4 -0
operations/puppetproduction+4 -0
operations/puppetproduction+5 -5
operations/puppetproduction+23 -2
operations/puppetproduction+8 -0
operations/puppetproduction+1 -1
operations/puppetproduction+3 -2
operations/puppetproduction+9 -5
operations/puppetproduction+14 -0
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Thank you @andrea.denisse for getting the ball rolling on this! The solution as-is works well for failing over hosts that are already running alerting_host role. In other words, we should be applying alerting_host role first, including adding the new hostnames and IP addresses wherever they appear in puppet.git. Then apply the procedure above to in turn put the new hosts in service, and finally decom the old hosts, what do you think ?

Thanks for your suggestions, the plan you propose sounds good to me, I'll update the task description.

Thank you, the procedure is definitely coming together! I've re-read the task description and I think we should be recognise/split the procedure in at least these parts:

  • We stage the new hosts with alerting_host role, add to puppet, arm keyholder, etc. The hosts are not active in icinga/alertmanager yet (i.e. alertmanagers and profile::icinga::active_host and profile::alertmanager::active_host variables)
  • Verify the ip addresses are propagated across the infrastructure
    • The ip addresses for alert hosts are deployed by homer to network devices in production
    • Ditto for fundraising firewalls, they do have the alert hosts IP addresses in them, though I'm not sure how to add/remove them? cc @ayounsi for this and the homer point above
  • Verify the hosts are working as intended as standby hosts (e.g. no puppet or unit failures)

This might take more than one day, and it is the "stage" phase. After this, we can proceed with making the hosts active. I think we should test making alert2002 active too, to avoid surprises later, therefore:

  • Make alert2002 active, verify functionality (as per your procedure, there will be more steps)
  • Make alert1002 active, verify functionality (ditto)

I'd also like to know more about the "migrate data" step, what data/commands/etc are involved ?

Ditto for fundraising firewalls, they do have the alert hosts IP addresses in them, though I'm not sure how to add/remove them? cc @ayounsi for this and the homer point above

Homer is generated automatically from Netbox data using the capirca script: https://netbox.wikimedia.org/extras/scripts/1/jobs/ then a homer run is needed to actually update the network devices from using that data.
For the fundraising hosts/network you need to ping @Jgreen and @Dwisehaupt

Ditto for fundraising firewalls, they do have the alert hosts IP addresses in them, though I'm not sure how to add/remove them? cc @ayounsi for this and the homer point above

Homer is generated automatically from Netbox data using the capirca script: https://netbox.wikimedia.org/extras/scripts/1/jobs/ then a homer run is needed to actually update the network devices from using that data.
For the fundraising hosts/network you need to ping @Jgreen and @Dwisehaupt

I have added them to our PFW config and created T372520 for the deployment.

I have put in the iptables and nagios config changes for the new hosts into the fundraising puppet repos. The iptables changes will be rolling out today as we do updates. We can then pull alert1001 and alert2001 when you all are ready.

I have added them to our PFW config and created T372520 for the deployment.

Deployed.

Change #1064097 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] alert: Add the alert[12]002 hosts to Prometheus blackbox exporter

https://gerrit.wikimedia.org/r/1064097

Change #1064107 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] alert: Add the alert[12]002 hosts to acme chief

https://gerrit.wikimedia.org/r/1064107

andrea.denisse set the point value for this task to 1.

I have put in the iptables and nagios config changes for the new hosts into the fundraising puppet repos. The iptables changes will be rolling out today as we do updates. We can then pull alert1001 and alert2001 when you all are ready.

I have added them to our PFW config and created T372520 for the deployment.

Deployed.

Thank you so much for your quick action here! Appreciate it

Change #1064107 merged by Andrea Denisse:

[operations/puppet@production] alert: Add the alert[12]002 hosts to acme chief

https://gerrit.wikimedia.org/r/1064107

Change #1062444 merged by Andrea Denisse:

[operations/puppet@production] alert: Ensure the alert[12]002 hosts use the alerting_host role

https://gerrit.wikimedia.org/r/1062444

Change #1064097 merged by Andrea Denisse:

[operations/puppet@production] alert: Add the alert[12]002 hosts to Prometheus blackbox exporter

https://gerrit.wikimedia.org/r/1064097

Change #1064806 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] alert: Enable the alert[12]002 hosts as alertmanagers

https://gerrit.wikimedia.org/r/1064806

Change #1064818 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] alert: Allow connections from the alert[12]002 addresses

https://gerrit.wikimedia.org/r/1064818

andrea.denisse removed the point value 1 for this task.

Change #1064820 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] alert: Add the alert[12]002 hosts as Icinga and AM partners

https://gerrit.wikimedia.org/r/1064820

Change #1064821 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] alert: Allow Apache2 connections for the alert[12]002 hosts

https://gerrit.wikimedia.org/r/1064821

Change #1064826 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] alert: Failover from alert1001 to alert2002

https://gerrit.wikimedia.org/r/1064826

Change #1063063 abandoned by Andrea Denisse:

[operations/puppet@production] alert: Add the alert[12]002 hosts to puppet realm

Reason:

Abandoning in favor of smaller patches.

https://gerrit.wikimedia.org/r/1063063

Change #1063075 abandoned by Andrea Denisse:

[operations/puppet@production] alert: Ensure alert1002 is the active alert host

Reason:

Abandoning in favor of smaller patches.

https://gerrit.wikimedia.org/r/1063075

Change #1064828 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] alert: Failover from alert2002 to alert1002

https://gerrit.wikimedia.org/r/1064828

Change #1065258 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/dns@master] alert: Resolve alerts DNS queries to alert2002

https://gerrit.wikimedia.org/r/1065258

Change #1064821 merged by Andrea Denisse:

[operations/puppet@production] alert: Allow Apache2 connections for the alert[12]002 hosts

https://gerrit.wikimedia.org/r/1064821

Change #1064818 merged by Andrea Denisse:

[operations/puppet@production] alert: Allow connections from the alert[12]002 addresses

https://gerrit.wikimedia.org/r/1064818

Change #1064820 merged by Andrea Denisse:

[operations/puppet@production] alert: Add the alert[12]002 hosts as Icinga and AM partners

https://gerrit.wikimedia.org/r/1064820

Change #1063235 merged by Andrea Denisse:

[operations/puppet@production] alert: Update alertmanager tests hostnames

https://gerrit.wikimedia.org/r/1063235

Change #1064806 merged by Andrea Denisse:

[operations/puppet@production] alert: Enable the alert[12]002 hosts as alertmanagers

https://gerrit.wikimedia.org/r/1064806

Mentioned in SAL (#wikimedia-operations) [2024-09-03T14:00:50Z] <denisse> Disabling meta-monitoring for the alert hosts - T372418

Mentioned in SAL (#wikimedia-operations) [2024-09-03T14:03:43Z] <denisse> Stopping services in the alert1001 host - T372418

Change #1064826 merged by Andrea Denisse:

[operations/puppet@production] alert: Failover from alert1001 to alert2002

https://gerrit.wikimedia.org/r/1064826

Mentioned in SAL (#wikimedia-operations) [2024-09-03T14:06:42Z] <denisse> Failing over to alert2002 - T372418

Mentioned in SAL (#wikimedia-operations) [2024-09-03T14:10:35Z] <denisse> Resolve DNS queries to alert2002 - T372418

Change #1065258 merged by Andrea Denisse:

[operations/dns@master] alert: Resolve alerts DNS queries to alert2002

https://gerrit.wikimedia.org/r/1065258

Mentioned in SAL (#wikimedia-operations) [2024-09-03T15:58:47Z] <denisse> Reverting back to alert1001 - T372418

Mentioned in SAL (#wikimedia-operations) [2024-09-03T15:59:06Z] <denisse> Enabling meta monitoring for alert[12]001 - T372418

Fundraising servers use nsca exclusively, so we've configured them to send to all four alert[12]00[12] hosts while things are in flux. I'll try to keep an eye on this task but please let us know if when we should remove extraneous hosts from the list.

Fundraising servers use nsca exclusively, so we've configured them to send to all four alert[12]00[12] hosts while things are in flux. I'll try to keep an eye on this task but please let us know if when we should remove extraneous hosts from the list.

Thank you @Jgreen, we'll keep you posted once we're done with the old hosts and have fully switched to new hosts.

With https://gerrit.wikimedia.org/r/c/operations/puppet/+/1060516 merged now corto's profile::corto::active_host hiera setting needs to be changed on failover, @andrea.denisse FYI

Also please note that we're blocked on T374340 before the failover can happen

@fgiunchedi Thanks for the heads-up regarding corto, I'm sending a patch for that.

T374340 is resolved now, kudos to Sukhbir for his help with it!

Change #1071700 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] alert: Make alert2002 the active host for corto

https://gerrit.wikimedia.org/r/1071700

Change #1071701 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] alert: Make alert1002 the active host for corto

https://gerrit.wikimedia.org/r/1071701

Change #1072318 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] alert: Enable the alert[12]002 hosts as alertmanagers

https://gerrit.wikimedia.org/r/1072318

Change #1072326 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/dns@master] alert: Resolve alerts DNS queries to alert2002

https://gerrit.wikimedia.org/r/1072326

Mentioned in SAL (#wikimedia-operations) [2024-09-12T14:01:45Z] <denisse> Enable the alert[12]002 hosts as alertmanagers - T372418

Change #1072318 merged by Andrea Denisse:

[operations/puppet@production] alert: Enable the alert[12]002 hosts as alertmanagers

https://gerrit.wikimedia.org/r/1072318

Mentioned in SAL (#wikimedia-operations) [2024-09-12T14:02:44Z] <denisse> Disable meta-monitoring for the alert hosts - T372418

Mentioned in SAL (#wikimedia-operations) [2024-09-12T14:04:25Z] <denisse> Make alert2002 the active host - T372418

Change #1071700 merged by Andrea Denisse:

[operations/puppet@production] alert: Failover from alert1001 to alert2002

https://gerrit.wikimedia.org/r/1071700

Change #1072326 merged by Andrea Denisse:

[operations/dns@master] alert: Resolve alerts DNS queries to alert2002

https://gerrit.wikimedia.org/r/1072326

Mentioned in SAL (#wikimedia-operations) [2024-09-18T15:00:50Z] <denisse> Disable meta-monitoring for the alert hosts - T372418

Mentioned in SAL (#wikimedia-operations) [2024-09-18T15:01:44Z] <denisse> Make alert1002 the active host - T372418

Change #1071701 merged by Andrea Denisse:

[operations/puppet@production] alert: Failover from alert2002 to alert1002

https://gerrit.wikimedia.org/r/1071701

Mentioned in SAL (#wikimedia-operations) [2024-09-18T15:08:44Z] <denisse> Resolve alerts DNS queries to alert1002 - T372418

Change #1063078 merged by Andrea Denisse:

[operations/dns@master] alert: Resolve alerts DNS queries to alert1002

https://gerrit.wikimedia.org/r/1063078

Mentioned in SAL (#wikimedia-operations) [2024-09-18T15:21:08Z] <denisse> Enable metamonitoring for the alert1002, and alert2002 hosts - T372418

Hi @Jgreen and @Dwisehaupt , we have finished the fail over and plan to decommission the alert1001, and the alert2001 hosts next week, progress is being tracked in T372607 .

Once the decommission is done I'll ping you so the IP addresses of the old hosts can be removed from the PFW configuration, thank you!

Hi @Jgreen and @Dwisehaupt , we finished the alert1001, and alert2001` hosts decommission, the IP addresses of the old hosts can now be removed from the PFW configuration, thank you!

Hi @Jgreen and @Dwisehaupt , we finished the alert1001, and alert2001` hosts decommission, the IP addresses of the old hosts can now be removed from the PFW configuration, thank you!

Done! Thank you.

Change #1064828 abandoned by Andrea Denisse:

[operations/puppet@production] alert: Failover from alert2002 to alert1002

https://gerrit.wikimedia.org/r/1064828