Page MenuHomePhabricator

Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of cp403[3-6].ulsfo.wmnet

Hostname / Racking / Installation Details

Hostnames: cp403[3-6].ulsfo.wmnet
Racking Proposal: Odd number hostnames in rack 22, even numbered in rack 23
Networking/Subnet/VLAN/IP: single 10g connection to private1-vlan
Partitioning/Raid: sw standard 2dev (the nmve card can remain unused by installer)
OS Distro: Buster (default unless otherwise specified)

Per host setup checklist

cp4033:

  • - receive in system on procurement task T286590 & in coupa - systems arrived but rails haven't yet.
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - https://gerrit.wikimedia.org/r/735461
  • - OS installation & initital puppet run via cookbook

cp4034:

  • - receive in system on procurement task T286590 & in coupa - systems arrived but rails haven't yet.
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - https://gerrit.wikimedia.org/r/735461
  • - OS installation & initital puppet run via cookbook

cp4035:

  • - receive in system on procurement task T286590 & in coupa - systems arrived but rails haven't yet.
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - https://gerrit.wikimedia.org/r/735461
  • - OS installation & initital puppet run via cookbook

cp4036:

  • - receive in system on procurement task T286590 & in coupa - systems arrived but rails haven't yet.
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - https://gerrit.wikimedia.org/r/735461
  • - OS installation & initital puppet run via cookbook

Once the system(s) above have had all checkbox steps completed, this task can be resolved.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
RobH added a parent task: Unknown Object (Task).Sep 9 2021, 9:16 PM
RobH mentioned this in Unknown Object (Task).
RobH updated the task description. (Show Details)

Change 735461 had a related patch set uploaded (by RobH; author: RobH):

[operations/puppet@production] new cp40[3-6]

https://gerrit.wikimedia.org/r/735461

Change 735461 merged by RobH:

[operations/puppet@production] new cp40[3-6]

https://gerrit.wikimedia.org/r/735461

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host cp4033.ulsfo.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host cp4034.ulsfo.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host cp4034.ulsfo.wmnet with OS buster executed with errors:

  • cp4034 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host cp4033.ulsfo.wmnet with OS buster executed with errors:

  • cp4033 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Not sure why these are failing, but I'm out of mental bandwidth for them today.

They are remotely accessible via idrac and will accept script commands. Something isn't right that is having them hang on install and needs investigation.

Please note the librenms alerts didnt clear for these, until they were powered down. I need to check the settings for their power redundancy to ensure they are pulling evenly instead of one then other for failover.

Not sure why these are failing, but I'm out of mental bandwidth for them today.

They are remotely accessible via idrac and will accept script commands. Something isn't right that is having them hang on install and needs investigation.

Both runs of the cookbook were interrupted by the operator:

2021-10-28 23:50:05,984 robh 31550 [ERROR] Ctrl+c pressed
...
2021-10-28 23:50:18,273 robh 31327 [ERROR] Ctrl+c pressed

In both cases the cookbook was polling the host to check when it was rebooted into the Debian installer environment.
For cp4034 it was interrupted after just 90s of polling, for cp4033 it was interrupted after 180s.
Depending on the hardware it can take a quite different amount of time for a host to be rebooted, those doesn't seem alarming polling times yet, it's possible that that they would have just continued the reimage without any problem.
In case of doubts you can always follow the reimage along connecting to the remote console.

Not sure why these are failing, but I'm out of mental bandwidth for them today.

They are remotely accessible via idrac and will accept script commands. Something isn't right that is having them hang on install and needs investigation.

Both runs of the cookbook were interrupted by the operator:

2021-10-28 23:50:05,984 robh 31550 [ERROR] Ctrl+c pressed
...
2021-10-28 23:50:18,273 robh 31327 [ERROR] Ctrl+c pressed

In both cases the cookbook was polling the host to check when it was rebooted into the Debian installer environment.
For cp4034 it was interrupted after just 90s of polling, for cp4033 it was interrupted after 180s.
Depending on the hardware it can take a quite different amount of time for a host to be rebooted, those doesn't seem alarming polling times yet, it's possible that that they would have just continued the reimage without any problem.
In case of doubts you can always follow the reimage along connecting to the remote console.

Oh, I was watchign ON the hardware and it gave an error screen so I knew it was a bios setting. Sorry if it seemed like I was blaming the script, i was not =]

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host cp4033.ulsfo.wmnet with OS buster

RobH changed the task status from Open to In Progress.Oct 29 2021, 7:41 PM

UEFI boot mode was enabled, which is why it was failing rather than attempting to actually hit our PXE server. Changed to bios mode and it immediately worked. I've corrected the bios setttings on cp4033 and once its fully done with a successful install, the rest will follow suit.

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host cp4033.ulsfo.wmnet with OS buster executed with errors:

  • cp4033 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details

hung on loading ramdisk post install reboot... not sure why

Change 735687 had a related patch set uploaded (by RobH; author: RobH):

[operations/puppet@production] ferm not firm

https://gerrit.wikimedia.org/r/735687

Change 735687 merged by RobH:

[operations/puppet@production] ferm not firm

https://gerrit.wikimedia.org/r/735687

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host cp4033.ulsfo.wmnet with OS buster

Change 735692 had a related patch set uploaded (by RobH; author: RobH):

[operations/puppet@production] insetup_noferm

https://gerrit.wikimedia.org/r/735692

Change 735692 merged by RobH:

[operations/puppet@production] insetup_noferm

https://gerrit.wikimedia.org/r/735692

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host cp4033.ulsfo.wmnet with OS buster completed:

  • cp4033 (WARN)
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202110291954_robh_12618_cp4033.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host cp4034.ulsfo.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host cp4035.ulsfo.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host cp4036.ulsfo.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host cp4036.ulsfo.wmnet with OS buster executed with errors:

  • cp4036 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host cp4035.ulsfo.wmnet with OS buster completed:

  • cp4035 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202110292043_robh_22622_cp4035.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host cp4034.ulsfo.wmnet with OS buster completed:

  • cp4034 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202110292042_robh_22567_cp4034.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host cp4036.ulsfo.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host cp4036.ulsfo.wmnet with OS buster completed:

  • cp4036 (WARN)
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202110292132_robh_31974_cp4036.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged
RobH updated the task description. (Show Details)

Change 735994 had a related patch set uploaded (by MMandere; author: MMandere):

[operations/puppet@production] site: Add new cache instances in ulsfo DC

https://gerrit.wikimedia.org/r/735994

Change 735994 merged by MMandere:

[operations/puppet@production] site: Add new cache instances in ulsfo DC

https://gerrit.wikimedia.org/r/735994

Icinga alerts that were added by puppet started firing and raised some questions but confirmed it was just about these new hosts and they just switched from PENDING to active in Icinga. also discussed on IRC

19:26 <+icinga-wm> ACKNOWLEDGEMENT - Check systemd state on cp4033 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-varnish-exporter.service,varnishncsa.service 
                   daniel_zahn https://phabricator.wikimedia.org/T290694 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
19:26 <+icinga-wm> ACKNOWLEDGEMENT - Check systemd state on cp4034 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-varnish-exporter.service,varnishncsa.service 
                   daniel_zahn https://phabricator.wikimedia.org/T290694 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
19:26 <+icinga-wm> ACKNOWLEDGEMENT - Check systemd state on cp4035 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-varnish-exporter.service,varnishncsa.service 
                   daniel_zahn https://phabricator.wikimedia.org/T290694 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
19:26 <+icinga-wm> ACKNOWLEDGEMENT - traffic-pool service on cp4035 is CRITICAL: CRITICAL - Expecting active but unit traffic-pool is inactive daniel_zahn 
                   https://phabricator.wikimedia.org/T290694 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
19:26 <+icinga-wm> ACKNOWLEDGEMENT - Check systemd state on cp4036 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-varnish-exporter.service,varnishncsa.service 
                   daniel_zahn https://phabricator.wikimedia.org/T290694 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
19:26 <+icinga-wm> ACKNOWLEDGEMENT - traffic-pool service on cp4036 is CRITICAL: CRITICAL - Expecting active but unit traffic-pool is inactive daniel_zahn 
                   https://phabricator.wikimedia.org/T290694 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state

Yeah sorry for the noise - we weren't anticipating the hosts re-puppeting themselves into the productions roles (incorrectly!) and should've just puppet-disabled them until we were ready to reimage them. Eitherway, definitely not an actual production concern.

Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp4033.ulsfo.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp4033.ulsfo.wmnet with OS buster completed:

  • cp4033 (WARN)
    • Downtimed on Icinga
    • Set pooled=inactive for the following services on confctl:

{"cp4033.ulsfo.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=ulsfo,cluster=cache_upload,service=varnish-fe"}
{"cp4033.ulsfo.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=ulsfo,cluster=cache_upload,service=ats-tls"}
{"cp4033.ulsfo.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=ulsfo,cluster=cache_upload,service=ats-be"}

  • Disabled Puppet
  • Removed from Puppet and PuppetDB if present
  • Deleted any existing Puppet certificate
  • Removed from Debmonitor if present
  • Forced PXE for next reboot
  • Host rebooted via IPMI
  • Host up (Debian installer)
  • Host up (new fresh buster OS)
  • Generated Puppet certificate
  • Signed new Puppet certificate
  • Run Puppet in NOOP mode to populate exported resources in PuppetDB
  • Found Nagios_host resource for this host in PuppetDB
  • Downtimed the new host on Icinga
  • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202111021119_mmandere_3810_cp4033.out
  • Checked BIOS boot parameters are back to normal
  • Rebooted
  • Automatic Puppet run was successful
  • Forced a re-check of all Icinga services for the host
  • Icinga status is not optimal, downtime not removed
  • Services in confctl are not automatically pooled, to restore the previous state you have to run the following commands:
  • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp4035.ulsfo.wmnet with OS buster

Mentioned in SAL (#wikimedia-operations) [2021-11-02T13:45:21Z] <mmandere> pool cp4033.ulsfo.wmnet - T290694

Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp4035.ulsfo.wmnet with OS buster completed:

  • cp4035 (WARN)
    • Downtimed on Icinga
    • Set pooled=inactive for the following services on confctl:

{"cp4035.ulsfo.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=ulsfo,cluster=cache_text,service=ats-be"}
{"cp4035.ulsfo.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=ulsfo,cluster=cache_text,service=ats-tls"}
{"cp4035.ulsfo.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=ulsfo,cluster=cache_text,service=varnish-fe"}

  • Disabled Puppet
  • Removed from Puppet and PuppetDB if present
  • Deleted any existing Puppet certificate
  • Removed from Debmonitor if present
  • Forced PXE for next reboot
  • Host rebooted via IPMI
  • Host up (Debian installer)
  • Host up (new fresh buster OS)
  • Generated Puppet certificate
  • Signed new Puppet certificate
  • Run Puppet in NOOP mode to populate exported resources in PuppetDB
  • Found Nagios_host resource for this host in PuppetDB
  • Downtimed the new host on Icinga
  • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202111021331_mmandere_15322_cp4035.out
  • Checked BIOS boot parameters are back to normal
  • Rebooted
  • Automatic Puppet run was successful
  • Forced a re-check of all Icinga services for the host
  • Icinga status is not optimal, downtime not removed
  • Services in confctl are not automatically pooled, to restore the previous state you have to run the following commands:
  • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp4034.ulsfo.wmnet with OS buster

Mentioned in SAL (#wikimedia-operations) [2021-11-02T14:34:44Z] <mmandere> pool cp4035.ulsfo.wmnet - T290694

Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp4034.ulsfo.wmnet with OS buster completed:

  • cp4034 (WARN)
    • Downtimed on Icinga
    • Set pooled=inactive for the following services on confctl:

{"cp4034.ulsfo.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=ulsfo,cluster=cache_upload,service=varnish-fe"}
{"cp4034.ulsfo.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=ulsfo,cluster=cache_upload,service=ats-be"}
{"cp4034.ulsfo.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=ulsfo,cluster=cache_upload,service=ats-tls"}

  • Disabled Puppet
  • Removed from Puppet and PuppetDB if present
  • Deleted any existing Puppet certificate
  • Removed from Debmonitor if present
  • Forced PXE for next reboot
  • Host rebooted via IPMI
  • Host up (Debian installer)
  • Host up (new fresh buster OS)
  • Generated Puppet certificate
  • Signed new Puppet certificate
  • Run Puppet in NOOP mode to populate exported resources in PuppetDB
  • Found Nagios_host resource for this host in PuppetDB
  • Downtimed the new host on Icinga
  • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202111021430_mmandere_25896_cp4034.out
  • Checked BIOS boot parameters are back to normal
  • Rebooted
  • Automatic Puppet run was successful
  • Forced a re-check of all Icinga services for the host
  • Icinga status is not optimal, downtime not removed
  • Services in confctl are not automatically pooled, to restore the previous state you have to run the following commands:
  • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp4036.ulsfo.wmnet with OS buster

Mentioned in SAL (#wikimedia-operations) [2021-11-02T15:41:46Z] <mmandere> pool cp4034.ulsfo.wmnet - T290694

Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp4036.ulsfo.wmnet with OS buster completed:

  • cp4036 (WARN)
    • Downtimed on Icinga
    • Set pooled=inactive for the following services on confctl:

{"cp4036.ulsfo.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=ulsfo,cluster=cache_text,service=ats-be"}
{"cp4036.ulsfo.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=ulsfo,cluster=cache_text,service=ats-tls"}
{"cp4036.ulsfo.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=ulsfo,cluster=cache_text,service=varnish-fe"}

  • Disabled Puppet
  • Removed from Puppet and PuppetDB if present
  • Deleted any existing Puppet certificate
  • Removed from Debmonitor if present
  • Forced PXE for next reboot
  • Host rebooted via IPMI
  • Host up (Debian installer)
  • Host up (new fresh buster OS)
  • Generated Puppet certificate
  • Signed new Puppet certificate
  • Run Puppet in NOOP mode to populate exported resources in PuppetDB
  • Found Nagios_host resource for this host in PuppetDB
  • Downtimed the new host on Icinga
  • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202111021538_mmandere_6478_cp4036.out
  • Checked BIOS boot parameters are back to normal
  • Rebooted
  • Automatic Puppet run was successful
  • Forced a re-check of all Icinga services for the host
  • Icinga status is not optimal, downtime not removed
  • Services in confctl are not automatically pooled, to restore the previous state you have to run the following commands:
  • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2021-11-02T16:38:43Z] <mmandere> pool cp4036.ulsfo.wmnet - T290694

Change 736475 had a related patch set uploaded (by MMandere; author: MMandere):

[operations/puppet@production] install_server: Update instance hardware category

https://gerrit.wikimedia.org/r/736475

Change 736475 merged by MMandere:

[operations/puppet@production] install_server: Update instance hardware category

https://gerrit.wikimedia.org/r/736475

Mentioned in SAL (#wikimedia-operations) [2021-11-03T15:58:40Z] <mmandere> depool cp4033.ulsfo.wmnet - T290694

Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp4033.ulsfo.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp4033.ulsfo.wmnet with OS buster completed:

  • cp4033 (WARN)
    • Downtimed on Icinga
    • Set pooled=inactive for the following services on confctl:

{"cp4033.ulsfo.wmnet": {"weight": 1, "pooled": "no"}, "tags": "dc=ulsfo,cluster=cache_upload,service=ats-tls"}
{"cp4033.ulsfo.wmnet": {"weight": 1, "pooled": "no"}, "tags": "dc=ulsfo,cluster=cache_upload,service=varnish-fe"}
{"cp4033.ulsfo.wmnet": {"weight": 100, "pooled": "no"}, "tags": "dc=ulsfo,cluster=cache_upload,service=ats-be"}

  • Disabled Puppet
  • Removed from Puppet and PuppetDB if present
  • Deleted any existing Puppet certificate
  • Removed from Debmonitor if present
  • Forced PXE for next reboot
  • Host rebooted via IPMI
  • Host up (Debian installer)
  • Host up (new fresh buster OS)
  • Generated Puppet certificate
  • Signed new Puppet certificate
  • Run Puppet in NOOP mode to populate exported resources in PuppetDB
  • Found Nagios_host resource for this host in PuppetDB
  • Downtimed the new host on Icinga
  • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202111031602_mmandere_22907_cp4033.out
  • Checked BIOS boot parameters are back to normal
  • Rebooted
  • Automatic Puppet run was successful
  • Forced a re-check of all Icinga services for the host
  • Icinga status is not optimal, downtime not removed
  • Services in confctl are not automatically pooled, to restore the previous state you have to run the following commands:

sudo confctl select 'dc=ulsfo,cluster=cache_upload,service=ats-tls' set/pooled=no
sudo confctl select 'dc=ulsfo,cluster=cache_upload,service=varnish-fe' set/pooled=no
sudo confctl select 'dc=ulsfo,cluster=cache_upload,service=ats-be' set/pooled=no

  • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2021-11-03T17:06:15Z] <mmandere> pool cp4033.ulsfo.wmnet - T290694

Mentioned in SAL (#wikimedia-operations) [2021-11-03T17:48:18Z] <mmandere> depool cp4035.ulsfo.wmnet - T290694

Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp4035.ulsfo.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp4035.ulsfo.wmnet with OS buster completed:

  • cp4035 (WARN)
    • Downtimed on Icinga
    • Set pooled=inactive for the following services on confctl:

{"cp4035.ulsfo.wmnet": {"weight": 100, "pooled": "no"}, "tags": "dc=ulsfo,cluster=cache_text,service=ats-be"}
{"cp4035.ulsfo.wmnet": {"weight": 1, "pooled": "no"}, "tags": "dc=ulsfo,cluster=cache_text,service=varnish-fe"}
{"cp4035.ulsfo.wmnet": {"weight": 1, "pooled": "no"}, "tags": "dc=ulsfo,cluster=cache_text,service=ats-tls"}

  • Disabled Puppet
  • Removed from Puppet and PuppetDB if present
  • Deleted any existing Puppet certificate
  • Removed from Debmonitor if present
  • Forced PXE for next reboot
  • Host rebooted via IPMI
  • Host up (Debian installer)
  • Host up (new fresh buster OS)
  • Generated Puppet certificate
  • Signed new Puppet certificate
  • Run Puppet in NOOP mode to populate exported resources in PuppetDB
  • Found Nagios_host resource for this host in PuppetDB
  • Downtimed the new host on Icinga
  • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202111031750_mmandere_25895_cp4035.out
  • Checked BIOS boot parameters are back to normal
  • Rebooted
  • Automatic Puppet run was successful
  • Forced a re-check of all Icinga services for the host
  • Icinga status is not optimal, downtime not removed
  • Services in confctl are not automatically pooled, to restore the previous state you have to run the following commands:

sudo confctl select 'dc=ulsfo,cluster=cache_text,service=ats-be' set/pooled=no
sudo confctl select 'dc=ulsfo,cluster=cache_text,service=varnish-fe' set/pooled=no
sudo confctl select 'dc=ulsfo,cluster=cache_text,service=ats-tls' set/pooled=no

  • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2021-11-03T19:26:14Z] <mmandere> pool cp4035.ulsfo.wmnet - T290694

Mentioned in SAL (#wikimedia-operations) [2021-11-04T09:09:28Z] <mmandere> depool cp4034.ulsfo.wmnet - T290694

Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp4034.ulsfo.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp4034.ulsfo.wmnet with OS buster completed:

  • cp4034 (WARN)
    • Downtimed on Icinga
    • Set pooled=inactive for the following services on confctl:

{"cp4034.ulsfo.wmnet": {"weight": 100, "pooled": "no"}, "tags": "dc=ulsfo,cluster=cache_upload,service=ats-be"}
{"cp4034.ulsfo.wmnet": {"weight": 1, "pooled": "no"}, "tags": "dc=ulsfo,cluster=cache_upload,service=ats-tls"}
{"cp4034.ulsfo.wmnet": {"weight": 1, "pooled": "no"}, "tags": "dc=ulsfo,cluster=cache_upload,service=varnish-fe"}

  • Disabled Puppet
  • Removed from Puppet and PuppetDB if present
  • Deleted any existing Puppet certificate
  • Removed from Debmonitor if present
  • Forced PXE for next reboot
  • Host rebooted via IPMI
  • Host up (Debian installer)
  • Host up (new fresh buster OS)
  • Generated Puppet certificate
  • Signed new Puppet certificate
  • Run Puppet in NOOP mode to populate exported resources in PuppetDB
  • Found Nagios_host resource for this host in PuppetDB
  • Downtimed the new host on Icinga
  • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202111040911_mmandere_26533_cp4034.out
  • Checked BIOS boot parameters are back to normal
  • Rebooted
  • Automatic Puppet run was successful
  • Forced a re-check of all Icinga services for the host
  • Icinga status is not optimal, downtime not removed
  • Services in confctl are not automatically pooled, to restore the previous state you have to run the following commands:

sudo confctl select 'dc=ulsfo,cluster=cache_upload,service=ats-be' set/pooled=no
sudo confctl select 'dc=ulsfo,cluster=cache_upload,service=ats-tls' set/pooled=no
sudo confctl select 'dc=ulsfo,cluster=cache_upload,service=varnish-fe' set/pooled=no

  • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2021-11-04T10:21:16Z] <mmandere> pool cp4034.ulsfo.wmnet - T290694

Mentioned in SAL (#wikimedia-operations) [2021-11-04T10:27:10Z] <mmandere> depool cp4036.ulsfo.wmnet - T290694

Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp4036.ulsfo.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp4036.ulsfo.wmnet with OS buster completed:

  • cp4036 (WARN)
    • Downtimed on Icinga
    • Set pooled=inactive for the following services on confctl:

{"cp4036.ulsfo.wmnet": {"weight": 1, "pooled": "no"}, "tags": "dc=ulsfo,cluster=cache_text,service=ats-tls"}
{"cp4036.ulsfo.wmnet": {"weight": 100, "pooled": "no"}, "tags": "dc=ulsfo,cluster=cache_text,service=ats-be"}
{"cp4036.ulsfo.wmnet": {"weight": 1, "pooled": "no"}, "tags": "dc=ulsfo,cluster=cache_text,service=varnish-fe"}

  • Disabled Puppet
  • Removed from Puppet and PuppetDB if present
  • Deleted any existing Puppet certificate
  • Removed from Debmonitor if present
  • Forced PXE for next reboot
  • Host rebooted via IPMI
  • Host up (Debian installer)
  • Host up (new fresh buster OS)
  • Generated Puppet certificate
  • Signed new Puppet certificate
  • Run Puppet in NOOP mode to populate exported resources in PuppetDB
  • Found Nagios_host resource for this host in PuppetDB
  • Downtimed the new host on Icinga
  • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202111041028_mmandere_6637_cp4036.out
  • Checked BIOS boot parameters are back to normal
  • Rebooted
  • Automatic Puppet run was successful
  • Forced a re-check of all Icinga services for the host
  • Icinga status is not optimal, downtime not removed
  • Services in confctl are not automatically pooled, to restore the previous state you have to run the following commands:

sudo confctl select 'dc=ulsfo,cluster=cache_text,service=ats-tls' set/pooled=no
sudo confctl select 'dc=ulsfo,cluster=cache_text,service=ats-be' set/pooled=no
sudo confctl select 'dc=ulsfo,cluster=cache_text,service=varnish-fe' set/pooled=no

  • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2021-11-04T11:53:50Z] <mmandere> pool cp4036.ulsfo.wmnet - T290694