Page MenuHomePhabricator

Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of an-worker11[42-48].eqiad.wmnet

Hostname / Racking / Installation Details

Hostnames: an-worker11[42-48].eqiad.wmnet
Racking Proposal: Distributed across racks and rows in eqiad please, for resilience.

I've included the current Hadoop worker distribution below, in case it helps to avoid clustering the nodes.

Networking/Subnet/VLAN/IP: A single 10 Gbps connection per host please. Analytics VLAN. No special IP requirements.
Partitioning/Raid: We will use an existing partmen recipe: partman/custom/analytics-flex.cfg
OS Distro: Buster

Additional Information to Support Racking Configuration

Currently the row distribution for hadoop workers is as follows:

23 eqiad A
19 eqiad B
22 eqiad C
20 eqiad D

Taking into account rows and racks, the current distribution is as follows:

1 eqiad A 1
6 eqiad A 2
2 eqiad A 3
6 eqiad A 4
2 eqiad A 5
6 eqiad A 7
5 eqiad B 2
1 eqiad B 3
5 eqiad B 4
5 eqiad B 7
3 eqiad B 8
5 eqiad C 2
4 eqiad C 3
7 eqiad C 4
5 eqiad C 7
1 eqiad C 8
6 eqiad D 2
5 eqiad D 4
2 eqiad D 5
6 eqiad D 7
1 eqiad D 8

Also, in case it's helpful to know, the 12 nodes being refreshed (and therefore to be decommissioned) are in the following racks:

1 eqiad A 1
2 eqiad A 3
3 eqiad B 8
3 eqiad C 3
2 eqiad D 2
1 eqiad D 8

Please let me know if you'd like any further information.

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

an-worker1142:

  • - receive in system on procurement task T292002 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

an-worker1143:

  • - receive in system on procurement task T292002 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

an-worker1144:

  • - receive in system on procurement task T292002 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

an-worker1145:

  • - receive in system on procurement task T292002 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

an-worker1146:

  • - receive in system on procurement task T292002 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

an-worker1147:

  • - receive in system on procurement task T292002 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

an-worker1148:

  • - receive in system on procurement task T292002 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

Once the system(s) above have had all checkbox steps completed, this task can be resolved.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 773613 merged by Cmjohnson:

[operations/puppet@production] Updating site.pp for an-worker hosts

https://gerrit.wikimedia.org/r/773613

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1142.eqiad.wmnet with OS buster executed with errors:

  • an-worker1142 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmnet with OS buster executed with errors:

  • an-worker1143 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1145.eqiad.wmnet with OS buster executed with errors:

  • an-worker1145 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1144.eqiad.wmnet with OS buster executed with errors:

  • an-worker1144 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1146.eqiad.wmnet with OS buster executed with errors:

  • an-worker1146 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1142.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1142.eqiad.wmnet with OS buster executed with errors:

  • an-worker1142 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1147.eqiad.wmnet with OS buster executed with errors:

  • an-worker1147 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1148.eqiad.wmnet with OS buster executed with errors:

  • an-worker1148 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1142.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1142.eqiad.wmnet with OS buster executed with errors:

  • an-worker1142 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

@Cmjohnson there is an issue with the port assigned for an-worker1143 on lsw1-e2-eqiad, an-worker1145 on lsw1-f2-eqiad, and an-worker1146 on lsw1-f3-eqiad.

These switches have ml-serve devices connected at 1G on port 25. Due to a technical limitation on them (see T303529) that means that 4 ports, 24-27, all need to be set to the 1G speed.

The an-worker hosts are connected to port 27 with a 10G DAC, but aren't coming up as that port is part of the block of 4 set to 1G. Can you move these connections to another port on each device? Anything outside 24-27 should be good.

We'll continue to work to make this process easier/prevent similar mix-ups in future. Apologies we only became aware of this constraint recently.

an-worker1142, an-worker1144, an-worker1147 and an-worker1148 should be good to go.

I'm not sure why the re-image failed on those tbh, I guess we should retry that on them and dig into the logs if any fail.

cmooney@cumin1001:~$ ping -4 -c 1 an-worker1142.eqiad.wmnet
PING  (10.64.130.6) 56(84) bytes of data.
64 bytes from an-worker1142.eqiad.wmnet (10.64.130.6): icmp_seq=1 ttl=62 time=0.337 ms
cmooney@cumin1001:~$ ping -4 -c 1 an-worker1144.eqiad.wmnet
PING  (10.64.134.5) 56(84) bytes of data.
64 bytes from an-worker1144.eqiad.wmnet (10.64.134.5): icmp_seq=1 ttl=62 time=0.286 ms
cmooney@cumin1001:~$ ping -4 -c 1 an-worker1147.eqiad.wmnet
PING  (10.64.138.2) 56(84) bytes of data.
64 bytes from an-worker1147.eqiad.wmnet (10.64.138.2): icmp_seq=1 ttl=62 time=0.273 ms
cmooney@cumin1001:~$ ping -4 -c 1 an-worker1148.eqiad.wmnet
PING  (10.64.142.2) 56(84) bytes of data.
64 bytes from an-worker1148.eqiad.wmnet (10.64.142.2): icmp_seq=1 ttl=62 time=0.394 ms

Moved all 3 servers to xe-0/0/28 on their respective switches, and committed the change on homer.

Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host an-worker1146.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host an-worker1146.eqiad.wmnet with OS buster executed with errors:

  • an-worker1146 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

^^ Above reimage seemed to fail due to some disk problem, I suspect maybe the raid config needs to be done in the BIOS (I was running the reimage to validate the network was not causing issues, there were none, at least before it failed). Not unlikely the same thing happened with an-worker114[2-5].

no_disk.png (437×721 px, 39 KB)

@Cmjohnson I backed out of this in terms of testing, not sure what needs to be done here I'll leave it to you guys.

@cmooney Can you confirm the raid setup please. analytics-flex is first 2 ssds are raid 1 and the rest jbod?

@Cmjohnson no idea unfortunately, it should match the partman config so my guess is you are right, but I can't really confirm. Perhaps @RobH might know? Or someone on the analytics team.

One other thing, hosts an-worker114[2-6] were added to the "private" rather than "analytics" vlan which was requested:

Host                 Interface       Far Side                       Vlans
an-worker1142        ##PRIMARY##     lsw1-e1-eqiad:xe-0/0/27        private1-e1-eqiad (1031)
an-worker1143        ##PRIMARY##     lsw1-e2-eqiad:xe-0/0/28        private1-e2-eqiad (1032)
an-worker1144        ##PRIMARY##     lsw1-f1-eqiad:xe-0/0/27        private1-f1-eqiad (1035)
an-worker1145        ##PRIMARY##     lsw1-f2-eqiad:xe-0/0/28        private1-f2-eqiad (1036)
an-worker1146        ##PRIMARY##     lsw1-f3-eqiad:xe-0/0/28        private1-f3-eqiad (1037)

I'm not sure what the easiest way to fix that up is. The vlan needs to change, host IPs deleted and new ones from the relevant analytics subnet assigned. I'm guessing this has probably happened before so perhaps you know the simplest way to get there?

Thanks @Cmjohnson

Note the IP addresses assigned to the servers need to be updated to match those vlans.

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmnet with OS buster executed with errors:

  • an-worker1143 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmnet with OS buster executed with errors:

  • an-worker1143 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmnet with OS buster executed with errors:

  • an-worker1143 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmnet with OS buster executed with errors:

  • an-worker1143 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmnet with OS buster executed with errors:

  • an-worker1143 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmnet with OS buster executed with errors:

  • an-worker1143 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmnet with OS buster executed with errors:

  • an-worker1143 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1145.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1144.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1146.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1146.eqiad.wmnet with OS buster executed with errors:

  • an-worker1146 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1145.eqiad.wmnet with OS buster executed with errors:

  • an-worker1145 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1144.eqiad.wmnet with OS buster executed with errors:

  • an-worker1144 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1144.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1145.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1146.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmnet with OS buster executed with errors:

  • an-worker1143 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1144.eqiad.wmnet with OS buster executed with errors:

  • an-worker1144 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1145.eqiad.wmnet with OS buster executed with errors:

  • an-worker1145 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1146.eqiad.wmnet with OS buster executed with errors:

  • an-worker1146 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1142.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1142.eqiad.wmnet with OS buster executed with errors:

  • an-worker1142 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host an-worker1142.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host an-worker1142.eqiad.wmnet with OS buster executed with errors:

  • an-worker1142 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

I tested the pxe boot on an-worker1142 and server was not getting anything from dhcp server after debug , I found out that the server is not it the right vlan.
The server is in private1-e1 vlan and not analytics1-e1-eqiad. Same also is true for 1143. let me know if you have any other issues

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1142.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1144.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1145.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1146.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1142.eqiad.wmnet with OS buster executed with errors:

  • an-worker1142 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1142.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1144.eqiad.wmnet with OS buster executed with errors:

  • an-worker1144 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1146.eqiad.wmnet with OS buster executed with errors:

  • an-worker1146 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1145.eqiad.wmnet with OS buster executed with errors:

  • an-worker1145 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmnet with OS buster executed with errors:

  • an-worker1143 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1142.eqiad.wmnet with OS buster completed:

  • an-worker1142 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206092044_cmjohnson_655255_an-worker1142.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1144.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1145.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1145.eqiad.wmnet with OS buster executed with errors:

  • an-worker1145 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1146.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1146.eqiad.wmnet with OS buster executed with errors:

  • an-worker1146 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1147.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1148.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1144.eqiad.wmnet with OS buster completed:

  • an-worker1144 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206131637_cmjohnson_1346419_an-worker1144.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1145.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1147.eqiad.wmnet with OS buster completed:

  • an-worker1147 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206131658_cmjohnson_1348309_an-worker1147.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1146.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1148.eqiad.wmnet with OS buster completed:

  • an-worker1148 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206131704_cmjohnson_1350697_an-worker1148.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmnet with OS buster completed:

  • an-worker1143 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206131718_cmjohnson_1354612_an-worker1143.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1145.eqiad.wmnet with OS buster completed:

  • an-worker1145 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206131722_cmjohnson_1355003_an-worker1145.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1146.eqiad.wmnet with OS buster completed:

  • an-worker1146 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206131729_cmjohnson_1357113_an-worker1146.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Finally resolved this, had some issues with network ports not being correct