Page MenuHomePhabricator

(Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of <enter the FQDN/hostname of the hosts being setup here>

Hostname / Racking / Installation Details

Hostnames: cloudcephosd10[04-15].wikimedia.org

Racking Proposal: All hosts in Row B, all in 10G racks, ideally 4 per rack for optimal spread of cluster against PDU & switch failures

Networking/Subnet/VLAN/IP: 2 x 10G ports per server (12 x 2 = 24 ports). One 10G ethernet network connection to the public subnet (wikimedia.org) and one to the private, internal (eqiad.wmnet) should be on each host.

Partitioning/Raid: RAID 10 on OS drive pair, no RAID (JBOD Only) for data drives.

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

cloudcephosd1004:

  • - receive in system on procurement task T242133
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

cloudcephosd1005:

  • - receive in system on procurement task T242133
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

cloudcephosd1006:

  • - receive in system on procurement task T242133
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

cloudcephosd1007:

  • - receive in system on procurement task T242133
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

cloudcephosd1008:

  • - receive in system on procurement task T242133
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

cloudcephosd1009:

  • - receive in system on procurement task T242133
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

cloudcephosd1010:

  • - receive in system on procurement task T242133
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

cloudcephosd1011:

  • - receive in system on procurement task T242133
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

cloudcephosd1012:

  • - receive in system on procurement task T242133
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

cloudcephosd1013:

  • - receive in system on procurement task T242133
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

cloudcephosd1014:

  • - receive in system on procurement task T242133
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

cloudcephosd1015:

  • - receive in system on procurement task T242133
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

Once the system(s) above have had all checkbox steps completed, this task can be resolved.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Jclark-ctr updated the task description. (Show Details)May 28 2020, 1:58 PM
wiki_willy renamed this task from (Need By: TBD) rack/setup/install cloudcephosd10[04-15].wikimedia.org to (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org.Jun 8 2020, 8:23 PM
Jclark-ctr updated the task description. (Show Details)Jun 29 2020, 4:40 PM

host rack. switch port asset tag
cloudcephosd1004 C8 22 WMF5103
cloudcephosd1005 C8 23 WMF5104
cloudcephosd1006 C8 24 WMF4831
cloudcephosd1007 C8 25 WMF4830
cloudcephosd1008 C8 26 WMF4829
cloudcephosd1009 C8 27 WMF4828
cloudcephosd1010 D5 6 WMF4827
cloudcephosd1011 D5 7 WMF4826
cloudcephosd1012 D5 8 WMF4825
cloudcephosd1013 D5 9 WMF4824
cloudcephosd1014 D5 10 WMF4823
cloudcephosd1015 D5 11 WMF4822

Jclark-ctr added a subscriber: Jclark-ctr.
Cmjohnson updated the task description. (Show Details)Jul 6 2020, 8:29 PM

The network switches still need to be connected to the network, in the meantime, everything will be completed so they can be imaged

I have updated the switch port descriptions but have not set any vlans.

Change 613333 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Adding production dns for cloudcephosd1004-1015

https://gerrit.wikimedia.org/r/613333

Change 613333 merged by Cmjohnson:
[operations/dns@master] Adding production dns for cloudcephosd1004-1015

https://gerrit.wikimedia.org/r/613333

Change 615182 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] Adding new cloudceph servers to site.pp

https://gerrit.wikimedia.org/r/615182

Change 615182 merged by Cmjohnson:
[operations/puppet@production] Adding new cloudceph servers to site.pp

https://gerrit.wikimedia.org/r/615182

Change 615183 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] Add cloudcephosd mac addressess to dhcpd file

https://gerrit.wikimedia.org/r/615183

Change 615183 merged by Cmjohnson:
[operations/puppet@production] Add cloudcephosd mac addressess to dhcpd file

https://gerrit.wikimedia.org/r/615183

There has been some confusions and some informal IRC discussions about how best to cable and vlan those hosts.

The initial PoC ceph hosts had their interfaces in public-b and private-b, but those are not ideal:

  • public-b was chosen to cross rows boundaries, which is not needed as both ceph and virts hosts will live in the same vlan (cloud-hosts) and not need to be publicly reachable
  • private-b was chosen as it is a vlan different from public-b, and not for the benefits of the private vlan itself (routable and secure)

That's why I suggested (in T251632#6292589 but it would have been better here) to configure them that way:

  • eth0:cloud-hosts1-eqiad (main vlan)
  • eth1:cloud-storage1-eqiad (unrouted private vlan)

Both present in row B, cloudsw-c8 and cloudsw-d5.

This also means that the PoC hosts will need to be renumbered to those final vlans.

Change 615513 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Adding cloudcephosd servers to private vlan

https://gerrit.wikimedia.org/r/615513

Change 615513 merged by Cmjohnson:
[operations/dns@master] Adding cloudcephosd servers to private vlan

https://gerrit.wikimedia.org/r/615513

Andrew added a subscriber: Andrew.EditedJul 22 2020, 4:28 PM

@ayounsi, the POC hosts are currently hosting a small amount of user workload. Will renumbering them cause a service interruption?

Edited to add: nevermind, Brooke is stating this more clearly

Bstorm added a subscriber: Bstorm.EditedJul 22 2020, 4:41 PM

@ayounsi I agree about the choice of public-b, I think. If we don't need to cross rows, and we can still contact the initial PoC cluster because that is part of this cluster. This needs to talk to the cephmon hosts for that cluster and all three existing cephosd hosts. It is a hard requirement that it be able to talk to the initial three cephosd hosts. They are on the private-b network because it is secure and that traffic should be reasonably secure because ceph trusts it. I'm not sure about the idea that it doesn't need to be routed? It depends on what that implies here. That network (in ceph it is called the private network, which is confusing for these discussions) is normal network traffic between hosts and not crossover or something. It is the busier of the two networks for an OSD server and has to be able to see all other OSDs in the cluster, as currently set up.

I guess my strongest concern here is that the PoC cluster is actually the seed for the rest of the build-out, and it is not separate. These hosts need to be in touch with those hosts in the end. I worry about the requirement of renumbering them blocking this because of @Andrew's comment, if you can help us understand the implications etc. :)

NOTE: For any spectators, we are setting up a meeting to make sure we are all synced up on this ASAP (likely tomorrow).

Change 615765 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Addig cloudcephosd to cloud-host vlan

https://gerrit.wikimedia.org/r/615765

Change 615765 merged by Cmjohnson:
[operations/dns@master] Addig cloudcephosd to cloud-host vlan

https://gerrit.wikimedia.org/r/615765

Current status on the switches side is that vlans (cloud-hosts + cloud-storage) are configured, but Ceph are offline or not cabled:

xe-0/0/22       up    down cloudcephosd1004:en0
xe-0/0/23       up    down cloudcephosd1005:en0
xe-0/0/24       up    down cloudcephosd1006:en0
xe-0/0/25       up    down cloudcephosd1007:en0
xe-0/0/26       up    down cloudcephosd1008:en0
xe-0/0/27       up    down cloudcephosd1009:en0
xe-0/0/42       up    down cloudcephosd1004:en1
xe-0/0/43       up    down cloudcephosd1005:en1
xe-0/0/44       up    down cloudcephosd1006:en1
xe-0/0/45       up    down cloudcephosd1007:en1
xe-0/0/46       up    down cloudcephosd1008:en1
xe-0/0/47       up    down cloudcephosd1009:en1

I first though they were connected as some servers show up in LLDP:

xe-0/0/37          -                   bc:97:e1:4a:12:d2   NIC 1/10Gb Unknown Broadcom Adv. Dual 10Gb Ethernet fw_version:AFW_214.4.6.0
xe-0/0/8           -                   bc:97:e1:4a:12:d3   NIC 1/10Gb Unknown Broadcom Adv. Dual 10Gb Ethernet fw_version:AFW_214.4.6.0
xe-0/0/38          -                   bc:97:e1:4a:37:2c   NIC 1/10Gb Unknown Broadcom Adv. Dual 10Gb Ethernet fw_version:AFW_214.4.6.0
xe-0/0/9           -                   bc:97:e1:4a:37:2d   NIC 1/10Gb Unknown Broadcom Adv. Dual 10Gb Ethernet fw_version:AFW_214.4.6.0
xe-0/0/40          -                   bc:97:e1:4a:68:52   NIC 1/10Gb Unknown Broadcom Adv. Dual 10Gb Ethernet fw_version:AFW_214.4.6.0
xe-0/0/11          -                   bc:97:e1:4a:68:53   NIC 1/10Gb Unknown Broadcom Adv. Dual 10Gb Ethernet fw_version:AFW_214.4.6.0
xe-0/0/39          -                   bc:97:e1:4a:6d:50   NIC 1/10Gb Unknown Broadcom Adv. Dual 10Gb Ethernet fw_version:AFW_214.4.6.0
xe-0/0/10          -                   bc:97:e1:4a:6d:51   NIC 1/10Gb Unknown Broadcom Adv. Dual 10Gb Ethernet fw_version:AFW_214.4.6.0
xe-0/0/36          -                   bc:97:e1:4a:7b:ba   NIC 1/10Gb Unknown Broadcom Adv. Dual 10Gb Ethernet fw_version:AFW_214.4.6.0
xe-0/0/7           -                   bc:97:e1:4a:7b:bb   NIC 1/10Gb Unknown Broadcom Adv. Dual 10Gb Ethernet fw_version:AFW_214.4.6.0

Not sure what they are, maybe hypervisors?

Change 615790 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Adding the mgmt dns entries created by netbox to dns file (not yet automated)

https://gerrit.wikimedia.org/r/615790

Change 615790 merged by Cmjohnson:
[operations/dns@master] Adding the mgmt dns entries created by netbox to dns file (not yet automated)

https://gerrit.wikimedia.org/r/615790

Change 615828 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Rename cloudcephosd1004 through 1015.

https://gerrit.wikimedia.org/r/615828

Change 615828 merged by Andrew Bogott:
[operations/puppet@production] Rename cloudcephosd1004 through 1015.

https://gerrit.wikimedia.org/r/615828

Change 615832 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] WMCS Ceph: add address entries for new OSD nodes

https://gerrit.wikimedia.org/r/615832

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

['cloudcephosd1005.eqiad.wmnet', 'cloudcephosd1006.eqiad.wmnet', 'cloudcephosd1007.eqiad.wmnet', 'cloudcephosd1008.eqiad.wmnet', 'cloudcephosd1009.eqiad.wmnet', 'cloudcephosd1010.eqiad.wmnet', 'cloudcephosd1011.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202007232043_andrew_3929.log.

Completed auto-reimage of hosts:

['cloudcephosd1008.eqiad.wmnet', 'cloudcephosd1009.eqiad.wmnet', 'cloudcephosd1010.eqiad.wmnet', 'cloudcephosd1006.eqiad.wmnet', 'cloudcephosd1005.eqiad.wmnet']

Of which those FAILED:

['cloudcephosd1011.eqiad.wmnet', 'cloudcephosd1007.eqiad.wmnet']

Change 615838 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] cloudcephosd nodes: Experiment with using a hw raid for the / volume

https://gerrit.wikimedia.org/r/615838

Change 615838 merged by Andrew Bogott:
[operations/puppet@production] cloudcephosd nodes: Experiment with using a hw raid for the / volume

https://gerrit.wikimedia.org/r/615838

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

['cloudcephosd1004.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202007232112_andrew_15882.log.

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

['cloudcephosd1004.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202007232130_andrew_19718.log.

Completed auto-reimage of hosts:

['cloudcephosd1004.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

['cloudcephosd1004.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202007232204_andrew_27467.log.

Completed auto-reimage of hosts:

['cloudcephosd1004.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

['cloudcephosd1011.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202007232236_andrew_4224.log.

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

['cloudcephosd1005.eqiad.wmnet', 'cloudcephosd1006.eqiad.wmnet', 'cloudcephosd1007.eqiad.wmnet', 'cloudcephosd1008.eqiad.wmnet', 'cloudcephosd1009.eqiad.wmnet', 'cloudcephosd1010.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202007232236_andrew_4118.log.

Completed auto-reimage of hosts:

['cloudcephosd1011.eqiad.wmnet']

Of which those FAILED:

['cloudcephosd1011.eqiad.wmnet']

Completed auto-reimage of hosts:

['cloudcephosd1005.eqiad.wmnet', 'cloudcephosd1009.eqiad.wmnet', 'cloudcephosd1007.eqiad.wmnet', 'cloudcephosd1010.eqiad.wmnet', 'cloudcephosd1006.eqiad.wmnet']

Of which those FAILED:

['cloudcephosd1008.eqiad.wmnet']

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

cloudcephosd1011.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202007232316_andrew_17911_cloudcephosd1011_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

['cloudcephosd1005.eqiad.wmnet', 'cloudcephosd1006.eqiad.wmnet', 'cloudcephosd1007.eqiad.wmnet', 'cloudcephosd1008.eqiad.wmnet', 'cloudcephosd1009.eqiad.wmnet', 'cloudcephosd1010.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202007232315_andrew_17849.log.

Completed auto-reimage of hosts:

['cloudcephosd1011.eqiad.wmnet']

Of which those FAILED:

['cloudcephosd1011.eqiad.wmnet']

Completed auto-reimage of hosts:

['cloudcephosd1007.eqiad.wmnet', 'cloudcephosd1008.eqiad.wmnet', 'cloudcephosd1009.eqiad.wmnet', 'cloudcephosd1006.eqiad.wmnet']

Of which those FAILED:

['cloudcephosd1005.eqiad.wmnet', 'cloudcephosd1010.eqiad.wmnet']

cookbooks.sre.hosts.decommission executed by andrew@cumin1001 for hosts: cloudcephosd1010.eqiad.wmnet

  • cloudcephosd1010.eqiad.wmnet (FAIL)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Unable to connect to the host, wipe of bootloaders will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

cloudcephosd1010.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202007240016_andrew_1222_cloudcephosd1010_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudcephosd1010.eqiad.wmnet']

and were ALL successful.

cookbooks.sre.hosts.decommission executed by andrew@cumin1001 for hosts: cloudcephosd1009.eqiad.wmnet

  • cloudcephosd1009.eqiad.wmnet (FAIL)
    • Failed downtime host on Icinga (likely already removed)
    • Found physical host
    • Skipped downtime management interface on Icinga (likely already removed)
    • Unable to connect to the host, wipe of bootloaders will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by andrew@cumin1001 for hosts: cloudcephosd1008.eqiad.wmnet

  • cloudcephosd1008.eqiad.wmnet (FAIL)
    • Failed downtime host on Icinga (likely already removed)
    • Found physical host
    • Skipped downtime management interface on Icinga (likely already removed)
    • Unable to connect to the host, wipe of bootloaders will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by andrew@cumin1001 for hosts: cloudcephosd1006.eqiad.wmnet

  • cloudcephosd1006.eqiad.wmnet (FAIL)
    • Failed downtime host on Icinga (likely already removed)
    • Found physical host
    • Skipped downtime management interface on Icinga (likely already removed)
    • Unable to connect to the host, wipe of bootloaders will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by andrew@cumin1001 for hosts: cloudcephosd1007.eqiad.wmnet

  • cloudcephosd1007.eqiad.wmnet (FAIL)
    • Failed downtime host on Icinga (likely already removed)
    • Found physical host
    • Skipped downtime management interface on Icinga (likely already removed)
    • Unable to connect to the host, wipe of bootloaders will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by andrew@cumin1001 for hosts: cloudcephosd1004.eqiad.wmnet

  • cloudcephosd1004.eqiad.wmnet (FAIL)
    • Failed downtime host on Icinga (likely already removed)
    • Found physical host
    • Skipped downtime management interface on Icinga (likely already removed)
    • Unable to connect to the host, wipe of bootloaders will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by andrew@cumin1001 for hosts: cloudcephosd1005.eqiad.wmnet

  • cloudcephosd1005.eqiad.wmnet (FAIL)
    • Failed downtime host on Icinga (likely already removed)
    • Found physical host
    • Skipped downtime management interface on Icinga (likely already removed)
    • Unable to connect to the host, wipe of bootloaders will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

['cloudcephosd1005.eqiad.wmnet', 'cloudcephosd1006.eqiad.wmnet', 'cloudcephosd1007.eqiad.wmnet', 'cloudcephosd1008.eqiad.wmnet', 'cloudcephosd1009.eqiad.wmnet', 'cloudcephosd1004.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202007240048_andrew_8107.log.

Completed auto-reimage of hosts:

['cloudcephosd1005.eqiad.wmnet', 'cloudcephosd1008.eqiad.wmnet', 'cloudcephosd1006.eqiad.wmnet', 'cloudcephosd1007.eqiad.wmnet', 'cloudcephosd1004.eqiad.wmnet']

Of which those FAILED:

['cloudcephosd1009.eqiad.wmnet']

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

['cloudcephosd1012.eqiad.wmnet', 'cloudcephosd1013.eqiad.wmnet', 'cloudcephosd1014.eqiad.wmnet', 'cloudcephosd1015.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202007241524_andrew_30738.log.

Change 615832 merged by Andrew Bogott:
[operations/puppet@production] WMCS Ceph: add address entries for new OSD nodes

https://gerrit.wikimedia.org/r/615832

Change 616115 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Make cloudcephosd1005.eqiad.wmnet a ceph node

https://gerrit.wikimedia.org/r/616115

Change 616115 merged by Andrew Bogott:
[operations/puppet@production] Make cloudcephosd1005.eqiad.wmnet a ceph node

https://gerrit.wikimedia.org/r/616115

Change 616119 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Fix naming of cloudcephosd1004 in site.pp

https://gerrit.wikimedia.org/r/616119

Change 616119 merged by Andrew Bogott:
[operations/puppet@production] Fix naming of cloudcephosd1004 in site.pp

https://gerrit.wikimedia.org/r/616119

Completed auto-reimage of hosts:

['cloudcephosd1013.eqiad.wmnet', 'cloudcephosd1014.eqiad.wmnet', 'cloudcephosd1015.eqiad.wmnet', 'cloudcephosd1012.eqiad.wmnet']

Of which those FAILED:

['cloudcephosd1013.eqiad.wmnet', 'cloudcephosd1014.eqiad.wmnet', 'cloudcephosd1015.eqiad.wmnet', 'cloudcephosd1012.eqiad.wmnet']

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

['cloudcephosd1012.eqiad.wmnet', 'cloudcephosd1013.eqiad.wmnet', 'cloudcephosd1014.eqiad.wmnet', 'cloudcephosd1015.eqiad.wmnet', 'cloudcephosd1011.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202007241952_andrew_5592.log.

Completed auto-reimage of hosts:

['cloudcephosd1011.eqiad.wmnet']

Of which those FAILED:

['cloudcephosd1012.eqiad.wmnet', 'cloudcephosd1013.eqiad.wmnet', 'cloudcephosd1014.eqiad.wmnet', 'cloudcephosd1015.eqiad.wmnet']

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

['cloudcephosd1012.eqiad.wmnet', 'cloudcephosd1013.eqiad.wmnet', 'cloudcephosd1014.eqiad.wmnet', 'cloudcephosd1015.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202007242013_andrew_26387.log.

Completed auto-reimage of hosts:

['cloudcephosd1012.eqiad.wmnet', 'cloudcephosd1014.eqiad.wmnet', 'cloudcephosd1013.eqiad.wmnet', 'cloudcephosd1015.eqiad.wmnet']

Of which those FAILED:

['cloudcephosd1012.eqiad.wmnet', 'cloudcephosd1014.eqiad.wmnet', 'cloudcephosd1013.eqiad.wmnet', 'cloudcephosd1015.eqiad.wmnet']

Change 616847 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Make the rest of the cloudcephosd hosts into osd nodes

https://gerrit.wikimedia.org/r/616847

Change 616847 merged by Andrew Bogott:
[operations/puppet@production] Make the rest of the cloudcephosd hosts into osd nodes

https://gerrit.wikimedia.org/r/616847

Change 616855 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Ceph osd nodes: install bootstrap keyring

https://gerrit.wikimedia.org/r/616855

Change 616855 merged by Andrew Bogott:
[operations/puppet@production] Ceph osd nodes: install bootstrap keyring

https://gerrit.wikimedia.org/r/616855

These hosts are in service now. @Cmjohnson, can this be closed?

Cmjohnson closed this task as Resolved.Jul 29 2020, 3:25 PM

Thanks @Andrew for the assist with these! Resolved

Change 619503 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] updating mgmt ip to reflect correct asset tag cloudcephosd host

https://gerrit.wikimedia.org/r/619503

Change 619503 abandoned by Cmjohnson:
[operations/dns@master] updating mgmt ip to reflect correct asset tag cloudcephosd host

Reason:

https://gerrit.wikimedia.org/r/619503

ayounsi removed a subscriber: ayounsi.Oct 30 2020, 12:46 PM