Page MenuHomePhabricator

(Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of cloudcephosd10[16-20].eqiad.wmnet

Hostname / Racking / Installation Details

Hostnames: cloudcephosd10[16-20].eqiad.wmnet
Racking Proposal: must go into cloudracks (c8, d5)
Networking/Subnet/VLAN/IP: plug into the cloudsw in the rack; 2 x 10G ports per server (5 x 2 = 10 ports). Each host should have its 1:10G on cloud-hosts1-eqiad and its 2:10G on cloud-storage1-eqiad.
Partitioning/Raid: all disks in non-raid mode on hw controller then sw RAID 10 on OS drive pair, no RAID (JBOD Only) for data drives.
OS Distro: Buster

Per host setup checklist

cloudcephosd1016:

  • - receive in system on procurement task T271239 & in coupa
  • - move system from existing rack to WMCS rack C8.
  • - list all of the new network port DAC cable IDs and port information.
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname), run cookbook sre.dns.netbox.
  • - add production dns entries in netbox via script for primary interface, then manually for secondary interface listed on T274945#7024798, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

cloudcephosd1017:

  • - receive in system on procurement task T271239 & in coupa
  • - move system from existing rack to WMCS rack C8.
  • - list all of the new network port DAC cable IDs and port information.
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname), run cookbook sre.dns.netbox.
  • - add production dns entries in netbox via script for primary interface, then manually for secondary interface listed on T274945#7024798, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

cloudcephosd1018:

  • - receive in system on procurement task T271239 & in coupa
  • - move system from existing rack to WMCS rack C8.
  • - list all of the new network port DAC cable IDs and port information.
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname), run cookbook sre.dns.netbox.
  • - add production dns entries in netbox via script for primary interface, then manually for secondary interface listed on T274945#7024798, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

cloudcephosd1019:

  • - receive in system on procurement task T271239 & in coupa
  • - move system from rack D4 to WMCS rack D5.
  • - list all of the new network port DAC cable IDs and port information.
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname), run cookbook sre.dns.netbox.
  • - add production dns entries in netbox via script for primary interface, then manually for secondary interface listed on T274945#7024798, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

cloudcephosd1020:

  • - receive in system on procurement task T271239 & in coupa
  • - move system from rack D4 to WMCS rack D5.
  • - list all of the new network port DAC cable IDs and port information.
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname), run cookbook sre.dns.netbox.
  • - add production dns entries in netbox via script for primary interface, then manually for secondary interface listed on T274945#7024798, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

Once the system(s) above have had all checkbox steps completed, this task can be resolved.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

These need to be re-imaged with internal IPs and names in .eqiad.wmnet. Sorry for the confusion!

Reopening (already synced with Andrew via irc). In the future, please reopen tasks that need action, otherwise its easy to miss in an avalance of phab alerts! ; D

I'll update dns info and reimage later today. I need to check the docs on how to do so properly, iirc its decom and setup the interface scripts.

@Andrew,

The info for networking is as follows:

Networking/Subnet/VLAN/IP: 2 x 10G ports per server (12 x 2 = 24 ports). One 10G ethernet network connection to the .wmnet subnet and one to the private, internal (eqiad.wmnet) should be on each host.

This means dual non bonded connections to the same vlan, which isn't supported for use as far as I know. So I'm guessing that second connection needs to go to a different vlan? Please advise and assign back to me for completion.

At this time, I've decommissioned cloudcephosd1016 and was about to reprovision and reimage, but the above networking question has blocked this.

cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: cloudcephosd1016.wikimedia.org

  • cloudcephosd1016.wikimedia.org (FAIL)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Failed to power off, manual intervention required: Remote IPMI for cloudcephosd1016.mgmt.eqiad.wmnet failed (exit=1): b''
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

I have updated the networking request yet again -- with luck I got it right this time. For reference it's probably worth comparing with existing cloudcephosd1001-1015 on netbox; these are intended to be part of the same cluster.

cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: cloudcephosd[1017-1019].wikimedia.org

  • cloudcephosd1017.wikimedia.org (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • cloudcephosd1018.wikimedia.org (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • cloudcephosd1019.wikimedia.org (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Change 681476 had a related patch set uploaded (by RobH; author: RobH):

[operations/puppet@production] updating cloudcephosd fqdn

https://gerrit.wikimedia.org/r/681476

Change 681476 merged by RobH:

[operations/puppet@production] updating cloudcephosd fqdn

https://gerrit.wikimedia.org/r/681476

cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: cloudcephosd1020.wikimedia.org

  • cloudcephosd1020.wikimedia.org (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

cloudcephosd1016.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202104202054_robh_8804_cloudcephosd1016_eqiad_wmnet.log.

All hosts have had the decom cookbook run for them, and then set back to planned.

Network port and cable mapping per host:

cloudcephosd1016
5344 	asw2-b4-eqiad xe-4/0/11
5345 	asw2-b4-eqiad xe-4/0/13

cloudcephosd1017
5346 	asw2-b4-eqiad xe-4/0/22
5347 	asw2-b4-eqiad xe-4/0/23

cloudcephosd1018
5349 	asw2-b4-eqiad xe-4/0/17
5348 	asw2-b4-eqiad xe-4/0/29

cloudcephosd1019
5350 	asw2-d4-eqiad xe-4/0/14
5351 	asw2-d4-eqiad xe-4/0/15

cloudcephosd1020
0012 	asw2-d4-eqiad xe-4/0/8
5353 	asw2-d4-eqiad xe-4/0/9

Completed auto-reimage of hosts:

['cloudcephosd1016.eqiad.wmnet']

Of which those FAILED:

['cloudcephosd1016.eqiad.wmnet']

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

cloudcephosd1016.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202104202128_robh_14752_cloudcephosd1016_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudcephosd1016.eqiad.wmnet']

and were ALL successful.

Summary of work so far:

All hosts have their interfaces removed and added back in the proper vlan. The primary interface has an IP address. I am not certain how to go about adding the IP address in the correct subnet, as clicking 'assign IP' gives me EVERY IP in use not just available ones.

DNS cookbook ran for these changes, as well as homer run against asw2-b-eqiad* and asw2-d-eqiad*

So each of these has the insetup role, and can work without the secondary interface having an IP address. However, I'm guessing they'll need one before actual use, so I'll need to coordinate with Chris or Papaul to figure out how I'm supposed to do this (I don't want someone else to just do it for me, I want to get how it should be done in Netbox properly.)

cloudcephosd1016 reimaged, but still needs secondary interface to have an IP assigned.

RobH added a subscriber: ayounsi.
Configuration diff for asw2-d-eqiad.mgmt.eqiad.wmnet:

[edit interfaces interface-range disabled]
-    member xe-4/0/8;
-    member xe-4/0/14;
[edit interfaces]
    interface-range vlan-analytics1-d-eqiad { ... }
+   interface-range vlan-cloud-hosts1-eqiad {
+       member xe-4/0/8;
+       member xe-4/0/14;
+       mtu 9192;
+       unit 0 {
+           family ethernet-switching {
+               interface-mode access;
+               vlan {
+                   members cloud-hosts1-eqiad;
+               }
+           }
+       }
+   }
    interface-range vlan-private1-d-eqiad { ... }
[edit interfaces]
+   xe-4/0/8 {
+       description "cloudcephosd1020 {#0012}";
+   }
+   xe-4/0/14 {
+       description "cloudcephosd1019 {#5350}";
+   }
[edit vlans]
+   cloud-hosts1-eqiad {
+       description "on asw2-b-eqiad and cloudsw (c8/d5) switches";
+       vlan-id 1118;
+       forwarding-options {
+           dhcp-security {
+               option-82 {
+                   circuit-id {
+                       prefix {
+                           host-name;
+                       }
+                       use-vlan-id;
+                   }
+               }
+           }
+       }
+   }

Type "yes" to commit, "no" to abort.

@ayounsi:
I now regret committing that, as I think these being in row D switch stack cannot have cloud host vlan for their primary interfaces, correct?

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

cloudcephosd1017.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202104202225_robh_25636_cloudcephosd1017_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudcephosd1017.eqiad.wmnet']

Of which those FAILED:

['cloudcephosd1017.eqiad.wmnet']

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

cloudcephosd1017.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202104202251_robh_29999_cloudcephosd1017_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudcephosd1017.eqiad.wmnet']

Of which those FAILED:

['cloudcephosd1017.eqiad.wmnet']

Completed auto-reimage of hosts:

['cloudcephosd1017.eqiad.wmnet']

Of which those FAILED:

['cloudcephosd1017.eqiad.wmnet']

Not sure why puppet run failed, but that was the ONLY thing that failed in the reimage, so I ran it manually and its all good (it even had its key already signed, it somehow hit a race condition in puppet initial run perhaps?)

cloudcephosd101[67] reimaged, the rest still to be reimaged. The secondary storage interface lacks an IP on all of them, and cloudcephosd10[19|20] are in row D. If they need to be in cloud hosts, then they have to move to row B or into C8 or D5. I'll need to coordinate with @ayounsi on how to best undo my poor commit via T274945#7022479

The secondary IPs are set by puppet and are on a subnet that's separate from anything managed by netbox so I think we're all good... I'll see about getting a node online.

Change 681681 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Make cloudcephosd1016 an osd node

https://gerrit.wikimedia.org/r/681681

Change 681681 merged by Andrew Bogott:

[operations/puppet@production] Make cloudcephosd1016 an osd node

https://gerrit.wikimedia.org/r/681681

The secondary IPs are set by puppet and are on a subnet that's separate from anything managed by netbox so I think we're all good... I'll see about getting a node online.

I'm no longer entirely sure this is right... for example on cloudcephosd1014 I see the second IP in netbox. Since there are clearly other things happening here I'm going to leave these in rob's hands for now. The secondary addresses should/will be:

cloudcephosd1016: 192.168.4.16
cloudcephosd1017: 192.168.4.17
cloudcephosd1018: 192.168.4.18
cloudcephosd1019: 192.168.4.19
cloudcephosd1020: 192.168.4.20

Shouldn't have they been racked in the cloud racks (C8 and D5) to be connected to cloudsw switches?

@RobH to your specific questions:
cloudcephosd1019 and cloudcephosd1020 are in rack D4, which is incorrect.
Regarless, to keep the switch configuration clean, asw2-d4-eqiad xe-4/0/8 and xe-4/0/14 need to be disabled in Netbox (+ vlan removed + mtu set to null). Then running homer will clean the switch config.
Then the hosts will have to move to the proper racks, and current cables deleted as well.

Shouldn't have they been racked in the cloud racks (C8 and D5) to be connected to cloudsw switches?

@RobH to your specific questions:
cloudcephosd1019 and cloudcephosd1020 are in rack D4, which is incorrect.
Regarless, to keep the switch configuration clean, asw2-d4-eqiad xe-4/0/8 and xe-4/0/14 need to be disabled in Netbox (+ vlan removed + mtu set to null). Then running homer will clean the switch config.
Then the hosts will have to move to the proper racks, and current cables deleted as well.

Understood and yeah, they need to move, this racking task has had the networking requirements shift since racking took place. I'll remove all that bad items and rerun homer to fix later today! Thanks for confirming how to fix!

cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: cloudcephosd1016.eqiad.wmnet

  • cloudcephosd1016.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: cloudcephosd1017.eqiad.wmnet

  • cloudcephosd1017.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

All new 10G cloud hosts need to be racked into C8 or D5 ONLY, so all of these hosts must be moved. 3 were racked into row B, and 2 into row D but not D5.

The checklists have been updated to reflect this. I've also run the decom script for the hosts I had previously imaged, so they can simply be moved to their new racks and the task commented on with their new network ports so I can run the update scripts.

@Jclark-ctr,

Please review the updated checklists above, complete the required steps, and reassign to me for followup. Thanks. If you aren't able to do this week, please assign this over to Chris for completion for the first half of next week. If Chris cannot complete before Thursday of next week, he'll need to reassign this back to John.

Chatted with John who pointed out D5's switch is nearly full. I logged in, and indeed it only has 4 ports left, so when 2 of these hosts move in, thats it!

I neglected to reassign this last week, these can be moved according to the updated per host checklist in the task description.

Finished moving host attached is ports and cable ID

cloudcephosd1016. rack C8. U30 port 12,15 ID 5348,5349
cloudcephosd1017. rack C8. U31. port ID14,17 5347,5346
cloudcephosd1018. rack C8. U32. port 16,33 ID5396,5397
cloudcephosd1019. rack D5. U29. port2,3 ID 5350, 5351
cloudcephosd1020. rack D5. U30. port 5,47 ID 5353, 5391

Hey, any updates on this? Can we help with anything, for example the switch config?

ayounsi raised the priority of this task from Medium to High.Jun 11 2021, 7:55 AM

There is some kind of missconfig here:
asw2-b-eqiad has those pending changes:

[edit interfaces interface-range disabled]
-    member xe-4/0/23;
[edit interfaces interface-range vlan-public1-b-eqiad]
     member xe-2/0/36 { ... }
+    member xe-4/0/23;
     member xe-4/0/41 { ... }
[edit interfaces]
+   xe-4/0/23 {
+       description "cloudcephosd1017 {#5347}";
+   }

cloudcephosd1017 is in C8 and shoudn't be on the public vlan, so cabling info needs to be corrected in Netbox.

I'm not sure why I'm not seeing a DHCP request for cloudcephosd1016:

cloudcephosd1016. rack C8. U30 port 12,15 ID 5348,5349
https://netbox.wikimedia.org/dcim/devices/3137/interfaces/

I requested that Chris take a look at these and see what I'm missing.

Updated all the netbox port information, added the 2nd interface, and connected to cloud-storage vlan. Named interfaces ens3f0np0 and ens3f1np1 respectfully.

I am still unable to pxe, I do not see the server hitting install1003. I will need to verify the cables are connected to the correct ports on the server and after that, I am not sure. @andrewbogott can you check the interface names?

ens3f0np0 and ens3f1np1 look right to me, although I won't know for sure until we see what debian calls them.

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

['cloudcephosd1016.eqiad.wmnet', 'cloudcephosd1017.eqiad.wmnet', 'cloudcephosd1018.eqiad.wmnet', 'cloudcephosd1019.eqiad.wmnet', 'cloudcephosd1020.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202106291754_robh_3121.log.

irc update: chris said all the cables were swapped in reality, so had to change them around (port 1/2 confusion on the server side of NIC). Now they should be fine, I one off booted 1020 via PXE, so now fired them all via script.

Change 702199 had a related patch set uploaded (by RobH; author: RobH):

[operations/puppet@production] cloudcephosd1016 insetup role

https://gerrit.wikimedia.org/r/702199

Change 702199 merged by RobH:

[operations/puppet@production] cloudcephosd1016 insetup role

https://gerrit.wikimedia.org/r/702199

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

cloudcephosd1016.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202106291840_robh_13398_cloudcephosd1016_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

cloudcephosd1017.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202106291859_robh_28973_cloudcephosd1017_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudcephosd1017.eqiad.wmnet']

Of which those FAILED:

['cloudcephosd1017.eqiad.wmnet']

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

cloudcephosd1017.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202106291900_robh_29355_cloudcephosd1017_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudcephosd1016.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

cloudcephosd1018.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202106291911_robh_9646_cloudcephosd1018_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudcephosd1018.eqiad.wmnet']

Of which those FAILED:

['cloudcephosd1018.eqiad.wmnet']

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

cloudcephosd1018.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202106291911_robh_9846_cloudcephosd1018_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

cloudcephosd1018.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202106291917_robh_14610_cloudcephosd1018_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudcephosd1018.eqiad.wmnet']

Of which those FAILED:

['cloudcephosd1018.eqiad.wmnet']

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

cloudcephosd1019.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202106291918_robh_15512_cloudcephosd1019_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudcephosd1017.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

cloudcephosd1020.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202106291926_robh_23021_cloudcephosd1020_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudcephosd1018.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['cloudcephosd1019.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['cloudcephosd1020.eqiad.wmnet']

and were ALL successful.

RobH updated the task description. (Show Details)

@Andrew these are now ready for your use

Installed and put in active on netbox the servers 16, 17, 19 and 20. Waiting for 18 to be fixed 👍