Page MenuHomePhabricator

Q4:rack/setup/install new cloudcephmon hosts
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of <hostnames to be determined>

Hostname / Racking / Installation Details

Hostnames: Racking Proposal: Where should these systems be racked? Can they share with any existing systems or should they avoid any other systems sharing their rack or row? (Note EQIAD now has rows A-F.) spread in racks E4 (or F4), C8 and D5
Networking Setup: # of Connections:1/2 - Speed:1G/10G. - VLAN:Private/Public/Other(Specify) : AAAA records:Y/N, Additional IP records (Cassandra)? Yes/No
Partitioning/Raid: HW Raid: Y/N, Partman recipe and/or desired Raid Level:
OS Distro: Bullseye (default unless otherwise specified)
Sub-team Technical Contact: @Andrew

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

cloudcephmon1004:
  • Receive in system on procurement task T361363 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
cloudcephmon1005:
  • Receive in system on procurement task T361363 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
cloudcephmon1006:
  • Receive in system on procurement task T361363 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook

Event Timeline

RobH triaged this task as High priority.
RobH mentioned this in Unknown Object (Task).
RobH added a parent task: Unknown Object (Task).

@Andrew @dcaro once we find out racking information we will be able to rack and image these fairly quickly these have arrived

Hi @dcaro - just following up on this. Can you provide the racking information for us, to start this install?

Thanks,
Willy

Hi @dcaro - just following up on this. Can you provide the racking information for us, to start this install?

Thanks,
Willy

For now, these servers need two network connections as listed on the Wikitech here:

https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Network#cephmon

Aside from that they can go into racks C8, D5, E4 or F4 wherever they will fit and there is sufficient free switch ports. Probably best to put them all in different racks.

Hi @dcaro - just following up on this. Can you provide the racking information for us, to start this install?

Thanks,
Willy

For now, these servers need two network connections as listed on the Wikitech here:

https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Network#cephmon

Aside from that they can go into racks C8, D5, E4 or F4 wherever they will fit and there is sufficient free switch ports. Probably best to put them all in different racks.

Yep, they should be in different racks please :)

@Andrew @dcaro thank you for providing update did you have host names for this and please update preseed.yaml, and site.pp

@Andrew @dcaro thank you for providing update did you have host names for this and please update preseed.yaml, and site.pp

Sent https://gerrit.wikimedia.org/r/c/operations/puppet/+/1053317, @Andrew can you review please?

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephmon1004.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephmon1005.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephmon1006.eqiad.wmnet with OS bullseye

@Jclark-ctr please update task with the error you are getting and what is on the console.

@Papaul they to fail start pxe I have downgraded firmware on nic and set correct ports for pxe. but still continue to fail to even start imaging. I have tried changing vlans on netbox with no luck

@Jclark-ctr that are some helpful informations I will take a look at it once on site.

@Jclark-ctr I checked 1004 PXE boot was set on both the 1G and 10G I disable it on the 1G you should be good now. You can check the other hosts and do the same.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephmon1004.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephmon1005.eqiad.wmnet with OS bullseye

@Papaul 1004 will still not start to image i checked Vlans and compared to 1003 with no luck also.

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephmon1004.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephmon1004 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cloudcephmon1004.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephmon1005.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephmon1005 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cloudcephmon1005.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

There is an outstanding diff on the switch for cloudcephmon1006. It looks correct, but could DCops double check it and make sure the switch config is in sync with Netbox ?

Configuration diff for cloudsw1-f4-eqiad.mgmt.eqiad.wmnet
[edit interfaces xe-0/0/47]
+   native-vlan-id 1118;
[edit interfaces xe-0/0/47 unit 0 family ethernet-switching]
-      interface-mode access;
+      interface-mode trunk;
[edit interfaces xe-0/0/47 unit 0 family ethernet-switching vlan]
-       members cloud-hosts1-f4-eqiad;
+       members [ cloud-hosts1-eqiad cloud-private-f4-eqiad ];
[edit vlans]
+   cloud-hosts1-eqiad {
+       description "Legacy cloud-hosts stretched vlan";
+       vlan-id 1118;
+   }

Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1002 for host cloudcephmon1004.eqiad.wmnet with OS bullseye

There is an outstanding diff on the switch for cloudcephmon1006. It looks correct, but could DCops double check it and make sure the switch config is in sync with Netbox ?

I fixed this and pushed the diff to the switch Host should have gone onto cloud-hosts1-f4-eqiad, not the legacy vlan stretched between d8/d5.

@Jclark-ctr I fixed up some issues in Netbox for cloudcephmon1006 (was on the wrong primary vlan, and had an IP from a cloud-private-f4 not cloud-hosts1-f4), and cloudcephmon1005 (IP from cloud-private-e4 not cloud-hosts1-e4). Other than that they look ok, cloudcephmon1004 had no problems in Netbox.

I tried to kick off a reimage of cloudcephmon1004 but it's not completing. I can see DHCP works fine, and Debian begins to load, but it seems to stall after that.

On the virtual VGA I can see it get this far:

image.png (303×1 px, 35 KB)

It's normal for the output to stop there, however what's unusual here is I see nothing on the virtual serial console, either before or after this stage. If I had to guess what's happening I'd say we've two problems at least:

  1. Virtual serial boot is not working / set up correctly
  2. The debian installer is failing at some point and prompting for user-input (hence it never completes), but we can't see what's happening.

Anyway from a network perspective all looks ok here, DHCP is working for this one and should also for the others.

Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1002 for host cloudcephmon1004.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephmon1004 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cloudcephmon1004.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephmon1004.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephmon1005.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephmon1004.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephmon1004 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cloudcephmon1004.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephmon1005.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephmon1005 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cloudcephmon1005.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1002 for host cloudcephmon1004.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1002 for host cloudcephmon1004.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephmon1004 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cloudcephmon1004.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1002 for host cloudcephmon1004.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1002 for host cloudcephmon1004.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephmon1004 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cloudcephmon1004.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1002 for host cloudcephmon1004.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1002 for host cloudcephmon1004.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephmon1004 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cloudcephmon1004.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1002 for host cloudcephmon1004.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1002 for host cloudcephmon1004.eqiad.wmnet with OS bullseye completed:

  • cloudcephmon1004 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407251810_pt1979_2491206_cloudcephmon1004.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

@Jclark-ctr
looking at 1004 i realized that com2 was not set

System BIOS

 System BIOS Settings > Serial Communication

 Serial Communication                                  <On without Console
                                                        Redirection>
 Serial Port Address                                   <COM1>
 External Serial Connector                             <Serial Device 1>
 Failsafe Baud Rate                                    <115200>
 Remote Terminal Type                                  <VT100/VT220>
 Redirection After Boot                                <Enabled>

runnning the re image it gets stuck at 33% for a long time until the cookbook times out fails. I let the re image finish on the server until it reboots in the OS and re run the cookbook again with the --no-pxe flag .
Please check console settings and the other servers and follow the steps above to do the install on them.

sudo cookbook sre.hosts.reimage  -t T364870  --os bullseye cloudcephmon1004 --new --no-pxe

Thanks

Jclark-ctr updated the task description. (Show Details)

Change #1097390 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] cloudcephmon1004: provision as mon

https://gerrit.wikimedia.org/r/1097390

Cookbook cookbooks.sre.hosts.reimage was started by dcaro@cumin1002 for host cloudcephmon1004.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by dcaro@cumin1002 for host cloudcephmon1004.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephmon1004 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudcephmon1004.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by dcaro@cumin1002 for host cloudcephmon1004.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by dcaro@cumin1002 for host cloudcephmon1004.eqiad.wmnet with OS bullseye completed:

  • cloudcephmon1004 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411261257_dcaro_560976_cloudcephmon1004.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1097390 merged by David Caro:

[operations/puppet@production] cloudcephmon1004: provision as mon

https://gerrit.wikimedia.org/r/1097390

Node up and running

Change #1098988 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[labs/private@master] Update cloudcephmon secrets

https://gerrit.wikimedia.org/r/1098988

Change #1098988 merged by Muehlenhoff:

[labs/private@master] Update cloudcephmon secrets

https://gerrit.wikimedia.org/r/1098988