Page MenuHomePhabricator

Q3:rack/setup/install rdb101[56]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of rdb101[56]

Hostname / Racking / Installation Details

Hostnames: rdb101[56]
Racking Proposal: Two different rows excluding B and D
Networking Setup: # of Connections:1 - Speed:10G. - VLAN:Private
OS Distro: Trixie
Boot Method: UEFI.
Sub-team Technical Contact: @Clement_Goubert

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

rdb1015:
  • Receive in system on procurement task T418292 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
rdb1016:
  • Receive in system on procurement task T418292 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook

Event Timeline

RobH mentioned this in Unknown Object (Task).
RobH added a parent task: Unknown Object (Task).
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH unsubscribed.

@Clement_Goubert,

I have made some assumptions in racking details. The parent order is already in approvals, this isn't blocking it. I've filed this to ensure it isn't forgotten. Please double check my hostname and other assumptions above and clarify racking restrictions (if any) in addition to the boilerplate:

Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-operations) and add the new servers to preseed.yml for partition info.

If possible, please reference this task number in your patch set, so it is clear when complete. Once complete, just un-assign yourself (leaving no assignee) for this task and once the hardware arrives on-site engineerss will claim this task for racking and setup. Please don't re-subscribe me to this task unless there is a direct question for me.

Thank you!

Change #1247977 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] Add new rdb101[56] hosts

https://gerrit.wikimedia.org/r/1247977

Updated task description with racking details and OS. Waiting for review on the puppet patch.

Change #1247989 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] Add new wikikube-worker23[57-74]

https://gerrit.wikimedia.org/r/1247989

Change #1247977 merged by Clément Goubert:

[operations/puppet@production] Add new rdb101[56] hosts

https://gerrit.wikimedia.org/r/1247977

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host rdb1015.eqiad.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host rdb1016.eqiad.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host rdb1016.eqiad.wmnet with OS trixie executed with errors:

  • rdb1016 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console rdb1016.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

@Clement_Goubert i am having issues with these failing to image this is error on console. They might be missing a partman

│ Error while setting up RAID │
│ An unexpected error occurred while setting up a preseeded RAID │
│ configuration. │
│ Check /var/log/syslog or see virtual console 4 for the details.│

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host rdb1015.eqiad.wmnet with OS trixie executed with errors:

  • rdb1015 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console rdb1015.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

@Clement_Goubert i am having issues with these failing to image this is error on console. They might be missing a partman

@jijiki could you please help on this while Clement is out?

@Clement_Goubert i am having issues with these failing to image this is error on console. They might be missing a partman

@jijiki could you please help on this while Clement is out?

I can take care of it next week and update. @Jclark-ctr thank you!

@Clement_Goubert i am having issues with these failing to image this is error on console. They might be missing a partman

@jijiki could you please help on this while Clement is out?

I can take care of it next week and update. @Jclark-ctr thank you!

@Jclark-ctl I am afraid this is still in the queue, will ping you when I have something, thanks!

Jclark-ctr reassigned this task from Effib to jijiki.
Jclark-ctr reassigned this task from jijiki to Clement_Goubert.
Jclark-ctr added a subscriber: Effib.
Jclark-ctr subscribed.

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1003 for host rdb1015.eqiad.wmnet with OS trixie

Change #1289448 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Fix partman to use standard-efi for new rdb nodes

https://gerrit.wikimedia.org/r/1289448

Change #1289448 merged by Papaul:

[operations/puppet@production] Fix partman to use standard-efi for new rdb nodes

https://gerrit.wikimedia.org/r/1289448

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1003 for host rdb1015.eqiad.wmnet with OS trixie completed:

  • rdb1015 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202605200054_pt1979_1743822_rdb1015.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

@Jclark-ctr partman fixed 1015 is done you can install 1016. thanks

@MLechvien-WMF i believe so looks like @Papaul noticed the missing part in puppet and updated both

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host rdb1016.eqiad.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host rdb1016.eqiad.wmnet with OS trixie completed:

  • rdb1016 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202605201123_jclark_1972713_rdb1016.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Jclark-ctr updated the task description. (Show Details)