Page MenuHomePhabricator

Q3:rack/setup/install dbprov100[56]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of dbprov100[56]

Hostname / Racking / Installation Details

Hostnames: dbprov1005 / dbprov1006
Racking Proposal: anywere, as redundant as possible to dbprov1003 & dbprov1004 (not on the same row, or at least not on the same rack as these 2 - so not C or D). They will substitute dbprov1001 and dbprov1002.
Networking Setup: # of Connections:1/2 - Speed:1G/10G. - VLAN:Private/Public/Other(Specify) : AAAA records:Y/N, Additional IP records (Cassandra)? Yes/No
Partitioning/Raid: HW Raid: Y/N Create 2 logical disks- first one with the HDs with RAID 6 where the os will be installed. Create a second logical disk with the SSDs in RAID 0, Partman recipe and/or desired Raid Level: db.cfg (this should just setup the HDs, the SSDs will be setup after puppet has run, don't worry about that).
OS Distro: Bullseye (sadly we cannot install yet bookworm until dbs upgrade)
Sub-team Technical Contact: Jaime first, otherwise anyone in data persistence (e.g. ask Kwaku)

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

dbprov1005:
  • Receive in system on procurement task T354213 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
dbprov1006:
  • Receive in system on procurement task T354213 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook

Details

Other Assignee
Jclark-ctr

Related Objects

StatusSubtypeAssignedTask
ResolvedJclark-ctr

Event Timeline

RobH mentioned this in Unknown Object (Task).
RobH added a parent task: Unknown Object (Task).
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH unsubscribed.
RobH renamed this task from Q#:rack/setup/install dbprov100[56] to Q3:rack/setup/install dbprov100[56].Jan 18 2024, 7:23 PM
VRiley-WMF updated Other Assignee, added: Jclark-ctr.

dbprov1005
Rack A2
U 25
CableID 4905
Port 8

dbprov1006
Rack B2
U 24
CableID 4903
Port 17

@jcrespo We are at the point to image. Would you be able to assist for updating Puppet?

@jcrespo are the Raid instructions backwards os is usually on ssd's RAID 0?

HW Raid: Y/N Create 2 logical disks- first one with the HDs with RAID 6 where the os will be installed. Create a second logical disk with the SSDs in RAID 0, Partman recipe and/or desired Raid Level: db.cfg (this should just setup the HDs, the SSDs will be setup after puppet has run, don't worry about that).

Sure, let me know what can I do for you, and I will get it done next week. The only custom stuff compared to other hosts is the HW RAID partitioning, the rest is a rather standard db-like install. If it is only adding it to netconf and the default role (staging), it shouldn't take me long, but I will get it done on Monday, as I have finished my work week already.

@jcrespo are the Raid instructions backwards os is usually on ssd's RAID 0?

Os instructions are correct, we need the redundancy of RAID 6 for the important data (and also the OS) but we need the SSDs aside for high amount of temporary writes in a non-safe RAID level. It doesn't matter that the os take more time to boot, this doesn't require performance there. We need 1+TB of SSD space so the database backups fit before compression, so the OS "doesn't fit". Buying more SSDs wouldn't work, because we also need the large space of the HDs for the important data. If we installed the OS on the ssds, we wouldn't get any redundancy if one disk fails.

Other than that, it would be the same config as a db, but the root partition will be on the HDs/RAID 6 (I will set up the SSDs on my own later after a puppet run).

Edit: If there is any doubt, all the other dbprov hosts have been setup like this (RAID 6 of HDs): with / (os), swap and /srv (important backup data); and (RAID 0 of SSDs): with /srv/backups/dumps/ongoing (losable data, lots of writes, high speed) (you don't have to worry about setting the ssds partition on install, I will do it, I just need the HW set up).

Change 1012684 had a related patch set uploaded (by RobH; author: RobH):

[operations/puppet@production] dbprov updates

https://gerrit.wikimedia.org/r/1012684

Change 1012684 abandoned by RobH:

[operations/puppet@production] dbprov updates

Reason:

Jaime is working on this, abandoning my patchset

https://gerrit.wikimedia.org/r/1012684

My patchset had mistakes, and @jcrespo has advised he is workong on these patchsets. As such, I've abandoned my patchset.

RobH added a subscriber: VRiley-WMF.

This installation is blocked until patchsets to allow installation are complete. I've removed the assignment from @VRiley-WMF to @jcrespo but once that is complete please assign it back to them.

Change 1012684 restored by Jcrespo:

[operations/puppet@production] dbprov updates

https://gerrit.wikimedia.org/r/1012684

I will take care, as I discussed previously with John, but to avoid future mistakes, @RobH is there a way to transmit the desired recipe clearer? I wrote: "Partman recipe and/or desired Raid Level: db.cfg" And I can see that as confusing because of the hostname, so trying to make your team's workflow easier, in any way I can- any tip?

I will take care, as I discussed previously with John, but to avoid future mistakes, @RobH is there a way to transmit the desired recipe clearer? I wrote: "Partman recipe and/or desired Raid Level: db.cfg" And I can see that as confusing because of the hostname, so trying to make your team's workflow easier, in any way I can- any tip?

I think this is due to a miscommunication on my part, which followed with more confusion in the racking task. I should have communicated these new steps in the ordering task.

As many of the DC ops members do not have the ability to merge puppet repo changes, we're now asking all sub-teams to update the puppet repo for the installation steps of the hosts in advance of the host arrivals, which is mainly partition/preseed/role/site.pp. This was new this quarter, and these kinds of miscommunication were to be expected. Apologies to all involved. Since this is a newly introduced shuffling of steps, it wasn't explained clearly enough in my ordering task to DB team and resulted in it not getting done in advance of the landing of the hosts. (That is why I just stepped in when Jenn pinged me and started to fix the site.pp and such, but had the partitioning incorrect.)

No, Robh, this is not your fault- I delayed doing it because emergencies and then offsite and then rest hours/vacations. Now, knowing the expectation better I could have prepared this weeks in advance. About to merge it.

Change 1012684 merged by Jcrespo:

[operations/puppet@production] installserver: Update dbprov for reimage of dbprov[12]00[56]

https://gerrit.wikimedia.org/r/1012684

This should unblock both the eqiad and the codfw tasks- except if there is an unexpected bug, but the overall idea should be there.

I hope this this is the right assignment, but not sure.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host dbprov1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host dbprov1006.eqiad.wmnet with OS bullseye executed with errors:

  • dbprov1006 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" dbprov1006.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host dbprov1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host dbprov1006.eqiad.wmnet with OS bullseye executed with errors:

  • dbprov1006 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" dbprov1006.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Change #1015082 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Configure dbprov1005/1006 for Puppet 7 like for dbprov2005/2006

https://gerrit.wikimedia.org/r/1015082

Change #1015082 merged by Papaul:

[operations/puppet@production] Configure dbprov1005/1006 for Puppet 7 like for dbprov2005/2006

https://gerrit.wikimedia.org/r/1015082

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host dbprov1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host dbprov1006.eqiad.wmnet with OS bullseye executed with errors:

  • dbprov1006 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" dbprov1006.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host dbprov1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host dbprov1006.eqiad.wmnet with OS bullseye executed with errors:

  • dbprov1006 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" dbprov1006.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host dbprov1005.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host dbprov1005.eqiad.wmnet with OS bullseye executed with errors:

  • dbprov1005 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" dbprov1005.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host dbprov1005.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host dbprov1005.eqiad.wmnet with OS bullseye executed with errors:

  • dbprov1005 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" dbprov1005.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host dbprov1005.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host dbprov1005.eqiad.wmnet with OS bullseye executed with errors:

  • dbprov1005 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" dbprov1005.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host dbprov1005.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host dbprov1005.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host dbprov1005.eqiad.wmnet with OS bullseye executed with errors:

  • dbprov1005 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" dbprov1005.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host dbprov1005.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host dbprov1005.eqiad.wmnet with OS bullseye completed:

  • dbprov1005 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404011514_jclark_83078_dbprov1005.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host dbprov1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host dbprov1006.eqiad.wmnet with OS bullseye executed with errors:

  • dbprov1006 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" dbprov1006.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host dbprov1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host dbprov1006.eqiad.wmnet with OS bullseye executed with errors:

  • dbprov1006 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" dbprov1006.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host dbprov1006.eqiad.wmnet with OS bullseye

hi, we cannot ssh into dbprov1006.eqiad.wmnet

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host dbprov1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host dbprov1006.eqiad.wmnet with OS bullseye executed with errors:

  • dbprov1006 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" dbprov1006.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host dbprov1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host dbprov1006.eqiad.wmnet with OS bullseye completed:

  • dbprov1006 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404151815_jclark_2807217_dbprov1006.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Replaced dac cable and reimaged @jcrespo looks like it resolved issue

Jclark-ctr claimed this task.

Thank you a lot, to everybody!