Page MenuHomePhabricator

Q#:rack/setup/install an-redacteddb1001
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of an-redacteddb1001

Hostname / Racking / Installation Details

Hostnames: an-redacteddb1001 (That's the best I could think of)
Racking Proposal: Anywhere within eqiad A-F
Networking Setup: # of Connections:1/2 - Speed:1G/10G. - VLAN:Private/Public/Other(Specify) : AAAA records:Y/N, Additional IP records (Cassandra)? Yes/No
Partitioning/Raid: HW Raid: Y/N, Partman recipe and/or desired Raid Level: Hardware RAID 10 with partman/custom/db.cfg
OS Distro: Bullseye (default unless otherwise specified) Bookworm
Sub-team Technical Contact: Who should our on-sites contact with any questions involving system racking and setup? @BTullis

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

an-redacteddb1001:
  • Receive in system on procurement task T354452 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook

Event Timeline

RobH mentioned this in Unknown Object (Task).
RobH added a parent task: Unknown Object (Task).
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH unsubscribed.

an-redacteddb1001
Rack D2
U25
Port 28
CableID 5365

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-redacteddb1001.eqiad.wmnet with OS bullseye

Jclark-ctr subscribed.

Firmware cookbook seems to be broken Manually downgraded firmware on server for Nic.

This comment was removed by Jclark-ctr.

Change 1005560 had a related patch set uploaded (by Jclark-ctr; author: jclark):

[operations/puppet@production] add an-redacteddb1001 to site.pp

https://gerrit.wikimedia.org/r/1005560

Change 1005560 merged by Jclark-ctr:

[operations/puppet@production] add an-redacteddb1001 to site.pp

https://gerrit.wikimedia.org/r/1005560

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-redacteddb1001.eqiad.wmnet with OS bullseye

Jclark-ctr added a subscriber: VRiley-WMF.

@BTullis this is a custom configuration and i am not having any luck with imaging 20 disk raid10. if you can asisst thank you

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-redacteddb1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-redacteddb1001.eqiad.wmnet with OS bullseye executed with errors:

  • an-redacteddb1001 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" an-redacteddb1001.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Change 1006052 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add a partition recipe for an-redacteddb

https://gerrit.wikimedia.org/r/1006052

Change 1006052 merged by Btullis:

[operations/puppet@production] Add a partition recipe for an-redacteddb

https://gerrit.wikimedia.org/r/1006052

Change 1006486 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Install an-redacteddb1001 with puppet 7

https://gerrit.wikimedia.org/r/1006486

Change 1006486 abandoned by Btullis:

[operations/puppet@production] Install an-redacteddb1001 with puppet 7

Reason:

Unnecessary after all.

https://gerrit.wikimedia.org/r/1006486

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host an-redacteddb1001.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host an-redacteddb1001.eqiad.wmnet with OS bookworm executed with errors:

  • an-redacteddb1001 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" an-redacteddb1001.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host an-redacteddb1001.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host an-redacteddb1001.eqiad.wmnet with OS bookworm executed with errors:

  • an-redacteddb1001 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202402261207_btullis_2185392_an-redacteddb1001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" an-redacteddb1001.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host an-redacteddb1001.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host an-redacteddb1001.eqiad.wmnet with OS bookworm completed:

  • an-redacteddb1001 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202402262257_btullis_2300335_an-redacteddb1001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host an-redacteddb1001.eqiad.wmnet with OS bookworm executed with errors:

  • an-redacteddb1001 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202402262257_btullis_2300335_an-redacteddb1001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" an-redacteddb1001.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Change 1006600 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Update the contacts for an-redacteddb1001

https://gerrit.wikimedia.org/r/1006600

Change 1006600 merged by Btullis:

[operations/puppet@production] Update the contacts for an-redacteddb1001

https://gerrit.wikimedia.org/r/1006600

@Jclark-ctr - This is all done now, I believe. I had to change one BIOS setting to boot from the hard drive, but nothing else. OK for me to resolve?

@BTullis what is the status of this?
I can see the host is up, but not yet provisioned?

root@an-redacteddb1001:~# df -hT /srv
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   9.1T   65G  9.1T   1% /srv
root@an-redacteddb1001:~# pvs
  PV         VG   Fmt  Attr PSize   PFree
  /dev/sda3  tank lvm2 a--  <17.42t 8.32t

clouddb1021 is now at 81% so maybe we need to start getting this host provisioned, considering there are 12TB to transfer, it might take a while :-)