Page MenuHomePhabricator

Q1:rack/setup/install gerrit2003
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of gerrit2003

Hostname / Racking / Installation Details

Hostname: gerrit2003.codfw.wmnet (next gerrit name but with private IP instead of public, please. no update to naming conventions needed for now, we might rename it later and then update)

Racking proposal: D - D8, where gerrit2002 is currently
Networking Setup: speed: 10G default, 1G would be acceptable, VLAN: private (this is different from host it replaces), AAAA records: yes, Additional IPs: N
partioning/RAID: raid1-2dev (just needs to be added to existing gerrit partman regex)
OS: bookworm
Sub-team: collaboration services, @Dzahn

Per host setup checklist

gerrit2003
  • Receive in system on procurement task T368917 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook

Related Objects

Event Timeline

RobH added a parent task: Unknown Object (Task).Jul 9 2024, 10:19 PM
RobH mentioned this in Unknown Object (Task).
RobH moved this task from Backlog to Racking Tasks on the ops-codfw board.
RobH unsubscribed.
Jhancock.wm subscribed.

@Dzahn heads up, rack D8 is currently a 1G rack, but we are in the process of upgrading that whole row to 10G. So it would be temporary.

@Jhancock.wm is possible to get this on a 10G rack if not it's ok. Thanks

@Papaul now that the C/D lsw's are live-ish we can move it over to 10G. I can swap the cable if you can update the link.

@Jhancock.wm yes i can take care of the link. Thanks

@Dzahn could you update the puppet repo for us when you have a moment? thanks in advance!

@Jhancock.wm i setup the node to use xe-0/0/39

papaul@lsw1-d8-codfw# run show interfaces xe-0/0/39 descriptions 
Interface       Admin Link Description
xe-0/0/39       up    up   gerrit2003

Change #1057253 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site: add new hardware gerrit2003 with insetup role

https://gerrit.wikimedia.org/r/1057253

Change #1057253 merged by Dzahn:

[operations/puppet@production] site: add new hardware gerrit2003 with insetup role

https://gerrit.wikimedia.org/r/1057253

@Dzahn could you update the puppet repo for us when you have a moment? thanks in advance!

Hi @Jhancock.wm sorry for the delay. Done!

I added it to site.pp in puppet.

partman change was not needed since it already covers this new host name and is supposed to be just like existing servers with 2 drives.

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host gerrit2003.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host gerrit2003.codfw.wmnet with OS bookworm executed with errors:

  • gerrit2003 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" gerrit2003.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host gerrit2003.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host gerrit2003.codfw.wmnet with OS bookworm executed with errors:

  • gerrit2003 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" gerrit2003.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host gerrit2003.wikimedia.org with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host gerrit2003.wikimedia.org with OS bookworm completed:

  • gerrit2003 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407291618_pt1979_2903405_gerrit2003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Papaul updated the task description. (Show Details)

@Dzahn all your's

Thanks @Papaul!

note from today's collab team meeting:

We defined codfw as the home for gerrit and eqiad as the home for phab/phorge. So that means this host stays gerrit2003 but we have to change gerrit1004 from T369671 to be a phab host with private IP.