Page MenuHomePhabricator

Q3: rack/setup/install dumpsdata100[67]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of dumpsdata100[67].

Please note this is also the first order with the PERC H750 controller, so support testing of this will be required/checked and updated back to T297913 during installation.

Hostname / Racking / Installation Details

Hostnames: dumpsdata100[67]
Racking Proposal: Prefer E &F, but if those racks aren't available by the time the servers have arrived, try to avoid more than 2 dumpsdata hosts (new and current) in a rack.
Networking/Subnet/VLAN/IP: Internal vlan, 10G ports if possible, otherwise split one 10G port between them as was done for dumpsdata1004,5.
Partitioning/Raid: Let's see which configuration we order first before I can say which partman recipe we need.
OS Distro: Buster

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

dumpsdata1006
  • - receive in system on procurement task T297151 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
  • - update T297913 with results of pass/fail of PERC H750 controller
dumpsdata1007
  • - receive in system on procurement task T297151 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook. - failing due to puppet failure on megacli monitoring commands, needs update to monitoring
  • - update T297913 with results of pass/fail of PERC H750 controller

    == post install puppet run failure due to raid monitoring ==

T297913#8038261

In T297913#8038261, RobH wrote:
In T297913#8038091, MoritzMuehlenhoff wrote:
In T297913#8038074, RobH wrote:

So post dumpsdata1007 install it fails puppet due to megaraid monitoring items it seems?

That's expected, we still need to adapt the "raid" fact in Puppet so that it installs perccli (but for that we needed a running system with Perc controller, so that we can figure out the device names which allow Puppet to detect the controller). Just leave the system in that state and we'll use dumpsdata1007 for that?

Works for me, I'll put this comment reference on the setup task there. Thanks!

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 767602 merged by RobH:

[operations/puppet@production] dumpsdata1007 raid testing

https://gerrit.wikimedia.org/r/767602

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye executed with errors:

  • dumpsdata1007 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

cookbook sre.hosts.provision fails for dumpsdata1006. Please check its mgmt cable and attempt to rerun.

dumpsdata1006 is now ready for install for partman testing, but its failing dhcp.

I see it hit dhcp on install1003 and send back info, but the host times out and don't pxe boot.

Mar 17 16:52:23 install1003 dhcpd[12541]: DHCPDISCOVER from e4:3d:1a:ae:59:c8 via 10.64.130.1
Mar 17 16:52:23 install1003 dhcpd[12541]: DHCPOFFER on 10.64.130.3 to e4:3d:1a:ae:59:c8 via 10.64.130.1
PXE-E51: No  HCP or proxyDHCP offers were received.

I'm likely going to need someone in netops to assist in tracking why dhcp requests aren't being passed back to the host.

@Jclark-ctr moved the DAC cable to the correct port, these should work now. I will image shortly

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS buster

Please note the partman part will faill due to the raid controller reordering the disk array numbers and puts SSDs as SDB. This was failing PXE for me, so I couldn't start partman testing.

Once this passses PXE it can come back to me for partman testing!

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS buster executed with errors:

  • dumpsdata1007 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS buster executed with errors:

  • dumpsdata1006 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS buster executed with errors:

  • dumpsdata1006 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

FYI I believe PXE is failing for dumpsdata1006 as the DAC cable is plugged into the second NIC port on the server side.

RobH raised the priority of this task from High to Unbreak Now!.Apr 11 2022, 5:23 PM
RobH updated Other Assignee, added: Jclark-ctr; removed: RobH.

FYI I believe PXE is failing for dumpsdata1006 as the DAC cable is plugged into the second NIC port on the server side.

It wasn't but they moved it and now both dumpsdata100[67] have the second port plugged in, not the first.

It was correct before the move. Both systems need to have the DAC plugged back into port 1. See attached screen shots:

Screen Shot 2022-04-11 at 10.22.15 AM.png (1×2 px, 217 KB)

Screen Shot 2022-04-11 at 10.22.22 AM.png (1×2 px, 200 KB)

This has now broken my testing of raid on dumpsdata1007 as its not network online, please move both of these hosts back to the port 1 on the NIC and ping me when done for dumpsdata1007.

RobH lowered the priority of this task from Unbreak Now! to Medium.EditedApr 11 2022, 5:34 PM

I worked around the issue via idrac and piping output to a text file to make up for the idrac serial screen issue of not getting all the full output without the ability to page up.

Unfortunately, the potential solution Dell sent for me to test did not fix the setting a disk to missing issue, so all of the above work around didn't result in a fix of our base issue of perccli command use on dumpsdata1007 in the OS.

Setting back to medium priority and these are the next onsite steps:

  • move dumpsdata1007's dac cable back to port 1 on the 10G nic as it was working
  • move dumpsdata1006's dac cable back to port 1, as its now on port2.
    • continue investigation on dumpsdata1006 not being able to PXE issue.

@RobH Both host dac cables have been corrected

moved dumpsdata1007's dac cable back to port 1
moved dumpsdata1006's dac cable back to port 1

@RobH where are you with testing these? Just wondering if we can try and get these imaged and off the workboard.

Cmjohnson added a subscriber: Cmjohnson.

@RobH assigning to you until ready to pass back

Sorry about this!

I'll take back over dumpsdata1006, but we'll need to modify partman recipes for this to work so this particular task will stay open a bit longer yet.

RobH renamed this task from Q3:(Need By: TBD) rack/setup/install dumpsdata100[67] to Q3: rack/setup/install dumpsdata100[67].May 5 2022, 6:10 PM
RobH updated Other Assignee, removed: Jclark-ctr.
RobH updated the task description. (Show Details)

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye executed with errors:

  • dumpsdata1006 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Something is very wrong with dumps1006, when I go to set it up, it doesn't see a 10G NIC, only the 1G. Rather than pollute this setup task, if I cannot solve it quickly I'll create a high priority hw error task for investigation and link to this.

Ok, I updated the bios and then foolishly updated idrac, and now https implementation is broken for idrac.

Screen Shot 2022-06-29 at 12.25.15 PM.png (2×3 px, 866 KB)

@Jclark-ctr (or @Cmjohnson) can one of you crash cart this into the lifecycle controller (not accessible via serial redirection and https is down so I cannot do it) and rollback the last idrac update I just rolled in to the older version?

Summarizing yesterday's work:

  • Rob updated the BIOS (to latest) and the idrac (one step below latest, latest breaks https idrac interface) - NIC still doesn't detect properly, showing onboard 1GB only and a 10G port, so something is wonky.
  • John confirmed both dumpsdata100[67] have identical NICs
    • So dumpsdata1006 showing the wrong nic in bios is an issue, as we cannot set NIC flags and its overall non-ideal
  • Since John popped the case, there is now a cable error on the system. I'm guessing the case top removal unseated a cable that wasn't quite seated properly in factory.
    • Error: The storage BP1 SAS A2 cable is not connected, or is improperly connected.

@Jclark-ctr: Please fix the cable issue on dumpsdata1006 so I can open a support case for the NIC issue and not have the error report include a cable issue. If we include multiple issues, we know support will totally delay addressing the NIC issue while they demand we fix the cable issue first.

John fixed it, just pinged me in IRC. So I'll steal this back and open a case for the NIC issue.

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye executed with errors:

  • dumpsdata1006 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye completed:

  • dumpsdata1007 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202208181900_robh_1293524_dumpsdata1007.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye executed with errors:

  • dumpsdata1006 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details
RobH updated Other Assignee, added: Jclark-ctr.

When attempting to install dumpsdata1006, the NIC is not detecting correctly. In the bios options, it won't show it under additional devices, when the exact same configuration on dumpsdata1007 shows the NIC ports under other options in bios (under the raid controller)

Can someone onsite unseat and reseat the nic to see if that fixes the issue? This has a 10G NIC.

If that doesn't have it showing up properly in bios, we'll need to open a support case.

@Jclark-ctr Can you try reseating the nic if that is possible

@RobH this is okay now after @Jclark-ctr reseted the NIC

Hello, just checking in to find out what is going on with the OS installation on dumpsdata1006, and please, when will it be ready for use?

Hey @RobH I am preemptively assuming that dumpsdata1007 is good to go for us to use, since it's got the role ("insetup::core_platform") and everything in the checklist is marked off. Basically we need these boxes, at least the one, tell me if it's not ready to be ours yet and I'll stop mucking about with it.

Also, what's happening with 1006? We can wait a bit for that one but it would be nice to have an ETA.

Thanks in advance etc...

RobH changed the task status from Open to In Progress.Mon, Feb 27, 2:17 PM
RobH claimed this task.

If I have an overwhelming number of notifications in a short period (seems I did around January 18th) I may miss direct pings where the task isn't assigned to me. Accidental of course!

I noticed today's ping however, so I'll steal this task back since John reseated the NIC and work on reimage.

Awesome, I would have looked for you on irc in a few days if I hadn't heard anything, no worries. Happy to see this moving along!

LVM data still exists on disks from a previous failed install attempt and the dd method didn't seem to remove, suspended instllation on dumpsdata1006 and set it to fully initialize all the disks, so I'll set a timer and come back to this in about 2 hours.

Virtual Disk 238: RAID10, 43.661TB, Ready, Initialization 6%                  
Virtual Disk 239: RAID1, 446.625GB, Ready

Going to be a bit for the larger volume but when done reimage script should work as there won't be any LVM data left on either array.

Virtual Disk 238: RAID10, 43.661TB, Ready, Initialization 42%

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye executed with errors:

  • dumpsdata1006 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye

  • dumpdata1007 has ssds raid1 virtual drive id 238, hdds raid10 virtual drive id 239, installs fine.
  • dumpdata1006 issues:
  • clear config, reboot, setup raid1 ssd, reboot, setup raid10 hdd, reboot, still sets ssd to 239 and hdds to 238, hdds detect first in os
  • clear config, reboot, setup raid1 ssd in gui, reboot, setup raid10 in gui, reboot, and they show up same with SSDs having ID 239.
    • manually rename them so it shows 238, but the raid controller ID remains the same when cosmetic name is changed.
  • both hosts have identical raid firmware, but 1006 has higher incremented bios and idrac (shouldn't matter for raid but listing it as difference)

I'm not sure how we got dumpsdata1007 setup like we did and cannot recreate it for dumpsdata1006.

I'm a bit frustrated and have spent about two dozen reboot cycles trying to setup the above iterations and test them both in CLI and HTTPS iDrac interfaces.

@Papaul, do you perhaps recall how we fix this?

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye executed with errors:

  • dumpsdata1006 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye

Whatever array you create first on the new 15th gen raid controller gets the higher ID, which is the opposite of 14th generation raid controller.

I recalled it was something really easy, so I just had to sit and have a good think on it.

Updated platform specific docs on wikitech so I never waste time re-tracing my own steps:

https://wikitech.wikimedia.org/w/index.php?title=SRE%2FDc-operations%2FPlatform-specific_documentation%2FDell_Documentation&diff=2057257&oldid=2050341

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye completed:

  • dumpsdata1006 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302281615_robh_579008_dumpsdata1006.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)
RobH reassigned this task from RobH to ArielGlenn.
RobH updated the task description. (Show Details)
RobH updated Other Assignee, removed: Jclark-ctr.

both hosts online and ready for your use!

Wonderful, we have claimed them already :-) Thank you!