Page MenuHomePhabricator

eqiad: request for a decom'ed R440 - Config C
Closed, ResolvedPublic

Description

We would like to request a machine in eqiad that is decom'ed but not deracked yet.

type: R440 - Config C (if available)

it should be named contint1003.wikimedia.org

resembling existing contint1002.wikimedia.org (https://netbox.wikimedia.org/search/?q=contint1002)

This would unblock us at T418109 and would be temporary. Thank you very much!

Event Timeline

Change #1244743 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site: add contint1003/2003 with insetup collab role

https://gerrit.wikimedia.org/r/1244743

Hey @Dzahn we have decommed Dell R440's, but none that were in the "Config C" would it be possible to try and match a decommed R440 to contint1003 hardware spec wise?

Dzahn added a subscriber: VRiley-WMF.

Hi @VRiley-WMF I tried to filter netbox for "eqiad" + "R440" + "status: decom" like this:

https://netbox.wikimedia.org/dcim/devices/?site_id=6&status=decommissioning&device_type_id=42

but I did not get results for that.

What is a good way to find those decommed R440's to match ? Or, what's the closest thing you have?

Change #1244743 merged by Dzahn:

[operations/puppet@production] site: add contint1003/2003 with insetup collab role

https://gerrit.wikimedia.org/r/1244743

Hey @Dzahn Here is a list of what is decommed and offline that we could use. https://netbox.wikimedia.org/dcim/devices/?site_id=6&status=offline&device_type_id=42&device_type_id=138

The "decommissioning" filter will look for servers that are marked as such, but may still be running. So I wouldn't recommend that.

Thank you @VRiley-WMF

Gotcha about the filter, makes sense.

I took a look and started sorting by purchase_date.

I see that your URL includes the device_type filter for exactly an "R 440 with Config C".

But it seems servers from around 2020 that we are looking at existed before the standardized config types and can differ. So yea, had to dig a bit deeper and look at procurement tickets.

Well, how about this one? The device 2684 - WMF5104 formerly known as cloudcephosd1005.

It is an R440, it has the same amount of RAM and enough disk space. That seems close enough.

"Intel 4214 (2.2GHz/12C) 128GB RAM (2) 1.92TB SSD" (per approved order on T242036)

So, after looking for that specific server, it doesn't seem to be here and netbox may not reflect it at the moment. I do apologize about that. However, I was looking at this device as a substitute

WMF5524

This seems to be the closest we have that's onsite. We can bump the RAM to 128 gig and put in the two 1.92TB HDDs. Let us know if this works?

@VRiley-WMF Thank you! Yea, that works too, provided you can bump RAM and disk. Sounds good.

Re-labeled moss-fe1002 to contint1003
Racked it in B1 U36
CableID: 3720
Port: 28

This unit is powered on and has an IP assigned to it on iDrac. You should be able to find more information here. https://netbox.wikimedia.org/dcim/devices/3135/

Since this is a bit different than an install ticket, is there anything else we may be able to change or update? Let us know, thank you!

@VRiley-WMF I tried to run the reimage cookbook with --new to install an OS. But I got these errors:

spicerack.redfish.RedfishError: Failed to perform GET request to https://10.65.0.145/redfish/v1/Systems/System.Embedded.1/Bios

I can't connect to the DRAC (contint1003.mgmt.eqiad.wmnet exists but is not listening on ssh).

Could you check the DRAC and/or install an OS (version: trixie or any) like over at T418545?

The machine exists in site.pp with insetup role and in partman; like what is needed for an install task.

Thanks a lot

dzahn@cumin2002:~$ ping contint1003.mgmt.eqiad.wmnet
PING contint1003.mgmt.eqiad.wmnet (10.65.0.145) 56(84) bytes of data.
^C
--- contint1003.mgmt.eqiad.wmnet ping statistics ---
4 packets transmitted, 0 received, 100% packet loss, time 3080ms

Hey @Dzahn for some reason the server had reverted back with it's IP address. However, I have set it to the correct settings. I was able to ping it and remote into it. Let me know if there are any other issues.

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host contint1003.wikimedia.org with OS trixie

Mentioned in SAL (#wikimedia-operations) [2026-03-09T17:34:29Z] <mutante> contint1003.mgmt - racadm serveraction powercycle T418544 - not reacting

Mentioned in SAL (#wikimedia-operations) [2026-03-09T17:38:11Z] <mutante> contint1003 - unable to get uptime Caused by: Cumin execution failed (exit_code=2) [101/240] - attempted manual powercycle - Initializing Firmware Interfaces... blank screen T418544

Hi @VRiley-WMF Thanks! Now I can reach the DRAC mgmt console.

I started the reimage cookbook again to get an OS on it.

Unfortunately this gets stuck when trying to reboot the host. I tried twice to manually powercycle it from the console to get it unstuck.

Both times I see a few boot messages but then the console just stays blank and the server does not come up:

Lifecycle Controller: Collecting System Inventory...

iDRAC IPV4:  10.65.0.145


Lifecycle Controller: Done
Booting...






















































�h�m[?1;6;7l

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host contint1003.wikimedia.org with OS trixie executed with errors:

  • contint1003 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console contint1003.wikimedia.org" to get a root shell, but depending on the failure this may not work.

Okay, I ran the provisioning for this server and it seem to have passed. I will try to install the OS onto it

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host contint1003.wikimedia.org with OS trixie

@Dzahn It seems like it's getting stuck. Does this need a specific raid setup?

@VRiley-WMF It is configured for standard raid1-dev. Just like contint2003 which had no problem with that partman recipe. Are there more than 2 disks in it or something unusual?

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host contint1003.wikimedia.org with OS trixie executed with errors:

  • contint1003 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console contint1003.wikimedia.org" to get a root shell, but depending on the failure this may not work.

@Dzahn I just tried to look contint2003 however, when I'm in iDRAC, it's not showing me physical disks. I know for this unit contint1003, we only had 1.92 SSDs available. Contint1002 has 960 gig drives in it. So, that may be the issue?

As far as I can tell the size of disks has never mattered for the partman recipe. It just requires 2 identical disks.

But the part that you can't see any physical disks in DRAC.. that seems more like the issue to me.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host contint1003.wikimedia.org with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host contint1003.wikimedia.org with OS trixie executed with errors:

  • contint1003 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console contint1003.wikimedia.org" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host contint1003.wikimedia.org with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host contint1003.wikimedia.org with OS trixie executed with errors:

  • contint1003 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console contint1003.wikimedia.org" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host contint1003.wikimedia.org with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host contint1003.wikimedia.org with OS trixie completed:

  • contint1003 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202603102344_vriley_2761040_contint1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Was able to complete this after speaking with @Jhancock.wm Thank you!

@Dzahn It should be completed now.

@Dzahn Let me know if you're about to access this and if so, I will close it out. Thanks!

@VRiley-WMF Thank you very much! It works and I can connect :))