Page MenuHomePhabricator

Reclaim recently-decommed CP host for WDQS (see T352253)
Closed, ResolvedPublic

Description

Hello DC Ops,

Per T352253 , @dr0ptp4kt requested one of the recently-decommed CP hosts for the WDQS graph split experiment (specifically, we are curious if NVME drives can speed up our data reload process).

Creating ticket based on @wiki_willy 's request in the earlier ticket :

  • hostname wdqs1025.eqiad.wnet
  • racking info any 10G RACK
  • network setup private VLAN
  • raid/partitioning modules/install_server/files/autoinstall/partman/custom/cacheproxy.cfg from puppet repo
  • OS Debian Bullseye
  • main poc @bking (IRC: inflatador)
  • needed for: 6 weeks (decom by mid-April)

Thanks for taking a look! Please let me know if you need more info.

Event Timeline

++ @VRiley-WMF and @Jclark-ctr - can one of you pick up this request? We'll be repurposing one of the previously decommissioned cp servers to set up a temp server for Adam to use. Thanks, Willy

Hi @dr0ptp4kt I have racked and stacked cp1086 in the following location

Rack B 7
U 20
CableID 1966
Port 20

Please let us know if there is anything else you may need for this. Thank you!

Thanks @VRiley-WMF ! @bking is up next for imaging, I think.

@VRiley-WMF or @Jclark-ctr are there any other lifecycle steps I need to take to get this host back into production as wdqs1025? This host was already decommissioned, so I'm not sure what to do to get it back into production.

The Server Lifecycle Page shows how to do this for hosts that still have a management interface, but cp1086.mgmt.eqiad.wmnet is not reachable.

@VRiley-WMF any pointers on how to iDRAC / iLO to this node and establish with a hostname of wdqs1025.eqiad.wmnet? I'm wondering if maybe there's a direct IP or IPs given that there don't seem to be DNS records for cp1086.eqiad.wmnet or cp1086.mgmt.eqiad.wmnet?

@VRiley-WMF any pointers on how to iDRAC / iLO to this node and establish with a hostname of wdqs1025.eqiad.wmnet? I'm wondering if maybe there's a direct IP or IPs given that there don't seem to be DNS records for cp1086.eqiad.wmnet or cp1086.mgmt.eqiad.wmnet?

I've made the changes in netbox needed for this, effectively here we need to follow this process:

https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Rename_while_reimaging

The only trick was after running the Netbox ProvisionServerNetwork script I manually changed the new mgmt IP that was assigned back to the old one the host is already using. This step can be avoided in future by using the --keep-mgmt-dns flag when running the sre.hosts.decommission cookbook.

I updated the DNS and it's pingable again:

cmooney@cumin1002:~$ ping wdqs1025.mgmt.eqiad.wmnet 
PING wdqs1025.mgmt.eqiad.wmnet (10.65.5.91) 56(84) bytes of data.
64 bytes from wdqs1025.mgmt.eqiad.wmnet (10.65.5.91): icmp_seq=1 ttl=62 time=0.682 ms
64 bytes from wmf7415.mgmt.eqiad.wmnet (10.65.5.91): icmp_seq=2 ttl=62 time=0.815 ms

I also pushed the required network config to asw2-b7-eqiad, so you should be able to follow the above steps from the "Patch Puppet" stage on I think.

Thank you @cmooney ! I have also relabeled this unit to match the name. Closing this ticket as per our discussion since it's completed from a DC Ops perspective.

@VRiley-WMF wdqs1025 is failing to reimage. I can't see any disks in the DRAC interface, are you able to check the disks and see if they're properly seated?

@dr0ptp4kt would you be able to try to reimage this unit again? I have ran it through a power cycle and that can help with this process. Let us know, thanks!

@VRiley-WMF

Unfortunately, I'm still getting errors (screenshot) when I try to boot up the host. Are you able to reseat the cables and disks?

Swapped cable with a new one (same port), shut down the unit and reseated the drives as well. Powered the unit back on

@bking was puppet and site.pp updated? unfortunately me and Valerie do not have access to push updates and has become a process for sre owner to do with procurement ticket

Change 1009361 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] site.pp: Add wdqs1025 host

https://gerrit.wikimedia.org/r/1009361

Change 1009361 merged by Bking:

[operations/puppet@production] site.pp: Add wdqs1025 host

https://gerrit.wikimedia.org/r/1009361

@Jclark-ctr Thanks for the tip, I've added a patch and will try the reimage again.

Change 1009371 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] site.pp: Move wdqs1025 into test role

https://gerrit.wikimedia.org/r/1009371

Mentioned in SAL (#wikimedia-operations) [2024-03-07T14:13:34Z] <bking@cumin2002> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sync hiera as instructed by failed reimage cookbook - bking@cumin2002 - T358727"

Mentioned in SAL (#wikimedia-operations) [2024-03-07T14:14:26Z] <bking@cumin2002> END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "sync hiera as instructed by failed reimage cookbook - bking@cumin2002 - T358727"

Change 1009371 merged by Bking:

[operations/puppet@production] site.pp: Move wdqs1025 into test role

https://gerrit.wikimedia.org/r/1009371

bking moved this task from Backlog to Done on the Data-Platform-SRE (2024.03.04 - 2024.03.24) board.

Apologies for not posting this sooner. wdqs1025 has been ready for use since the above request was merged. Closing this for now, but work continues in T359062 .