Reclaim recently-decommed CP host for WDQS (see T352253)
Closed, ResolvedPublic
Actions

Description

Hello DC Ops,

Per T352253 , @dr0ptp4kt requested one of the recently-decommed CP hosts for the WDQS graph split experiment (specifically, we are curious if NVME drives can speed up our data reload process).

Creating ticket based on @wiki_willy 's request in the earlier ticket :

hostname wdqs1025.eqiad.wnet
racking info any 10G RACK
network setup private VLAN
raid/partitioning modules/install_server/files/autoinstall/partman/custom/cacheproxy.cfg from puppet repo
OS Debian Bullseye
main poc @bking (IRC: inflatador)
needed for: 6 weeks (decom by mid-April)

Thanks for taking a look! Please let me know if you need more info.

Details

	Subject	Repo	Branch	Lines +/-
	site.pp: Move wdqs1025 into test role	operations/puppet	production	+5 -2
	site.pp: Add wdqs1025 host	operations/puppet	production	+4 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T291916 Tracking task for Bullseye migrations in production
Resolved	Gehel	T323921 [Epic] Migrate all Search Platform servers to Debian Bullseye
Resolved	bking	T332314 Service implementation for wdqs20[13-22]
Resolved	bking	T331300 Ensure WDQS stack works on Bullseye
Resolved	bking	T336443 Investigate performance differences between wdqs2022 and older hosts
Resolved	Gehel	T358314 Hardware requests for Data Platform Engineering - FY2024-2025
Resolved	bking	T358533 Hardware requests for Search Platform FY2024-2025
In Progress	dr0ptp4kt	T359062 Assess Wikidata dump import hardware
Resolved	VRiley-WMF	T358727 Reclaim recently-decommed CP host for WDQS (see T352253)

Event Timeline

bking created this task.Feb 28 2024, 9:58 PM

bking updated the task description. (Show Details)Feb 28 2024, 10:00 PM

++ @VRiley-WMF and @Jclark-ctr - can one of you pick up this request? We'll be repurposing one of the previously decommissioned cp servers to set up a temp server for Adam to use. Thanks, Willy

Maintenance_bot added a project: SRE.Feb 28 2024, 10:29 PM

Maintenance_bot added a project: Wikidata.

Restricted Application added a project: wmde-wikidata-tech. · View Herald TranscriptFeb 28 2024, 10:29 PM

Gehel triaged this task as High priority.Feb 29 2024, 9:30 AM

Gehel moved this task from Incoming to 2024.02.12 - 2024.03.03 on the Data-Platform-SRE board.

Gehel edited projects, added Data-Platform-SRE (2024.02.12 - 2024.03.03); removed Data-Platform-SRE.

Jclark-ctr moved this task from Backlog to Racking Tasks on the ops-eqiad board.Feb 29 2024, 5:12 PM

dr0ptp4kt added a parent task: T336443: Investigate performance differences between wdqs2022 and older hosts.Feb 29 2024, 9:26 PM

dr0ptp4kt added a parent task: T358533: Hardware requests for Search Platform FY2024-2025.

bking mentioned this in T358533: Hardware requests for Search Platform FY2024-2025.Feb 29 2024, 9:33 PM

VRiley-WMF claimed this task.Mar 1 2024, 4:43 PM

Hi @dr0ptp4kt I have racked and stacked cp1086 in the following location

Rack B 7
U 20
CableID 1966
Port 20

Please let us know if there is anything else you may need for this. Thank you!

Thanks @VRiley-WMF ! @bking is up next for imaging, I think.

@VRiley-WMF or @Jclark-ctr are there any other lifecycle steps I need to take to get this host back into production as wdqs1025? This host was already decommissioned, so I'm not sure what to do to get it back into production.

The Server Lifecycle Page shows how to do this for hosts that still have a management interface, but cp1086.mgmt.eqiad.wmnet is not reachable.

Gehel edited projects, added Data-Platform-SRE (2024.03.04 - 2024.03.24); removed Data-Platform-SRE (2024.02.12 - 2024.03.03).Mar 4 2024, 1:34 PM

Gehel moved this task from Incoming to Current work on the Wikidata-Query-Service board.Mar 4 2024, 4:29 PM

Gehel edited projects, added Discovery-Search (Current work); removed Wikidata-Query-Service.

Gehel moved this task from Incoming to DPE-SRE on the Discovery-Search (Current work) board.Mar 4 2024, 4:44 PM

@VRiley-WMF any pointers on how to iDRAC / iLO to this node and establish with a hostname of wdqs1025.eqiad.wmnet? I'm wondering if maybe there's a direct IP or IPs given that there don't seem to be DNS records for cp1086.eqiad.wmnet or cp1086.mgmt.eqiad.wmnet?

In T358727#9600557, @dr0ptp4kt wrote:

@VRiley-WMF any pointers on how to iDRAC / iLO to this node and establish with a hostname of wdqs1025.eqiad.wmnet? I'm wondering if maybe there's a direct IP or IPs given that there don't seem to be DNS records for cp1086.eqiad.wmnet or cp1086.mgmt.eqiad.wmnet?

I've made the changes in netbox needed for this, effectively here we need to follow this process:

https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Rename_while_reimaging

The only trick was after running the Netbox ProvisionServerNetwork script I manually changed the new mgmt IP that was assigned back to the old one the host is already using. This step can be avoided in future by using the --keep-mgmt-dns flag when running the sre.hosts.decommission cookbook.

I updated the DNS and it's pingable again:

cmooney@cumin1002:~$ ping wdqs1025.mgmt.eqiad.wmnet 
PING wdqs1025.mgmt.eqiad.wmnet (10.65.5.91) 56(84) bytes of data.
64 bytes from wdqs1025.mgmt.eqiad.wmnet (10.65.5.91): icmp_seq=1 ttl=62 time=0.682 ms
64 bytes from wmf7415.mgmt.eqiad.wmnet (10.65.5.91): icmp_seq=2 ttl=62 time=0.815 ms

I also pushed the required network config to asw2-b7-eqiad, so you should be able to follow the above steps from the "Patch Puppet" stage on I think.

Thank you @cmooney ! I have also relabeled this unit to match the name. Closing this ticket as per our discussion since it's completed from a DC Ops perspective.

VRiley-WMF closed this task as Resolved.Mar 5 2024, 3:51 PM

bking added a parent task: T359062: Assess Wikidata dump import hardware.Mar 5 2024, 6:19 PM

@VRiley-WMF wdqs1025 is failing to reimage. I can't see any disks in the DRAC interface, are you able to check the disks and see if they're properly seated?

@dr0ptp4kt would you be able to try to reimage this unit again? I have ran it through a power cycle and that can help with this process. Let us know, thanks!

@VRiley-WMF

Unfortunately, I'm still getting errors (screenshot) when I try to boot up the host. Are you able to reseat the cables and disks?

Swapped cable with a new one (same port), shut down the unit and reseated the drives as well. Powered the unit back on

@bking was puppet and site.pp updated? unfortunately me and Valerie do not have access to push updates and has become a process for sre owner to do with procurement ticket

Change 1009361 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] site.pp: Add wdqs1025 host

https://gerrit.wikimedia.org/r/1009361

gerritbot added a project: Patch-For-Review.Mar 6 2024, 10:59 PM

Change 1009361 merged by Bking:

[operations/puppet@production] site.pp: Add wdqs1025 host

https://gerrit.wikimedia.org/r/1009361

@Jclark-ctr Thanks for the tip, I've added a patch and will try the reimage again.

bking mentioned this in T359495: Reimage cookbook: warn when host is not in site.pp.Mar 6 2024, 11:19 PM

Maintenance_bot removed a project: Patch-For-Review.Mar 6 2024, 11:30 PM

Change 1009371 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] site.pp: Move wdqs1025 into test role

https://gerrit.wikimedia.org/r/1009371

gerritbot added a project: Patch-For-Review.Mar 7 2024, 1:46 AM

Lucas_Werkmeister_WMDE moved this task from Incoming to Other Projects on the wmde-wikidata-tech board.Mar 7 2024, 1:49 PM

Mentioned in SAL (#wikimedia-operations) [2024-03-07T14:13:34Z] <bking@cumin2002> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sync hiera as instructed by failed reimage cookbook - bking@cumin2002 - T358727"

Mentioned in SAL (#wikimedia-operations) [2024-03-07T14:14:26Z] <bking@cumin2002> END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "sync hiera as instructed by failed reimage cookbook - bking@cumin2002 - T358727"

Change 1009371 merged by Bking:

[operations/puppet@production] site.pp: Move wdqs1025 into test role

https://gerrit.wikimedia.org/r/1009371

Maintenance_bot removed a project: Patch-For-Review.Mar 7 2024, 5:30 PM

Apologies for not posting this sooner. wdqs1025 has been ready for use since the above request was merged. Closing this for now, but work continues in T359062 .

Reclaim recently-decommed CP host for WDQS (see T352253)Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Reclaim recently-decommed CP host for WDQS (see T352253)
Closed, ResolvedPublic
Actions

Related Objects
Search...