Page MenuHomePhabricator

hw troubleshooting: Memory correctable errors -EDAC- for elastic1029.eqiad.wmnet
Closed, ResolvedPublic

Description

This task will track the hardware troubleshooting and repair of server elastic1029.eqiad.wmnet

The first half of the steps should be completed by the person filing the hardware repair task. Some of these steps require access to Icinga to put a host into maintenance mode.

All Other Failures (memory, battery, controller, cpu, etc.)
  • - Follow directions on https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#All_Other_Failures
  • - Place system in fully offline state for hardware troubleshooting by the onsite. If it cannot be placed offline, coordinate with on-site engineer to schedule a maintenance window.
  • - Set system and mgmt interface to maint mode (no checks/alarms on services) for 5 business days (excluding weekends).
  • - attach detailed hardware failure log to this task via comment, see hardware troubleshooting runbook on how to accomplish this.
  • - Assign task to proper assignee & onsite project, and place in 'hardware troubleshooting' column on workboard.

I don't currently have management interface access to run most of the commands.
I will like to coordinate with a DCops person to schedule downtime for this node.

Event Timeline

Mathew.onipe triaged this task as Medium priority.

@Mathew.onipe elastic1029 is over 2 years out of warranty. The DIMM can be reseated, does this server need scheduled downtime or can it be taken down anytime?

@Gehel This server is 2 years out of warranty and the memory can be reseated but doubtful it will correct the issues. Are there plans to replace this server?

@wiki_willy This server is in netbox as decommissioning. I didn't do that? Can you ask around and figure out the status please. I am going to disable the network port for now.

Network port is removed from private vlan and disabled for now.

[edit interfaces interface-range disabled]

member ge-6/0/35 { ... }

+ member ge-4/0/34;

Hi @Gehel - just wanted to follow up on this one, to hopefully wrap up the task. I couldn't find too much on the current status of elastic1029 - does your team have this slated to be decommissioned? It's been in production for about 6yrs, so hoping there's a refresh in the works. Thanks, Willy

@wiki_willy This server is in netbox as decommissioning. I didn't do that? Can you ask around and figure out the status please. I am going to disable the network port for now.

@wiki_willy / @Cmjohnson : this server has been decommed as part of T239821, so yep, nothing to do here except get rid of it. Thanks! And sorry for the delay...

@wiki_willy / @Cmjohnson : this server has been decommed as part of T239821, so yep, nothing to do here except get rid of it. Thanks! And sorry for the delay...

Thanks for the confirmation @Gehel , we'll resolve this task then. Thanks, Willy