Page MenuHomePhabricator

maps2009 is unreachable
Closed, ResolvedPublic

Description

Common information

  • alertname: ManagementSSHDown
  • instance: maps2009.mgmt:22
  • job: probes/mgmt
  • module: ssh_banner
  • prometheus: ops
  • rack: B6
  • severity: task
  • site: codfw
  • source: prometheus
  • team: dcops

Firing alerts


  • dashboard: TODO
  • description: The management interface at maps2009.mgmt:22 has been unresponsive for multiple hours.
  • runbook: https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
  • summary: Unresponsive management for maps2009.mgmt:22
  • alertname: ManagementSSHDown
  • instance: maps2009.mgmt:22
  • job: probes/mgmt
  • module: ssh_banner
  • prometheus: ops
  • rack: B6
  • severity: task
  • site: codfw
  • source: prometheus
  • team: dcops
  • Source

Impact

Stale maps (codfw) data: This is the primary postgres server of our codfw maps infrastructure. Currently, the sideeffect here is that our maps data on codfw are not being update. In other words, the service is functional but its data is stale.

Event Timeline

Joe triaged this task as High priority.Aug 14 2023, 7:14 AM
Joe added a project: serviceops-radar.
Joe subscribed.

The console is still unreachable; this server is part of a cluster actively servicing traffic

I randomly encountered this while manually checking maps2009. Just a heads up this node is a master node which means that if services are down, OSM data syncing/invalidation and postgres replication is down and the cluster will be out of sync.

@Joe @Jgiannelos maps2009 doesn't look to be up at all. the network link and mgmt link are down and the front panel indicates that the server is off. anyway you can confirm?

after attempting to boot the server I believe it is a bad motherboard. server will not power up even with minimum configuration. PSUs/PDU are working/have green lights. No network lights, no idrac indicators, no front panel activity etc. No Console activity.

I have created a dispatch with Dell. The server is out of warranty on the 27th of this month. So good thing it went out now and not in September.

Service Request 173930280

Dell sent me a list of checks to determine if it's the motherboard or the backplane. followed directions and replied. my guess is the MB will need to be replaced. will update when part arrives.

jijiki renamed this task from ManagementSSHDown to maps2009 is unreachable.Aug 16 2023, 10:45 AM
jijiki updated the task description. (Show Details)

replaced the system board and the controller. System still did not post. pulled out everything except 1 ram, 1 cpu, a psu. Booted and started adding back components. Found that the backplane is causing the system to not boot. have contacted Dell and let them know. expecting another part soon. will update as events unfold.

@Jhancock.wm thanks for working on this. since we have some R440 in storage can you pull the backplane out of one of those servers and try while we waiting for Dell to send us one?

Thanks.

I can give it a shot. Dell confirmed they will be sending a new part this morning.

@Jhancock.wm in that case just wait for the new part. thanks

The backplane has been replaced and the settings in idrac have been updated. I think it's good to go back. I will leave this ticket up for a day to observe. Please let me know if you come across any issues that I need to check!

On the OSM sync side of things, it might worth checking if the system catches up with the diffs (~1 week worth of diffs could be manageable). The idea is that we avoid having a full planet import from scratch for codfw.

Turns out there is one more thing I need to do to. I missed a firmware update. Is it safe for me to reboot at this time?

Turns out there is one more thing I need to do to. I missed a firmware update. Is it safe for me to reboot at this time?

@Jhancock.wm, I dropped the ball on this one, I will ping you and give you a window when it will be safe to reboot. Thank you very much for your effort on this!

@jijiki all good. I knew it needed time to do its thing. I'll be on site for the next 3 hours today and roughly the same time for the rest of the week. thanks!

@jijiki is there a day this week that I can update the firmware on this server?

@jijiki is there a day this week that I can update the firmware on this server?

Hey @Jhancock.wm we need a slight coordination which has not been possible to do so yet, sorry for this taking longer, I will update you as soon as we can

@jijiki I haven't seen the idrac/ssh go down on this server in a while. is it ok if I close this ticket? if there's still work that needs to be done I can hold onto it. =)