Page MenuHomePhabricator

asw2-c5-eqiad crash
Closed, ResolvedPublic

Description

Placeholder task for follow up investigation and actions.

It's not clear how stable this switch is, so better to move any critical services out of that rack.

DB related critical hosts: T313382#8090176

Event Timeline

ayounsi triaged this task as High priority.Jul 20 2022, 7:17 AM
ayounsi created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 815680 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/dns@master] wmnet: Failover m3-master

https://gerrit.wikimedia.org/r/815680

Critical DB infra there:

  • dbproxy1020 (m3 current proxy): needs failover.
  • pc1013 active pc3 master: needs failover T313401
  • db1181 s7 master: needs failover T313383
  • db1120 x1 master: needs failover T313398
  • dbproxy1018 and dbproxy1019 are active WMCS proxies, need to be handled by them cc @nskaggs (they should also be moved into different racks) T313445

This didn't get caught by monitoring. We have a LibreNMS alert that triggers when any "emergency" log is sent by a device, but looks like this wasn't critical enough to be tagged as so.

I added a specific LibreNMS alert to trigger if any syslog contains the string Member change: vc delete of member this will help us pin-point the root cause faster.
And matching runbook: https://wikitech.wikimedia.org/wiki/Network_monitoring#virtual-chassis_crash

Opened high severity JTAC case 2022-0720-513915.
In the meantime we need to discuss if we want to preemptively replace FPC5 with a spare, or focus on the re-cabling.

Change 815680 merged by Marostegui:

[operations/dns@master] wmnet: Failover m3-master

https://gerrit.wikimedia.org/r/815680

m3-master dbproxy has been failed over.

  • dbproxy1018 and dbproxy1019 are active WMCS proxies, need to be handled by them cc @nskaggs (they should also be moved into different racks)

Thanks for noting this @Marostegui! I filed to move dbproxy1019 under T313445

Both masters, s7 and x1 have been switched over and no longer live in this rack.

pc1013 is no longer a master in pc3.
All the tasks owned by DBA team have been completed. Though, we'd appreciate a heads up before the maintenace to depool the replicas that are in production.
Please note that dbproxy1018 and dbproxy1019 aren't owned by us and they need to get one of them out of the rack first as both of them live together see: T313445

ayounsi claimed this task.

Sub-task completed successfully nothing more to do here.