Placeholder task for follow up investigation and actions.
It's not clear how stable this switch is, so better to move any critical services out of that rack.
DB related critical hosts: T313382#8090176
Placeholder task for follow up investigation and actions.
It's not clear how stable this switch is, so better to move any critical services out of that rack.
DB related critical hosts: T313382#8090176
| Subject | Repo | Branch | Lines +/- | |
|---|---|---|---|---|
| wmnet: Failover m3-master | operations/dns | master | +1 -1 |
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | ayounsi | T313382 asw2-c5-eqiad crash | |||
| Resolved | Marostegui | T313383 Switchover s7 master db1181 -> db1136 | |||
| Resolved | Jclark-ctr | T313384 eqiad row C switch fabric recabling | |||
| Unknown Object (Task) | |||||
| Resolved | Marostegui | T313398 Failover x1 master db1120 -> db1103 | |||
| Resolved | dcaro | T313400 2022-07-20 CloudVPS unstability after network outage | |||
| Resolved | dcaro | T313402 NovafullstackSustainedFailures cloudcontrol1003:9100 The automated tests were unable to create, provision and decommission a VM in the last 5h | |||
| Resolved | dcaro | T313407 JobUnavailable The Prometheus job rabbitmq running on cloud@ has been unable to scrape 20% of its targets. Check if the targets are reachable and exporting metrics. | |||
| Resolved | Marostegui | T313401 Move pc1014 from pc2 to pc3 |
Change 815680 had a related patch set uploaded (by Marostegui; author: Marostegui):
[operations/dns@master] wmnet: Failover m3-master
Critical DB infra there:
This didn't get caught by monitoring. We have a LibreNMS alert that triggers when any "emergency" log is sent by a device, but looks like this wasn't critical enough to be tagged as so.
I added a specific LibreNMS alert to trigger if any syslog contains the string Member change: vc delete of member this will help us pin-point the root cause faster.
And matching runbook: https://wikitech.wikimedia.org/wiki/Network_monitoring#virtual-chassis_crash
Opened high severity JTAC case 2022-0720-513915.
In the meantime we need to discuss if we want to preemptively replace FPC5 with a spare, or focus on the re-cabling.
Change 815680 merged by Marostegui:
[operations/dns@master] wmnet: Failover m3-master
pc1013 is no longer a master in pc3.
All the tasks owned by DBA team have been completed. Though, we'd appreciate a heads up before the maintenace to depool the replicas that are in production.
Please note that dbproxy1018 and dbproxy1019 aren't owned by us and they need to get one of them out of the rack first as both of them live together see: T313445