Page MenuHomePhabricator

Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad
Closed, ResolvedPublic

Description

2024-07-09, 15:00 UTC
on this rack: https://netbox.wikimedia.org/dcim/racks/83/

  • backup1009
  • db1192 - old s8 master
  • db1198 - s3
  • db1199 - s4
  • db1204 - backup
  • ms-be1074
  • an-presto1010
  • an-worker1154
  • cephosd1003
  • druid1010
  • kafka-jumbo1012
  • kafka-stretch1001
  • elastic1093
  • elastic1094
  • elastic1095
  • wdqs1015
  • dse-k8s-worker1006
  • ml-serve1006
  • kubernetes1061
  • kubernetes1048
  • kubernetes1047
  • kubernetes1049
  • kubernetes1050
  • kubernetes1051
  • mw1491
  • mw1492
  • mw1493

Teams Involved: Data Persistence, Data Platform, Search, Machine Learning, Service Ops

Expected outage: 15-30 minutes

Please use the below sheet to detail any actions that are required in advance of the work:

https://docs.google.com/spreadsheets/d/1pLPpzGBmdExXxQ_0_eGXpO0VlUU5oPKZy-_KViMSwuM

Details

Other Assignee
MatthewVernon

Event Timeline

[from a swift POV, just need to check cluster OK afterwards]

cmooney triaged this task as Medium priority.
cmooney updated the task description. (Show Details)
cmooney updated the task description. (Show Details)

backup1009 is the main backup node for bacula on eqiad. Most backups happen during the night- so just monitoring that it came back and new backups happen normally would be enough.

db1204 is the main mediabackups metadata db- ideally, media backups are stopped while maintenance happens and restored afterwards to avoid errors.

backup1009 is the main backup node for bacula on eqiad. Most backups happen during the night- so just monitoring that it came back and new backups happen normally would be enough.

db1204 is the main mediabackups metadata db- ideally, media backups are stopped while maintenance happens and restored afterwards to avoid errors.

Is there a procedure for that so we know how to do so?

Is there a procedure for that so we know how to do so?

Sadly, there is not. The code changes for implementing that "the proper way" are trivial, I didn't want to implement the simple current way & I am scared of changing any backup code logic just days before me leaving (traditionally, it always introduced some regression), so I will just encourage to join me the session next week about debugging mediabackups I had with Arnaud, where I was going to go over typical issues and maintenance.

I will try - but just in case @ABran-WMF please take some notes!

Mentioned in SAL (#wikimedia-operations) [2024-07-09T10:29:47Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Depool db1192 db1198 db1199 T365995', diff saved to https://phabricator.wikimedia.org/P66039 and previous config saved to /var/cache/conftool/dbconfig/20240709-102947-root.json

Mentioned in SAL (#wikimedia-operations) [2024-07-09T11:15:48Z] <btullis> drained dse-k8s-worker1006.eqiad.wmnet ready for T365995

Mentioned in SAL (#wikimedia-operations) [2024-07-09T11:17:19Z] <btullis> set cephosd cluster into noout mode to prevent rebalancing for T365995

Mentioned in SAL (#wikimedia-operations) [2024-07-09T14:26:14Z] <hnowlan> kubectl drain kubernetes1061.eqiad.wmnet kubernetes1061.eqiad.wmnet kubernetes1061.eqiad.wmnet kubernetes1061.eqiad.wmnet kubernetes1061.eqiad.wmnet kubernetes1061.eqiad.wmnet kubernetes1061.eqiad.wmnet mw1492.eqiad.wmnet mw1492.eqiad.wmnet (T365995)

Switch upgrade completed without issue. All connected hosts are back online and responding to ping now, thanks all for the help.

@cmooney got to be closed?

Yep just making sure no fallout thanks!