Similar to T167274, T168462, T169345, T170380 and especially T148506.
This is the last switch that needs to be upgraded to fix T133387.
As eqiad is the active datacenter and row D contains multiple systems, the first step is to evaluate and discuss the difficulty/impact of the upgrade. Note that all services are supposed to have at least row redundancy.
Thanks to the efforts put in the preparation of T148506, this might be easier than previously.
I'm planning on doing the upgrade on Wednesday Feb. 14th at 0800 PDT; 1500 UTC; 1100 EDT
That's for the sake of picking a date, I can reschedule at will, let me know if that's an issue for anyone.
2h total maintenance time.
As the previous standard upgrade was smooth, we're doing the same thing. Which means the whole ROW will go down for between 10 and 20min (previous rows were ~10min) if no complications.
the full list of servers is available at: https://racktables.wikimedia.org/index.php?row_id=2102&location_id=2006&page=row&tab=default
To summarize, here is the types of hosts in that row:
analytics* aqs* auth* cp* db* dbstore* druid* dumpsdata* einsteinium <- Icinga elastic* es* kafka* kubernetes* labpuppetmaster* labservices* labstore* labweb* logstash* maps* mc* ms-be* mw* ocg* ores* pc* puppetmaster* rdb* restbase* restbase-dev* scb* snapshot* stat* thorium thumbor* wdqs* wtp*
I subscribed users on that task based on https://etherpad.wikimedia.org/p/p0Iq39YWZg . Don't hesitate to add people I could have missed, or remove yourself from the task if you're not involved.
Timeline, - to be completed - please edit and add anyone who needs to be notified or any extra step that needs to be done.
Days before:
- Depool puppetmaster1002 ( T148506#3196641 )
- Switchover from einsteinium to tegmen T163324
- Switchover from oresrdb1001 to oresrdb1002 ( T163326 )
- Ban all elasticsearch nodes of row D ( @Gehel )
- fail etcd over to codfw
- Ensure we can survive a loss of labservices1001 ( T163402 )
- Failover affected DB masters (T186188)
1h before the window:
- Warn people of the upcoming maintenance
- Ping @elukey to disable kafka
- Ping @elukey to drain traffic from hadoop nodes
- Ping @Eevans to mute restbase* alerts ( T148506#3202477 )
- Ping @Gehel for elasticsearch and logstash coordination
- Disable elasticsearch check https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=search.svc.eqiad.wmnet&service=ElasticSearch+health+check+for+shards
- Downtime switch in Icinga/LibreNMS
- ...
After the upgrade:
- Confirm switches are in a healthy state
- Re-enable igmp-snooping
- Confirm T133387 is fixed (otherwise re-disable igmp-snooping)
- Run an LibreNMS discovery/pool
- Ask confirmation of "all good" to the list of users above
- Repool puppetmaster1002 (@akosiaris)
- Unban elasticsearch nodes of row D (@Gehel )
- Elasticsearch reindex that time period if we see any lost writes (@Gehel )
- Re-enable all Icinga checks
- Remove monitoring downtime
- ...