**Planned for Tues July 27th at 15:00 UTC (08:00 PDT / 11:00 EDT / 17:00 CEST)**
Netops plan to adjust the buffer memory configuration for all switches in Eqiad Row B, to address tail drops observed on some of the devices, which is causing throughput issues.
This is an intrusive change, the Juniper documentation says it will bring all traffic on the row to a complete stop, as the switches reconfigure themselves. Experience from the change to Row D indicates that this interruption is very brief (no ping loss was observed,) so the expected interruption is less than 1 second, and no interface state changes or similar will occur. As always there is a very small potential something will go wrong, we hit a bug etc, that will disrupt networking on the row for longer. TL;DR the interruption should be short enough nobody notices, but the row should be considered "at risk".
Service owners may want to take some pro-active steps in advance to failover / de-pool systems, to address this higher risk.
The complete list of servers in this row can be found here:
https://netbox.wikimedia.org/dcim/devices/?q=&rack_group_id=6&status=active&role=server
Summarizing the physical servers / types:
|Server Name / Prefix |Count|Relevant Team |Action Required |Action Status|
|----------------------|-----|----------------------------------------------------------------|----------------|-------------|
|mw |39 |Service Operations |N|N/A|
|db |21 |Data Persistence |N, Consider failing over to db1117 if the maintenance takes more than expected|N/A |
|an-worker |14 |Analytics SREs |N|N/A|
|cloudvirt |12 |WMCS |N|N/A|
|elastic |9 |Search Platform SREs |N|N/A|
|ms-be |9 |Data Persistence (Media Storage) |N|N/A|
|ganeti |6 |Infrastructure Foundations |N|N/A|
|analytics |5 |Analytics SREs |N|N/A|
|wtp |5 |Service Operations |N|N/A|
|cp |4 |Traffic |depool the individual hosts with the `depool` command|Complete|
|es |4 |Data Persistence |N|N/A|
|restbase |4 |Service Operations |N|N/A|
|cloudcephmon |3 |WMCS |N|N/A |
|cloudcephosd |3 |WMCS |N|N/A |
|cloudvirt-wdqs |3 |WMCS |N|N/A |
|kubernetes |3 |Service Operations |N|N/A|
|wdqs |3 |Search Platform SREs |N|N/A|
|clouddb |2 |WMCS, with support from DBAs |N|N/A|
|cloudelastic |2 |WMCS/Search Platform SREs |N|N/A|
|cloudnet |2 |WMCS |N|N/A|
|dbproxy |2 |Data Persistence |dbproxy1014 requires action, dbproxy1015 doesn't| Done |
|druid |2 |Analytics |N|N/A|
|mc |2 |Service Operations |N|N/A|
|ores |2 |Machine Learning SREs |N|N/A|
|snapshot |2 |Service Operations & Platform Engineering |N|N/A|
|thumbor |2 |Service Operations (& Performance) |N|N/A|
|an-conf1001 |1 |Analytics |N|N/A|
|an-coord1001 |1 |Analytics SREs |N|N/A|
|an-launcher1002 |1 |Analytics SREs |N|N/A|
|an-master1002 |1 |Analytics SREs |N|N/A|
|an-presto1004 |1 |Analytics SREs |N|N/A|
|aqs1008 |1 |Analytics SREs |N|N/A|
|atlas-eqiad |1 |Infrastructure Foundations |N|N/A|
|authdns1001 |1 |Traffic |[[https://wikitech.wikimedia.org/wiki/Anycast#How_to_temporarily_depool_a_server|Manual depool]] needed|Complete|
|backup1003 |1 |Data Persistence |N|N/A|
|cloudcontrol1004 |1 |WMCS |N|N/A|
|cloudservices1003 |1 |WMCS |N|N/A|
|conf1005 |1 |Service Operations |N|N/A|
|dbprov1002 |1 |Data Persistence |N|N/A|
|dumpsdata1001 |1 |Service Operations & Platform Engineering |N|N/A|
|gerrit1001 |1 |Service Operations (Supportive Services) & Release Engineering |N|N/A|
|graphite1004 |1 |Observability |N|N/A|
|kafka-jumbo1003 |1 |Analytics SREs |N|N/A|
|kafka-main1002 |1 |Analytics || |
|kubestage1002 |1 |Service Operations |N|N/A|
|labweb1001 |1 |WMCS |N|N/A|
|logstash1011 |1 |Observability |N| |
|lvs1014 |1 |Traffic |Failover to secondary (lvs1016 in row D) by stopping pybal with puppet disabled||
|maps1002 |1 | ?? |N|N/A|
|mwmaint1002 |1 |Service Operations |N|N/A|
|pc1008 |1 |SRE Data Persistence (DBAs), with support from Platform and Performance|N|N/A|
|prometheus1004 |1 |Observability |N|N/A|
|puppetmaster1001 |1 |Infrastructure Foundations |Disable puppet fleet wide||
|rdb1009 |1 |Service Operations |N|N/A|
|relforge1004 |1 |Search Platform SREs |N|N/A|
|restbase-dev1005 |1 |Platform Engineering ?? |N|N/A|
|stat1007 |1 |Analytics SREs |N|N/A|
|thanos-be1002 |1 |Observability |N|N/A|
|thanos-fe1002 |1 |Observability |N|N/A|
VMs in this row are as follows:
|VM Name |Ganeti Host |Team |Action Required |Action Status|
|--------------------|----------------|--------------------------------------------------------------|-----------------|-------------|
|an-airflow1002 |ganeti1018 |Analytics SREs |N|N/A|
|an-test-ui1001 |ganeti1015 |Analytics |N|N/A|
|an-tool1008 |ganeti1015 |Analytics SREs |N|N/A|
|an-tool1009 |ganeti1016 |Analytics SREs |N|N/A|
|debmonitor1002 |ganeti1014 |Infrastructure Foundations |N|N/A|
|dragonfly-supernode1001|ganeti1017 |Service Operations |N|N/A|
|failoid1002 |ganeti1018 |Infrastructure Foundations |N|N/A|
|kafka-test1006 |ganeti1014 |Analytics ? |N|N/A|
|kafka-test1007 |ganeti1013 |Analytics ? |N|N/A|
|kafka-test1008 |ganeti1016 |Analytics ? |N|N/A|
|kafka-test1009 |ganeti1013 |Analytics ? |N|N/A |
|kafka-test1010 |ganeti1015 |Analytics ? |N|N/A |
|kubernetes1015 |ganeti1015 |Service Operations |N|N/A|
|kubestagemaster1001 |ganeti1014 |Service Operations |N|N/A|
|kubestagetcd1005 |ganeti1013 |Service Operations |N|N/A|
|kubetcd1006 |ganeti1013 |Service Operations |N|N/A|
|ldap-replica1003 |ganeti1017 |Infrastructure Foundations |Y|Complete|
|logstash1032 |ganeti1013 |Observability |N|N/A|
|ml-etcd1001 |ganeti1014 |ML team |N|N/A|
|ml-serve-ctrl1001 |ganeti1017 |ML team ?? |N|N/A|
|otrs1001 |ganeti1014 |Service Operations ?? |N|N/A |
|schema1003 |ganeti1015 |Analytics SREs & Service Operations |N|N/A|
|zookeeper-test1002 |ganeti1013 |Analytics ?? |N|N/A|
I have listed the teams, and subscribed relevant individuals to this task, based on the server names and info here: https://wikitech.wikimedia.org/wiki/SRE/Infrastructure_naming_conventions. Don't hesitate to add people I could have missed, or remove yourself from the task if you do not need to be involved.
(WARNING) **Kindly update the tables if action needs to be taken for any servers/VMs. Please also list the current status of action if required, and set status to 'Complete' once work has been done.**
###### Days Before:
- Prepare config changes (netops)
###### 1h Before Window:
- Confirm switches are in a healthy state (snapshot MAC and ARP tables, port status, buffer usage) (netops)
- Warn people of the upcoming maintenance (netops)
###### After The Change
- Confirm switches are in a healthy state (snapshot MAC and ARP tables, port status, buffer usage, validate against prior values). (netops)