Change Details

**Planned for Thurs July 22nd at 15:00 UTC (08:00 PDT / 11:00 EDT / 17:00 CEST)** Netops plan to adjust the buffer memory configuration for all switches in Eqiad Row C, to address tail drops observed on some of the devices, which is causing throughput issues. This is an intrusive change, and will bring **all traffic on the row to a complete stop** for a short time while the switches reconfigure themselves. All services should have row redundancy, but we may want to take some pro-active steps in advance to de-pool servers / make things go smoothly. The exact duration of the impact is unknown at this time - we will have a much better sense after performing the same action on other rows prior to this change, and will update this task with those results. Best estimate is it will be in the order of seconds, certainly no longer than a minute, but we should plan for up to a 5-minute interruption, and be aware as always that there is a small potential something will go wrong and cause a longer disturbance. The complete list of servers in this row can be found here: https://netbox.wikimedia.org/dcim/devices/?q=&rack_group_id=7&status=active&role=server Summary of hosts by type here: |Server Name / Prefix |Count|Relevant Team |Action Required |Action Status| |----------------------|-----|----------------------------------------------------------------|----------------|-------------| |mw |39 |Service Operations |N|N/A| |db |23 |Data Persistence |N|N/A| |an-worker |16 |Analytics SREs |N|N/A| |elastic |9 |Search Platform SREs |N|N/A| |ms-be |7 |Data Persistence (Media Storage) |N|N/A| |wtp |6 |Service Operations |N|N/A| |analytics |5 |Analytics SREs |N|N/A| |mc |5 |Service Operations |N|N/A| |cp |4 |Traffic |depool the individual hosts with the `depool` command|| |dbproxy |4 |Data Persistence |dbproxy1018 and dbproxy1019 owned by #cloud-services-team, dbproxy1020 requires action after row D is done, dbroxy1021 doesn't | dbproxy1020 has been depooled by #DBA | |ganeti |4 |Infrastructure Foundations |N|N/A| |es |3 |Data Persistence |N|N/A| |kafka-jumbo |3 |Analytics SREs & Infrastructure Foundations |N|N/A| |kubernetes |3 |Service Operations |N|N/A| |clouddb |2 |WMCS, with support from DBAs |Y|| |labstore |2 |WMCS |Y|| |ms-fe |2 |Data Persistence (Media Storage) |N|N/A| |ores |2 |Machine Learning SREs |N|N/A| |wdqs |2 |Search Platform SREs |N|N/A| |alert1001 |1 |Observability |Y - To be switched over ahead of maint|| |an-conf1002 |1 |Analytics |N|N/A| |an-druid1002 |1 |Analytics |N|N/A| |an-test-master1002 |1 |Analytics |N|N/A| |an-test-worker1002 |1 |Analytics |N|N/A| |aqs1005 |1 |Analytics SREs |N|N/A| |backup1002 |1 |Data Persistence |Heads up to Jaime before|N/A| |cloudcontrol1005 |1 |WMCS || | |cloudelastic1003 |1 |WMCS || | |cloudmetrics1001 |1 |WMCS |N|N/A| |cumin1001 |1 |Infrastructure Foundations |Y|Tell other SREs to use cumin2002 instead that day. Announce it few days earlier as some cookbooks takes days to run. | |dbprov1003 |1 |Data Persistence |N|N/A| |dbstore1005 |1 |Analytics SREs & Data Persistence |N|N/A| |druid1002 |1 | Analytics |N|N/A| |dumpsdata1003 |1 |Service Operations & Platform Engineering |Y|| |kafka-main1003 |1 |SRE |Y - Keith to depool in advance|| |lvs1015 |1 |Traffic |Failover to secondary (lvs1016 in row D) by stopping pybal with puppet disabled|| |maps1003 |1 | |N|N/A| |mc-gp1002 |1 |Service Operations |N|N/A| |ms-backup1002 |1 |Data Persistence ?? |N|N/A| |mwlog1002 |1 |Service Operations |N|N/A| |pc1009 |1 |SRE Data Persistence (DBAs), with support from Platform and Performance|N|N/A| |sessionstore1002 |1 |Service Operations |N|N/A| |thanos-be1003 |1 |Observability |N|N/A| |thanos-fe1003 |1 |Observability |N|N/A| VMs on this row are as follows: |VM Name |Ganeti Host |Team |Action Required |Action Status| |--------------------|----------------|--------------------------------------------------------------|-----------------|-------------| |acmechief-test1001 |ganeti1009 |Traffic |N|N/A| |acmechief1001 |ganeti1009 |Traffic |Y disable puppet on acme_chief clients|| |an-airflow1001 |ganeti1010 |Analytics SREs |N|N/A| |an-tool1005 |ganeti1009 |Analytics SREs |N|N/A| |an-tool1007 |ganeti1010 |Analytics SREs |N|N/A| |doc1002 |ganeti1009 | |N|N/A| |doh1001 |ganeti1011 |Traffic |N|N/A| |etherpad1002 |ganeti1009 |Service Operations |N|N/A| |flowspec1001 |ganeti1009 |Infrastructure Foundations |N|N/A| |idp-test1001 |ganeti1010 |Infrastructure Foundations |N|N/A| |kubemaster1002 |ganeti1010 |Service Operations |N|N/A| |kubernetes1006 |ganeti1009 |Service Operations |N|N/A| |kubestagetcd1006 |ganeti1012 |Service Operations |N|N/A| |kubetcd1004 |ganeti1010 |Service Operations |N|N/A| |logstash1009 |ganeti1010 |Observability |N|N/A| |logstash1025 |ganeti1009 |Observability |N|N/A| |matomo1002 |ganeti1009 |Analytics |N|N/A| |miscweb1002 |ganeti1009 |Service Operations |N|N/A| |ml-etcd1002 |ganeti1012 |ML team |N|N/A| |mwdebug1001 |ganeti1010 |Service Operations |N|N/A| |mx1001 |ganeti1009 |Infrastructure Foundations |N|N/A| |ncredir1001 |ganeti1009 |Traffic |N|N/A| |netflow1001 |ganeti1012 |Infrastructure Foundations |N|N/A| |orespoolcounter1004 |ganeti1010 |Machine Learning SREs |N|N/A| |ping1001 |ganeti1009 |Infrastructure Foundations |Y|| |poolcounter1005 |ganeti1010 |Service Operations |Y|| |puppetboard1001 |ganeti1010 |Infrastructure Foundations |N|N/A| |puppetdb1002 |ganeti1012 |Infrastructure Foundations |Y (disable Puppet fleet-wide during maintenance)|| |registry1004 |ganeti1009 |Service Operations |N|N/A| |rpki1001 |ganeti1009 |Infrastructure Foundations |N|N/A| |seaborgium |ganeti1010 |Infrastructure Foundations |N|N/A| |urldownloader1002 |ganeti1010 |Infrastructure Foundations |Y|| I have listed the teams, and subscribed relevant individuals to this task, based mostly on the server names and info here: https://wikitech.wikimedia.org/wiki/SRE/Infrastructure_naming_conventions. Don't hesitate to add people I could have missed, or remove yourself from the task if you do not need to be involved. (WARNING) **Kindly update the tables if action needs to be taken for any servers/VMs. Please also list the current status of action if required, and set status to 'Complete' once work has been done.** ###### Days Before: - Prepare config changes (netops) ###### 1h Before Window: - Confirm switches are in a healthy state (snapshot MAC and ARP tables, port status, buffer usage) (netops) - Warn people of the upcoming maintenance (netops) - Depool ping1001 ([[ https://wikitech.wikimedia.org/wiki/Ping_offload#Temporarily_stop_the_ICMP_echo_redirect | doc ]]) (netops) ###### After The Change - Confirm switches are in a healthy state (snapshot MAC and ARP tables, port status, buffer usage, validate against prior values). (netops)

**Planned for Thurs July 22nd at 15:00 UTC (08:00 PDT / 11:00 EDT / 17:00 CEST)** Netops plan to adjust the buffer memory configuration for all switches in Eqiad Row C, to address tail drops observed on some of the devices, which is causing throughput issues. This is an intrusive change, and will bring **all traffic on the row to a complete stop** for a short time while the switches reconfigure themselves. All services should have row redundancy, but we may want to take some pro-active steps in advance to de-pool servers / make things go smoothly. The exact duration of the impact is unknown at this time - we will have a much better sense after performing the same action on other rows prior to this change, and will update this task with those results. Best estimate is it will be in the order of seconds, certainly no longer than a minute, but we should plan for up to a 5-minute interruption, and be aware as always that there is a small potential something will go wrong and cause a longer disturbance. The complete list of servers in this row can be found here: https://netbox.wikimedia.org/dcim/devices/?q=&rack_group_id=7&status=active&role=server Summary of hosts by type here: |Server Name / Prefix |Count|Relevant Team |Action Required |Action Status| |----------------------|-----|----------------------------------------------------------------|----------------|-------------| |mw |39 |Service Operations |N|N/A| |db |23 |Data Persistence |N|N/A| |an-worker |16 |Analytics SREs |N|N/A| |elastic |9 |Search Platform SREs |N|N/A| |ms-be |7 |Data Persistence (Media Storage) |N|N/A| |wtp |6 |Service Operations |N|N/A| |analytics |5 |Analytics SREs |N|N/A| |mc |5 |Service Operations |N|N/A| |cp |4 |Traffic |depool the individual hosts with the `depool` command|| |dbproxy |4 |Data Persistence |dbproxy1018 and dbproxy1019 owned by #cloud-services-team, dbproxy1020 requires action after row D is done, dbroxy1021 doesn't | dbproxy1020 has been depooled by #DBA | |ganeti |4 |Infrastructure Foundations |N|N/A| |es |3 |Data Persistence |N|N/A| |kafka-jumbo |3 |Analytics SREs & Infrastructure Foundations |N|N/A| |kubernetes |3 |Service Operations |N|N/A| |clouddb |2 |WMCS, with support from DBAs |Y|| |labstore |2 |WMCS |Y|| |ms-fe |2 |Data Persistence (Media Storage) |N|N/A| |ores |2 |Machine Learning SREs |N|N/A| |wdqs |2 |Search Platform SREs |N|N/A| |alert1001 |1 |Observability |Y - To be switched over ahead of maint|| |an-conf1002 |1 |Analytics |N|N/A| |an-druid1002 |1 |Analytics |N|N/A| |an-test-master1002 |1 |Analytics |N|N/A| |an-test-worker1002 |1 |Analytics |N|N/A| |aqs1005 |1 |Analytics SREs |N|N/A| |backup1002 |1 |Data Persistence |Heads up to Jaime before|N/A| |cloudcontrol1005 |1 |WMCS || | |cloudelastic1003 |1 |WMCS || | |cloudmetrics1001 |1 |WMCS |N|N/A| |cumin1001 |1 |Infrastructure Foundations |Y|Tell other SREs to use cumin2002 instead that day. Announce it few days earlier as some cookbooks takes days to run. | |dbprov1003 |1 |Data Persistence |N|N/A| |dbstore1005 |1 |Analytics SREs & Data Persistence |N|N/A| |druid1002 |1 | Analytics |N|N/A| |dumpsdata1003 |1 |Service Operations & Platform Engineering |Y|| |kafka-main1003 |1 |SRE |Y - Keith to depool in advance|| |lvs1015 |1 |Traffic |Failover to secondary (lvs1016 in row D) by stopping pybal with puppet disabled|| |maps1003 |1 | |N|N/A| |mc-gp1002 |1 |Service Operations |N|N/A| |ms-backup1002 |1 |Data Persistence ?? |N|N/A| |mwlog1002 |1 |Service Operations |N|N/A| |pc1009 |1 |SRE Data Persistence (DBAs), with support from Platform and Performance|N|N/A| |sessionstore1002 |1 |Service Operations |N|N/A| |thanos-be1003 |1 |Observability |N|N/A| |thanos-fe1003 |1 |Observability |N|N/A| VMs on this row are as follows: |VM Name |Ganeti Host |Team |Action Required |Action Status| |--------------------|----------------|--------------------------------------------------------------|-----------------|-------------| |acmechief-test1001 |ganeti1009 |Traffic |N|N/A| |acmechief1001 |ganeti1009 |Traffic |Y disable puppet on acme_chief clients|| |an-airflow1001 |ganeti1010 |Analytics SREs |N|N/A| |an-tool1005 |ganeti1009 |Analytics SREs |N|N/A| |an-tool1007 |ganeti1010 |Analytics SREs |N|N/A| |doc1002 |ganeti1009 | |N|N/A| |doh1001 |ganeti1011 |Traffic |N|N/A| |etherpad1002 |ganeti1009 |Service Operations |N|N/A| |flowspec1001 |ganeti1009 |Infrastructure Foundations |N|N/A| |idp-test1001 |ganeti1010 |Infrastructure Foundations |N|N/A| |kubemaster1002 |ganeti1010 |Service Operations |N|N/A| |kubernetes1006 |ganeti1009 |Service Operations |N|N/A| |kubestagetcd1006 |ganeti1012 |Service Operations |N|N/A| |kubetcd1004 |ganeti1010 |Service Operations |N|N/A| |logstash1009 |ganeti1010 |Observability |N|N/A| |logstash1025 |ganeti1009 |Observability |N|N/A| |matomo1002 |ganeti1009 |Analytics |N|N/A| |miscweb1002 |ganeti1009 |Service Operations |N|N/A| |ml-etcd1002 |ganeti1012 |ML team |N|N/A| |mwdebug1001 |ganeti1010 |Service Operations |N|N/A| |mx1001 |ganeti1009 |Infrastructure Foundations |N|N/A| |ncredir1001 |ganeti1009 |Traffic |N|N/A| |netflow1001 |ganeti1012 |Infrastructure Foundations |N|N/A| |orespoolcounter1004 |ganeti1010 |Machine Learning SREs |N|N/A| |ping1001 |ganeti1009 |Infrastructure Foundations |Y|| |poolcounter1005 |ganeti1010 |Service Operations |N|N/A| |puppetboard1001 |ganeti1010 |Infrastructure Foundations |N|N/A| |puppetdb1002 |ganeti1012 |Infrastructure Foundations |Y (disable Puppet fleet-wide during maintenance)|| |registry1004 |ganeti1009 |Service Operations |N|N/A| |rpki1001 |ganeti1009 |Infrastructure Foundations |N|N/A| |seaborgium |ganeti1010 |Infrastructure Foundations |N|N/A| |urldownloader1002 |ganeti1010 |Infrastructure Foundations |N|N/A| I have listed the teams, and subscribed relevant individuals to this task, based mostly on the server names and info here: https://wikitech.wikimedia.org/wiki/SRE/Infrastructure_naming_conventions. Don't hesitate to add people I could have missed, or remove yourself from the task if you do not need to be involved. (WARNING) **Kindly update the tables if action needs to be taken for any servers/VMs. Please also list the current status of action if required, and set status to 'Complete' once work has been done.** ###### Days Before: - Prepare config changes (netops) ###### 1h Before Window: - Confirm switches are in a healthy state (snapshot MAC and ARP tables, port status, buffer usage) (netops) - Warn people of the upcoming maintenance (netops) - Depool ping1001 ([[ https://wikitech.wikimedia.org/wiki/Ping_offload#Temporarily_stop_the_ICMP_echo_redirect | doc ]]) (netops) ###### After The Change - Confirm switches are in a healthy state (snapshot MAC and ARP tables, port status, buffer usage, validate against prior values). (netops)

**Planned for Thurs July 22nd at 15:00 UTC (08:00 PDT / 11:00 EDT / 17:00 CEST)** Netops plan to adjust the buffer memory configuration for all switches in Eqiad Row C, to address tail drops observed on some of the devices, which is causing throughput issues. This is an intrusive change, and will bring **all traffic on the row to a complete stop** for a short time while the switches reconfigure themselves. All services should have row redundancy, but we may want to take some pro-active steps in advance to de-pool servers / make things go smoothly. The exact duration of the impact is unknown at this time - we will have a much better sense after performing the same action on other rows prior to this change, and will update this task with those results. Best estimate is it will be in the order of seconds, certainly no longer than a minute, but we should plan for up to a 5-minute interruption, and be aware as always that there is a small potential something will go wrong and cause a longer disturbance. The complete list of servers in this row can be found here: https://netbox.wikimedia.org/dcim/devices/?q=&rack_group_id=7&status=active&role=server Summary of hosts by type here: |Server Name / Prefix |Count|Relevant Team |Action Required |Action Status| |----------------------|-----|----------------------------------------------------------------|----------------|-------------| |mw |39 |Service Operations |N|N/A| |db |23 |Data Persistence |N|N/A| |an-worker |16 |Analytics SREs |N|N/A| |elastic |9 |Search Platform SREs |N|N/A| |ms-be |7 |Data Persistence (Media Storage) |N|N/A| |wtp |6 |Service Operations |N|N/A| |analytics |5 |Analytics SREs |N|N/A| |mc |5 |Service Operations |N|N/A| |cp |4 |Traffic |depool the individual hosts with the `depool` command|| |dbproxy |4 |Data Persistence |dbproxy1018 and dbproxy1019 owned by #cloud-services-team, dbproxy1020 requires action after row D is done, dbroxy1021 doesn't | dbproxy1020 has been depooled by #DBA | |ganeti |4 |Infrastructure Foundations |N|N/A| |es |3 |Data Persistence |N|N/A| |kafka-jumbo |3 |Analytics SREs & Infrastructure Foundations |N|N/A| |kubernetes |3 |Service Operations |N|N/A| |clouddb |2 |WMCS, with support from DBAs |Y|| |labstore |2 |WMCS |Y|| |ms-fe |2 |Data Persistence (Media Storage) |N|N/A| |ores |2 |Machine Learning SREs |N|N/A| |wdqs |2 |Search Platform SREs |N|N/A| |alert1001 |1 |Observability |Y - To be switched over ahead of maint|| |an-conf1002 |1 |Analytics |N|N/A| |an-druid1002 |1 |Analytics |N|N/A| |an-test-master1002 |1 |Analytics |N|N/A| |an-test-worker1002 |1 |Analytics |N|N/A| |aqs1005 |1 |Analytics SREs |N|N/A| |backup1002 |1 |Data Persistence |Heads up to Jaime before|N/A| |cloudcontrol1005 |1 |WMCS || | |cloudelastic1003 |1 |WMCS || | |cloudmetrics1001 |1 |WMCS |N|N/A| |cumin1001 |1 |Infrastructure Foundations |Y|Tell other SREs to use cumin2002 instead that day. Announce it few days earlier as some cookbooks takes days to run. | |dbprov1003 |1 |Data Persistence |N|N/A| |dbstore1005 |1 |Analytics SREs & Data Persistence |N|N/A| |druid1002 |1 | Analytics |N|N/A| |dumpsdata1003 |1 |Service Operations & Platform Engineering |Y|| |kafka-main1003 |1 |SRE |Y - Keith to depool in advance|| |lvs1015 |1 |Traffic |Failover to secondary (lvs1016 in row D) by stopping pybal with puppet disabled|| |maps1003 |1 | |N|N/A| |mc-gp1002 |1 |Service Operations |N|N/A| |ms-backup1002 |1 |Data Persistence ?? |N|N/A| |mwlog1002 |1 |Service Operations |N|N/A| |pc1009 |1 |SRE Data Persistence (DBAs), with support from Platform and Performance|N|N/A| |sessionstore1002 |1 |Service Operations |N|N/A| |thanos-be1003 |1 |Observability |N|N/A| |thanos-fe1003 |1 |Observability |N|N/A| VMs on this row are as follows: |VM Name |Ganeti Host |Team |Action Required |Action Status| |--------------------|----------------|--------------------------------------------------------------|-----------------|-------------| |acmechief-test1001 |ganeti1009 |Traffic |N|N/A| |acmechief1001 |ganeti1009 |Traffic |Y disable puppet on acme_chief clients|| |an-airflow1001 |ganeti1010 |Analytics SREs |N|N/A| |an-tool1005 |ganeti1009 |Analytics SREs |N|N/A| |an-tool1007 |ganeti1010 |Analytics SREs |N|N/A| |doc1002 |ganeti1009 | |N|N/A| |doh1001 |ganeti1011 |Traffic |N|N/A| |etherpad1002 |ganeti1009 |Service Operations |N|N/A| |flowspec1001 |ganeti1009 |Infrastructure Foundations |N|N/A| |idp-test1001 |ganeti1010 |Infrastructure Foundations |N|N/A| |kubemaster1002 |ganeti1010 |Service Operations |N|N/A| |kubernetes1006 |ganeti1009 |Service Operations |N|N/A| |kubestagetcd1006 |ganeti1012 |Service Operations |N|N/A| |kubetcd1004 |ganeti1010 |Service Operations |N|N/A| |logstash1009 |ganeti1010 |Observability |N|N/A| |logstash1025 |ganeti1009 |Observability |N|N/A| |matomo1002 |ganeti1009 |Analytics |N|N/A| |miscweb1002 |ganeti1009 |Service Operations |N|N/A| |ml-etcd1002 |ganeti1012 |ML team |N|N/A| |mwdebug1001 |ganeti1010 |Service Operations |N|N/A| |mx1001 |ganeti1009 |Infrastructure Foundations |N|N/A| |ncredir1001 |ganeti1009 |Traffic |N|N/A| |netflow1001 |ganeti1012 |Infrastructure Foundations |N|N/A| |orespoolcounter1004 |ganeti1010 |Machine Learning SREs |N|N/A| |ping1001 |ganeti1009 |Infrastructure Foundations |Y|| |poolcounter1005 |ganeti1010 |Service Operations |Y|||N|N/A| |puppetboard1001 |ganeti1010 |Infrastructure Foundations |N|N/A| |puppetdb1002 |ganeti1012 |Infrastructure Foundations |Y (disable Puppet fleet-wide during maintenance)|| |registry1004 |ganeti1009 |Service Operations |N|N/A| |rpki1001 |ganeti1009 |Infrastructure Foundations |N|N/A| |seaborgium |ganeti1010 |Infrastructure Foundations |N|N/A| |urldownloader1002 |ganeti1010 |Infrastructure Foundations |Y|||N|N/A| I have listed the teams, and subscribed relevant individuals to this task, based mostly on the server names and info here: https://wikitech.wikimedia.org/wiki/SRE/Infrastructure_naming_conventions. Don't hesitate to add people I could have missed, or remove yourself from the task if you do not need to be involved. (WARNING) **Kindly update the tables if action needs to be taken for any servers/VMs. Please also list the current status of action if required, and set status to 'Complete' once work has been done.** ###### Days Before: - Prepare config changes (netops) ###### 1h Before Window: - Confirm switches are in a healthy state (snapshot MAC and ARP tables, port status, buffer usage) (netops) - Warn people of the upcoming maintenance (netops) - Depool ping1001 ([[ https://wikitech.wikimedia.org/wiki/Ping_offload#Temporarily_stop_the_ICMP_echo_redirect | doc ]]) (netops) ###### After The Change - Confirm switches are in a healthy state (snapshot MAC and ARP tables, port status, buffer usage, validate against prior values). (netops)