Change Details

= codfw row B switches upgrade = For reasons detailed in {T327248} we're going to upgrade codfw row B switches. This is scheduled for **Feb 21st - 14:00-16:00 UTC**, please let us know if there is any issue with the scheduled time. It means a !!30min hard downtime!! for the whole row if everything goes well. Also a good opportunity to test the hosts depool mechanisms and services redundancy. The list of impacted servers and teams for this row is listed below. The actions needed is quite free form: * please write `NONE` if no action is needed, * the cookbook/command to run if it can be done by a 3rd party * who will be around to take care of the depool * Link to the relevant doc * etc The two main types of actions needed are depool and monitoring downtime NOTE: If the servers can handle a longer depool, it's preferred to depool them many hours or the day before (and mark `None` in the table) so there are less moving parts closer to the maintenance window. == Core Platform == #core-platform-team |Servers|Depool action needed|Repool action needed|Status| |---|---|---|---| |maps[2006,2009]| | | | |restbase[2013-2014,2019,2021,2024]| depool | repool | | |sessionstore2001| `confctl --object-type discovery select 'dnsdisc=sessionstore,name=codfw' set/pooled=false` | `confctl --object-type discovery select 'dnsdisc=sessionstore,name=codfw' set/pooled=true` | Out of an abundance of caution, we should depool the datacenter prior to maintenance (see: https://wikitech.wikimedia.org/wiki/Incidents/2023-01-24_sessionstore_quorum_issues) | |thumbor[2003-2004]| | | | == Search Platform == #discovery-search |Servers|Depool action needed|Repool action needed|Status| |---|---|---|---| |elastic[2041-2044,2057-2058,2063-2064,2070,2077-2080]| | | | |wcqs2001| | | | |wdqs[2005,2007,2010]| | | | == ServiceOps-Collab == #serviceops-collab |Servers|Depool action needed|Repool action needed|Status| |---|---|---|---| |gitlab2003| `NONE` | `NONE` | insetup| |gitlab-runner2002| pause gitlab-runner in [admin interface](https://gitlab.wikimedia.org/admin/runners/311#/) | unpause gitlab-runner in [admin interface](https://gitlab.wikimedia.org/admin/runners/311#/) | paused/depooled by @Jelto | |miscweb2002| NONE | NONE | | |releases2002| NONE | NONE | || |contint2002| NONE | NONE | | == Infrastructure Foundations == #infrastructure-foundations |Servers|Depool action needed|Repool action needed|Status| |---|---|---|---| |apt2001|none | | | |bast2002|none | | | |ganeti[2019-2022,2031-2032]|none | | | |install2003|none | | | |mx2001|none | | | |pki2002|none |none |none | |puppetmaster2003|[[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/890792 | offline puppet master ]] |[[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/890792 | revert offline change ]] | server offline | |serpens|none | | | |urldownloader2002|failed over to 2001 | not needed | done | == Observability == #sre_observability |Servers|Depool action needed|Repool action needed|Status| |---|---|---|---| |centrallog2002| set downtimes | sync rotated logs for day of maint from eqiad (on the following day) | | |graphite2004| none | none | n/a | |kafka-logging[2002,2004]| set downtimes, stop kafka service | start kafka service, ensure dashboard returns to green | | |logstash[2024-2025,2027,2034,2036]| conftool 202[45] disable shard allocation 2027,203[46] | conftool 202[45] allow shard allocation 2027,203[46] | draining shards, hosts depooled, downtime scheduled | |prometheus2005| `depool` on the host | `pool` | repooled | == Observability and Data Persistence == #sre_observability #data-persistence |Servers|Depool action needed|Repool action needed|Status| |---|---|---|---| |thanos-fe2002| conftool depool, while making sure another thanos-fe host is pooled for service thanos-web | conftool pool, while making sure another thanos-fe host is pooled for service thanos-web | repooled | == Traffic == #traffic |Servers|Depool action needed|Repool action needed|Status| |---|---|---|---| |cp[2031-2034]| | | | |doh2002|disable puppet and stop bird.service | | depooled| |lvs2008| | | | |ncredir2002| | | | |pybal-test[2001-2003]| | | | == Data Engineering == #data-engineering |Servers|Depool action needed|Repool action needed|Status| |---|---|---|---| |aqs[2005-2008]| None | None | | |furud| | | | == Machine Learning == #machine-learning-team |Servers|Depool action needed|Repool action needed|Status| |---|---|---|---| |ml-cache2002|-|- | | |ml-etcd2001|-|-| | |ml-serve[2002,2006]|-|-| | |ml-staging-ctrl2001|-|-| | |ml-staging-etcd2002|-|-| | |ores[2003-2004]|`sudo -i depool` | `sudo -i pool` | | |orespoolcounter2004|-|-| | == WMCS == #cloud-services-team |Servers|Depool action needed|Repool action needed|Status| |---|---|---|---| |cloudcephmon[2004-2006]-dev| | | | |cloudcephosd[2001-2003]-dev| | | | |cloudcontrol[2001,2005]-dev| | | | |clouddb[2001-2002]-dev| | | | |cloudgw[2001-2003]-dev| | | | |cloudnet[2005-2006]-dev| | | | |cloudservices[2004-2005]-dev| | | | |cloudvirt[2001-2003]-dev| | | | |cloudweb2002-dev| | | | == Data Persistence == #data-persistence |Servers|Depool action needed|Repool action needed|Status| |---|---|---|---| |backup[2005,2008]| They are not a service, but storage. Jaime will make sure earlier in the week they are not active at the time of the maintenance. | Jaime will restart some delayed backups, if any. | | |cassandra-dev2001| None | None | | |db[2096,2098,2107-2111,2123-2124,2134,2137,2143,2147-2148,2159-2164,2177-2178]|Nothing as codfw will be depooled | |Nothing needed as codfw will be depooled. db2134 (m3 master can be ignored) | |dbprov2002| They are not a service, but storage. Jaime will make sure earlier in the week they are not active at the time of the maintenance. | None | | |dbproxy2002|None |Reload haproxy | | |es[2021,2025,2029-2030]|Nothing as codfw will be depooled | | | |moss-be2002|N/A |N/A |Not in production service | |ms-be[2041,2046-2047,2053,2057,2063,2067]|None |None | | |ms-fe2010|`sudo depool` |`sudo pool` |@MatthewVernon is away, so @fgiunchedi will handle| |pc2012|Nothing as codfw will be depooled | | | |thanos-be2002|None |None | | == ServiceOps == #serviceops Due to the large number of services potentially affected (multiple mw appservers, kubernetes workers), a global depool of a/a services will be done: `sre.discovery.datacenter depool --reason T327991 codfw` After the maintenance, check state of {T329664} and repool: `sre.discovery.datacenter pool --reason T327991 codfw` Depool restbase-async from eqiad: `cookbook sre.discovery.service-route --reason T327991 depool --wipe-cache eqiad restbase-async` |Servers|Depool action needed|Repool action needed|Status| |---|---|---|---| |conf2004| | | | |contint2002| | | | |kafka-main2002| | | | |kubemaster2002| | | | |kubernetes[2006,2009-2010,2020,2023]| | | | |kubestage2002| | | | |kubestagetcd2001| | | | |kubetcd2006| | | | |mc[2042-2046]| | | | |mc-gp2002| | | | |mw[2259-2270,2310-2334]| | | | |mwdebug2002| | | | |parse[2006-2010]| | | | |poolcounter2004| | | | |rdb2008| | | | |registry2004| | | |

= codfw row B switches upgrade = For reasons detailed in {T327248} we're going to upgrade codfw row B switches. This is scheduled for **Feb 21st - 14:00-16:00 UTC**, please let us know if there is any issue with the scheduled time. It means a !!30min hard downtime!! for the whole row if everything goes well. Also a good opportunity to test the hosts depool mechanisms and services redundancy. The list of impacted servers and teams for this row is listed below. The actions needed is quite free form: * please write `NONE` if no action is needed, * the cookbook/command to run if it can be done by a 3rd party * who will be around to take care of the depool * Link to the relevant doc * etc The two main types of actions needed are depool and monitoring downtime NOTE: If the servers can handle a longer depool, it's preferred to depool them many hours or the day before (and mark `None` in the table) so there are less moving parts closer to the maintenance window. == Core Platform == #core-platform-team |Servers|Depool action needed|Repool action needed|Status| |---|---|---|---| |maps[2006,2009]| | | | |restbase[2013-2014,2019,2021,2024]| depool | repool | | |sessionstore2001| `confctl --object-type discovery select 'dnsdisc=sessionstore,name=codfw' set/pooled=false` | `confctl --object-type discovery select 'dnsdisc=sessionstore,name=codfw' set/pooled=true` | Out of an abundance of caution, we should depool the datacenter prior to maintenance (see: https://wikitech.wikimedia.org/wiki/Incidents/2023-01-24_sessionstore_quorum_issues) | |thumbor[2003-2004]| | | | == Search Platform == #discovery-search |Servers|Depool action needed|Repool action needed|Status| |---|---|---|---| |elastic[2041-2044,2057-2058,2063-2064,2070,2077-2080]| | | | |wcqs2001| | | | |wdqs[2005,2007,2010]| | | | == ServiceOps-Collab == #serviceops-collab |Servers|Depool action needed|Repool action needed|Status| |---|---|---|---| |gitlab2003| `NONE` | `NONE` | insetup| |gitlab-runner2002| pause gitlab-runner in [admin interface](https://gitlab.wikimedia.org/admin/runners/311#/) | unpause gitlab-runner in [admin interface](https://gitlab.wikimedia.org/admin/runners/311#/) | paused/depooled by @Jelto | |miscweb2002| NONE | NONE | | |releases2002| NONE | NONE | || |contint2002| NONE | NONE | | == Infrastructure Foundations == #infrastructure-foundations |Servers|Depool action needed|Repool action needed|Status| |---|---|---|---| |apt2001|none | | | |bast2002|none | | | |ganeti[2019-2022,2031-2032]|none | | | |install2003|none | | | |mx2001|none | | | |pki2002|none |none |none | |puppetmaster2003|[[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/890792 | offline puppet master ]] |[[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/890792 | revert offline change ]] | done | |serpens|none | | | |urldownloader2002|failed over to 2001 | not needed | done | == Observability == #sre_observability |Servers|Depool action needed|Repool action needed|Status| |---|---|---|---| |centrallog2002| set downtimes | sync rotated logs for day of maint from eqiad (on the following day) | | |graphite2004| none | none | n/a | |kafka-logging[2002,2004]| set downtimes, stop kafka service | start kafka service, ensure dashboard returns to green | | |logstash[2024-2025,2027,2034,2036]| conftool 202[45] disable shard allocation 2027,203[46] | conftool 202[45] allow shard allocation 2027,203[46] | draining shards, hosts depooled, downtime scheduled | |prometheus2005| `depool` on the host | `pool` | repooled | == Observability and Data Persistence == #sre_observability #data-persistence |Servers|Depool action needed|Repool action needed|Status| |---|---|---|---| |thanos-fe2002| conftool depool, while making sure another thanos-fe host is pooled for service thanos-web | conftool pool, while making sure another thanos-fe host is pooled for service thanos-web | repooled | == Traffic == #traffic |Servers|Depool action needed|Repool action needed|Status| |---|---|---|---| |cp[2031-2034]| | | | |doh2002|disable puppet and stop bird.service | | depooled| |lvs2008| | | | |ncredir2002| | | | |pybal-test[2001-2003]| | | | == Data Engineering == #data-engineering |Servers|Depool action needed|Repool action needed|Status| |---|---|---|---| |aqs[2005-2008]| None | None | | |furud| | | | == Machine Learning == #machine-learning-team |Servers|Depool action needed|Repool action needed|Status| |---|---|---|---| |ml-cache2002|-|- | | |ml-etcd2001|-|-| | |ml-serve[2002,2006]|-|-| | |ml-staging-ctrl2001|-|-| | |ml-staging-etcd2002|-|-| | |ores[2003-2004]|`sudo -i depool` | `sudo -i pool` | | |orespoolcounter2004|-|-| | == WMCS == #cloud-services-team |Servers|Depool action needed|Repool action needed|Status| |---|---|---|---| |cloudcephmon[2004-2006]-dev| | | | |cloudcephosd[2001-2003]-dev| | | | |cloudcontrol[2001,2005]-dev| | | | |clouddb[2001-2002]-dev| | | | |cloudgw[2001-2003]-dev| | | | |cloudnet[2005-2006]-dev| | | | |cloudservices[2004-2005]-dev| | | | |cloudvirt[2001-2003]-dev| | | | |cloudweb2002-dev| | | | == Data Persistence == #data-persistence |Servers|Depool action needed|Repool action needed|Status| |---|---|---|---| |backup[2005,2008]| They are not a service, but storage. Jaime will make sure earlier in the week they are not active at the time of the maintenance. | Jaime will restart some delayed backups, if any. | | |cassandra-dev2001| None | None | | |db[2096,2098,2107-2111,2123-2124,2134,2137,2143,2147-2148,2159-2164,2177-2178]|Nothing as codfw will be depooled | |Nothing needed as codfw will be depooled. db2134 (m3 master can be ignored) | |dbprov2002| They are not a service, but storage. Jaime will make sure earlier in the week they are not active at the time of the maintenance. | None | | |dbproxy2002|None |Reload haproxy | | |es[2021,2025,2029-2030]|Nothing as codfw will be depooled | | | |moss-be2002|N/A |N/A |Not in production service | |ms-be[2041,2046-2047,2053,2057,2063,2067]|None |None | | |ms-fe2010|`sudo depool` |`sudo pool` |@MatthewVernon is away, so @fgiunchedi will handle| |pc2012|Nothing as codfw will be depooled | | | |thanos-be2002|None |None | | == ServiceOps == #serviceops Due to the large number of services potentially affected (multiple mw appservers, kubernetes workers), a global depool of a/a services will be done: `sre.discovery.datacenter depool --reason T327991 codfw` After the maintenance, check state of {T329664} and repool: `sre.discovery.datacenter pool --reason T327991 codfw` Depool restbase-async from eqiad: `cookbook sre.discovery.service-route --reason T327991 depool --wipe-cache eqiad restbase-async` |Servers|Depool action needed|Repool action needed|Status| |---|---|---|---| |conf2004| | | | |contint2002| | | | |kafka-main2002| | | | |kubemaster2002| | | | |kubernetes[2006,2009-2010,2020,2023]| | | | |kubestage2002| | | | |kubestagetcd2001| | | | |kubetcd2006| | | | |mc[2042-2046]| | | | |mc-gp2002| | | | |mw[2259-2270,2310-2334]| | | | |mwdebug2002| | | | |parse[2006-2010]| | | | |poolcounter2004| | | | |rdb2008| | | | |registry2004| | | |

= codfw row B switches upgrade = For reasons detailed in {T327248} we're going to upgrade codfw row B switches. This is scheduled for **Feb 21st - 14:00-16:00 UTC**, please let us know if there is any issue with the scheduled time. It means a !!30min hard downtime!! for the whole row if everything goes well. Also a good opportunity to test the hosts depool mechanisms and services redundancy. The list of impacted servers and teams for this row is listed below. The actions needed is quite free form: * please write `NONE` if no action is needed, * the cookbook/command to run if it can be done by a 3rd party * who will be around to take care of the depool * Link to the relevant doc * etc The two main types of actions needed are depool and monitoring downtime NOTE: If the servers can handle a longer depool, it's preferred to depool them many hours or the day before (and mark `None` in the table) so there are less moving parts closer to the maintenance window. == Core Platform == #core-platform-team |Servers|Depool action needed|Repool action needed|Status| |---|---|---|---| |maps[2006,2009]| | | | |restbase[2013-2014,2019,2021,2024]| depool | repool | | |sessionstore2001| `confctl --object-type discovery select 'dnsdisc=sessionstore,name=codfw' set/pooled=false` | `confctl --object-type discovery select 'dnsdisc=sessionstore,name=codfw' set/pooled=true` | Out of an abundance of caution, we should depool the datacenter prior to maintenance (see: https://wikitech.wikimedia.org/wiki/Incidents/2023-01-24_sessionstore_quorum_issues) | |thumbor[2003-2004]| | | | == Search Platform == #discovery-search |Servers|Depool action needed|Repool action needed|Status| |---|---|---|---| |elastic[2041-2044,2057-2058,2063-2064,2070,2077-2080]| | | | |wcqs2001| | | | |wdqs[2005,2007,2010]| | | | == ServiceOps-Collab == #serviceops-collab |Servers|Depool action needed|Repool action needed|Status| |---|---|---|---| |gitlab2003| `NONE` | `NONE` | insetup| |gitlab-runner2002| pause gitlab-runner in [admin interface](https://gitlab.wikimedia.org/admin/runners/311#/) | unpause gitlab-runner in [admin interface](https://gitlab.wikimedia.org/admin/runners/311#/) | paused/depooled by @Jelto | |miscweb2002| NONE | NONE | | |releases2002| NONE | NONE | || |contint2002| NONE | NONE | | == Infrastructure Foundations == #infrastructure-foundations |Servers|Depool action needed|Repool action needed|Status| |---|---|---|---| |apt2001|none | | | |bast2002|none | | | |ganeti[2019-2022,2031-2032]|none | | | |install2003|none | | | |mx2001|none | | | |pki2002|none |none |none | |puppetmaster2003|[[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/890792 | offline puppet master ]] |[[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/890792 | revert offline change ]] | server offlidone | |serpens|none | | | |urldownloader2002|failed over to 2001 | not needed | done | == Observability == #sre_observability |Servers|Depool action needed|Repool action needed|Status| |---|---|---|---| |centrallog2002| set downtimes | sync rotated logs for day of maint from eqiad (on the following day) | | |graphite2004| none | none | n/a | |kafka-logging[2002,2004]| set downtimes, stop kafka service | start kafka service, ensure dashboard returns to green | | |logstash[2024-2025,2027,2034,2036]| conftool 202[45] disable shard allocation 2027,203[46] | conftool 202[45] allow shard allocation 2027,203[46] | draining shards, hosts depooled, downtime scheduled | |prometheus2005| `depool` on the host | `pool` | repooled | == Observability and Data Persistence == #sre_observability #data-persistence |Servers|Depool action needed|Repool action needed|Status| |---|---|---|---| |thanos-fe2002| conftool depool, while making sure another thanos-fe host is pooled for service thanos-web | conftool pool, while making sure another thanos-fe host is pooled for service thanos-web | repooled | == Traffic == #traffic |Servers|Depool action needed|Repool action needed|Status| |---|---|---|---| |cp[2031-2034]| | | | |doh2002|disable puppet and stop bird.service | | depooled| |lvs2008| | | | |ncredir2002| | | | |pybal-test[2001-2003]| | | | == Data Engineering == #data-engineering |Servers|Depool action needed|Repool action needed|Status| |---|---|---|---| |aqs[2005-2008]| None | None | | |furud| | | | == Machine Learning == #machine-learning-team |Servers|Depool action needed|Repool action needed|Status| |---|---|---|---| |ml-cache2002|-|- | | |ml-etcd2001|-|-| | |ml-serve[2002,2006]|-|-| | |ml-staging-ctrl2001|-|-| | |ml-staging-etcd2002|-|-| | |ores[2003-2004]|`sudo -i depool` | `sudo -i pool` | | |orespoolcounter2004|-|-| | == WMCS == #cloud-services-team |Servers|Depool action needed|Repool action needed|Status| |---|---|---|---| |cloudcephmon[2004-2006]-dev| | | | |cloudcephosd[2001-2003]-dev| | | | |cloudcontrol[2001,2005]-dev| | | | |clouddb[2001-2002]-dev| | | | |cloudgw[2001-2003]-dev| | | | |cloudnet[2005-2006]-dev| | | | |cloudservices[2004-2005]-dev| | | | |cloudvirt[2001-2003]-dev| | | | |cloudweb2002-dev| | | | == Data Persistence == #data-persistence |Servers|Depool action needed|Repool action needed|Status| |---|---|---|---| |backup[2005,2008]| They are not a service, but storage. Jaime will make sure earlier in the week they are not active at the time of the maintenance. | Jaime will restart some delayed backups, if any. | | |cassandra-dev2001| None | None | | |db[2096,2098,2107-2111,2123-2124,2134,2137,2143,2147-2148,2159-2164,2177-2178]|Nothing as codfw will be depooled | |Nothing needed as codfw will be depooled. db2134 (m3 master can be ignored) | |dbprov2002| They are not a service, but storage. Jaime will make sure earlier in the week they are not active at the time of the maintenance. | None | | |dbproxy2002|None |Reload haproxy | | |es[2021,2025,2029-2030]|Nothing as codfw will be depooled | | | |moss-be2002|N/A |N/A |Not in production service | |ms-be[2041,2046-2047,2053,2057,2063,2067]|None |None | | |ms-fe2010|`sudo depool` |`sudo pool` |@MatthewVernon is away, so @fgiunchedi will handle| |pc2012|Nothing as codfw will be depooled | | | |thanos-be2002|None |None | | == ServiceOps == #serviceops Due to the large number of services potentially affected (multiple mw appservers, kubernetes workers), a global depool of a/a services will be done: `sre.discovery.datacenter depool --reason T327991 codfw` After the maintenance, check state of {T329664} and repool: `sre.discovery.datacenter pool --reason T327991 codfw` Depool restbase-async from eqiad: `cookbook sre.discovery.service-route --reason T327991 depool --wipe-cache eqiad restbase-async` |Servers|Depool action needed|Repool action needed|Status| |---|---|---|---| |conf2004| | | | |contint2002| | | | |kafka-main2002| | | | |kubemaster2002| | | | |kubernetes[2006,2009-2010,2020,2023]| | | | |kubestage2002| | | | |kubestagetcd2001| | | | |kubetcd2006| | | | |mc[2042-2046]| | | | |mc-gp2002| | | | |mw[2259-2270,2310-2334]| | | | |mwdebug2002| | | | |parse[2006-2010]| | | | |poolcounter2004| | | | |rdb2008| | | | |registry2004| | | |