The network stacks in eqiad rows C and D are being upgraded to all 10G capable switches. Part of this migration will require all systems on the old switches to be moved to the new switch stack.
In previous migrations, we've stepped through the racks on by one, requiring each sub-team to be present for all affected hosts on the day of the migration. In an effort to better scale with the needs and schedules of multiple teams, we're planning to do this migration slightly different. Rather than a single date for each rack, we're providing a listing to each sub-team of all affected hosts, and that sub-team can then provide feedback with the priority and scheduling of the migration of hosts.
Scheduling Options and Considerations:
* Provide priority groups for the hosts below, and we can move group 1, then 2, etc...
* Provide specific dates and times for the migrations and we can coordinate the migration of the required host(s)
* A mix of the above for easier hosts could be in groups where high priority or critical hosts could have specific date/times set.
The checklist for each hosts migration steps are being developed and won't be pasted in to each task for each host in advance of the move (since if there is an adjustment it is a lot of tasks to update.)
The host list is also available on the [[ https://docs.google.com/spreadsheets/d/13ow4JxrsQdz8KSsdBBNwvlrAuGKo8OHWcnR4RhXTYc0/edit?usp=sharing | Google Sheet listing of all affected hosts ]].
Host(s) List:
|**Host**|**Rack**|Notes regarding role or host|Action required
|~~an-backup-datanode1001~~ |`D2`| decommissioned| None
|~~an-backup-datanode1003~~ |`D7`| decommissioned| None
|~~an-backup-datanode1034~~ |`C2`| decommissioned| None
|~~an-backup-datanode1035~~ |`C4`| decommissioned| None
|~~an-backup-datanode1036~~ |`C4`| decommissioned| None
|~~an-backup-datanode1037~~ |`C4`| decommissioned| None
|~~an-backup-datanode1038~~ |`C4`| decommissioned| None
|~~an-backup-datanode1039~~ |`C7`| decommissioned| None
|~~an-backup-datanode1040~~ |`C7`| decommissioned| None
|~~an-backup-datanode1041~~ |`C2`| decommissioned| None
|~~an-backup-datanode1042~~ |`D2`| decommissioned| None
|~~an-backup-datanode1043~~ |`D4`| decommissioned| None
|~~an-backup-datanode1044~~ |`D4`| decommissioned| None
|~~an-backup-datanode1045~~ |`D7`| decommissioned| None
|~~an-backup-datanode1046~~ |`D7`| decommissioned| None
|~~an-backup-namenode1001~~ |`C4`| decommissioned| None
|~~an-backup-namenode1002~~ |`D7`| decommissioned| None
|an-conf1006 |`C3`| 1/3 nodes - should be fine to take down a single node at a time| None
|an-druid1005 |`D3`| 1/5 nodes - should be fine to take down a single node at a time| None
|an-master1003 |`C2`| Primary HDFS namenode of the production Hadoop cluster|Will probably need a [[https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Hadoop/Administration#Manual_Failover|manual failover]] of the HDFS namenode service to `an-master1004`
|an-master1004 |`D7`| Standby HDFS namenode of the production Hadoop cluster|Will need to ensure that `an-master1003` is the active namenode, or carry out a [[https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Hadoop/Administration#Manual_Failover|manual failover]] if not.
|an-presto1019 |`C4`| 1/15 nodes - should be fine to take down a single node at a time| None
|an-presto1020 |`D2`| 1/15 nodes - should be fine to take down a single node at a time| None
|an-redacteddb1001 |`D2`| SPOF. Analytics dedicated copy of the wikireplicas|Avoid first few days of the month, if possible
|an-test-coord1001 |`D3`| SPOF, but only on the test Hadoop cluster - not critical| None
|an-test-master1002 |`C3`| Standby HDFS namenode of the production Hadoop cluster| Will need to ensure that `an-test-master1001` is the active namenode, or carry out a [[https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Hadoop/Administration#Manual_Failover|failover]] if not.
|an-test-worker1002 |`C5`| 1/3 nodes - should be fine to take down a single node at a time| None
|an-test-worker1003 |`D6`| 1/3 nodes - should be fine to take down a single node at a time| None
|~~an-worker1088~~ |`C2`| decommissioned| None
|~~an-worker1089~~ |`C4`| decommissioned| None
|~~an-worker1090~~ |`C4`| decommissioned| None
|~~an-worker1091~~ |`C7`| decommissioned| None
|~~an-worker1092~~ |`D2`| decommissioned| None
|~~an-worker1093~~ |`D2`| decommissioned| None
|~~an-worker1094~~ |`D7`| decommissioned| None
|~~an-worker1095~~ |`D7`| decommissioned| None
|an-worker1131 |`C2`| 1/~120 nodes - should be fine to take down a single node at a time| None
|an-worker1132 |`C2`| 1/~120 nodes - should be fine to take down a single node at a time| None
|an-worker1133 |`C7`| 1/~120 nodes - should be fine to take down a single node at a time| None
|an-worker1134 |`D4`| 1/~120 nodes - should be fine to take down a single node at a time| None
|an-worker1135 |`D4`| 1/~120 nodes - should be fine to take down a single node at a time| None
|an-worker1136 |`D4`| 1/~120 nodes - should be fine to take down a single node at a time| None
|an-worker1137 |`D4`| 1/~120 nodes - should be fine to take down a single node at a time| None
|an-worker1138 |`D4`| 1/~120 nodes - should be fine to take down a single node at a time| None
|an-worker1151 |`C7`| 1/~120 nodes - should be fine to take down a single node at a time| None
|an-worker1152 |`D7`| 1/~120 nodes - should be fine to take down a single node at a time| None
|an-worker1175 |`D2`| 1/~120 nodes - should be fine to take down a single node at a time| None
|an-worker1180 |`C7`| 1/~120 nodes - should be fine to take down a single node at a time| None
|an-worker1209 |`D2`| 1/~120 nodes - should be fine to take down a single node at a time| None
|an-worker1234 |`C2`| 1/~120 nodes - should be fine to take down a single node at a time| None
||~~druid1008~~ |`D6`| 1/5 nodes - should be fine to take down a single node at a timedecommissioned| None
|druid1012 |`C2`| 1/5 nodes - should be fine to take down a single node at a time|Ideally, `depool` before and `pool` afterwards
|druid1013 |`D2`| 1/5 nodes - should be fine to take down a single node at a time|Ideally, `depool` before and `pool` afterwards
|stat1011 |`C7`| SPOF, will - One of the analytics clients| Will require notifying users of downtime ahead of time
|dse-k8s-worker1003 |`C4`| 1/19 hosts -k8s workers | Will require draining the host with `sre.k8s.pool-depool-node` cookbook
|dse-k8s-worker1004 |`D4`| 1/19 hosts -k8s workers | Will require draining the host with `sre.k8s.pool-depool-node` cookbook
|dse-k8s-worker1010 |`D6`| 1/19 hosts -k8s workers | Will require draining the host with `sre.k8s.pool-depool-node` cookbook
|dse-k8s-worker1011 |`C5`| 1/19 hosts -k8s workers | Will require draining the host with `sre.k8s.pool-depool-node` cookbook
|dse-k8s-worker1013 |`C5`| 1/19 hosts -k8s workers | Will require draining the host with `sre.k8s.pool-depool-node` cookbook
|dse-k8s-worker1018 |`D3`| 1/19 hosts -k8s workers | Will require draining the host with `sre.k8s.pool-depool-node` cookbook
|dse-k8s-worker1019 |`C6`| 1/19 hosts -k8s workers | Will require draining the host with `sre.k8s.pool-depool-node` cookbook