Page MenuHomePhabricator

codfw: Dedicate Rack B1 for cloudX-dev servers
Closed, ResolvedPublic

Description

We are planning to dedicate a rack for only WMCS nodes, for this reason we need to relocate some nodes in Rack B1 into other rack within the same row. No change of IP addresss is needed. Please see below for the list of nodes that we will be relocating. Thanks

Service owner please update the "Ready for relocation" section with YES when the server is depool.

NodesActual locationNew locationDate/timeswitch portReady for relocation?Notes
ganeti2019B1/U1B8/U20April 11th 9:30am CTge-8/0/19yes
ganeti2020B1/U2B8/U21April 14th 9:30amge-8/0/20yes
maps2006B1/U7B5/U36April 14th 9:30am CTge-5/0/35yes
db2076B1/U13B5/U30April 11th 9:30am CTge-5/0/29yes
db2086B1/U14B5/U31April 11th 9:30am CTge-5/0/30yes
db2107B1/U36B5/U32April 11th 9:30am CTge-5/0/31yes
db2137B1/U4B5/U33April 11th 9:30am CTge-5/0/32yes
db2143B1/U12B5/U34April 11th 9:30am CTge-5/0/33yes
db2147B1/U18B5/U35April 11th 9:30am CTge-5/0/34yes
mc2023B1/31B6/U34April 14th 9:30am CTge-6/0/33yes
es2029B1/8B8/U31April 11th 9:30am CTge-8/0/30yes
es2030B1/10B8/U33April 11th 9:30am CTge-8/0/32yes
restbase2021B1/19B3/21April 14th 9:30am CTge-3/0/21
kubestage2002B1/5B8/U27April 14th 9:30pm CTge-8/0/26yes
rdb2008B1/6B6/U32April 11th 12:30 pm CTge-6/0/31yes
wcqs2001B1/21B6/U33April 11th 9:30am CTge-6/0/32yes

Relocate all cloud nodes in other racks in row B,C and D to rack B1

NodesActual locationNew locationDate/timeswitch portReady for relocation?Notes
cloudvirt2001-devB3/U29B1/U4April 18th 9:30amge-1/0/6yes
cloudvirt2002-devB5/U27B1/U2April 18th 9:30amge-1/0/[2-3]yes
cloudnet2002-devB5/U14B1/U16April 18th 9:30amge-1/0/[16-17]]yes
cloudgw2002-devB5/U12B1/U7April 18th 9:30amge-1/0/[19-20]yes
cloudcephmon2002-devB5/U11B1/U6April 18th 9:30amge-1/0/10yes
cloudcephosd2002-devB5/U9B1/U1April 18th 9:30amge-1/0/[0-1]yeseno2
cloudvirt2003-devB8/U3B1/U6May 12th 9:30amge-1/0[10/13]yes
clouddb2001-devB8/U1will be replaced T306854
cloudcephosd2003-devB8/U7B1/U5April 18th 9:30amge-1/0/7-8]yes
cloudcephmon2003-devB8/U8will be replaced T304881(decom)
cloudcontrol2003-devC1/U14on public vlan no need for move
cloudcontrol2004-devD1/U17on public vlan no need for move
cloudservices2002-devC1/U15will be replaced T304881(decom)
cloudservices2003-devD1/U16will be replaced T304881(decom)

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Papaul updated the task description. (Show Details)

@Papaul the databases and es2029/es2030 are ready for relocation. Please turn them ON once you are done
For what is worth, es2029 and es2030 are scheduled to be done 14th, which is a bank holiday for me, so someone else would need to bring mysql up once that is done.

For what is worth, es2029 and es2030 are scheduled to be done 14th, which is a bank holiday for me, so someone else would need to bring mysql up once that is done.

I can take care of that.

Papaul updated the task description. (Show Details)
Papaul updated the task description. (Show Details)

Changing the tag as our DBA part here is done. If there's anything else required, I am still subscribed to the task.

Icinga downtime and Alertmanager silence (ID=dc2c981d-aef2-4a2b-9d24-2e3ca912b985) set by bking@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their services with reason: physically moving host

wcqs2001.codfw.wmnet

Icinga downtime and Alertmanager silence (ID=9620461c-f770-40dd-99d6-2b4f895a2549) set by akosiaris@cumin1001 for 2:00:00 on 1 host(s) and their services with reason: moving to a different rack

rdb2008.codfw.wmnet

Icinga downtime and Alertmanager silence (ID=6e1e84ea-fac8-4dde-be55-1bf6ea935f75) set by akosiaris@cumin1001 for 2:00:00 on 1 host(s) and their services with reason: moving to a different rack

kubestage2002.codfw.wmnet

Icinga downtime and Alertmanager silence (ID=5293ea70-e1a3-4862-ae77-82e8abf9cdd4) set by akosiaris@cumin1001 for 2:00:00 on 1 host(s) and their services with reason: moving to a different rack

mc2023.codfw.wmnet
akosiaris subscribed.

I marked rdb2008, kubestage2002 and mc2023 as YES in the table. rdb2008 is the secondary, not the primary, kubestage2002 is for the staging cluster anyway and mc2023 will be handled by mcrouter's configuration and shard05 should be moved to mc-gp* hosts (gutter pool).

Papaul updated the task description. (Show Details)

I can not power down kuberstage2002

W: aborting poweroff due to 30-query-hostname exiting with code 1.
Papaul updated the task description. (Show Details)
Papaul updated the task description. (Show Details)

@hnowlan will it be possible to get me restbase2021 offline on April 14th at 9:30am CT?

thanks.

@hnowlan will it be possible to get me restbase2021 offline on April 14th at 9:30am CT?

thanks.

Yep, that should be fine - it ideally shouldn't be down for too long but please ping me when ready and I can take it down.

Icinga downtime and Alertmanager silence (ID=575a5fd0-668b-41f6-8ab3-5ff749f54ac7) set by akosiaris@cumin1001 for 2 days, 0:00:00 on 1 host(s) and their services with reason: moving to a different rack

mc2023.codfw.wmnet

Icinga downtime and Alertmanager silence (ID=60f8ccbd-38ba-4b65-aadf-f44a7fc83c9e) set by akosiaris@cumin1001 for 2 days, 0:00:00 on 1 host(s) and their services with reason: moving to a different rack

kubestage2002.codfw.wmnet

@Papaul: mc2023 and kubestage2002 have been downtimed again (for 2days) and I 've just powered them off. The should be ready to be moved.

@Andrew @aborrero I have listed 14 servers that we will have to move into rack b1 4 of those are not in row B and using Public IP. I think will be better to do those in after moving the others that are already in row B.
I am thinking on starting the move next Monday April 18th. what are the servers that are in row B and not in Rack B1 that we can move next Monday and how many can we do that same day.(for me if we can do all 10 that Monday, I have no issues) if Monday is not a good day please let me know. Please add in the "notes" section a comment if we can move that server next Monday. Thanks

@hnowlan i am ready for restbase2021

Go ahead!

Papaul updated the task description. (Show Details)

All the nodes that are not cloud are now out of Rack B1. Thanks to all helping me to de-pool the servers and power them off.

I have a dentist appointment at 2PM CDT on Monday the 18th; otherwise I'm available to help with this.

Please be aware that I'm largely ignorant of network topology vs. racks so will be relying on papaul or netops staff to ensure that this move keeps everything still able to talk to everything else.

Mentioned in SAL (#wikimedia-cloud) [2022-04-18T13:40:05Z] <andrewbogott> shutting down many codfdfw1dev servers (including network infra!) for T305469

Papaul updated the task description. (Show Details)

backups are complaining of lack of recent backups of cloudcontrol2003-dev, as it is down. I will ignore those for a while- we must reenable monitoring once maintenance completes.

Change 784621 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] backup: Ignore cloudcontrol2003-dev backup monitoring

https://gerrit.wikimedia.org/r/784621

Change 784621 merged by Jcrespo:

[operations/puppet@production] backup: Ignore cloudcontrol2003-dev backup monitoring

https://gerrit.wikimedia.org/r/784621

Papaul updated the task description. (Show Details)

This is complete. @Andrew thanks for all you help