Page MenuHomePhabricator

Relocating servers out of A1 in codfw
Closed, ResolvedPublic

Description

We want to move the servers racked in A1, a router rack, to different positions in row A. This will eliminate any service interruptions caused by regular service maintenance. This is the proposed relocations with some issues addressed.

Proposal date and time: Wednesday January 24th 16:00 UTC

DoneserverrackU spaceswitch portUservice owner
ml-serve2005a443ge-0/0/402U@machine_learning
db2158a322ge-0/0/211u@Data Persistence
db2157a530ge-0/0/291u@Data Persistence
es2026a441ge-0/0/422u@Data Persistence
gitlab2002a622ge-0/0/211u@collaboration services
kubestage2001a623ge-0/0/221u@serviceops
db2136a624ge-0/0/231u@Data Persistence

this is what i think will work, with two issues.

Both of the 2U servers (that are 1G) will be going in 10G racks. I positioned them at the top so that we can cable them to the last quartet on each lsw. that way they take as little as possible from the 10G racks. otherwise they need to go to a different row. There isn't enough room to rehome two 2U servers in the 1G racks on row A.

also the three servers going in A6 will be using space freed up by moving the mgmt switch to the top. There are enough power ports and available power in this rack because of already racked 2U servers. I will have to break the power cabling conventions in this rack to fit them.

otherwise there is adequate space and power to move these servers out of the router rack.

Event Timeline

Papaul updated the task description. (Show Details)
Papaul moved this task from Backlog to Racking Tasks on the ops-codfw board.

@Papaul the hosts belonging to Data Persistence will be off and ready to be moved.

Mentioned in SAL (#wikimedia-operations) [2024-01-22T17:17:12Z] <akosiaris> draining kubestage2001, uncordoning kubestage2002 to allow it to receive the pods. T355437

@Papaul we have lvs2011 in U43 in A2, so we can't put ml-serve2005 there.

Also es2026 can't be connected at 1G on lsw1-a2-codfw port 41, as port 42 is connected to lvs2011 at 10G.

ge-0/0/44 and ge-0/0/45 are maybe options, as that block is already set to 1G. Or let me know what's best.

@Marostegui thank you @cmooney i will again take a look at it thanks

Moved db2158 port because the port was already taken up. Used first available.
Same for the three servers in A6. Moved up to the next available port.
These cabling convention issues will be fixed when the rack is migrated to the leaf.

Change 992555 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Disable notifications on A1 hosts

https://gerrit.wikimedia.org/r/992555

Mentioned in SAL (#wikimedia-operations) [2024-01-24T05:51:44Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Depool db2158 db2157 es2026 db2136 T355437', diff saved to https://phabricator.wikimedia.org/P55452 and previous config saved to /var/cache/conftool/dbconfig/20240124-055143-marostegui.json

Change 992555 merged by Marostegui:

[operations/puppet@production] mariadb: Disable notifications on A1 hosts

https://gerrit.wikimedia.org/r/992555

Mentioned in SAL (#wikimedia-operations) [2024-01-24T09:28:15Z] <marostegui@cumin1002> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2026.codfw.wmnet with reason: A1 codfw maintenance T355437

Mentioned in SAL (#wikimedia-operations) [2024-01-24T09:28:40Z] <marostegui@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2026.codfw.wmnet with reason: A1 codfw maintenance T355437

Mentioned in SAL (#wikimedia-operations) [2024-01-24T09:29:18Z] <marostegui@cumin1002> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: A1 codfw maintenance T355437

Mentioned in SAL (#wikimedia-operations) [2024-01-24T09:29:32Z] <marostegui@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: A1 codfw maintenance T355437

Mentioned in SAL (#wikimedia-operations) [2024-01-24T09:29:36Z] <marostegui@cumin1002> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2157.codfw.wmnet with reason: A1 codfw maintenance T355437

Mentioned in SAL (#wikimedia-operations) [2024-01-24T09:29:59Z] <marostegui@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2157.codfw.wmnet with reason: A1 codfw maintenance T355437

Mentioned in SAL (#wikimedia-operations) [2024-01-24T09:30:10Z] <marostegui@cumin1002> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2136.codfw.wmnet with reason: A1 codfw maintenance T355437

Mentioned in SAL (#wikimedia-operations) [2024-01-24T09:30:19Z] <marostegui@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2136.codfw.wmnet with reason: A1 codfw maintenance T355437

@Papaul @Jhancock.wm db2158 db2157 db2136 es2026 are now off and ready to be moved anytime

Mentioned in SAL (#wikimedia-operations) [2024-01-24T14:03:51Z] <klausman@cumin2002> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ml-serve2005.codfw.wmnet with reason: Machine move (T355437)

Icinga downtime and Alertmanager silence (ID=f37d946c-6c32-4271-92ba-bc66a002809d) set by klausman@cumin2002 for 1 day, 0:00:00 on 1 host(s) and their services with reason: Machine move (T355437)

ml-serve2005.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-01-24T14:04:10Z] <klausman@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ml-serve2005.codfw.wmnet with reason: Machine move (T355437)

klausman subscribed.

ml-serve2005 is off and ready

ml-serve2005 is back up and working fine

Today's work is complete. The only node left to relocation is gitlab2002. Service ops will get back with us with a day for sometimes next week. All old ports in netbox and on asw-a1-codfw removed.

Database related hosts are being repooled

The only node left to relocation is gitlab2002.

downtime of gitlab announced for tomorrow, Jan 30, 8:30 to 8:40 PST and banner added, for moving gitlab2002

The banner says 8:30 to 8:40 UTC. I was confused that it was still there.

We did the last server move today. Thanks for All

Papaul claimed this task.

The banner says 8:30 to 8:40 UTC. I was confused that it was still there.

Yea, the start got delayed a little bit with things like having to re-auth before logging in to stop runners

It was 8:41 to 8:48 so overall downtime 7 minutes.