Page MenuHomePhabricator

eqiad: Server moves to free up space on 10g racks
Open, MediumPublicRequest

Description

Hi John, it looks like we're short 7x 2u positions on 10g racks to complete T260445. This task is to target 1g servers that can be moved out of their existing 10g rack locations to make this happen. Please provide the info that shows old --> new rack locations and switch port info, along with proposed dates for the server moves. Summary of the proposed rack moves in:
https://docs.google.com/spreadsheets/d/1om4K2iy2yx6dfQ6DGufMSDf2luoyn2zxQHyZTXJEo-4/edit#gid=1862899982

In addition, here's a quick summary of incoming upcoming hardware installs that require 10g over the next couple quarters:

Q2:

  • logstash[1020-1022] refresh - 3x 10g ports across 3u

Q3:

  • Backup storage expansion for Swift objects - 4x 10g ports across 18u
  • special slaves, vslow, test host (9 servers) - 9x 10g ports across 9u
  • Bacula expansion - 1x 10g port across 3u
  • kafka-logging100[123] - 3x 10g ports across 3u
  • cloudvirt expansion (3+1 nodes) - 8x 10g ports across 4u (wmcs rack)
  • ceph expansion (6+3 nodes) - 18x 10g ports across 9u (wmcs rack)

Q4:

  • eqiad: mc[1019-1036] refresh - 18x 10g ports across 18u
  • eqiad: SDC/SDAW? - 7x 10g ports across 7u

Thanks,
Willy

Event Timeline

Restricted Application added a project: SRE. · View Herald TranscriptNov 2 2020, 8:17 PM
wiki_willy renamed this task from eqiad: Server Moves to Free up 7x 2u Spaces on 10g Racks to eqiad: Server Moves to Free up Space on 10g Racks.Nov 2 2020, 8:43 PM
wiki_willy updated the task description. (Show Details)
wiki_willy updated the task description. (Show Details)Nov 2 2020, 8:48 PM
Krinkle renamed this task from eqiad: Server Moves to Free up Space on 10g Racks to eqiad: Server moves to free up space on 10g racks.Nov 3 2020, 4:17 AM

These are all 1G servers in 10G racks for row A

A2A4A7
db1074stat1004mw1269
db1075logstash1020mw1270
db1079wdqs1003mw1271
db1080ganeti1005mw1272
db1081continet1001mw1273
db1082db1111mw1274
es10112Umaps1001mw1275
es10122Uanalytics10702Umw1276
snapshot1005mw1277
kubestage1001mw1278
scb1001mw1279
aqs1004mw1281
druid1001mw1282
mw1283

These are all 1G serves in 10G racks for row B

B2B4B7
db1099elastic1050wtp1031
analytics10722Uelastic1049wtp1032
conf1005wtp1033
Kublog1002druid1003
maps1002ores1003
cloudcontrol1004
mw1313
mw1314
mw1315
mw1316

Row C

C2C4C7
es10162Uores1006francium
db1100mwlog1001polonium
db1101snapshot1006scb1003
analytics10642Udeploy1001elastic1051
analytics10652Ulabsdb1010elastic1052
analyitcs10662Uwtp1040
db1087wtp1041
db1088wtp1042
labnet10042U
es10152U
analyitcs10742U
db1108

Racks D2 and D7 are 100% 10G but they were initially built that way. D4 was just converted to 10G

D4
db1114
ores1008
mc1033
mc1034
mc1035
mc1036
aqs1006
druid1003
labweb1002
wtp1046
wtp1047
wtp1048
snapshot1007
restbase1030
elastic1064
conf1006
puppetmaster1002
Cmjohnson claimed this task.Nov 3 2020, 7:37 PM
Cmjohnson added a subscriber: Jclark-ctr.

@wiki_willy I had time to do this today while the Dell tech worked on an-presto1004. I am going to be utilizing a 2U space in A2 and B2 for the kafka-jumbo 10G updates leaving only 15 2U spaces. We will have less than I previously reported. I am also pasting what I put in the an-worker ticket here for better tracking.

@wiki_willy and I do not have enough 10G rack space to fit 24 2U servers, Currently, I have 17 2U spaces in 10G racks. This is all I have left for servers this size.

A2 - 1
B2 - 4
B4 - 4
C2 - 2
D4- 6

Consolidated all the info @Cmjohnson provided in a Google doc, so we can add the service owners of the hosts and track future rack location, etc. below:

https://docs.google.com/spreadsheets/d/1om4K2iy2yx6dfQ6DGufMSDf2luoyn2zxQHyZTXJEo-4/edit#gid=1862899982

wiki_willy reassigned this task from Cmjohnson to Jclark-ctr.Nov 5 2020, 7:37 PM

@elukey Hey when you get a chance can you let me know best day i can schedule with you some movies next week?

elukey added a comment.EditedNov 9 2020, 8:08 AM

@elukey Hey when you get a chance can you let me know best day i can schedule with you some movies next week?

I had a chat the past week with John over IRC and we decided to meet tomorrow to move some servers. I am going to list any constraint I have for racking:

ROW A:

  • stat1004 - fine to move it in any row A rack, except (if possible) A6 where stat1008 is. - This one needs ~48h of user notification before being able to move it.
  • aqs1004 - fine to move it in any row A rack, except (if possible) A6 where aqs1007 is.
  • druid1001 - fine to move it in any row A rack, except (if possible) A5 where an-druid1001 is.

ROW B:

  • druid1005 - fine to move it anywhere in row B except B3.

ROW C:

  • db1108 - fine to move it in any C rack, except if possible C7 where an-coord1002 is

ROW D

  • aqs1006 fine to move it in any row D rack, except (if possible) D3 where aqs1009 is.

Going to make a separate list for the hadoop worker nodes:

  • analytics1070 - A4
  • analytics1072 - B2
  • analytics1064 - C2
  • analytics1065 - C2
  • analytics1066 - C2
  • analytics1074 - C2

Can I have a list of proposed rack moves before we proceed so I can check if they are ok first? We have precise racking settings for hadoop to ensure data is spread around in a good way, and I'd like to keep it balanced if possible.. I'd also need to change some settings before starting :)

@Jclark-ctr if you can give me the network ports you intend to use I will have them pre-configured as well.

Jclark-ctr added a comment.EditedNov 10 2020, 3:28 PM

@elukey Please Review the racks i have recommended let me know if anything needs to change
@Cmjohnson Will wait for after Luca gives the ok to configure ports

old rack new_rack, Unit, Switchport
A4 stat1004 A5 U32 Port32
A4 analytics1070 A5 U37 port37
A4 aqs1004 A3. U13 port12
A4 druid1001

row B
B2 analytics1072 B3 U40 Port15
B4 conf1005 B3 U36 Port14
B7 druid1005 B6 U26 Port25

Row C
C2 analytics1064 C3 U35 port 29
C2 analytics1065 C3 U37 port 30
C2 analyitcs1066 C3 U39 port 31
C2 analyitcs1074 C3 U6 port 14
C2 db1108 C3 U34 port34

Row D
D4 aqs1006 D6 U34 port34
D4 druid1003 D6 U33 port33
D4 conf1006 D6 U34 port34

elukey added a subscriber: Joe.Nov 10 2020, 4:01 PM

@Jclark-ctr I checked and I have only a couple of comments:

  1. B7 druid1003 B6 U26 Port25 - this is druid1005 right?
  2. could we place druid1001 in a rack that it is not A6? There is another druid node in there, so I'd prefer to keep them in separate racks if possible.

Also I need to coordinate with @Joe for the conf100x hosts, but the racking looks fine.

elukey updated the task description. (Show Details)Nov 10 2020, 4:05 PM

Change 640448 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Update network topology for Hadoop worker nodes

https://gerrit.wikimedia.org/r/640448

@wiki_willy Hi! Do we have a timeline on how much time it will take to move the hosts to free space for the new hadoop worker nodes? I am asking since I'd need them racked this month if possible (I can help in the bootstrap os install etc.. of course), otherwise I'll make other plans :) Thanks!

Jclark-ctr added a comment.EditedNov 19 2020, 12:04 AM

@Cmjohnson would you be able to do switch ports?
A4 stat1004 A5 U32 Port32
A4 analytics1070 A5 U37 port37
A4 aqs1004 A3. U13 port12
A4 druid1001

row B
B2 analytics1072 B3 U40 Port15
B4 conf1005 B3 U36 Port14
B7 druid1005 B6 U26 Port25

Row C
C2 analytics1064 C3 U35 port 29
C2 analytics1065 C3 U37 port 30
C2 analyitcs1066 C3 U39 port 31
C2 analyitcs1074 C3 U6 port 14
C2 db1108 C3 U34 port34

Row D
D4 aqs1006 D6 U34 port34
D4 druid1003 D6 U33 port33
D4 conf1006 D6 U34 port34

Hi @elukey - it's pretty close. Once @Cmjohnson and @Jclark-ctr work out the configuration on the switch ports, you should be good to go. Thanks, Willy

Change 640448 merged by Elukey:
[operations/puppet@production] Update network topology for Hadoop worker nodes

https://gerrit.wikimedia.org/r/640448

For what is worth

  • conf*, kubestage*, mw*, scb*, wtp*, ores*, restbase* can all be taken offline for extended periods of time.
  • ganeti1005 will need to be emptied of VMs (I 'll need a 24H advance notice)
  • maps* might be problematic at the current state of the infrastructure, leading to a difficult to get out of outage. Adding @hnowlan for input/advice.

Mentioned in SAL (#wikimedia-operations) [2020-11-23T16:37:30Z] <elukey> move analytics1070 from rack A7 to rack A5 - T267065

Mentioned in SAL (#wikimedia-operations) [2020-11-23T17:12:19Z] <elukey> move aqs1004 from rack A4 to A3 - T267065

maps* will be a slight issue - this cluster is underprovisioned at the moment and removing them will cause instability. However, neither are masters so moving them will not cause data loss. Depending on when things are happening I could have more capacity in place beforehand to head this off. Do you have an estimate for how long the move will take?

Mentioned in SAL (#wikimedia-operations) [2020-11-24T14:58:48Z] <elukey> move analytics1072 from rack B2 to B3 - T267065

Mentioned in SAL (#wikimedia-operations) [2020-11-24T15:38:11Z] <elukey> move druid1005 from rack B7 to B6 - T267065

Mentioned in SAL (#wikimedia-operations) [2020-11-24T16:29:15Z] <elukey> move analytics1064 from C2 to C3 eqiad - T267065

Mentioned in SAL (#wikimedia-operations) [2020-11-25T15:38:14Z] <elukey> move stat1004 to A5 - T267065

Mentioned in SAL (#wikimedia-operations) [2020-11-25T16:11:34Z] <elukey> move analytics1065 to C3 - T267065

Mentioned in SAL (#wikimedia-operations) [2020-11-25T16:46:09Z] <elukey> move analytics1066 to C3 - T267065

wiki_willy updated the task description. (Show Details)Nov 25 2020, 10:12 PM
wiki_willy updated the task description. (Show Details)

@wiki_willy re: upcoming 10g rack space needed, there is also T260445 (24 Hadoop worker nodes) :)

Marostegui added a subscriber: Marostegui.EditedNov 26 2020, 9:27 AM

All db that had hostnames under db1095 will be replaced by new ones, so those will go away (T258361).
es1011 (2U) has been decommissioned (T268100)
es1012 (2U) has been decommissioned (T268101)
es1015 (2U) will be decommissioned in a few days (T268810)
es1016 (2U) will be decommissioned in a few days (T268812)

All probably all the other databases (or most of them, if they are not masters) with hostnames higher than db1095 can probably be moved to 1G racks if needed. We'd need to schedule it with DC-Ops but it can be done.

Hi @elukey, I have the 24x worker nodes covered in the first part of the task description. We're looking pretty good after the recent moves, so @Cmjohnson should be able to start getting those set up next week. Thanks, Willy

@wiki_willy re: upcoming 10g rack space needed, there is also T260445 (24 Hadoop worker nodes) :)

Hi @elukey, I have the 24x worker nodes covered in the first part of the task description. We're looking pretty good after the recent moves, so @Cmjohnson should be able to start getting those set up next week. Thanks, Willy

@wiki_willy re: upcoming 10g rack space needed, there is also T260445 (24 Hadoop worker nodes) :)

Awesome news thanks a lot!

Mentioned in SAL (#wikimedia-operations) [2020-12-03T11:46:06Z] <elukey> move druid1001 to rack A1 - T267065

Mentioned in SAL (#wikimedia-operations) [2020-12-03T12:17:19Z] <elukey> move aqs1006 to rack D6 - T267065

Mentioned in SAL (#wikimedia-operations) [2020-12-03T13:00:21Z] <elukey> move db1108 to C3 - T267065

Mentioned in SAL (#wikimedia-operations) [2020-12-03T15:45:59Z] <elukey> moved conf1005 to rack B3 - T267065

elukey added a comment.Dec 3 2020, 5:35 PM

Note for conf1006 - this node is set as target for pybal in ulsfo, and after a chat with @akosiaris it is not clear what happens if pybal gets restarted when conf1006 is down. If possible let's schedule the move as last; if we want to proceed with it we'd need to do some puppet changes / pybal restarts to free conf1006 and allow a safer rack move.