Page MenuHomePhabricator

Try to move some new analytics worker nodes to different racks
Closed, ResolvedPublic

Description

First of all, apologies from my side for some new work on the an-worker nodes :(

In the parent task we realized that rack A4 has now 8 Hadoop worker nodes, I didn't realize that when planning the rack locations as described in T260445#6863444

I'd ask if you could help and try to balance some worker nodes again, since some of them are not yet in service and it would be easier for us to schedule downtime.

If possible, I'd ask if any of these nodes could be racked in a different place: an-worker[1129,1139-1141].eqiad.wmnet

The current distribution of hadoop worker nodes (without the aforementioned ones) is:

19 A
19 B
21 C
19 D

Meanwhile the distribution between the rows is:

1 A/1
5 A/2
2 A/3
4 A/4
2 A/5
5 A/7

5 B/2
1 B/3
5 B/4
5 B/7
3 B/8

5 C/2
4 C/3
7 C/4
4 C/7
1 C/8

6 D/2
4 D/4
2 D/5
6 D/7
1 D/8

What we are trying to do is to avoid more than 5/6 nodes for each rack. Adding an-worker[1129,1139-1141] to A4 means getting to 8, that is too much for us (for resiliency if a rack goes down etc..).

@wiki_willy proposed a change in T260445#6865096 to avoid IP changes (so keeping the nodes within row A), but if there are free spots elsewhere (in other rows) it will be fine as well.

No worries @elukey, it looks like I missed the double count in rack A4 as well. If these hosts need to stay in row A though, the only other 10g options would be in racks A2 or A7. Both are pretty full, but I do see room to fit one server in each rack, near the very top. We typically don't use shelf 42, but it could be possible - @Jclark-ctr will probably need to confirm how tight the space is on shelf 42 is in A2 and A7. Also, ms-be1019 in A2 is EOL, so hopefully the SREs will have a decom task submitted for that soon, which would also free up another spot in the future. Would this work for you?

  • Move an-worker1129 to A2
  • Move an-worker1139 to A7

That would net you 6x servers in A2, 6x servers in A4, and 6x servers in A7. If it does, let's track this via a new task for the server moves.

Thanks,
Willy

Thanks a lot for the patience!

Event Timeline

Some of the mw servers in rack A7 should be decom'd, after T273915 is installed for the refresh.
Since the power in A7 is maxing out, I think we should wait for a few of the mw servers to be decom'd first, then rack an-worker1139 in one of those shelf positions. In the mean time though, we should still be able to move an-worker1129 to rack A2. Thanks, Willy

@elukey can I move the 2 servers anytime or does this need to be scheduled?

Move an-worker1129 to A2
Move an-worker1139 to A7

@elukey I have not forgotten about this, A7 is a rack for the possible move but we are already maxing out our power utilization in that rack and adding another R740XD is probably not a good idea.

Hi @Cmjohnson - there should some power freed up, after some mw servers are decom'd for the T273915 refresh. There's going to 7x servers coming out, so an-worker1139 can take one of those slots. Thanks, Willy

@Cmjohnson hi! Any news about the worker nodes?

Hi @elukey - the rack space in A7 is pending on T280203. @Cmjohnson - you should be able to complete the move to A2 though - you just need to decom T280121 to free up a 2u spot in that rack. Thanks, Willy

@Cmjohnson Hi! Do you have time next week for the A2 rack?

@elukey sure I can move one of them to A2. Rack A7 is still full

Yes perfect! Anytime is good, those nodes are not in service.

@elukey I have this on my plan for tomorrow morning. i'll update the task once the move is complete.

@elukey an-worker1129 has been moved to A2

Thanks! Remaining step is to move an-worker1139 to A7, pending https://phabricator.wikimedia.org/T280203

@elukey @Cmjohnson to plan our work for T275767, do we have an ETA for the move of an-worker1139 to A7?

@Ottomata it depends on T280203, but I think that we can move 5 out of 6 nodes right now and then wait the last one when A7 will be freed by old nodes :)

Right, but if that A7 move will happen in the next couple of weeks, we might just wait. If it will happen in many months from now, then I agree let's do the 5 we can now.

@Ottomata I cannot say for sure, I am getting new MW servers online. That will allow the current MW servers in A7 to be decomm'd and removed. I think this could be done in the next 2-3 weeks.

The MW are almost ready, once we can get a batch of these online, the mw servers in A7 will be able to be decommissioned

We are close to moving these to A7 now. Several MW's have been decom'd and John and I need to get them out of the rack. Looking to have this done next week. Thanks for waiting

@Ottomata an-worker1139 is officially in rack A7. All cabled up and ready for OS install

I am going to resolve this task because the relocation is complete.

Change 714331 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add rack locations of the six new datanode servers

https://gerrit.wikimedia.org/r/714331

Change 714331 merged by Btullis:

[operations/puppet@production] Add rack locations of the six new datanode servers

https://gerrit.wikimedia.org/r/714331