Page MenuHomePhabricator

3 conf200x servers in codfw for zookeeper (and etcd?)
Closed, ResolvedPublic

Description

Feel free to close or mark this as a duplicate.

We recently provisioned 4 new Kafka brokers, 2 in eqiad and 2 in codfw. The 2 in eqiad are set up and running using the conf100x Zookeepers. However, there is no provisioned Zoookeeper cluster in codfw.

I'm not sure what the plans for etcd in codfw are, but we'll need to set up a Zookeeper cluster there before we can use the new codfw Kafka nodes.

The same misc hardware we just ordered for the Kafka brokers would be fine for the Zookeepers too.

Event Timeline

Ottomata created this task.Dec 18 2015, 4:56 PM
Ottomata assigned this task to Joe.
Ottomata raised the priority of this task from to Needs Triage.
Ottomata updated the task description. (Show Details)
Ottomata added a subscriber: Ottomata.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptDec 18 2015, 4:56 PM
RobH added a subscriber: RobH.Dec 18 2015, 5:29 PM

We do indeed lack conf2XXX deployments in codfw. After a IRC chat with @Ottomata, we're going to assign this task to @Joe for his input on how he plans to deploy zookeeper and etcd in codfw. (They are served off the same systems in EQIAD, but as those particular services may be re-engineered for codfw, the time to check seems to be now.)

@Joe is likely out for the holidays (I'm told) so this will have to sit for feedback until post-holidays. As codfw isn't serving traffic for this at the moment, that seems acceptable.

RobH triaged this task as High priority.Dec 18 2015, 5:29 PM

Had a meeting with @Joe and @paravoid (and others yesterday), and we decided to move forward with this and another Kafka relate procurement request.

The eqiad Zookeeper/etcd servers are Dell PowerEdge R310s. We can get something equivalent in codfw.

Joe added a comment.Feb 10 2016, 2:02 PM

I think we can get the same systems we have in eqiad, yes; for the moment, it will only serve zookeeper probably, while I figure out how to replicate/distribute etcd.

So we need 3 servers in codfw with the same features as the ones we have in eqiad.

I'm assigning this task back to @RobH.

Joe reassigned this task from Joe to RobH.Feb 10 2016, 2:03 PM
Joe set Security to None.
RobH added subscribers: mark, faidon, Joe.EditedFeb 16 2016, 5:30 PM

So we don't have any in warranty systems that match these specifications, as they are quite low and codfw only has the new high performance misc systems spare, plus four identical but slightly (expired this last January) out of warranty systems with the following:

  • Dual Intel® Xeon® Processor E5-2440 (3Ghz/6c)
  • 32GB RAM
  • Dual 500GB SATA

Otherwise the stats on these are identical or better to the conf100[1-3]. If these aren't acceptable, I can get a quote for them instead. I'd like to get feedback from @Joe/@Ottomata/@faidon if these slightly out of warranty systems can be used.

If so, we can escalate this to @mark for approval to allocate systems to this.

Since conf100[1-3] are all in different rows, I've selected systems in different racks for this as well.

Systems proposed: WMF3560(C4) WMF3565 (C7), WMF5849 (A5)

RobH reassigned this task from RobH to Joe.Feb 16 2016, 5:30 PM
RobH reassigned this task from Joe to Ottomata.
RobH moved this task from Backlog to In Discussion / Review on the hardware-requests board.
RobH moved this task from In Discussion / Review to Pending Approval on the hardware-requests board.

Actually, I should have assigned to @Ottomata as he was the initial requester.

Joe added a comment.Feb 16 2016, 5:57 PM

@RobH the specifications seem neat, we will probably need to refresh those hosts in two years right?

That seems reasonable anyways.

Those are beefier than the conf100xs, so they will certainly do just fine. However, since they are out of warrantee, it might be better to just get new less beefy nodes that match the use case for these more appropriately.

Joe added a comment.Feb 16 2016, 5:59 PM

@Ottomata given these are consistent distributed systems it's ok to use servers that are out of warranty on the premise that we'll replace them in the future

RobH claimed this task.Feb 16 2016, 6:07 PM
RobH reassigned this task from RobH to mark.Feb 16 2016, 6:16 PM

After the above discussion on task, and an IRC discussion with both @Ottomata and @Joe, we have the following summary:

The kafka allocation mentioned in the initial request was for kafka200[1-2]. Those machines came out of the recently ordered new misc systems batch (T120246), and are very over provisioned compared to conf100[1-3].

CODFW presently has no zookeeper(conf) nodes. EQIAD has conf100[1-3]. The specifications for those systems are quite low, so we don't have any in warranty systems that match these specifications exactly. We do have 4 systems that just had their warranties expire last month.

  • Dell PowerEdge
  • Dual Intel® Xeon® Processor E5-2440 (3Ghz/6c)
  • 32GB RAM
  • Dual 500GB SATA

Otherwise the stats on these are identical or better to the conf100[1-3]. If these aren't acceptable, I can get a quote for them instead. I'd like to get feedback from @Joe/@Ottomata/@faidon if these slightly out of warranty systems can be used.

Otherwise the stats on these are identical or better to the conf100[1-3].

Since conf100[1-3] are all in different rows, I've selected systems in different racks for this as well.

Systems proposed: WMF3560(C4) WMF3565 (C7), WMF5849 (A5)

Both @Joe and @Ottomata have reviewed and the specification will meet the requirements. There is confusion on if we can allocate these systems to analytics use, and we need that clarified.

If we prefer to order a new machine for these, rather than use the spares on site, I can request pricing for it. (This is a public task, so I've not placed pricing on it.) Prices for the most recent single CPU purchases can be found on T117240 or T118993.

This task is now assigned to @mark for his review and approval of the allocation of the three spares or his request that we instead generate new quotes for order of 3 new systems (pricing will be similar to T117240 or T118993.) Please attach needed review/approval/comments and assign back to me, thanks!

mark added a comment.Feb 18 2016, 3:41 PM

Let's go ahead, considering these only just expired and we don't have budget for them otherwise. Although we try to avoid that, considering the use, internal redundancy and ease of redeployment, it seems low risk in this case.

RobH closed this task as Resolved.Feb 18 2016, 6:09 PM

Task T127344 is for the setup/deployment of conf200[1-3]. This hardware-requests is fulfilled.

RobH reopened this task as Open.Feb 18 2016, 6:31 PM

So it turns out WMF3560 & WMF3565 were listed on our spares in codfw, but are actually in eqiad. I'm not sure how that happened, but I'm auditing the rest of the codfw spares and have not found the issue elsewhere.

So for the three systems allocated on this task, only one of them exists in codfw.

As such, I'm reopening this task to find a the proper allocations.

The only systems that would match are the new dual cpu misc systems. The only issue with them is they are 4*4TB, which is overkill. Otherwise they are 3GHz @ 4 cores, versus the conf1001-1003 which is 3GHz at 6 cores. Are the cpu core count a major issue on this or would these work? (Assigning back to @Ottomata for his input.)

Sorry for the confusion and turn around on this, steps have been taken and the spares sheets audited to prevent future re-occurrence.

RobH reassigned this task from RobH to Ottomata.Feb 18 2016, 6:31 PM

I think those would work fine. Are you sure conf100x have 6 cores? I see 4.

We def don't need 4*4TB, maybe you can swap out smaller HDDs and save the big ones for another use?

RobH reassigned this task from Ottomata to mark.Feb 18 2016, 9:08 PM

You are correct, they are 4 not 6 cores, which makes them even better allocations as its not a core count reduction.

So I suppose the next step is to get @mark's approval to allocate 3 of the high performance misc systems of the 5 we have available.

@mark:

The initial 3 systems I suggested ended up having 2 of the 3 in eqiad, not codfw. The spares sheet was not correct, so I audited it and fixed. However, we then only have one of the just out of warranty systems, and having one of three in a cluster different would be non-ideal.

So I'd like your approval to allocate 3 of the 5 new high performance misc systems for this request as conf200[1-3]. The CPU core count matches and the RAM at 32GB is fine. The 4 * 4TB is overkill, but the price difference between those and the 500GB dual systems was close enough that we didn't order the dual 500GB systems.

Alternatively, I can get new system orders quoted for this (seemed easier to use spares and order more spares, but your call!)

This wasn't in the pending approval column, so I've moved it there and dropped @mark a note via IRC PM. (It may not have been part of his triage due to the column placement, but it was already assigned to him.)

mark added a comment.Mar 1 2016, 12:19 PM

I assume we have no weaker boxes for this purpose?

Looking at conf100x, they use no resources whatsoever...

Indeed, either we order new weaker ones, or we use the referred spares. Whatever if fine with me.

RobH claimed this task.Mar 14 2016, 8:59 PM

I have to steal this back for update, as other allocations (sca and scb clusters) used up all the codfw spares.

This request is now assigned to me. We'll have to order new weaker ones, or new ones as part of the spare pool order on T128910.

The spare pool order is for dual 2.6GHz 8 core each and 64GB of RAM. This seems overkill for this need, so I'll also create a request for three systems approximately half as powerful.

RobH mentioned this in Unknown Object (Task).Mar 14 2016, 9:03 PM
RobH changed the task status from Open to Stalled.Mar 15 2016, 10:41 PM

T130080 has been created to get quotes for this, and is a blocker to this task.

RobH added a subtask: Unknown Object (Task).Mar 15 2016, 10:41 PM
RobH added a comment.Mar 23 2016, 6:55 PM

Just to update the public task, we have quotes back from one of our two hardware vendors. Once we have the other back (expected today/tomorrow), they'll be escalated for review and purchase approvals.

RobH mentioned this in Unknown Object (Task).Mar 23 2016, 9:04 PM

This has been ordered, and now has a public blocking/racking task of T131959.

RobH edited subtasks, added: T131959: rack/setup/deploy conf200[123]; removed: Unknown Object (Task).Apr 6 2016, 5:56 PM
RobH closed this task as Resolved.Apr 19 2016, 7:03 PM

As this task has had systems allocated, and setup is via T131959, resolving this request.

Restricted Application added a subscriber: TerraCodes. · View Herald TranscriptApr 19 2016, 7:03 PM