Page MenuHomePhabricator

Refresh elastic10{01..16}.eqiad.wmnet servers
Closed, ResolvedPublic

Description

These servers are due for a refresh this FY. They have fallen far enough behind we have had to switch morelike traffic from the eqiad cluster to codfw to keep these older servers from maxing out CPU during the busiest parts of the week and effecting user latency.

The server spec we are using these days for elasticsearch nodes is higher than these old nodes. Ideally we should match specs with elastic10{17..31}.eqiad.wmnet. The refresh budget on these servers is not enough for all of them, due to the higher spec, as such the discovery capex budget will be used to cover any additional funds required.

Event Timeline

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript
Deskana moved this task from Needs triage to Ops on the Discovery-ARCHIVED board.

after poking around i think this belongs in hardware-requests, a procurement ticket will be created later for quotes and such.

As I specified in my recent private e-mail, using the refresh budget we can cover ~10 servers of the latest spec. More could be purchased using the remaining capex allocation for Discovery.

I looked back at your email and you are totally correct. I'm double checking with @Tfinc but i'm pretty sure we will do the full 16 server refresh then.

Is there anything we need to do from this end to start things moving forward?

@EBernhardson:

I'm a bit confused by a section of the request:

Ideally we should match specs with elastic10{17..31}.eqiad.wmnet.

However, @mark mentions the newer codfw specification. Did you want the new systems to purposefully be slow and match eqiad (elastic1031, ordered on 2014-10-13), or the new codfw specification (elastic2024, ordered on 2015-08-28)? I'm would assume we would want updated and faster systems, please advise.

@EBernhardson:

I'm a bit confused by a section of the request:

Ideally we should match specs with elastic10{17..31}.eqiad.wmnet.

However, @mark mentions the newer codfw specification. Did you want the new systems to purposefully be slow and match eqiad (elastic1031, ordered on 2014-10-13), or the new codfw specification (elastic2024, ordered on 2015-08-28)? I'm would assume we would want updated and faster systems, please advise.

We might as well match the newest spec in terms of CPU and memory. The annoyance there will be that elasticsearch treats all the machines as homogeneous in terms of query and shard routing (as seen by the current maxing out of 10{01..16} while 10{17..31} are at 30% capacity). But it wont hurt to update anyways.

For disks, we can stay with the same disks that are in elastic10{17..31}. codfw needed bigger disks because there are less servers (24 vs 31). Even with the smaller disks eqiad cluster is at 59% disk usage which leaves plenty of room to grow without going with the 800GB disks.

Summarize:
Match CPU and memory with newest spec (codfw)
CPU: 2x Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz
Memory: 128G
Match SSD's with eqiad spec (2x300GB disks. software raid. 1g swap raid1, 28G root raid1, 500G data raid0)

The Intel(R) Xeon(R) CPU E5-2640 v3 is an 8 core CPU. You linked the v1 of the CPU, when we have v3.

So the v3 has 8/16, the v1 6/12 (cores/threads)

Do we have enough rack space to rack up all 16 new servers, before taking down the old 16? Just trying to plan out how we will do the switchover between the old and the new servers.

@EBernhardson space is at a premium in eqiad. I don't think we have enough space to evenly distribute all 16 new systems without removing some of the old systems. We have space in row D, but the rest of the rows are very full, and pushing all the new boxes into one row is typically a bad idea. The current elastic boxes seem intentionally distributed across racks/rows.

We will likely need to do a staged rollout, perhaps adding in some new servers to row D first, then decommissioning some out of the full rows A-C. How many we move at once will be largely dictated by discovery/search input.

Update from IRC chat:

  • Get quotes to match the CPU/RAM of the last codfw elastic purchase.
  • Get quotes to match the existing eqiad elastic spec & comparison quote to match codfw elastic spec.
    • old eqiad spec was using raid0 of SSD, new codfw spec uses raid10 of SSD. We likely want to go with raid10.
RobH mentioned this in Unknown Object (Task).Mar 9 2016, 6:10 PM

I'm not convinced by raid10 for elasticsearch. Elasticsearch itself provides redundancy (shard replicated on multiple nodes, multiple masters, ...). I usually think that multiple levels of HA to cover similar failures is unnecessary complexity.

I've seen discussion on the number of masters we can loose and keep a quorum, but this should probably be addressed by increasing the number of masters, not by increasing individual node stability.

I'm the new guy here, so I am most probably missing context, but I'd love to understand what's wrong in the above lines...

@chasemp do you remember why we used a different RAID setup for codfw? If I remember correctly it was mostly because we did not find the perfect SSD size so we decided to use big 800Gb drives (sorry can't remember the details).
T105707 might contain more info.

RobH added a subtask: Unknown Object (Task).Mar 10 2016, 4:25 PM

@Gehel: SSD and HDD failure are still some of the highest[1] failure rate hardware in the datacenter. If we raid the OS disk (as these will be hot swap disks) it eliminates downtime on a single host from a simple disk failure. The majority of the cluster is built in this way, with only a few exceptions that I'm aware of. Those are the eqiad elastic cluster, restbase nodes, and mw cluster.

That being said, the end decision will have to be approved by both Discovery and Operations (not by me getting quotes). To that end, the quotes I'm obtaining off T129381 will include pricing for both raid1 and raid0 options. That pricing will also include per SSD pricing, so we can scale up any single node if needed.

[1]: I have no actual historical data for this, just my overall impression over time. We could pull some since there are tickets for every hardware failure, but it requires quite a bit of data crunching.

Please note that the order for 16 systems was placed today on blocking task T129381. Since the blocking task is private, I wanted to provide a public update.

The current lead time for the order is 2 weeks, we expect this in the week of April 11th.

Note: we want to have 3 elasticsearch node to be master eligible and we'd like them to be in different rows (for obvious reasons). @Gehel will check with @Cmjohnson when the servers arrive...

RobH claimed this task.

Systems arrived and have been racked. Resolving this hardware request.

RobH closed subtask Unknown Object (Task) as Resolved.Jul 11 2016, 5:31 PM