Page MenuHomePhabricator

elasticsearch new servers (5x eqiad / 12x codfw)
Closed, ResolvedPublic

Description

The goal is to increase the size of both eqiad and codfw clusters to 36 nodes. We want to keep the specs as close as possible to the current server to ensure uniform load.

Specs:
CPU: Dual Intel(R) Xeon(R) CPU E5-2640 v3
Disk: 800GB raw raided space (2x 800GB SSD RAID1 or similar, software RAID is fine)
RAM: 128GB

Number of servers:
eqiad: 36-31 = 5
codfw: 36-24 = 12
numbers are: [desired size of cluster] - [current size of cluster] = [number of servers to add]

Event Timeline

Gehel created this task.Oct 25 2016, 1:23 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
debt triaged this task as High priority.Oct 25 2016, 5:22 PM

We're still sorting out budget on this request and the slightly related T148747.

MaxSem moved this task from Needs triage to Ops on the Discovery board.Oct 26 2016, 9:22 PM
Gehel assigned this task to RobH.Nov 7 2016, 6:36 PM
Gehel added a project: hardware-requests.
RobH created subtask Unknown Object (Task).Nov 9 2016, 9:52 PM
RobH mentioned this in Unknown Object (Task).
RobH changed the status of subtask Unknown Object (Task) from Open to Stalled.
RobH reassigned this task from RobH to Gehel.Nov 9 2016, 10:03 PM
RobH added a subscriber: RobH.

Some questions regarding elastic search specifications:

It should be noted that the cpu Intel(R) Xeon(R) CPU E5-2640 v3 is now a late model CPU, with the new one being Intel(R) Xeon(R) CPU E5-2640 v4. The v3 is 2.6GHz with 8 cores, the v4 2.4GHz with 10 cores. This may require an upgrade in memory to maintain the memory to cpu core ratio currently in use on the elastic search cluster.

If these systems cannot be weighted in load, other considerations for balancing may need to be used, as they won't be identical systems.

The Intel® Xeon® Processor E5-4655 v4(30M Cache, 2.50 GHz) has 8 cores, but it is likely more expensive. I'll get it quoted, but plan that it won't really be a viable option.

The current systems have 2 [cpu] * 8 [cores per cpu] * 2 [hyperthreading] = 32 presented cores

128 GB RAM / 32 = 4GB per presented core

To maintain this for the new revision 4 CPU, we'll need to bring the memory up accordingly.

rev4 is 2 * 10 * 2 = 40

4GB * 40 = 160

RAM would only balance out at 192GB, so that means they'll have 4.8GB per core.
If we want to keep these at only 128GB, that lowers the allocation of memory per presented core down to:

128 / 40 = 3.2GB

So the questions are:

  • Can we accommodate that these new systems will have a core increase compared to the old?
  • Do we need to maintain 4 GB per core, or is 3.2GB acceptable?

Chatted with @Gehel about this, updating this task and assigning to him for feedback. Please assign back to me once feedback is added.

Thanks!

RobH added a comment.Nov 9 2016, 10:11 PM

I neglected to ask:

  • If the increased core count leads to higher utilization on the new systems, will they need any more SSD/storage capacity for their increased workloads?

memory to core count isn't a big deal for our elasticsearch cluster, for the most part we are using memory as a giant disk cache. There is a limit of 30.5G of java heap for a single elasticsearch process, so adding more cores will still have to share the same java heap. @Gehel or @dcausse may be able to comment more, but typically it looks like we generally have 10-15G of static heap usage, and the rest of the heap is used to buffer how often we need to do a big GC. Under load we still only do an old GC every 30min to 1hr,

The short answer there is, i don't think more cores will require more memory.

The next question would be if increasing memory on the machines would help keep more hot data in memory, rather than going out to disk. After an update to 35 machines per cluster the average machine will have ~266GB of data. It looks like our busiest servers are pulling about 10MB/s from the disk to handle queries. If the price difference is pretty minimal 192GB wouldn't hurt, but I wouldn't declare it any kind of requirement.

There was also a question or IRC about if more cores would end up needing more disk space. For reference, current cluster disk usage:

currentupgradeafter upgrade
eqiad used9358G9358G
eqiad avail9317G5*800G13317G
eqiad percent51%41%
codfw used9319GB9319GB
codfw avail7595G12*800G17195G
codfw percent55%35%

If anything, using the newer 800G disks is giving us way more space than we need (but if i remember correctly, last time the price difference on 800G disks vs. smaller was almost non-existent). No change will be necessary there.

EBernhardson reassigned this task from Gehel to RobH.Nov 9 2016, 10:50 PM

Re-reading this after some sleep, it seems that @EBernhardson nailed all the answers (as always). We can always find ways to increase memory allocated to elasticsearch (running multiple instances on the same server for example), but the increased complexity is probably not worth the trouble. Nothing to add...

Deskana changed the task status from Open to Stalled.Nov 15 2016, 6:11 PM
Deskana added a subscriber: Deskana.

Quotes are being obtained. Marking as stalled.

RobH changed the status of subtask Unknown Object (Task) from Stalled to Open.Nov 15 2016, 7:19 PM
RobH closed this task as Resolved.Jan 19 2017, 11:35 PM

All systems for this have been ordered. The codfw systems are in place, and the eqiad systems will be racked and installed via T154251

RobH closed subtask Unknown Object (Task) as Resolved.Jun 12 2017, 7:52 PM