Page MenuHomePhabricator

Estimate hardware requirements for Toolforge logging elastic cluster
Closed, DeclinedPublic

Description

This task is to do some estimation for specs of elasticsearch servers needed for logging on toolforge.

If we estimate:

  1. 1000 active tools
  2. with logs being kept for 30 days
  3. Upper bounding to 64MB of logs per tool per day

That requires 1.8T of total space on an ongoing basis in the elasticsearch cluster.

Event Timeline

yuvipanda renamed this task from Hardware for Tool Labs logging elastic cluster to Estimate hardware requirements for Tool Labs logging elastic cluster.Feb 18 2016, 9:28 PM
yuvipanda updated the task description. (Show Details)

Logs will eat all the disk we can give them. That's just the nature of logging. With ElasticSearch as a backing store we can spread things out over multiple physical machines such that each machine only has a portion of the total log volume locally such that each host only needs to hold 2M/N of the total where M is the desired capacity and N is the number of nodes in the cluster.

ElasticSearch is not going to be happy with 30K (1K tools * 30 days) distinct indices. If we lump all of the tools into a single index per-day (basically what we do with ELK in production) then we can keep a lid on that, but we would lose the ability to easily drop logs for a tool that was out of control.

@bd808 can we instead have one index per tool and then just limit the size of that index? maybe a max of 1G per tool and then things start getting auto-dropped...

@bd808 can we instead have one index per tool and then just limit the size of that index? maybe a max of 1G per tool and then things start getting auto-dropped...

Deleting old logs without doing index rotation is not the greatest thing for ElasticSearch performance. I wrote a lot about this on another task but I can't find it now. The TL;DR is that the Lucene data is append only and "deleting" means flagging a record as deleted and then at some point rebuilding the shard with the deleted records removed. ElasticSearch manages the mechanics of that for you but it requires disk space and iops to accomplish.

Having something like ElasticSearch backing the log storage gives a lot of neat benefits, but ultimately we just need to separate log shipping from log storage (so we don't use NFS for both) and create a way for tool maintainers to get at their log streams. Having a horizontally sharded pool of servers that act like fluorine does in production with a nice shipping solution may really be all we need.

I imagine most of these will be relatively low volume, as in <5GB per tool per month. In this case we might be best off using a single index per tool. In this case 1k indices with 1 primary and 1 replica each is probably something elasticsearch can support fairly reasonably.

1 index per day per tool would be a non-starter without having multiple independant elasticsearch clusters

related to bd808's comments, i'm not as worried about deleting 3% of an index per day. In the prod search cluster we run a deleted documents rate of ~30% without major issues. It might require some testing but my intuition is 3%/day is minimal.

related to bd808's comments, i'm not as worried about deleting 3% of an index per day. In the prod search cluster we run a deleted documents rate of ~30% without major issues. It might require some testing but my intuition is 3%/day is minimal.

A lot of this depends on how the fragments in the shards are tuned too, so I may be over reacting (indices are composed of shards, shards are composed of fragments, fragments are the append-only bit). We can have fairly large shards and still manage to not eat up too much space during compaction if the fragment size is reasonable (say <2G).

I've got battle scars from $DAYJOB-1 and an initial design of our ElasticSearch cluster that melted at 40% disk utilization, so I'm probably erring on the side of extreme caution.

bd808 lowered the priority of this task from High to Medium.Nov 27 2018, 11:40 PM
Legoktm renamed this task from Estimate hardware requirements for Tool Labs logging elastic cluster to Estimate hardware requirements for Toolforge logging elastic cluster.Nov 28 2018, 12:22 AM
Legoktm removed a project: Cloud-Services.
Legoktm updated the task description. (Show Details)
Bstorm changed the task status from Open to Stalled.Oct 16 2020, 6:19 PM
Bstorm moved this task from Inbox to Graveyard on the cloud-services-team (Kanban) board.
Bstorm subscribed.

This solutioning task is old enough that the contents are not accurate anymore. I'll leave it alive to be edited later perhaps for history, but for now, it's stalled and going in the graveyard I think.

We should have closed this years ago when we figured out that the FOSS ELK stack was not suitably multi-tenant (open core paywall blocking the needed authz features).