Estimate hardware requirements for Toolforge logging elastic cluster
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	yuvipanda
	Feb 18 2016, 9:21 PM

Description

This task is to do some estimation for specs of elasticsearch servers needed for logging on toolforge.

If we estimate:

1000 active tools
with logs being kept for 30 days
Upper bounding to 64MB of logs per tool per day

That requires 1.8T of total space on an ongoing basis in the elasticsearch cluster.

Related Objects
Search...

Status	Assigned	Task
Resolved	• Bstorm	T126083 overhaul labstore setup [tracking]
Resolved	• GTirloni	T216988 labstore1004 - DISK CRITICAL - free space: /srv/tools 115904 MB (1% inode=79%):
Resolved	• Bstorm	T217993 2019-03-10: tools and NFS share cleanup (high usage)
Resolved	• Bstorm	T122508 Prevent overly-large log files
Declined	None	T286847 Add webservice flag to mount project directory read-only
Open	None	T127367 [toolforge,jobs-api,webservice,storage] Provide modern, non-NFS log solution for Toolforge webservices and bots
Declined	None	T127368 Estimate hardware requirements for Toolforge logging elastic cluster

Event Timeline

yuvipanda created this task.Feb 18 2016, 9:21 PM

yuvipanda renamed this task from Hardware for Tool Labs logging elastic cluster to Estimate hardware requirements for Tool Labs logging elastic cluster.Feb 18 2016, 9:28 PM

yuvipanda updated the task description. (Show Details)

EBernhardson subscribed.Feb 18 2016, 9:47 PM

Logs will eat all the disk we can give them. That's just the nature of logging. With ElasticSearch as a backing store we can spread things out over multiple physical machines such that each machine only has a portion of the total log volume locally such that each host only needs to hold 2M/N of the total where M is the desired capacity and N is the number of nodes in the cluster.

ElasticSearch is not going to be happy with 30K (1K tools * 30 days) distinct indices. If we lump all of the tools into a single index per-day (basically what we do with ELK in production) then we can keep a lid on that, but we would lose the ability to easily drop logs for a tool that was out of control.

@bd808 can we instead have one index per tool and then just limit the size of that index? maybe a max of 1G per tool and then things start getting auto-dropped...

In T127368#2041872, @yuvipanda wrote:

@bd808 can we instead have one index per tool and then just limit the size of that index? maybe a max of 1G per tool and then things start getting auto-dropped...

Deleting old logs without doing index rotation is not the greatest thing for ElasticSearch performance. I wrote a lot about this on another task but I can't find it now. The TL;DR is that the Lucene data is append only and "deleting" means flagging a record as deleted and then at some point rebuilding the shard with the deleted records removed. ElasticSearch manages the mechanics of that for you but it requires disk space and iops to accomplish.

Having something like ElasticSearch backing the log storage gives a lot of neat benefits, but ultimately we just need to separate log shipping from log storage (so we don't use NFS for both) and create a way for tool maintainers to get at their log streams. Having a horizontally sharded pool of servers that act like fluorine does in production with a nice shipping solution may really be all we need.

I imagine most of these will be relatively low volume, as in <5GB per tool per month. In this case we might be best off using a single index per tool. In this case 1k indices with 1 primary and 1 replica each is probably something elasticsearch can support fairly reasonably.

1 index per day per tool would be a non-starter without having multiple independant elasticsearch clusters

related to bd808's comments, i'm not as worried about deleting 3% of an index per day. In the prod search cluster we run a deleted documents rate of ~30% without major issues. It might require some testing but my intuition is 3%/day is minimal.

In T127368#2041976, @EBernhardson wrote:

related to bd808's comments, i'm not as worried about deleting 3% of an index per day. In the prod search cluster we run a deleted documents rate of ~30% without major issues. It might require some testing but my intuition is 3%/day is minimal.

A lot of this depends on how the fragments in the shards are tuned too, so I may be over reacting (indices are composed of shards, shards are composed of fragments, fragments are the append-only bit). We can have fairly large shards and still manage to not eat up too much space during compaction if the fragment size is reasonable (say <2G).

I've got battle scars from $DAYJOB-1 and an initial design of our ElasticSearch cluster that melted at 40% disk utilization, so I'm probably erring on the side of extreme caution.

valhallasw mentioned this in T127367: [toolforge,jobs-api,webservice,storage] Provide modern, non-NFS log solution for Toolforge webservices and bots.May 27 2016, 4:37 PM

• chasemp triaged this task as High priority.May 31 2016, 3:17 PM

scfc moved this task from Backlog to Ready to be worked on on the Toolforge board.Dec 5 2016, 4:01 AM

• Phabricator_maintenance removed a subscriber: yuvipanda.Jun 7 2017, 6:44 PM

Framawiki subscribed.Jun 7 2018, 7:01 PM

bd808 lowered the priority of this task from High to Medium.Nov 27 2018, 11:40 PM

Legoktm renamed this task from Estimate hardware requirements for Tool Labs logging elastic cluster to Estimate hardware requirements for Toolforge logging elastic cluster.Nov 28 2018, 12:22 AM

Legoktm removed a project: Cloud-Services.

Legoktm updated the task description. (Show Details)

• GTirloni added a project: cloud-services-team (Kanban).Mar 23 2019, 9:53 PM

This solutioning task is old enough that the contents are not accurate anymore. I'll leave it alive to be edited later perhaps for history, but for now, it's stalled and going in the graveyard I think.

fnegri edited projects, added cloud-services-team; removed cloud-services-team (Kanban).Jan 18 2023, 6:53 PM

fnegri moved this task from Kanban to Graveyard on the cloud-services-team board.

fnegri moved this task to Inbox on the cloud-services-team board.Jan 18 2023, 9:58 PM

We should have closed this years ago when we figured out that the FOSS ELK stack was not suitably multi-tenant (open core paywall blocking the needed authz features).

Estimate hardware requirements for Toolforge logging elastic clusterClosed, DeclinedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Estimate hardware requirements for Toolforge logging elastic cluster
Closed, DeclinedPublic
Actions

Related Objects
Search...