The Logstash cluster currently consists of three 'misc'-class nodes, to which have been added 2x3tb disks in raid0. These machines were allocated while the service was in development, and disks were added as the service was adopted and volume increased. However we are quickly outgrowing the current config as more event sources are added. RAM appears to be the current bottleneck.
Now that Logstash has proved its usefulness, let's purchase equipment explicitly spec'd for the task with sufficient capacity for future growth.
Factors:
* Elasticsearch replicates data between nodes, therefore redundant storage is not required for nodes. For comparison, elastic10xx nodes also use raid0
* Logstash nodes have 16GB RAM and have had trouble with Elasticsearch OOM failures. For comparison, elastic10xx nodes have 96GB RAM.
* Currently the daily Logstash indices are ~30GB/day, stored for 30 days = 900GB storage required per node
* It is unclear how much our storage requirements will increase based on desired additional logging, or even whether all potential log sources have been enumerated.
For scalability and clarity, we may wish to divide the service among nodes by task:
** Redis nodes for an input message queue
** Logstash nodes for ingest
** Elasticsearch master nodes for cluster management which store no data
** Elasticsearch client nodes for data storage
** Kibana web frontend service nodes
* Some roles might be combined onto a common set of hosts, such as Redis + Logstash
Current event sources:
* Apache2
* HHVM
* Mediawiki
* scap
* Job queue runner
* Hadoop
* OCG
* Parsoid
Future sources
* Syslog from all nodes (includes Puppet)
* Zookeeper (T84908)
* Kafka (T84907)
* Icinga
* Any local logs which may be tailed
* ?
I request help from others to identify all additional event sources we wish to store, and to determine appropriate host stats and number of hosts.