Page MenuHomePhabricator

Increase logging pipeline ingestion capacity
Open, MediumPublic

Description

Yesterday during an incident the logging pipeline became severely backlogged, with input logs/s on kafka being in the order of ~25k/s and ingestion on logstash maxing out at ~4k/s for logstash 5 and ~7.5k/s for logstash 7. Normal volume of logs is ~1.5k/s, I think for capacity purposes we should have at least 10x normal volume available.

Possible solutions at their tradeoffs (non-exhaustive list)

  1. Add more Logstash+ES instance (i.e. the "frontend") instances to ingest data. Easy (more VMs, or baremetal) in the short term
  2. Investigate generic performance of Logstash ingestion (cfr T215904: Better understanding of Logstash performance). Including consuming from Kafka performance and producting to ES. Unknown complexity but sth to be done eventually
  3. Investigate ingestion performance of ES in isolation, IOW writing to ES could be the bottleneck too
  4. Throttle messages first, then ingest. MW is the biggest producer of logs and one very much subject to be spamming. Right now we're throttling within logstash and that has helped, however I suspect it is too late in cases like log spamming. Having throttling applied before ingestion would yield better results. There's a bunch of unknown unknowns (e.g. does Logstash with only throttle and input/output to Kafka have sufficient performance?) but could be a reasonable stopgap and generally applicable for logspam situations

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFri, Jun 12, 8:29 AM
jbond triaged this task as Medium priority.Fri, Jun 12, 9:48 AM
herron added a subscriber: herron.Fri, Jun 12, 2:02 PM
fgiunchedi renamed this task from Increase Logstash ingestion capacity to Increase logging pipeline ingestion capacity.Fri, Jun 12, 2:47 PM
fgiunchedi updated the task description. (Show Details)
fgiunchedi added a comment.EditedMon, Jun 15, 12:20 PM

I ran some benchmarking on logstash1023 (i.e. elk7) to address point 2. I grabbed the the logs from the timeframe of the incident from kafka and ran logstash in isolation with stdin / stdout encoded as json_lines and measured throughput in terms of lines/s. Logstash with all production filters enabled is able to do ~14k lines/s over 15M lines. Without filters (i.e. in "cat emulation" mode) we're at ~18k lines/s. Also Logstash runs with nice 19 in production and I ran the tests without nice, we should consider restoring default nice value IMHO.

With Logstash unlikely to be the bottleneck I looked at Elasticsearch, and during the incident not all SSD hosts (logstash10[6789]) seemed to receive equal CPU load

Indeed we're defaulting to a single shard per index as per logstash template. I think we're maxing out a single shard's writing speed (~7k/s index per second, for logstash-mediawiki since it was the most active index) and for the biggest indices prefixes (logstash-<date> and logstash-mediawiki-<date>) at least we should switch to four shards, so in theory that should yield ~28k/s index per second (per index) on the ES side. We could also switch all indices to default to four shards, although other indices are tiny in comparison and shards are not "free" in the sense that they are stored in the cluster state.

There were also reports of Kibana being unusable / slow, I suspect this was both due to ES being busy writing and Kibana using the same ES instance as Logstash is using for writing (to be confirmed with more load tests on the whole pipeline)

Change 605602 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash: align number of shards with number of ES indexing hosts

https://gerrit.wikimedia.org/r/605602

Change 605602 merged by Herron:
[operations/puppet@production] logstash: align number of shards with number of ES indexing hosts

https://gerrit.wikimedia.org/r/605602

Mentioned in SAL (#wikimedia-operations) [2020-06-17T08:10:14Z] <godog> stop logstash temporarily on logstash7 hosts to test increased es shards - T255243

Mentioned in SAL (#wikimedia-operations) [2020-06-17T08:30:56Z] <godog> start logstash on logstash7 - T255243

I've stopped logstash for 20 min first and 60m later to test backlog processing. The sharding definitely worked and now all SSD hosts are receiving equal load:

In terms of index per second there were about 7k ips per SSD host, the next test worth trying IMHO is pointing logstash directly to all SSD hosts, by passing the local elasticsearch instance.

I've stopped logstash for 20 min first and 60m later to test backlog processing. The sharding definitely worked and now all SSD hosts are receiving equal load:

In terms of index per second there were about 7k ips per SSD host, the next test worth trying IMHO is pointing logstash directly to all SSD hosts, by passing the local elasticsearch instance.

Turns out this didn't make a whole lot of difference, I also did an empirical test with kibana-next and kibana didn't appear significantly slower or faster (in case there's contention on the local elasticsearch, used both by logstash and kibana)

fgiunchedi added a comment.EditedWed, Jun 17, 2:12 PM

To recap, my understanding so far is the following:

  1. We're seeing a maximum of ~6-7k index/s per SSD host (total of 4 hosts), with 2 replicas and 4 shards. It is unclear to me now whether this is the aggregate index limit per-host or per-shard. And if increasing the shards to say 8 we could increase indexing speed further?
  2. In pratice we have two big prefixes (logstash- and logstash-mediawiki) that receive many writes and thus benefit from multiple shards, for other indices (deploy, restbase, syslog) there would be little benefit I think.
  3. Regardless of the point above, generally having a lot of shards in the cluster isn't optimal. We're sharding for indexing speed, thus after a few days we'd like to shrink shards back to 1 or 2. The operation is essentially a reindex (and curator can take care of that) and thus potentially expensive on the cluster. "A lot" in this context is ~20 shards per GB of heap (per Elastic recommendations) thus (24GB * 20 * 4 nodes = 1920 shards on all SSD hosts)
  4. We're currently using 2 replicas per site, since data is replicated to codfw as well we could be using 1 replica per site instead. If catastrophic disruption ensues on a site (i.e. two hosts down) we can failover to the other site.
  5. There are likely other optimizations (which?) we can do to increase indexing speed, most likely by looking at the indexed data and deciding for better types and/or not indexing the data at all

Mentioned in SAL (#wikimedia-operations) [2020-06-17T15:28:37Z] <godog> temp bump logstash7 workers to 8 and temp stop logstash - T255243

Mentioned in SAL (#wikimedia-operations) [2020-06-18T09:29:47Z] <godog> temp stop logstash on elk7 to test 8 pipeline workers - T255243

Yesterday I inquired the folks at -discovery on IRC and David suggested to increase the number of concurrent writers to Elastic, and indeed moving to 8 workers for Logstash the index per second to Elastic have peaked at ~10k per SSD host. Peak input from Kafka was at ~14k log/s (aggregate, over three hosts) and an 1h backlog (at "normal" rate of 1k log/s) was cleared in ~8m. I'll test with 3h hour backlog next.

fgiunchedi added a comment.EditedThu, Jun 18, 4:19 PM

Results have been encouraging with a bigger backlog, peaking briefly at 15k logs/s overall submitted to ES and then decreasing. Also I noticed that as mediawiki topics began to be consumed the related statsd output also was increasing in rate (obviously). This gave me the idea that it could be the statsd output being slow and backlogging the pipeline.

I ran perf trace record on logstash1023 and noticed that there were a lot of syscalls for sendto. Turns out the statsd output doesn't batch metrics (upstream issue) and the syscall / network (even if on localhost) I suspect becomes relevant at high rates.

Since there was still a little backlog to be processed I've stopped logstash and removed the statsd output, now the kafka input and the elasticsearch output were running at roughly the same rate which is encouraging (right hand side of the graph)

With this knowledge I think next actions should be:

  1. Add one or two more VMs to consume logs
  2. Bump pipeline workers to 8
  3. Switch away from statsd output, since we're simply counting things perhaps we can reuse the built-in logstash plugin metrics
  4. Keep the number of shards to 1 for old indices (or revert the shards change and measure throughput again)

Change 606647 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] logstash: bump pipeline workers

https://gerrit.wikimedia.org/r/606647

fgiunchedi moved this task from Backlog to Doing on the User-fgiunchedi board.Fri, Jun 19, 1:38 PM

Change 606647 merged by Filippo Giunchedi:
[operations/puppet@production] logstash: bump pipeline workers

https://gerrit.wikimedia.org/r/606647

fgiunchedi moved this task from Inbox to In progress on the observability board.Mon, Jun 22, 3:07 PM
Dzahn added a subscriber: Dzahn.

There are 4 new ganeti VMs now, 2 in eqiad and 2 in codfw, in row D each. They are ready to be taken into production in T256443.

There are 4 new ganeti VMs now, 2 in eqiad and 2 in codfw, in row D each. They are ready to be taken into production in T256443.

Thank you very much @Dzahn ! Appreciate it

Did a few more tests on this while working on T257294: Alert on no (or "few") logs indexed (was: No logs ingested in logstash7 since 2020-07-06 19:23) although bumping pipeline workers (20) and increasing batch size (256) didn't seem to make a whole lot of difference (with statsd output turned off) in the logstash7 cluster