Page MenuHomePhabricator

"shards failed" error while loading the "varnish webrequest 50x" dashboard in logstash-next
Open, Needs TriagePublic

Description

Currently experiencing the error below while loading https://logstash-next.wikimedia.org/app/kibana#/dashboard/Varnish-Webrequest-50X

86 of 520 shards failed

The data you are seeing might be incomplete or wrong.

The details are for example

0
logstash-syslog-2020.06.06
nPHFBhPEQRuKR9OhEnY7Ew
circuit_breaking_exception
circuit_breaking_exception at shard 0index logstash-syslog-2020.06.06node nPHFBhPEQRuKR9OhEnY7Ew
Type
circuit_breaking_exception
Reason
[parent] Data too large, data for [indices:data/read/search[phase/query]] would be [25268422696/23.5gb], which is larger than the limit of [24481313587/22.7gb], real usage: [25268421360/23.5gb], new bytes reserved: [1336/1.3kb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=2672/2.6kb, accounting=1051594784/1002.8mb]
Bytes wanted
25268422696
Bytes limit
24481313587
Durability
PERMANENT

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptThu, Jul 30, 7:28 AM

Change 617526 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash7: increase SSD tier JVM heap to 32G

https://gerrit.wikimedia.org/r/617526

herron added a subscriber: herron.Thu, Jul 30, 7:22 PM

Was able to produce these errors as well, although not 100% of the time. Reading up on this error suggests that increasing the JVM heap should help. We have a 24G heap currently due to the memory size of the "HDD" ES nodes, but the "SSD" nodes have more memory available and could support a larger heap. Uploaded https://gerrit.wikimedia.org/r/617526

fgiunchedi moved this task from Inbox to In progress on the observability board.Fri, Jul 31, 12:33 PM

Change 617526 merged by Herron:
[operations/puppet@production] logstash7: increase SSD tier JVM heap to 32G

https://gerrit.wikimedia.org/r/617526

Change 619032 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash: increase 'hdd' hosts heap from 24G to 26G

https://gerrit.wikimedia.org/r/619032

Saw this same Data too large, data for... error also affecting shard allocation on the HDD hosts yesterday. Bumping the heap on the eqiad HDD hosts manually from 24G to 26G and issuing a /_cluster/reroute?retry_failed=true cleared it. Uploaded https://gerrit.wikimedia.org/r/619032 to persist the setting (and for deploy to codfw)

Changes LGTM so far, thanks! Is it known what's driving the huge responses / data load? I'm assuming it is one/few queries from Kibana we should avoid or restrict that wants to load data (which? fields perhaps?)

Change 619032 merged by Herron:
[operations/puppet@production] logstash: increase 'hdd' hosts heap from 24G to 26G

https://gerrit.wikimedia.org/r/619032