Page MenuHomePhabricator

Alert on logstash index failures on too many fields
Closed, ResolvedPublic

Description

This is a followup from T234564: Logstash discards messages from MediaWiki if they contain uncommon keys in the $context array and specifically about getting alerted when we're hitting elasticsearch's per-index field limits, which in turn usually indicates a "fields explosion" problem.

Event Timeline

Change 548280 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] mtail: add logstash program

https://gerrit.wikimedia.org/r/548280

Change 548281 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] profile: add mtail to logstash

https://gerrit.wikimedia.org/r/548281

Change 548280 merged by Filippo Giunchedi:
[operations/puppet@production] mtail: add logstash program

https://gerrit.wikimedia.org/r/548280

Change 548281 merged by Filippo Giunchedi:
[operations/puppet@production] profile: add mtail to logstash

https://gerrit.wikimedia.org/r/548281

Change 548975 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: fix mtail::logs location for logstash role

https://gerrit.wikimedia.org/r/548975

Change 548975 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: fix mtail::logs location for logstash role

https://gerrit.wikimedia.org/r/548975

Change 550446 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: collect logstash mtail metrics

https://gerrit.wikimedia.org/r/550446

Change 550446 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: collect logstash mtail metrics

https://gerrit.wikimedia.org/r/550446

Change 550471 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] logstash: alert on indexing failures

https://gerrit.wikimedia.org/r/550471

Change 550640 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] mtail: export logstash ES index failure details

https://gerrit.wikimedia.org/r/550640

Change 550640 merged by Filippo Giunchedi:
[operations/puppet@production] mtail: export logstash ES index failure details

https://gerrit.wikimedia.org/r/550640

Change 550471 merged by Filippo Giunchedi:
[operations/puppet@production] logstash: alert on indexing failures

https://gerrit.wikimedia.org/r/550471

fgiunchedi claimed this task.

This is completed, surges of indexing errors will result in an alert now. Unfortunately the thresholds are a little higher than I expected because of background noise of errors/conflicts (tracked in T238196: Logging fields conflicts (tracking))

Change 550678 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] logstash: move ingestion alerts to be site-local

https://gerrit.wikimedia.org/r/550678

Change 550678 merged by Filippo Giunchedi:
[operations/puppet@production] logstash: move ingestion alerts to be site-local

https://gerrit.wikimedia.org/r/550678

Change 552492 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: lower threshold for logstash indexing failures

https://gerrit.wikimedia.org/r/552492

Change 552492 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: lower threshold for logstash indexing failures

https://gerrit.wikimedia.org/r/552492