Page MenuHomePhabricator

Reduce noise from Elasticsearch / OpenSearch alerts to make triaging easier for on-call
Closed, ResolvedPublic

Description

We are currently in the middle of an Elasticsearch to OpenSearch migration. As part of the migration, some alerts are going to be triggered (in particular, unassigned shards and settings checks). Those create noise to for the on-call SREs, making their lives more complicated than they should be. We should make sure those alerts are routed to Data Platform SRE (and not the larger SRE teams) and make sure that problematic alerts are downtimed when they are expected.

Problematic alerts

  • ElasticSearch unassigned shard check
  • OpenSearch unassigned shard check
  • ElasticSearch setting check

AC

  • Elasticsearch and OpenSearch alerts that are relevant only to DPE SRE are routed to the DPE SRE team, not to the larger SRE group
  • During migration, noisy alerts are downtimed appropriately

Event Timeline

Gehel triaged this task as High priority.May 19 2025, 9:10 AM

I've downtimed 30 or so alerts on @Gehel's suggestion for 48 hours (only delayed the noise, it will reappear on Wednesday).

image.png (477×1 px, 85 KB)

Icinga downtime and Alertmanager silence (ID=53680efe-51bd-4e69-a0cb-c69f7ffac3dd) set by bking@cumin2002 for 7 days, 0:00:00 on 65 host(s) and their services with reason: eqiad is depooled, noisy alerts

cirrussearch[1068-1079,1084,1111-1125].eqiad.wmnet,elastic[1054,1057-1067,1080-1083,1087-1103,1107-1110].eqiad.wmnet

Apologies for the noise and thank you for bringing this to our attention. Since EQIAD is depooled, I've downtimed all our EQIAD hosts for the next week (we hope we'll be done by then).

In the meantime, please feel free to ping me directly on Slack or IRC if you see any more noisy alerts.

8 more alerts were received at the SRE board:

image.png (675×960 px, 128 KB)

I've downtimeed them.

bking changed the task status from Open to In Progress.May 20 2025, 2:51 PM
bking claimed this task.

Icinga downtime and Alertmanager silence (ID=b139a4cb-57ae-4d79-8869-06e835f82525) set by bking@cumin2002 for 7 days, 0:00:00 on 65 host(s) and their services with reason: eqiad is depooled, noisy alerts

cirrussearch[1068-1087,1111-1125].eqiad.wmnet,elastic[1054,1057,1060-1067,1088-1103,1107-1110].eqiad.wmnet

Thanks @jcrespo . I think the reimage cookbook must be removing downtimes. The help says it doesn't, but I'm guessing that is incorrect:

--no-downtime         do not set the host in downtime on Icinga/Alertmanager before the reimage. Included if --new is set. The host will be downtimed after the reimage in any
                      case. (default: False)

Since downtiming isn't working, we'll need to detune the alerts so they don't fire immediately (delaying 10-15m is probably long enough.

In the meantime, I've added running the downtime cookbook and checking icinga to our procedure . Apologies once again for the noise, and if there's anything else we can do, please let me know.

I think the reimage cookbook must be removing downtimes.

Indeed. My recommendation (or at least what I do) is to set a host as notif:disabled on puppet while it is being setup, that way it doesn't create too much noise. This is not a rule, but anything that prevents unneeded alerts would help people attending actual outages. Notifications that are systematically ignored only cause but alert fatigue (and sometimes this option I suggest is not even known). Thank you for your understanding.

Change #1148402 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] elastic/cirrussearch: silence eqiad, improve alert routing

https://gerrit.wikimedia.org/r/1148402

Change #1148402 merged by Bking:

[operations/puppet@production] elastic/cirrussearch: silence eqiad, improve alert routing

https://gerrit.wikimedia.org/r/1148402

Icinga downtime and Alertmanager silence (ID=55d264b1-8db7-4900-adde-2a0735ae990d) set by bking@cumin2002 for 7 days, 0:00:00 on 1 host(s) and their services with reason: eqiad is depooled, noisy alerts

cirrussearch1087.eqiad.wmnet

@jcrespo Thanks for the suggestion! I merged the above CR and reimaged cirrussearch1088. The cookbook mentioned alerts such as cirrussearch1088:OpenSearch health check for shards on 9200, but I did not see the alerts make it into the icinga web UI or #wikimedia-operations IRC.

As such, I'm resolving the ticket and moving to 'Needs Review' on our board. Feel free to un-resolve if this did not fix the issue. Thanks again for bringing this to our attention!