Migrate selected Search Platform alerts from icinga search-platform team to prometheus data-platform team
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	bking
	Feb 20 2024, 6:11 PM

Description

Per T346438 and the Data Platform Alerts Review document , we're ready to start migrating alerts. The purpose of these migrations is twofold:

Better targeting of alerts. For example, we don't want to bother SWEs with operational alerts.
Moving to the new Prometheus-driven alerts system.

Details

Subject	Repo	Branch	Lines +/-
wdqs: Distinguish between public and internal monitoring	operations/puppet	production	+30 -27
wdqs: remove failing blackbox check	operations/puppet	production	+0 -27
wdqs: loosen up regex for blackbox check	operations/puppet	production	+2 -2
wdqs: Add blackbox http checks for SPARQL endpoint	operations/puppet	production	+27 -1
wdqs: add blackbox check for query.wikidata.org	operations/puppet	production	+21 -1

Customize query in gerrit

Related Objects

Mentioned In: T361862: Determine cause of unexpectedly high blackbox poller entries in wdqs nginx access logs
T358802: Cloudelastic: alert on inconsistencies between running CCS running state and active master-eligibles
Mentioned Here: T353878: Service implementation for elastic2087-2109
rETMH10065649e838: Update notes for submitted changes
T346438: [Epic] Review alerting strategy for Data Platform SRE

Event Timeline

bking created this task.Feb 20 2024, 6:11 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 20 2024, 6:11 PM

bking moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.02.12 - 2024.03.03) board.Feb 20 2024, 10:30 PM

I'm doing a first pass on alerts I already recorded in this Google Sheet.

My first choices for moving to Prometheus:

the categories nagios alerts . This script could become its own Prometheus exporter, or we could add its functionality into our existing Blazegraph exporter

WDQS SPARQL endpoint checks . These will become Prometheus blackbox checks.

Gehel triaged this task as Medium priority.Feb 26 2024, 2:20 PM

Change rETMH10065649e838 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wdqs: add blackbox check for query.wikidata.org

https://gerrit.wikimedia.org/r/1006564

gerritbot added a project: Patch-For-Review.Feb 26 2024, 6:24 PM

Change rETMH10065649e838 merged by Bking:

[operations/puppet@production] wdqs: add blackbox check for query.wikidata.org

https://gerrit.wikimedia.org/r/1006564

Maintenance_bot removed a project: Patch-For-Review.Feb 26 2024, 8:31 PM

Change 1006992 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wdqs: Add blackbox http checks for SPARQL endpoint

https://gerrit.wikimedia.org/r/1006992

gerritbot added a project: Patch-For-Review.Feb 27 2024, 8:20 PM

Change 1006992 merged by Bking:

[operations/puppet@production] wdqs: Add blackbox http checks for SPARQL endpoint

https://gerrit.wikimedia.org/r/1006992

Maintenance_bot removed a project: Patch-For-Review.Feb 27 2024, 10:30 PM

Change 1007006 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wdqs: loosen up regex for blackbox check

https://gerrit.wikimedia.org/r/1007006

gerritbot added a project: Patch-For-Review.Feb 27 2024, 11:03 PM

Change 1007006 merged by Bking:

[operations/puppet@production] wdqs: loosen up regex for blackbox check

https://gerrit.wikimedia.org/r/1007006

Maintenance_bot removed a project: Patch-For-Review.Feb 27 2024, 11:30 PM

Change 1007014 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wdqs: remove failing blackbox check

https://gerrit.wikimedia.org/r/1007014

gerritbot added a project: Patch-For-Review.Feb 27 2024, 11:41 PM

Change 1007014 abandoned by Bking:

[operations/puppet@production] wdqs: remove failing blackbox check

Reason:

Not actually broken; just needs refactor for internal hosts

https://gerrit.wikimedia.org/r/1007014

Maintenance_bot removed a project: Patch-For-Review.Feb 28 2024, 3:30 PM

We found a few interesting things trying to adapt the new blackbox checks for wdqs-internal today:

query.wikidata.org is not on wdqs-internal's TLS cert
It appears that the internal URL wdqs-internal.discovery.wmnet only works via cleartext http (at least from the prometheus and cumin hosts).
wdqs-internal does have envoy running per this PR and a valid TLS cert for wdqs-internal.discovery.wmnet, but it's how it's being access outside of envoy (or if it can be accessed in that way).

Short-term, we might want to change the internal check to use cleartext http...long-term, we should probably migrate to HTTPS via envoy. But we need to figure out stakeholders for the internal service before we talk about that.

Change 1007653 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wdqs: Distinguish between public and internal monitoring

https://gerrit.wikimedia.org/r/1007653

gerritbot added a project: Patch-For-Review.Feb 29 2024, 5:34 PM

Gehel edited projects, added Data-Platform-SRE (2024.03.04 - 2024.03.24); removed Data-Platform-SRE (2024.02.12 - 2024.03.03).Mar 1 2024, 4:00 PM

Gehel moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.03.04 - 2024.03.24) board.Mar 1 2024, 4:21 PM

Change 1007653 merged by Bking:

[operations/puppet@production] wdqs: Distinguish between public and internal monitoring

https://gerrit.wikimedia.org/r/1007653

Maintenance_bot removed a project: Patch-For-Review.Mar 5 2024, 4:31 PM

Mentioned in SAL (#wikimedia-operations) [2024-03-05T17:40:20Z] <inflatador> bking@prometheus1006 reload prometheus service as part of troubleshooting T358029

Hi, as some of those hosts had Puppet disabled for a long time (with this task as disabled message), they got removed from PuppetDB.
As hosts not in PuppetDB can be problematic (lack of security updates for example) we have a check to catch them:
https://netbox.wikimedia.org/extras/reports/puppetdb.PhysicalHosts/
Full list is currently:

an-worker1096 (WMF4839)
elastic2107 (WMF11895)
elastic2108 (WMF11896)
moss-be2001 (WMF5769)
moss-be2002 (WMF5772)
wdqs1022 (WMF11314)
wdqs1023 (WMF11315)
wdqs1024 (WMF11316)

The recommended action here is to of course re-enable Puppet if possible (ideally through a re-image to make sure the host didn't get compromised in the meantime). If not, set their Netbox status to "failed" until they're fully back in order.

bking mentioned this in T358802: Cloudelastic: alert on inconsistencies between running CCS running state and active master-eligibles.Mar 19 2024, 9:44 PM

@ayounsi ACK.

I can't speak for the other hosts as I don't own them, but:

I just re-enabled Puppet on wdqs1022-24 and I can confirm they're running OK. They were disabled due to Puppet errors (which have since been resolved by merging this CR).
elastic2107-2108 are unreachable and have DRAC problems. I'll try and take a look at them tomorrow.

Sorry for the trouble! If you need more info, feel free to reach out here or in IRC.

Thanks, and no pb !

elastic2107-2108 are unreachable and have DRAC problems. I'll try and take a look at them tomorrow.

Please set their Netbox status to Failed then :)

I successfully reimaged the Elastic hosts above, but elastic2088 is still fighting me. That being said, we're getting pretty far off-topic from this task.

This task's work is done, so I am closing it out. Lifecycle work for elastic2088 continues in T353878 .

bking mentioned this in T361862: Determine cause of unexpectedly high blackbox poller entries in wdqs nginx access logs.Thu, Apr 4, 5:22 PM

Migrate selected Search Platform alerts from icinga search-platform team to prometheus data-platform teamClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Migrate selected Search Platform alerts from icinga search-platform team to prometheus data-platform team
Closed, ResolvedPublic
Actions