Page MenuHomePhabricator

Migrate selected Search Platform alerts from icinga search-platform team to prometheus data-platform team
Closed, ResolvedPublic

Description

Per T346438 and the Data Platform Alerts Review document , we're ready to start migrating alerts. The purpose of these migrations is twofold:

  • Better targeting of alerts. For example, we don't want to bother SWEs with operational alerts.
  • Moving to the new Prometheus-driven alerts system.

Event Timeline

I'm doing a first pass on alerts I already recorded in this Google Sheet.

My first choices for moving to Prometheus:

Gehel triaged this task as Medium priority.Feb 26 2024, 2:20 PM

Change rETMH10065649e838 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wdqs: add blackbox check for query.wikidata.org

https://gerrit.wikimedia.org/r/1006564

Change rETMH10065649e838 merged by Bking:

[operations/puppet@production] wdqs: add blackbox check for query.wikidata.org

https://gerrit.wikimedia.org/r/1006564

Change 1006992 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wdqs: Add blackbox http checks for SPARQL endpoint

https://gerrit.wikimedia.org/r/1006992

Change 1006992 merged by Bking:

[operations/puppet@production] wdqs: Add blackbox http checks for SPARQL endpoint

https://gerrit.wikimedia.org/r/1006992

Change 1007006 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wdqs: loosen up regex for blackbox check

https://gerrit.wikimedia.org/r/1007006

Change 1007006 merged by Bking:

[operations/puppet@production] wdqs: loosen up regex for blackbox check

https://gerrit.wikimedia.org/r/1007006

Change 1007014 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wdqs: remove failing blackbox check

https://gerrit.wikimedia.org/r/1007014

Change 1007014 abandoned by Bking:

[operations/puppet@production] wdqs: remove failing blackbox check

Reason:

Not actually broken; just needs refactor for internal hosts

https://gerrit.wikimedia.org/r/1007014

We found a few interesting things trying to adapt the new blackbox checks for wdqs-internal today:

  • query.wikidata.org is not on wdqs-internal's TLS cert
  • It appears that the internal URL wdqs-internal.discovery.wmnet only works via cleartext http (at least from the prometheus and cumin hosts).
  • wdqs-internal does have envoy running per this PR and a valid TLS cert for wdqs-internal.discovery.wmnet, but it's how it's being access outside of envoy (or if it can be accessed in that way).

Short-term, we might want to change the internal check to use cleartext http...long-term, we should probably migrate to HTTPS via envoy. But we need to figure out stakeholders for the internal service before we talk about that.

Change 1007653 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wdqs: Distinguish between public and internal monitoring

https://gerrit.wikimedia.org/r/1007653

Change 1007653 merged by Bking:

[operations/puppet@production] wdqs: Distinguish between public and internal monitoring

https://gerrit.wikimedia.org/r/1007653

Mentioned in SAL (#wikimedia-operations) [2024-03-05T17:40:20Z] <inflatador> bking@prometheus1006 reload prometheus service as part of troubleshooting T358029

Hi, as some of those hosts had Puppet disabled for a long time (with this task as disabled message), they got removed from PuppetDB.
As hosts not in PuppetDB can be problematic (lack of security updates for example) we have a check to catch them:
https://netbox.wikimedia.org/extras/reports/puppetdb.PhysicalHosts/
Full list is currently:

an-worker1096 (WMF4839)
elastic2107 (WMF11895)
elastic2108 (WMF11896)
moss-be2001 (WMF5769)
moss-be2002 (WMF5772)
wdqs1022 (WMF11314)
wdqs1023 (WMF11315)
wdqs1024 (WMF11316)

The recommended action here is to of course re-enable Puppet if possible (ideally through a re-image to make sure the host didn't get compromised in the meantime). If not, set their Netbox status to "failed" until they're fully back in order.

@ayounsi ACK.

I can't speak for the other hosts as I don't own them, but:

  • I just re-enabled Puppet on wdqs1022-24 and I can confirm they're running OK. They were disabled due to Puppet errors (which have since been resolved by merging this CR).
  • elastic2107-2108 are unreachable and have DRAC problems. I'll try and take a look at them tomorrow.

Sorry for the trouble! If you need more info, feel free to reach out here or in IRC.

Thanks, and no pb !

elastic2107-2108 are unreachable and have DRAC problems. I'll try and take a look at them tomorrow.

Please set their Netbox status to Failed then :)

bking closed this task as Resolved.EditedMar 21 2024, 4:23 PM

I successfully reimaged the Elastic hosts above, but elastic2088 is still fighting me. That being said, we're getting pretty far off-topic from this task.

This task's work is done, so I am closing it out. Lifecycle work for elastic2088 continues in T353878 .