Page MenuHomePhabricator

beta / deployment-prep alerts show up in production alertmanager
Closed, ResolvedPublic

Description

While looking at alerts.w.o with @cluster=wikimedia.org (i.e. the production alertmanager) I noticed there are a few alerts for deployment-prep instances, which is unexpected to say the least:

summary: deployment-docker-wikifunctions01:9100 FS / at 0.00% avail
summary: deployment-mwlog01:9100 FS /srv at 0.00% avail
summary: Puppet stale on deployment-docker-wikifunctions01:9100 for 7d 0h 24m 59s

The expectation is for these alerts to show up with @cluster=wmcloud.org not @cluster=wikimedia.org

Event Timeline

The Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832/ and replace it with a more specific project tag to this task. Thanks!

Update on the investigation:

  • InstanceDiskFullCrit alert is defined in modules/role/files/prometheus/alerts_beta.yml
  • Said file is deployed by role::prometheus::beta
  • The host in deployment-prep with that role applied is deployment-prometheus05.deployment-prep.eqiad1.wikimedia.cloud
  • prometheus@beta on that host is indeed configured with production alertmanagers (something to be fixed) alert1001 and alert2001
  • The host above seemingly can't talk to alert[21]001:
# curl https://alert1001.wikimedia.org:9093 -v
*   Trying 208.80.154.88:9093...
*   Trying 2620:0:861:3:208:80:154:88:9093...
* Immediate connect fail for 2620:0:861:3:208:80:154:88: Network is unreachable
# curl https://alert2001.wikimedia.org:9093 -v
*   Trying 208.80.153.84:9093...
*   Trying 2620:0:860:3:208:80:153:84:9093...
* Immediate connect fail for 2620:0:860:3:208:80:153:84: Network is unreachable

Not directly related to this issue, though zooming out a little I'm now questioning if with cloudinfra and cloud vps - wide prometheus metrics and alerts we even need a dedicated prometheus in deployment-prep (cc @taavi)

Prometheus is failing to talk to alert2001 only in logs, not alert1001:

root@deployment-prometheus05:~# journalctl -u prometheus@beta --since -2d | grep -ic alert2001
2877
root@deployment-prometheus05:~# journalctl -u prometheus@beta --since -2d | grep -ic alert1001
0

Which tipped me off on checking the prometheus connections:

root@deployment-prometheus05:~# lsof -p $(pidof prometheus) | grep -i alert
prometheu 528 prometheus   20u     IPv4           58615134       0t0     TCP deployment-prometheus05.deployment-prep.eqiad1.wikimedia.cloud:54762->alert1001.wikimedia.org:9093 (ESTABLISHED)
prometheu 528 prometheus  168u     IPv4           32182748       0t0     TCP deployment-prometheus05.deployment-prep.eqiad1.wikimedia.cloud:46164->deployment-alert01.deployment-prep.eqiad1.wikimedia.cloud:9100 (ESTABLISHED)
prometheu 528 prometheus  195u     IPv4           67199208       0t0     TCP deployment-prometheus05.deployment-prep.eqiad1.wikimedia.cloud:40752->alert2001.wikimedia.org:9093 (SYN_SENT)

I'm guessing that connections to alert1001 must have been possible for a certain time period in the past, and prometheus kept hanging on to that connection.

With that in mind, these are the action items:

  • Restart prometheus@beta so connections are reset and alerts disappear
  • Configure beta prometheus with alertmanagers from cloudinfra instead
  • Evaluate whether we need prometheus@beta at all

With that in mind, these are the action items:

  • Restart prometheus@beta so connections are reset and alerts disappear
  • Configure beta prometheus with alertmanagers from cloudinfra instead
  • Evaluate whether we need prometheus@beta at all

Thinking about this a little more, I don't think we need or want a beta-specific prometheus in this day and age. It used to make sense when Prometheus deployed was in its infancy. For generic Cloud VPS purposes we have cloudinfra and its alerts. Therefore I've gone ahead and shut off the instance, which I'll delete in 7-8 weeks unless complains arise.

fgiunchedi claimed this task.

With that in mind, these are the action items:

  • Restart prometheus@beta so connections are reset and alerts disappear
  • Configure beta prometheus with alertmanagers from cloudinfra instead
  • Evaluate whether we need prometheus@beta at all

Thinking about this a little more, I don't think we need or want a beta-specific prometheus in this day and age. It used to make sense when Prometheus deployed was in its infancy. For generic Cloud VPS purposes we have cloudinfra and its alerts. Therefore I've gone ahead and shut off the instance, which I'll delete in 7-8 weeks unless complains arise.

Tracked at T344974: De-provision beta-specific Prometheus

Resolving this task since the original issue is resolved