beta / deployment-prep alerts show up in production alertmanager
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	fgiunchedi
	Aug 21 2023, 7:45 AM

Description

While looking at alerts.w.o with @cluster=wikimedia.org (i.e. the production alertmanager) I noticed there are a few alerts for deployment-prep instances, which is unexpected to say the least:

summary: deployment-docker-wikifunctions01:9100 FS / at 0.00% avail
summary: deployment-mwlog01:9100 FS /srv at 0.00% avail
summary: Puppet stale on deployment-docker-wikifunctions01:9100 for 7d 0h 24m 59s

The expectation is for these alerts to show up with @cluster=wmcloud.org not @cluster=wikimedia.org

Related Objects

Mentioned In: T344974: De-provision beta-specific Prometheus
Mentioned Here: T344974: De-provision beta-specific Prometheus

Event Timeline

fgiunchedi created this task.Aug 21 2023, 7:45 AM

The Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832/ and replace it with a more specific project tag to this task. Thanks!

RhinosF1 edited projects, added Cloud-VPS; removed Cloud-Services.Aug 21 2023, 7:46 AM

Restricted Application added a subscriber: RhinosF1. · View Herald TranscriptAug 21 2023, 7:46 AM

RhinosF1 added a project: cloud-services-team.Aug 21 2023, 7:47 AM

fgiunchedi updated the task description. (Show Details)Aug 21 2023, 7:47 AM

TheresNoTime added a project: Beta-Cluster-Infrastructure.Aug 21 2023, 7:52 AM

Update on the investigation:

InstanceDiskFullCrit alert is defined in modules/role/files/prometheus/alerts_beta.yml
Said file is deployed by role::prometheus::beta
The host in deployment-prep with that role applied is deployment-prometheus05.deployment-prep.eqiad1.wikimedia.cloud
prometheus@beta on that host is indeed configured with production alertmanagers (something to be fixed) alert1001 and alert2001
The host above seemingly can't talk to alert[21]001:

# curl https://alert1001.wikimedia.org:9093 -v
*   Trying 208.80.154.88:9093...
*   Trying 2620:0:861:3:208:80:154:88:9093...
* Immediate connect fail for 2620:0:861:3:208:80:154:88: Network is unreachable
# curl https://alert2001.wikimedia.org:9093 -v
*   Trying 208.80.153.84:9093...
*   Trying 2620:0:860:3:208:80:153:84:9093...
* Immediate connect fail for 2620:0:860:3:208:80:153:84: Network is unreachable

Not directly related to this issue, though zooming out a little I'm now questioning if with cloudinfra and cloud vps - wide prometheus metrics and alerts we even need a dedicated prometheus in deployment-prep (cc @taavi)

Prometheus is failing to talk to alert2001 only in logs, not alert1001:

root@deployment-prometheus05:~# journalctl -u prometheus@beta --since -2d | grep -ic alert2001
2877
root@deployment-prometheus05:~# journalctl -u prometheus@beta --since -2d | grep -ic alert1001
0

Which tipped me off on checking the prometheus connections:

root@deployment-prometheus05:~# lsof -p $(pidof prometheus) | grep -i alert
prometheu 528 prometheus   20u     IPv4           58615134       0t0     TCP deployment-prometheus05.deployment-prep.eqiad1.wikimedia.cloud:54762->alert1001.wikimedia.org:9093 (ESTABLISHED)
prometheu 528 prometheus  168u     IPv4           32182748       0t0     TCP deployment-prometheus05.deployment-prep.eqiad1.wikimedia.cloud:46164->deployment-alert01.deployment-prep.eqiad1.wikimedia.cloud:9100 (ESTABLISHED)
prometheu 528 prometheus  195u     IPv4           67199208       0t0     TCP deployment-prometheus05.deployment-prep.eqiad1.wikimedia.cloud:40752->alert2001.wikimedia.org:9093 (SYN_SENT)

I'm guessing that connections to alert1001 must have been possible for a certain time period in the past, and prometheus kept hanging on to that connection.

With that in mind, these are the action items:

Restart prometheus@beta so connections are reset and alerts disappear
Configure beta prometheus with alertmanagers from cloudinfra instead
Evaluate whether we need prometheus@beta at all

Mentioned in SAL (#wikimedia-operations) [2023-08-21T08:27:16Z] <godog> restart prometheus@beta - T344582

In T344582#9105213, @fgiunchedi wrote:

With that in mind, these are the action items:

Restart prometheus@beta so connections are reset and alerts disappear

Configure beta prometheus with alertmanagers from cloudinfra instead

Evaluate whether we need prometheus@beta at all

Thinking about this a little more, I don't think we need or want a beta-specific prometheus in this day and age. It used to make sense when Prometheus deployed was in its infancy. For generic Cloud VPS purposes we have cloudinfra and its alerts. Therefore I've gone ahead and shut off the instance, which I'll delete in 7-8 weeks unless complains arise.

Mentioned in SAL (#wikimedia-cloud) [2023-08-21T15:30:53Z] <godog> shut prometheus05 - T344582

In T344582#9106603, @fgiunchedi wrote:

In T344582#9105213, @fgiunchedi wrote:

With that in mind, these are the action items:

Restart prometheus@beta so connections are reset and alerts disappear

Configure beta prometheus with alertmanagers from cloudinfra instead

Evaluate whether we need prometheus@beta at all

Thinking about this a little more, I don't think we need or want a beta-specific prometheus in this day and age. It used to make sense when Prometheus deployed was in its infancy. For generic Cloud VPS purposes we have cloudinfra and its alerts. Therefore I've gone ahead and shut off the instance, which I'll delete in 7-8 weeks unless complains arise.

Tracked at T344974: De-provision beta-specific Prometheus

Resolving this task since the original issue is resolved

beta / deployment-prep alerts show up in production alertmanagerClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

beta / deployment-prep alerts show up in production alertmanager
Closed, ResolvedPublic
Actions