ProbeDown
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	phaultfinder
	Feb 28 2023, 2:03 AM

Description

Common information

dashboard: https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All
runbook: https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443

alertname: ProbeDown
instance: gitlab2002:443
job: probes/custom
prometheus: ops
severity: task
site: codfw
source: prometheus
team: serviceops-collab

Firing alerts

dashboard: https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All
description: gitlab2002:443 failed when probed by http_gitlab_wikimedia_org_ip4 from codfw. Availability is 0%.
logs: https://logstash.wikimedia.org/app/dashboards#/view/f3e709c0-a5f8-11ec-bf8e-43f1807d5bc2?_g=(filters:!((query:(match_phrase:(service.name:http_gitlab_wikimedia_org_ip4)))))
runbook: https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443
summary: Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4)
address: 208.80.153.8
alertname: ProbeDown
family: ip4
instance: gitlab2002:443
job: probes/custom
module: http_gitlab_wikimedia_org_ip4
prometheus: ops
severity: task
site: codfw
source: prometheus
team: serviceops-collab
Source

dashboard: https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All
description: gitlab2002:443 failed when probed by http_gitlab_wikimedia_org_ip6 from codfw. Availability is 0%.
logs: https://logstash.wikimedia.org/app/dashboards#/view/f3e709c0-a5f8-11ec-bf8e-43f1807d5bc2?_g=(filters:!((query:(match_phrase:(service.name:http_gitlab_wikimedia_org_ip6)))))
runbook: https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443
summary: Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip6)
address: 2620:0:860:1:208:80:153:8
alertname: ProbeDown
family: ip6
instance: gitlab2002:443
job: probes/custom
module: http_gitlab_wikimedia_org_ip6
prometheus: ops
severity: task
site: codfw
source: prometheus
team: serviceops-collab
Source

Details

	Subject	Repo	Branch	Lines +/-
	gitlab: enable restore for replicas, disable on active_host	operations/puppet	production	+1 -15

Customize query in gerrit

Related Objects

Mentioned In: T329931: Switchover gitlab (gitlab1004 -> gitlab2002)
Mentioned Here: T329931: Switchover gitlab (gitlab1004 -> gitlab2002)

Event Timeline

phaultfinder created this task.Feb 28 2023, 2:03 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 28 2023, 2:03 AM

gitlab2002 was switched from replica to a production instance yesterday in T329931.

It seems the restore timer was not removed on the instance and the restore was triggered at 2:00 UTC. The restore was executed from previously done backup from 0:04 UTC.

So we may have lost data for this two hours.

I'm digging into the puppet code to see why the job remained on the active host. After that I'll prepare some communication for the data loss.

Change 892892 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: enable restore for replicas, disable on active_host

https://gerrit.wikimedia.org/r/892892

gerritbot added a project: Patch-For-Review.Feb 28 2023, 7:59 AM

So gitlab2002 was down for around 20 minutes while doing the restore. The above change should disable restore on the production host.

Jelto added a project: GitLab (Infrastructure).Feb 28 2023, 8:27 AM

Jelto mentioned this in T329931: Switchover gitlab (gitlab1004 -> gitlab2002).Feb 28 2023, 8:32 AM

Change 892892 merged by Jelto:

[operations/puppet@production] gitlab: enable restore for replicas, disable on active_host

https://gerrit.wikimedia.org/r/892892

After merging the above change the restore timer is gone form gitlab2002. So we should not see a ProbeDown alert again due to restores on the production instance.

jelto@gitlab2002 $ systemctl list-timers | grep restore
<empty>

I'm closing this task. Fallout from the data loss will be addressed in T329931.

Thanks @Dzahn for implementing the new alerts :)

ProbeDownClosed, ResolvedPublicActions