Page MenuHomePhabricator

ProbeDown
Closed, ResolvedPublic

Description

Common information

  • alertname: ProbeDown
  • instance: gitlab2002:443
  • job: probes/custom
  • prometheus: ops
  • severity: task
  • site: codfw
  • source: prometheus
  • team: serviceops-collab

Firing alerts



Event Timeline

Jelto triaged this task as High priority.
Jelto added subscribers: eoghan, LSobanski, Jelto.

gitlab2002 was switched from replica to a production instance yesterday in T329931.

It seems the restore timer was not removed on the instance and the restore was triggered at 2:00 UTC. The restore was executed from previously done backup from 0:04 UTC.

So we may have lost data for this two hours.

I'm digging into the puppet code to see why the job remained on the active host. After that I'll prepare some communication for the data loss.

Change 892892 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: enable restore for replicas, disable on active_host

https://gerrit.wikimedia.org/r/892892

So gitlab2002 was down for around 20 minutes while doing the restore. The above change should disable restore on the production host.

Change 892892 merged by Jelto:

[operations/puppet@production] gitlab: enable restore for replicas, disable on active_host

https://gerrit.wikimedia.org/r/892892

Jelto added a subscriber: Dzahn.

After merging the above change the restore timer is gone form gitlab2002. So we should not see a ProbeDown alert again due to restores on the production instance.

jelto@gitlab2002 $ systemctl list-timers | grep restore
<empty>

I'm closing this task. Fallout from the data loss will be addressed in T329931.

Thanks @Dzahn for implementing the new alerts :)