Page MenuHomePhabricator

ProbeDown
Closed, ResolvedPublic

Description

Common information

  • address: 2a02:ec80:a100:fe03::2
  • alertname: ProbeDown
  • family: ip6
  • job: probes/custom
  • prometheus: ops
  • severity: critical
  • source: prometheus
  • team: wmcs

Firing alerts





Event Timeline

tappof subscribed.

irc logs:

11:15:30    tappof │ arturo: dcaro Just a heads-up in case of any unwanted alerts: I'm merging this patch: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1100819
11:15:59    arturo │ tappof: thanks 🚢🇮🇹
11:15:59     dcaro │ tappof: ack!
11:52:06    arturo │ tappof: indeed we just had the alert firing
11:52:21    arturo │ https://usercontent.irccloud-cdn.com/file/SlS4y6la/image.png
11:52:34    arturo │ and T388379
11:52:34 +stashbot │ T388379: ProbeDown  - https://phabricator.wikimedia.org/T388379
11:52:46    tappof │ arturo: yes, I've seen
11:54:52    arturo │ tappof: are you able to investigate? I'm about to jump into a meeting
11:55:17    tappof │ arturo: yes, I'll check soon
11:55:22    arturo │ thanks
12:02:21       <-- │ dwalden (~dwalden@dwalden.co.uk) has quit (Quit: ZNC 1.8.2+deb2+b1 - https://znc.in)
12:02:37       --> │ dwalden [dwalden] (ZNC - https://znc.in) (~dwalden@81.187.41.249) has joined #wikimedia-cloud
12:07:25       --> │ PhantomTech [PhantomTech] (en:User:PhantomTech) (~PhantomTe@wikipedia/PhantomTech) has joined #wikimedia-cloud
12:34:05        -- │ tgr|away is now known as tgr_
13:06:28    tappof │ arturo: We didn't have the IPv6 check before... https://w.wiki/DNG7 AFAICS, Prometheus in eqiad can reach cloudgw2002-dev on the IPv6 VIP, but the reply gets lost
                   │ somewhere https://snipboard.io/XJ0gn9.jpg
13:06:57    arturo │ topranks: oh, ok!
13:07:40    arturo │ we may want to create a ticket to investigate why that happens, and remove the IPv6 check meanwhile
13:10:05    tappof │ arturo: I think we can use T388379 as the task for this one. I'll link it to the task related to monitoring to keep track of the relationship.
13:10:06 +stashbot │ T388379: ProbeDown  - https://phabricator.wikimedia.org/T388379
13:10:29    tappof │ is it ok for you arturo ?
13:11:26    arturo │ ok!

Change #1126023 had a related patch set uploaded (by Tiziano Fogli; author: Tiziano Fogli):

[operations/puppet@production] cloudgw/icmp check/ip6: disabling

https://gerrit.wikimedia.org/r/1126023

Change #1126023 merged by Tiziano Fogli:

[operations/puppet@production] cloudgw/icmp check/ip6: disabling

https://gerrit.wikimedia.org/r/1126023

I referenced the wrong task on this patch, but it should allow the pings to work

https://gerrit.wikimedia.org/r/c/operations/homer/public/+/1126035

The alert is not firing anymore, can this be resolved?

The alert is not firing anymore, can this be resolved?

The alert is not firing because the check was disabled. But the underlying problem is not fixed yet.

We can wait a few days until all patches are merged, in particular https://gerrit.wikimedia.org/r/c/operations/homer/public/+/1126035 and a revert of https://gerrit.wikimedia.org/r/1126023

This should be ok now if you want to try re-enabling the check.

cmooney@prometheus2005:~$ ping -c 2 -6 wan.cloudgw.codfw1dev.wikimediacloud.org.
PING wan.cloudgw.codfw1dev.wikimediacloud.org.(wan.cloudgw.codfw1dev.wikimediacloud.org (2a02:ec80:a100:fe03::2)) 56 data bytes
64 bytes from wan.cloudgw.codfw1dev.wikimediacloud.org (2a02:ec80:a100:fe03::2): icmp_seq=1 ttl=62 time=0.964 ms
64 bytes from wan.cloudgw.codfw1dev.wikimediacloud.org (2a02:ec80:a100:fe03::2): icmp_seq=2 ttl=62 time=0.316 ms

--- wan.cloudgw.codfw1dev.wikimediacloud.org. ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 0.316/0.640/0.964/0.324 ms

Change #1129770 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] Revert "cloudgw/icmp check/ip6: disabling"

https://gerrit.wikimedia.org/r/1129770

Change #1129770 merged by Majavah:

[operations/puppet@production] Revert "cloudgw/icmp check/ip6: disabling"

https://gerrit.wikimedia.org/r/1129770

taavi assigned this task to cmooney.
taavi added a project: Cloud-VPS.