Page MenuHomePhabricator

ProbeDown Service wan.cloudgw.eqiad1.wikimediacloud.org:0 has failed probes (icmp_wan_cloudgw_eqiad1_wikimediacloud_org_ip4)
Closed, ResolvedPublic

Description

Common information

  • address: 185.15.56.244
  • alertname: ProbeDown
  • family: ip4
  • instance: wan.cloudgw.eqiad1.wikimediacloud.org:0
  • job: probes/custom
  • module: icmp_wan_cloudgw_eqiad1_wikimediacloud_org_ip4
  • prometheus: ops
  • severity: critical
  • site: codfw
  • source: prometheus
  • team: wmcs

Firing alerts


Event Timeline

The average success rate for that probe seems to be deteriorating:

Screenshot 2024-12-16 at 19.35.31.png (1×2 px, 401 KB)

Same graph over 9 months, the success rate is still decent at around 99%, but there's a downward trend.

Screenshot 2024-12-16 at 19.47.14.png (1×2 px, 501 KB)

cc @Andrew @aborrero @cmooney

I don't think the failure rate here is significant enough to warrant any concern, success rate on average is 99.8% over the past 3 months, and that includes when the cloudgw was under severe strain due to all the DDOS traffic being sent.

https://w.wiki/CSNd

Unless there are other reports of internet connectivity issues I'd not be overly concerned.

FWIW I ran a little test from here with similar results, 0.1% loss to the cloudgw. 0% loss to the cloudsw it's connected to so overall I'm not sure there is any issue with the internet connectivity here.

cathal@officepc:~$ mtr -z -b -w -c 25000 wan.cloudgw.eqiad1.wikimediacloud.org
Start: 2024-12-16T21:13:57+0000
HOST: officepc                                                                Loss%   Snt   Last   Avg  Best  Wrst StDev
  1. AS???    nbgw (192.168.240.1)                                             0.0% 25000    0.5   0.1   0.1   1.0   0.3
  2. AS6830   176.61.34.1                                                      0.0% 25000   16.0   9.7   4.0 139.3   5.6
  3. AS6830   109.255.255.254                                                  0.0% 25000   23.0   9.2   3.8 160.3   5.8
  4. AS6830   ie-dub01a-rc1-ae-31-0.aorta.net (84.116.238.42)                  0.0% 25000   11.2   9.9   4.5 135.9   7.1
  5. AS6830   ie-dub02a-ri1-ae-73-0.aorta.net (84.116.134.110)                 0.0% 25000   11.5   9.7   4.4 164.4   6.9
  6. AS2914   ae-7.a00.dublir01.ie.bb.gin.ntt.net (129.250.9.172)              0.0% 25000   13.0  10.3   5.1 154.8   8.6
  7. AS2914   ae-5.r22.parsfr04.fr.bb.gin.ntt.net (129.250.4.163)              0.0% 25000   28.2  25.3  20.6 147.7   5.6
  8. AS2914   ae-13.r23.parsfr04.fr.bb.gin.ntt.net (129.250.4.149)             0.0% 25000   29.6  25.5  20.4 224.4   6.0
  9. AS2914   ae-13.r26.asbnva02.us.bb.gin.ntt.net (129.250.6.6)               0.1% 25000  108.2 107.6 101.7 262.4   5.8
 10. AS2914   ae-6.a05.asbnva02.us.bb.gin.ntt.net (129.250.3.255)              0.2% 25000  112.6 107.7 102.1 245.2   5.6
 11. AS2914   xe-2-5-3-2.a05.asbnva02.us.ce.gin.ntt.net (192.80.17.186)        0.0% 25000  112.7 110.7 104.9 244.4   5.8
 12. AS14907  xe-0-0-0-1102.cloudsw1-c8-eqiad.wikimedia.org (208.80.154.211)   0.0% 25000  119.2 113.1 104.5 262.6  12.1
 13. AS14907  wan.cloudgw.eqiad1.wikimediacloud.org (185.15.56.244)            0.1% 25000  112.1 106.8 101.8 251.0   5.7
fnegri claimed this task.

My theory is that there are some events that tend to degrade the success rate, and these events are becoming (slightly) more frequent.

I agree we should not worry about the success rate for now, but I would like to find out more about the errors getting logged like the ones in T382220: KernelError Server cloudgw1002 may have kernel errors.

I'll close this task but keep T382220 open.