Page MenuHomePhabricator

ProbeDown - contint2002
Closed, ResolvedPublic

Description

Common information

  • address: 208.80.153.39
  • alertname: ProbeDown
  • family: ip4
  • instance: contint2002:1443
  • job: probes/custom
  • module: http_integration_wikimedia_org_ip4
  • prometheus: ops
  • severity: task
  • site: codfw
  • source: prometheus
  • team: serviceops-collab

Firing alerts


Event Timeline

LSobanski renamed this task from ProbeDown to ProbeDown - contint2002.Nov 17 2023, 11:10 AM

Alert recovered after 5 minutes. syslog on contin2002 shows some dns issues during the time of the alert

Nov 16 21:43:42 contint2002 ferm[24013]: DNS query for 'contint1002.wikimedia.org' failed: query timed out
Nov 16 21:43:42 contint2002 systemd[1]: ferm.service: Control process exited, code=exited, status=255/EXCEPTION
Nov 16 21:43:42 contint2002 systemd[1]: Reload failed for ferm firewall configuration.
...
Nov 16 21:46:09 contint2002 zuul-server[909]: URLError: <urlopen error [Errno -3] Temporary failure in name resolution>
...
Nov 16 21:46:09 contint2002 zuul-merger[12859]:   stderr: 'fatal: Could not read from remote repository.
Nov 16 21:46:09 contint2002 zuul-merger[12859]: Please make sure you have the correct access rights
Nov 16 21:46:09 contint2002 zuul-merger[12859]: and the repository exists.'
...
Nov 16 21:50:38 contint2002 helm3[25690]: #011Get "https://helm-charts.wikimedia.org/stable/index.yaml": dial tcp: lookup helm-charts.wikimedia.org on 10.3.0.1:53: read udp 208.80.153.39:51480->10.3.0.1:53: i/o timeout
Dzahn claimed this task.
20:59 <+icinga-wm> PROBLEM - Host dns2004 is DOWN: PING CRITICAL - Packet loss = 100%
..
21:00 <+icinga-wm> RECOVERY - Host dns2004 is UP: PING WARNING - Packet loss = 90%, RTA = 33.21 ms

..
21:42 < topranks> !log Removing VRRP config for for public1-b-codfw on codfw CRs (T347191)

...
21:44 <+icinga-wm> PROBLEM - Host dns2004 is DOWN: PING CRITICAL - Packet loss = 100%
...
21:46 <+icinga-wm> RECOVERY - Host dns2004 is UP: PING OK - Packet loss = 0%, RTA = 33.20 ms
...
21:47 < topranks> that was me sry, "cleaning up" after previous work seems I'd left teh VIP on the CRs
21:47 < topranks> they'll clear shortly, reverted immediately