Page MenuHomePhabricator

Wikimedia Cloud (labs) dns is intermittingly failing
Closed, ResolvedPublic

Description

Hi today on irc we got this icinga warnnings

[13:57:01] <icinga-wm> PROBLEM - Check for gridmaster host resolution TCP on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call
[13:58:51] <icinga-wm> PROBLEM - Check for gridmaster host resolution TCP on labs-ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call

soon after i got these errors

[14:12:23] <icinga2-wm> PROBLEM - Host ores-worker-06 is DOWN: check_ping: Invalid hostname/address - ores-worker-06.ores.eqiad.wmflabsUsage:check_ping -H <host_address> -w <wrta>,<wpl>% -c <crta>,<cpl>% [-p packets] [-t timeout] [-4
[14:12:34] <icinga2-wm> PROBLEM - puppet on ores-worker-10 is WARNING: Could not resolve hostname ores-worker-10.ores.eqiad.wmflabs: Name or service not known
[14:12:44] <icinga2-wm> PROBLEM - check users on ores-worker-10 is WARNING: Could not resolve hostname ores-worker-10.ores.eqiad.wmflabs: Name or service not known

Event Timeline

Andrew claimed this task.

This seems to have been caused by https://gerrit.wikimedia.org/r/#/c/382415/, which has now been reverted.

The labservices boxes were unable to reach the remote syslog box, which resulted in terrible hangs all over the place, disrupting DNS and other things.

We'll investigate before re-applying that change.