Webservicemonitor is down as it fails to resolve cloudmetrics1001.eqiad.wmnet:
Feb 12 21:21:33 tools-sgecron-2 systemd[1]: Started webservicemonitor service, to ensure web services are always running once started. Feb 12 21:21:34 tools-sgecron-2 collector-runner[18600]: Traceback (most recent call last): Feb 12 21:21:34 tools-sgecron-2 collector-runner[18600]: File "/usr/bin/collector-runner", line 31, in <module> Feb 12 21:21:34 tools-sgecron-2 collector-runner[18600]: sleep=args.sleep, Feb 12 21:21:34 tools-sgecron-2 collector-runner[18600]: File "/usr/lib/python3/dist-packages/tools/manifest/webservicemonitor.py", line 134, in __init__ Feb 12 21:21:34 tools-sgecron-2 collector-runner[18600]: self.stats = statsd.StatsClient(statsd_host, 8125, prefix=statsd_prefix) Feb 12 21:21:34 tools-sgecron-2 collector-runner[18600]: File "/usr/lib/python3/dist-packages/statsd/client.py", line 139, in __init__ Feb 12 21:21:34 tools-sgecron-2 collector-runner[18600]: host, port, fam, socket.SOCK_DGRAM)[0] Feb 12 21:21:34 tools-sgecron-2 collector-runner[18600]: File "/usr/lib/python3.7/socket.py", line 748, in getaddrinfo Feb 12 21:21:34 tools-sgecron-2 collector-runner[18600]: for res in _socket.getaddrinfo(host, port, family, type, proto, flags): Feb 12 21:21:34 tools-sgecron-2 collector-runner[18600]: socket.gaierror: [Errno -2] Name or service not known Feb 12 21:21:34 tools-sgecron-2 systemd[1]: webservicemonitor.service: Main process exited, code=exited, status=1/FAILURE Feb 12 21:21:34 tools-sgecron-2 systemd[1]: webservicemonitor.service: Failed with result 'exit-code'. Feb 12 21:21:34 tools-sgecron-2 systemd[1]: webservicemonitor.service: Service RestartSec=100ms expired, scheduling restart.
That DNS name was removed back in November.. 610e603ccc3c
Options include:
Update the name- let's not, given the lack of usage and statsd removal plans
- remove the statsd collection (a la T244809)
- remove webservicemonitor entirely
I think option 3 is the way to go here if no-one noticed it being down for two months.