Page MenuHomePhabricator

remove webservicemonitor (down due to DNS errors)
Closed, DeclinedPublic

Description

Webservicemonitor is down as it fails to resolve cloudmetrics1001.eqiad.wmnet:

Feb 12 21:21:33 tools-sgecron-2 systemd[1]: Started webservicemonitor service, to ensure web services are always running once started.
Feb 12 21:21:34 tools-sgecron-2 collector-runner[18600]: Traceback (most recent call last):
Feb 12 21:21:34 tools-sgecron-2 collector-runner[18600]:   File "/usr/bin/collector-runner", line 31, in <module>
Feb 12 21:21:34 tools-sgecron-2 collector-runner[18600]:     sleep=args.sleep,
Feb 12 21:21:34 tools-sgecron-2 collector-runner[18600]:   File "/usr/lib/python3/dist-packages/tools/manifest/webservicemonitor.py", line 134, in __init__
Feb 12 21:21:34 tools-sgecron-2 collector-runner[18600]:     self.stats = statsd.StatsClient(statsd_host, 8125, prefix=statsd_prefix)
Feb 12 21:21:34 tools-sgecron-2 collector-runner[18600]:   File "/usr/lib/python3/dist-packages/statsd/client.py", line 139, in __init__
Feb 12 21:21:34 tools-sgecron-2 collector-runner[18600]:     host, port, fam, socket.SOCK_DGRAM)[0]
Feb 12 21:21:34 tools-sgecron-2 collector-runner[18600]:   File "/usr/lib/python3.7/socket.py", line 748, in getaddrinfo
Feb 12 21:21:34 tools-sgecron-2 collector-runner[18600]:     for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
Feb 12 21:21:34 tools-sgecron-2 collector-runner[18600]: socket.gaierror: [Errno -2] Name or service not known
Feb 12 21:21:34 tools-sgecron-2 systemd[1]: webservicemonitor.service: Main process exited, code=exited, status=1/FAILURE
Feb 12 21:21:34 tools-sgecron-2 systemd[1]: webservicemonitor.service: Failed with result 'exit-code'.
Feb 12 21:21:34 tools-sgecron-2 systemd[1]: webservicemonitor.service: Service RestartSec=100ms expired, scheduling restart.

That DNS name was removed back in November.. 610e603ccc3c

Options include:

  1. Update the name
    • let's not, given the lack of usage and statsd removal plans
  2. remove the statsd collection (a la T244809)
  3. remove webservicemonitor entirely

I think option 3 is the way to go here if no-one noticed it being down for two months.

Event Timeline

Change 888347 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:toolforge::grid: disable webservicemonitor

https://gerrit.wikimedia.org/r/888347

Change 889085 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/software/tools-manifest@master] tools-manifests: don't collect statsd metrics

https://gerrit.wikimedia.org/r/889085

Change 889085 merged by jenkins-bot:

[operations/software/tools-manifest@master] tools-manifests: don't collect statsd metrics

https://gerrit.wikimedia.org/r/889085

Mentioned in SAL (#wikimedia-cloud) [2023-02-14T12:02:53Z] <arturo> included tools-manifests 0.25 in toolsbeta-buster aptly repo (T329611, T329467, T244809)

Mentioned in SAL (#wikimedia-cloud) [2023-02-14T12:09:57Z] <arturo> included tools-manifests 0.25 in tools-buster aptly repo, deploying it now! (T329611, T329467, T244809)

aborrero triaged this task as Medium priority.Feb 14 2023, 12:20 PM
aborrero moved this task from Backlog to Ready to be worked on on the Toolforge board.
aborrero added a subscriber: aborrero.

Option #2 has been implemented (drop statsd support). The service is now up and running.

However the underlying problem stated in this ticket remains: monitoring for this piece of code is missing.

I think a sensible and cost-effectice approach could be to generate some basic prometheus metrics (and export them via the the filesystem) that we can use to build alerts on.

Let's just let this die when the grid dies.