Page MenuHomePhabricator

remove webservicemonitor (down due to DNS errors)
Open, MediumPublic

Description

Webservicemonitor is down as it fails to resolve cloudmetrics1001.eqiad.wmnet:

Feb 12 21:21:33 tools-sgecron-2 systemd[1]: Started webservicemonitor service, to ensure web services are always running once started.
Feb 12 21:21:34 tools-sgecron-2 collector-runner[18600]: Traceback (most recent call last):
Feb 12 21:21:34 tools-sgecron-2 collector-runner[18600]:   File "/usr/bin/collector-runner", line 31, in <module>
Feb 12 21:21:34 tools-sgecron-2 collector-runner[18600]:     sleep=args.sleep,
Feb 12 21:21:34 tools-sgecron-2 collector-runner[18600]:   File "/usr/lib/python3/dist-packages/tools/manifest/webservicemonitor.py", line 134, in __init__
Feb 12 21:21:34 tools-sgecron-2 collector-runner[18600]:     self.stats = statsd.StatsClient(statsd_host, 8125, prefix=statsd_prefix)
Feb 12 21:21:34 tools-sgecron-2 collector-runner[18600]:   File "/usr/lib/python3/dist-packages/statsd/client.py", line 139, in __init__
Feb 12 21:21:34 tools-sgecron-2 collector-runner[18600]:     host, port, fam, socket.SOCK_DGRAM)[0]
Feb 12 21:21:34 tools-sgecron-2 collector-runner[18600]:   File "/usr/lib/python3.7/socket.py", line 748, in getaddrinfo
Feb 12 21:21:34 tools-sgecron-2 collector-runner[18600]:     for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
Feb 12 21:21:34 tools-sgecron-2 collector-runner[18600]: socket.gaierror: [Errno -2] Name or service not known
Feb 12 21:21:34 tools-sgecron-2 systemd[1]: webservicemonitor.service: Main process exited, code=exited, status=1/FAILURE
Feb 12 21:21:34 tools-sgecron-2 systemd[1]: webservicemonitor.service: Failed with result 'exit-code'.
Feb 12 21:21:34 tools-sgecron-2 systemd[1]: webservicemonitor.service: Service RestartSec=100ms expired, scheduling restart.

That DNS name was removed back in November.. 610e603ccc3c

Options include:

  1. Update the name
    • let's not, given the lack of usage and statsd removal plans
  2. remove the statsd collection (a la T244809)
  3. remove webservicemonitor entirely

I think option 3 is the way to go here if no-one noticed it being down for two months.

Event Timeline

Change 888347 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:toolforge::grid: disable webservicemonitor

https://gerrit.wikimedia.org/r/888347

Change 889085 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/software/tools-manifest@master] tools-manifests: don't collect statsd metrics

https://gerrit.wikimedia.org/r/889085

Change 889085 merged by jenkins-bot:

[operations/software/tools-manifest@master] tools-manifests: don't collect statsd metrics

https://gerrit.wikimedia.org/r/889085

Mentioned in SAL (#wikimedia-cloud) [2023-02-14T12:02:53Z] <arturo> included tools-manifests 0.25 in toolsbeta-buster aptly repo (T329611, T329467, T244809)

Mentioned in SAL (#wikimedia-cloud) [2023-02-14T12:09:57Z] <arturo> included tools-manifests 0.25 in tools-buster aptly repo, deploying it now! (T329611, T329467, T244809)

aborrero triaged this task as Medium priority.Feb 14 2023, 12:20 PM
aborrero moved this task from Triage to Backlog on the Toolforge board.
aborrero added a subscriber: aborrero.

Option #2 has been implemented (drop statsd support). The service is now up and running.

However the underlying problem stated in this ticket remains: monitoring for this piece of code is missing.

I think a sensible and cost-effectice approach could be to generate some basic prometheus metrics (and export them via the the filesystem) that we can use to build alerts on.