remove webservicemonitor (down due to DNS errors)
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	taavi
	Feb 12 2023, 9:35 PM

Description

Webservicemonitor is down as it fails to resolve cloudmetrics1001.eqiad.wmnet:

Feb 12 21:21:33 tools-sgecron-2 systemd[1]: Started webservicemonitor service, to ensure web services are always running once started.
Feb 12 21:21:34 tools-sgecron-2 collector-runner[18600]: Traceback (most recent call last):
Feb 12 21:21:34 tools-sgecron-2 collector-runner[18600]:   File "/usr/bin/collector-runner", line 31, in <module>
Feb 12 21:21:34 tools-sgecron-2 collector-runner[18600]:     sleep=args.sleep,
Feb 12 21:21:34 tools-sgecron-2 collector-runner[18600]:   File "/usr/lib/python3/dist-packages/tools/manifest/webservicemonitor.py", line 134, in __init__
Feb 12 21:21:34 tools-sgecron-2 collector-runner[18600]:     self.stats = statsd.StatsClient(statsd_host, 8125, prefix=statsd_prefix)
Feb 12 21:21:34 tools-sgecron-2 collector-runner[18600]:   File "/usr/lib/python3/dist-packages/statsd/client.py", line 139, in __init__
Feb 12 21:21:34 tools-sgecron-2 collector-runner[18600]:     host, port, fam, socket.SOCK_DGRAM)[0]
Feb 12 21:21:34 tools-sgecron-2 collector-runner[18600]:   File "/usr/lib/python3.7/socket.py", line 748, in getaddrinfo
Feb 12 21:21:34 tools-sgecron-2 collector-runner[18600]:     for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
Feb 12 21:21:34 tools-sgecron-2 collector-runner[18600]: socket.gaierror: [Errno -2] Name or service not known
Feb 12 21:21:34 tools-sgecron-2 systemd[1]: webservicemonitor.service: Main process exited, code=exited, status=1/FAILURE
Feb 12 21:21:34 tools-sgecron-2 systemd[1]: webservicemonitor.service: Failed with result 'exit-code'.
Feb 12 21:21:34 tools-sgecron-2 systemd[1]: webservicemonitor.service: Service RestartSec=100ms expired, scheduling restart.

That DNS name was removed back in November.. 610e603ccc3c

Options include:

~~Update the name~~
- let's not, given the lack of usage and statsd removal plans
remove the statsd collection (a la T244809)
remove webservicemonitor entirely

I think option 3 is the way to go here if no-one noticed it being down for two months.

Details

	Subject	Repo	Branch	Lines +/-
	P:toolforge::grid: disable webservicemonitor	operations/puppet	production	+2 -14
	tools-manifests: don't collect statsd metrics	operations/software/tools-manifest	master	+1 -26

Customize query in gerrit

Related Objects

Mentioned In: T244809: Remove or fix stats collecting from tools-manifest (webservice-monitor)
T329611: Toolforge grid: start webservices after outage
Mentioned Here: T329611: Toolforge grid: start webservices after outage
T244809: Remove or fix stats collecting from tools-manifest (webservice-monitor)
rONED610e603ccc3c: andrew@cumin1001: cloudmetrics[1001-1002].eqiad.wmnet decommissioned, removing…

Event Timeline

taavi created this task.Feb 12 2023, 9:35 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 12 2023, 9:35 PM

taavi added a parent task: T314664: [infra] Decommission the Grid Engine infrastructure.Feb 12 2023, 9:36 PM

Change 888347 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:toolforge::grid: disable webservicemonitor

https://gerrit.wikimedia.org/r/888347

gerritbot added a project: Patch-For-Review.Feb 12 2023, 9:40 PM

aborrero mentioned this in T329611: Toolforge grid: start webservices after outage.Feb 14 2023, 9:51 AM

Change 889085 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/software/tools-manifest@master] tools-manifests: don't collect statsd metrics

https://gerrit.wikimedia.org/r/889085

Change 889085 merged by jenkins-bot:

[operations/software/tools-manifest@master] tools-manifests: don't collect statsd metrics

https://gerrit.wikimedia.org/r/889085

Mentioned in SAL (#wikimedia-cloud) [2023-02-14T12:02:53Z] <arturo> included tools-manifests 0.25 in toolsbeta-buster aptly repo (T329611, T329467, T244809)

Stashbot mentioned this in T244809: Remove or fix stats collecting from tools-manifest (webservice-monitor).Feb 14 2023, 12:02 PM

Mentioned in SAL (#wikimedia-cloud) [2023-02-14T12:09:57Z] <arturo> included tools-manifests 0.25 in tools-buster aptly repo, deploying it now! (T329611, T329467, T244809)

Option #2 has been implemented (drop statsd support). The service is now up and running.

However the underlying problem stated in this ticket remains: monitoring for this piece of code is missing.

I think a sensible and cost-effectice approach could be to generate some basic prometheus metrics (and export them via the the filesystem) that we can use to build alerts on.

Let's just let this die when the grid dies.

taavi removed a parent task: T314664: [infra] Decommission the Grid Engine infrastructure.Jan 22 2024, 6:14 PM

remove webservicemonitor (down due to DNS errors)Closed, DeclinedPublicActions

Description

Details

Related Objects

Event Timeline

remove webservicemonitor (down due to DNS errors)
Closed, DeclinedPublic
Actions