Page MenuHomePhabricator

No metrics from JS arriving in Prometheus/Graphite since around 11:48 UTC Wed. 2025-03-19
Closed, ResolvedPublic

Description

This seems to have started somewhat before the train:

image.png (735×310 px, 33 KB)

This is the same across all the metrics that we collect in javascript. Server-side metrics are not affected as far as I can tell, and our features also seem to work correctly themselves. I verified that the browser does make the request to statsv and gets a 204 response.

Event Timeline

colewhite triaged this task as High priority.
colewhite subscribed.

It seems the statsv process wedged itself. After restarting the process, metrics are now flowing again.

I threw together a dashboard for diagnostics.

We'll also want to wire up some alerting to this.

From around the same time:

Mar 19 11:45:21 webperf1003 python3[3604232]: Process Process-1:
Mar 19 11:45:21 webperf1003 python3[3604232]: Traceback (most recent call last):
Mar 19 11:45:21 webperf1003 python3[3604232]:   File "/usr/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
Mar 19 11:45:21 webperf1003 python3[3604232]:     self.run()
Mar 19 11:45:21 webperf1003 python3[3604232]:   File "/usr/lib/python3.9/multiprocessing/process.py", line 108, in run
Mar 19 11:45:21 webperf1003 python3[3604232]:     self._target(*self._args, **self._kwargs)
Mar 19 11:45:21 webperf1003 python3[3604232]:   File "/srv/deployment/statsv/statsv/statsv.py", line 268, in process_queue
Mar 19 11:45:21 webperf1003 python3[3604232]:     emit(sock, statsd_addr, statsd_message)
Mar 19 11:45:21 webperf1003 python3[3604232]:   File "/srv/deployment/statsv/statsv/statsv.py", line 195, in emit
Mar 19 11:45:21 webperf1003 python3[3604232]:     sock.sendto(payload.encode('utf-8'), addr)
Mar 19 11:45:21 webperf1003 python3[3604232]: socket.gaierror: [Errno -2] Name or service not known

Change #1129899 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/alerts@master] add statsv throughput alerts

https://gerrit.wikimedia.org/r/1129899

Change #1129899 merged by jenkins-bot:

[operations/alerts@master] add statsv throughput alerts

https://gerrit.wikimedia.org/r/1129899

We have alerting now and we know a simple restart of statsv brings it back. Optimistically closing.

Change #1204929 had a related patch set uploaded (by Cwhite; author: Cwhite):

[performance/statsv@master] add try-except around sock.sendto()

https://gerrit.wikimedia.org/r/1204929

Change #1204929 merged by jenkins-bot:

[performance/statsv@master] improve logging and set default socket timeout

https://gerrit.wikimedia.org/r/1204929

Mentioned in SAL (#wikimedia-operations) [2026-01-12T21:43:11Z] <cwhite@deploy2002> Started deploy [statsv/statsv@b935e2d]: T389469

Mentioned in SAL (#wikimedia-operations) [2026-01-12T21:43:21Z] <cwhite@deploy2002> Finished deploy [statsv/statsv@b935e2d]: T389469 (duration: 00m 09s)