Related Objects
- Mentioned Here
- T241263: Service metrics starts crashing if non-resolvable logstash domain is provided
rODNSb4463d8c6ba8: wmnet: drop cloudmetrics CNAMEs
T326266: Remove the WMCS statsd/Graphite service
T336854: Move labs/wmcs (OpenStack) Prometheus instance off cloudmetrics hosts to prometheus* hosts
T344974: De-provision beta-specific Prometheus
Event Timeline
Broke at around 2023-10-24T00:30.
Nothing in the mediawiki-config list or prod or releng SALs.
jforrester@deployment-docker-wikifunctions01:~$ date Tue Oct 24 16:03:36 UTC 2023 jforrester@deployment-docker-wikifunctions01:~$ sudo docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 147e0bb147be docker-registry.wikimedia.org/repos/abstract-wiki/wikifunctions/function-evaluator/omnibus:latest "node server.js -c /…" About a minute ago Up About a minute 0.0.0.0:6927->6927/tcp mediawiki-services-function-evaluator.service c631b4425bff docker-registry.wikimedia.org/repos/abstract-wiki/wikifunctions/function-evaluator/python3-all:latest "node server.js -c /…" 40 minutes ago Up 40 minutes 0.0.0.0:6929->6929/tcp function-evaluator-py.service ac9c19153f70 docker-registry.wikimedia.org/repos/abstract-wiki/wikifunctions/function-evaluator/javascript-all:latest "node server.js -c /…" 47 minutes ago Up 47 minutes 0.0.0.0:6928->6928/tcp function-evaluator-js.service 7a1f39892c4f docker-registry.wikimedia.org/repos/abstract-wiki/wikifunctions/function-orchestrator:latest "node server.js -c /…" 6 days ago Up 6 days 0.0.0.0:6254->6254/tcp mediawiki-services-function-orchestrator.service
Error response is:
{ "error": { "code": "wikilambda_function_call-not-connected", "info": "Could not resolve host 'deployment-docker-wikifunctions01.deployment-prep.eqiad1.wikimedia.cloud:6254', probably because the orchestrator is not running. Please consult the README to add the orchestrator to your docker-compose configuration.", "*": "See https://wikifunctions.beta.wmflabs.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/> for notice of API deprecations and breaking changes." }, "servedby": "deployment-mediawiki11" }
jforrester@deployment-deploy03:~$ ping deployment-docker-wikifunctions01.deployment-prep.eqiad1.wikimedia.cloud PING deployment-docker-wikifunctions01.deployment-prep.eqiad1.wikimedia.cloud (172.16.1.154) 56(84) bytes of data. 64 bytes from deployment-docker-wikifunctions01.deployment-prep.eqiad1.wikimedia.cloud (172.16.1.154): icmp_seq=1 ttl=64 time=0.393 ms 64 bytes from deployment-docker-wikifunctions01.deployment-prep.eqiad1.wikimedia.cloud (172.16.1.154): icmp_seq=2 ttl=64 time=0.511 ms
… but:
jforrester@deployment-deploy03:~$ curl deployment-docker-wikifunctions01.deployment-prep.eqiad1.wikimedia.cloud:6254/_info curl: (7) Failed to connect to deployment-docker-wikifunctions01.deployment-prep.eqiad1.wikimedia.cloud port 6254: Connection refused
Did the firewall/network permissions get changed?
samtar@deployment-docker-wikifunctions01:~$ sudo systemctl status mediawiki-services-function-orchestrator ● mediawiki-services-function-orchestrator.service - Systemd runner for mediawiki-services-function-orchestrator Loaded: loaded (/lib/systemd/system/mediawiki-services-function-orchestrator.service; enabled; vendor preset: enabled) Active: active (running) since Tue 2023-10-24 16:10:17 UTC; 1min 58s ago Process: 2028762 ExecStartPre=/usr/bin/docker stop mediawiki-services-function-orchestrator.service (code=exited, status=1/FAILURE) Process: 2028775 ExecStartPre=/usr/bin/docker rm mediawiki-services-function-orchestrator.service (code=exited, status=1/FAILURE) Process: 2028785 ExecStartPre=/usr/bin/docker pull docker-registry.wikimedia.org/repos/abstract-wiki/wikifunctions/function-orchestrator:latest (code=exited, status=0/SUCCESS) Main PID: 2028800 (docker) Tasks: 9 (limit: 19182) Memory: 28.5M CPU: 345ms CGroup: /system.slice/mediawiki-services-function-orchestrator.service └─2028800 /usr/bin/docker run --rm=true --env-file /etc/mediawiki-services-function-orchestrator/env -p 6254:6254 -v /etc/mediawiki-services-function-orchestrator/:/etc/mediawiki-services-function-orchestrator --name mediawiki-services-function-orchestrator.> Oct 24 16:12:15 deployment-docker-wikifunctions01 docker-mediawiki-services-function-orchestrator[2028800]: {"name":"function-orchestrator","hostname":"ec17df8d3c6e","pid":2815,"level":"INFO","levelPath":"info","msg":"Worker 2815 listening on 0.0.0.0:6254","time":"2023-1> Oct 24 16:12:15 deployment-docker-wikifunctions01 docker-mediawiki-services-function-orchestrator[2028800]: Error: Error sending hot-shots message: Error: getaddrinfo ENOTFOUND prometheus-labmon.eqiad.wmnet Oct 24 16:12:15 deployment-docker-wikifunctions01 docker-mediawiki-services-function-orchestrator[2028800]: {"name":"function-orchestrator","hostname":"ec17df8d3c6e","pid":2815,"level":"FATAL","err":{"message":"","name":"Error","stack":"Error: Error sending hot-shots mes> Oct 24 16:12:15 deployment-docker-wikifunctions01 docker-mediawiki-services-function-orchestrator[2028800]: (node:2808) Warning: Accessing non-existent property 'Implementation' of module exports inside circular dependency Oct 24 16:12:15 deployment-docker-wikifunctions01 docker-mediawiki-services-function-orchestrator[2028800]: (Use `node --trace-warnings ...` to show where the warning was created) Oct 24 16:12:15 deployment-docker-wikifunctions01 docker-mediawiki-services-function-orchestrator[2028800]: {"name":"function-orchestrator","hostname":"ec17df8d3c6e","pid":2808,"level":"INFO","levelPath":"info","msg":"Worker 2808 listening on 0.0.0.0:6254","time":"2023-1> Oct 24 16:12:15 deployment-docker-wikifunctions01 docker-mediawiki-services-function-orchestrator[2028800]: Error: Error sending hot-shots message: Error: getaddrinfo ENOTFOUND prometheus-labmon.eqiad.wmnet Oct 24 16:12:15 deployment-docker-wikifunctions01 docker-mediawiki-services-function-orchestrator[2028800]: {"name":"function-orchestrator","hostname":"ec17df8d3c6e","pid":2808,"level":"FATAL","err":{"message":"","name":"Error","stack":"Error: Error sending hot-shots mes> Oct 24 16:12:15 deployment-docker-wikifunctions01 docker-mediawiki-services-function-orchestrator[2028800]: {"name":"function-orchestrator","hostname":"ec17df8d3c6e","pid":1,"level":"ERROR","message":"worker died, restarting","worker_pid":2815,"exit_code":1,"levelPath":"> Oct 24 16:12:15 deployment-docker-wikifunctions01 docker-mediawiki-services-function-orchestrator[2028800]: {"name":"function-orchestrator","hostname":"ec17df8d3c6e","pid":1,"level":"ERROR","message":"worker died, restarting","worker_pid":2808,"exit_code":1,"levelPath":">
Error: getaddrinfo ENOTFOUND prometheus-labmon.eqiad.wmnet sounds suspiciously related to the recently completed T344974: De-provision beta-specific Prometheus
{"name":"function-evaluator","hostname":"f4b4e024f6d4","pid":814,"level":"WARN","levelPath":"warn/spec","msg":"Could not load the spec: Error: ENOENT: no such file or directory, open '/srv/service/spec.yaml'","time":"2023-10-24T16:36:42.739Z","v":0} maxRequestsPerId is 100 {"name":"function-evaluator","hostname":"f4b4e024f6d4","pid":830,"level":"WARN","levelPath":"warn/spec","msg":"Could not load the spec: Error: ENOENT: no such file or directory, open '/srv/service/spec.yaml'","time":"2023-10-24T16:36:43.773Z","v":0} maxRequestsPerId is 100 Error: Error sending hot-shots message: Error: getaddrinfo ENOTFOUND prometheus-labmon.eqiad.wmnet Error: Error sending hot-shots message: Error: getaddrinfo ENOTFOUND prometheus-labmon.eqiad.wmnet {"name":"function-evaluator","hostname":"f4b4e024f6d4","pid":783,"level":"FATAL","err":{"message":"","name":"Error","stack":"Error: Error sending hot-shots message: Error: getaddrinfo ENOTFOUND prometheus-labmon.eqiad.wmnet\n at handleCallback (/srv/service/node_modules/hot-shots/lib/statsd.js:372:32)\n at processTicksAndRejections (node:internal/process/task_queues:82:21)","code":"ENOTFOUND","levelPath":"fatal/service-runner/unhandled"},"msg":"Error sending hot-shots message: Error: getaddrinfo ENOTFOUND prometheus-labmon.eqiad.wmnet","time":"2023-10-24T16:36:43.895Z","v":0} {"name":"function-evaluator","hostname":"f4b4e024f6d4","pid":776,"level":"FATAL","err":{"message":"","name":"Error","stack":"Error: Error sending hot-shots message: Error: getaddrinfo ENOTFOUND prometheus-labmon.eqiad.wmnet\n at handleCallback (/srv/service/node_modules/hot-shots/lib/statsd.js:372:32)\n at processTicksAndRejections (node:internal/process/task_queues:82:21)","code":"ENOTFOUND","levelPath":"fatal/service-runner/unhandled"},"msg":"Error sending hot-shots message: Error: getaddrinfo ENOTFOUND prometheus-labmon.eqiad.wmnet","time":"2023-10-24T16:36:43.909Z","v":0} {"name":"function-evaluator","hostname":"f4b4e024f6d4","pid":783,"level":"INFO","levelPath":"info","msg":"Worker 783 listening on 0.0.0.0:6927","time":"2023-10-24T16:36:43.951Z","v":0} {"name":"function-evaluator","hostname":"f4b4e024f6d4","pid":776,"level":"INFO","levelPath":"info","msg":"Worker 776 listening on 0.0.0.0:6927","time":"2023-10-24T16:36:43.999Z","v":0} Error: Error sending hot-shots message: Error: getaddrinfo ENOTFOUND prometheus-labmon.eqiad.wmnet {"name":"function-evaluator","hostname":"f4b4e024f6d4","pid":798,"level":"FATAL","err":{"message":"","name":"Error","stack":"Error: Error sending hot-shots message: Error: getaddrinfo ENOTFOUND prometheus-labmon.eqiad.wmnet\n at handleCallback (/srv/service/node_modules/hot-shots/lib/statsd.js:372:32)\n at processTicksAndRejections (node:internal/process/task_queues:82:21)","code":"ENOTFOUND","levelPath":"fatal/service-runner/unhandled"},"msg":"Error sending hot-shots message: Error: getaddrinfo ENOTFOUND prometheus-labmon.eqiad.wmnet","time":"2023-10-24T16:36:44.100Z","v":0} {"name":"function-evaluator","hostname":"f4b4e024f6d4","pid":798,"level":"INFO","levelPath":"info","msg":"Worker 798 listening on 0.0.0.0:6927","time":"2023-10-24T16:36:44.108Z","v":0} {"name":"function-evaluator","hostname":"f4b4e024f6d4","pid":1,"level":"ERROR","message":"worker died, restarting","worker_pid":776,"exit_code":1,"levelPath":"error/service-runner/master","msg":"worker died, restarting","time":"2023-10-24T16:36:44.150Z","v":0} {"name":"function-evaluator","hostname":"f4b4e024f6d4","pid":1,"level":"ERROR","message":"worker died, restarting","worker_pid":783,"exit_code":1,"levelPath":"error/service-runner/master","msg":"worker died, restarting","time":"2023-10-24T16:36:44.154Z","v":0} Error: Error sending hot-shots message: Error: getaddrinfo ENOTFOUND prometheus-labmon.eqiad.wmnet {"name":"function-evaluator","hostname":"f4b4e024f6d4","pid":785,"level":"FATAL","err":{"message":"","name":"Error","stack":"Error: Error sending hot-shots message: Error: getaddrinfo ENOTFOUND prometheus-labmon.eqiad.wmnet\n at handleCallback (/srv/service/node_modules/hot-shots/lib/statsd.js:372:32)\n at processTicksAndRejections (node:internal/process/task_queues:82:21)","code":"ENOTFOUND","levelPath":"fatal/service-runner/unhandled"},"msg":"Error sending hot-shots message: Error: getaddrinfo ENOTFOUND prometheus-labmon.eqiad.wmnet","time":"2023-10-24T16:36:44.170Z","v":0} {"name":"function-evaluator","hostname":"f4b4e024f6d4","pid":1,"level":"ERROR","message":"worker died, restarting","worker_pid":798,"exit_code":1,"levelPath":"error/service-runner/master","msg":"worker died, restarting","time":"2023-10-24T16:36:44.374Z","v":0} {"name":"function-evaluator","hostname":"f4b4e024f6d4","pid":1,"level":"WARN","message":"worker died during startup, continue startup","exit_code":1,"worker_pid":785,"levelPath":"warn/service-runner/master","msg":"worker died during startup, continue startup","time":"2023-10-24T16:36:44.434Z","v":0} ^XError: Error sending hot-shots message: Error: getaddrinfo ENOTFOUND prometheus-labmon.eqiad.wmnet {"name":"function-evaluator","hostname":"f4b4e024f6d4","pid":807,"level":"FATAL","err":{"message":"","name":"Error","stack":"Error: Error sending hot-shots message: Error: getaddrinfo ENOTFOUND prometheus-labmon.eqiad.wmnet\n at handleCallback (/srv/service/node_modules/hot-shots/lib/statsd.js:372:32)\n at processTicksAndRejections (node:internal/process/task_queues:82:21)","code":"ENOTFOUND","levelPath":"fatal/service-runner/unhandled"},"msg":"Error sending hot-shots message: Error: getaddrinfo ENOTFOUND prometheus-labmon.eqiad.wmnet","time":"2023-10-24T16:36:45.385Z","v":0}
Above crash looks like T241263: Service metrics starts crashing if non-resolvable logstash domain is provided
Oop, no, looks like prometheus-labmon.eqiad.wmnet got nix'd in rODNSb4463d8c6ba8: wmnet: drop cloudmetrics CNAMEs yesterday (rel T326266, T336854)
Horrible hacky "fix" — changed host: prometheus-labmon.eqiad.wmnet to host: 127.0.0.1 in the hiera config metrics stanzas
No longer crash-looping, and https://wikifunctions.beta.wmflabs.org/view/en/Z866 appears to work
Oh lovely. Those internal DNS aliases were never documented to be stable, and the retirement of the statsd service was announced about four months ago. It's just that it's UDP, so nothing broke immediately when two layers of firewalls started dropping the traffic.
Is there a replacement for writing statsd-like metrics, or are we expected to just stop (or build our own)? The relevant part of that announcement is about clients reading data out of those statsd service, FWICT.
It's briefly mentioned in the Wikitech news page linked from the announcement, but no, there is no replacement maintained by the WMCS team.
Mentioned in SAL (#wikimedia-releng) [2023-10-24T18:44:17Z] <James_F> tools.deployment-prep Re-configure deployment-docker-wikifunctions01 to drop statsd metrics writing, no longer supported, per T349648
Note that there are other references, presumably equally broken: https://codesearch-beta.wmcloud.org/search/?q=prometheus-labmon.eqiad.wmnet&files=&excludeFiles=&repos=