Page MenuHomePhabricator

Wikifunctions Beta services down, alerting, blocking use and CI
Closed, ResolvedPublic

Event Timeline

Broke at around 2023-10-24T00:30.

Nothing in the mediawiki-config list or prod or releng SALs.

jforrester@deployment-docker-wikifunctions01:~$ date
Tue Oct 24 16:03:36 UTC 2023
jforrester@deployment-docker-wikifunctions01:~$ sudo docker ps
CONTAINER ID   IMAGE                                                                                                      COMMAND                  CREATED              STATUS              PORTS                    NAMES
147e0bb147be   docker-registry.wikimedia.org/repos/abstract-wiki/wikifunctions/function-evaluator/omnibus:latest          "node server.js -c /…"   About a minute ago   Up About a minute   0.0.0.0:6927->6927/tcp   mediawiki-services-function-evaluator.service
c631b4425bff   docker-registry.wikimedia.org/repos/abstract-wiki/wikifunctions/function-evaluator/python3-all:latest      "node server.js -c /…"   40 minutes ago       Up 40 minutes       0.0.0.0:6929->6929/tcp   function-evaluator-py.service
ac9c19153f70   docker-registry.wikimedia.org/repos/abstract-wiki/wikifunctions/function-evaluator/javascript-all:latest   "node server.js -c /…"   47 minutes ago       Up 47 minutes       0.0.0.0:6928->6928/tcp   function-evaluator-js.service
7a1f39892c4f   docker-registry.wikimedia.org/repos/abstract-wiki/wikifunctions/function-orchestrator:latest               "node server.js -c /…"   6 days ago           Up 6 days           0.0.0.0:6254->6254/tcp   mediawiki-services-function-orchestrator.service

Error response is:

{
    "error": {
        "code": "wikilambda_function_call-not-connected",
        "info": "Could not resolve host 'deployment-docker-wikifunctions01.deployment-prep.eqiad1.wikimedia.cloud:6254', probably because the orchestrator is not running. Please consult the README to add the orchestrator to your docker-compose configuration.",
        "*": "See https://wikifunctions.beta.wmflabs.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/> for notice of API deprecations and breaking changes."
    },
    "servedby": "deployment-mediawiki11"
}
jforrester@deployment-deploy03:~$ ping deployment-docker-wikifunctions01.deployment-prep.eqiad1.wikimedia.cloud
PING deployment-docker-wikifunctions01.deployment-prep.eqiad1.wikimedia.cloud (172.16.1.154) 56(84) bytes of data.
64 bytes from deployment-docker-wikifunctions01.deployment-prep.eqiad1.wikimedia.cloud (172.16.1.154): icmp_seq=1 ttl=64 time=0.393 ms
64 bytes from deployment-docker-wikifunctions01.deployment-prep.eqiad1.wikimedia.cloud (172.16.1.154): icmp_seq=2 ttl=64 time=0.511 ms

… but:

jforrester@deployment-deploy03:~$ curl deployment-docker-wikifunctions01.deployment-prep.eqiad1.wikimedia.cloud:6254/_info
curl: (7) Failed to connect to deployment-docker-wikifunctions01.deployment-prep.eqiad1.wikimedia.cloud port 6254: Connection refused

Did the firewall/network permissions get changed?

samtar@deployment-docker-wikifunctions01:~$ sudo systemctl status mediawiki-services-function-orchestrator
● mediawiki-services-function-orchestrator.service - Systemd runner for mediawiki-services-function-orchestrator
     Loaded: loaded (/lib/systemd/system/mediawiki-services-function-orchestrator.service; enabled; vendor preset: enabled)
     Active: active (running) since Tue 2023-10-24 16:10:17 UTC; 1min 58s ago
    Process: 2028762 ExecStartPre=/usr/bin/docker stop mediawiki-services-function-orchestrator.service (code=exited, status=1/FAILURE)
    Process: 2028775 ExecStartPre=/usr/bin/docker rm mediawiki-services-function-orchestrator.service (code=exited, status=1/FAILURE)
    Process: 2028785 ExecStartPre=/usr/bin/docker pull docker-registry.wikimedia.org/repos/abstract-wiki/wikifunctions/function-orchestrator:latest (code=exited, status=0/SUCCESS)
   Main PID: 2028800 (docker)
      Tasks: 9 (limit: 19182)
     Memory: 28.5M
        CPU: 345ms
     CGroup: /system.slice/mediawiki-services-function-orchestrator.service
             └─2028800 /usr/bin/docker run --rm=true --env-file /etc/mediawiki-services-function-orchestrator/env -p 6254:6254 -v /etc/mediawiki-services-function-orchestrator/:/etc/mediawiki-services-function-orchestrator --name mediawiki-services-function-orchestrator.>

Oct 24 16:12:15 deployment-docker-wikifunctions01 docker-mediawiki-services-function-orchestrator[2028800]: {"name":"function-orchestrator","hostname":"ec17df8d3c6e","pid":2815,"level":"INFO","levelPath":"info","msg":"Worker 2815 listening on 0.0.0.0:6254","time":"2023-1>
Oct 24 16:12:15 deployment-docker-wikifunctions01 docker-mediawiki-services-function-orchestrator[2028800]: Error: Error sending hot-shots message: Error: getaddrinfo ENOTFOUND prometheus-labmon.eqiad.wmnet
Oct 24 16:12:15 deployment-docker-wikifunctions01 docker-mediawiki-services-function-orchestrator[2028800]: {"name":"function-orchestrator","hostname":"ec17df8d3c6e","pid":2815,"level":"FATAL","err":{"message":"","name":"Error","stack":"Error: Error sending hot-shots mes>
Oct 24 16:12:15 deployment-docker-wikifunctions01 docker-mediawiki-services-function-orchestrator[2028800]: (node:2808) Warning: Accessing non-existent property 'Implementation' of module exports inside circular dependency
Oct 24 16:12:15 deployment-docker-wikifunctions01 docker-mediawiki-services-function-orchestrator[2028800]: (Use `node --trace-warnings ...` to show where the warning was created)
Oct 24 16:12:15 deployment-docker-wikifunctions01 docker-mediawiki-services-function-orchestrator[2028800]: {"name":"function-orchestrator","hostname":"ec17df8d3c6e","pid":2808,"level":"INFO","levelPath":"info","msg":"Worker 2808 listening on 0.0.0.0:6254","time":"2023-1>
Oct 24 16:12:15 deployment-docker-wikifunctions01 docker-mediawiki-services-function-orchestrator[2028800]: Error: Error sending hot-shots message: Error: getaddrinfo ENOTFOUND prometheus-labmon.eqiad.wmnet
Oct 24 16:12:15 deployment-docker-wikifunctions01 docker-mediawiki-services-function-orchestrator[2028800]: {"name":"function-orchestrator","hostname":"ec17df8d3c6e","pid":2808,"level":"FATAL","err":{"message":"","name":"Error","stack":"Error: Error sending hot-shots mes>
Oct 24 16:12:15 deployment-docker-wikifunctions01 docker-mediawiki-services-function-orchestrator[2028800]: {"name":"function-orchestrator","hostname":"ec17df8d3c6e","pid":1,"level":"ERROR","message":"worker died, restarting","worker_pid":2815,"exit_code":1,"levelPath":">
Oct 24 16:12:15 deployment-docker-wikifunctions01 docker-mediawiki-services-function-orchestrator[2028800]: {"name":"function-orchestrator","hostname":"ec17df8d3c6e","pid":1,"level":"ERROR","message":"worker died, restarting","worker_pid":2808,"exit_code":1,"levelPath":">

Error: getaddrinfo ENOTFOUND prometheus-labmon.eqiad.wmnet sounds suspiciously related to the recently completed T344974: De-provision beta-specific Prometheus


{"name":"function-evaluator","hostname":"f4b4e024f6d4","pid":814,"level":"WARN","levelPath":"warn/spec","msg":"Could not load the spec: Error: ENOENT: no such file or directory, open '/srv/service/spec.yaml'","time":"2023-10-24T16:36:42.739Z","v":0}
maxRequestsPerId is 100
{"name":"function-evaluator","hostname":"f4b4e024f6d4","pid":830,"level":"WARN","levelPath":"warn/spec","msg":"Could not load the spec: Error: ENOENT: no such file or directory, open '/srv/service/spec.yaml'","time":"2023-10-24T16:36:43.773Z","v":0}
maxRequestsPerId is 100
Error: Error sending hot-shots message: Error: getaddrinfo ENOTFOUND prometheus-labmon.eqiad.wmnet
Error: Error sending hot-shots message: Error: getaddrinfo ENOTFOUND prometheus-labmon.eqiad.wmnet
{"name":"function-evaluator","hostname":"f4b4e024f6d4","pid":783,"level":"FATAL","err":{"message":"","name":"Error","stack":"Error: Error sending hot-shots message: Error: getaddrinfo ENOTFOUND prometheus-labmon.eqiad.wmnet\n    at handleCallback (/srv/service/node_modules/hot-shots/lib/statsd.js:372:32)\n    at processTicksAndRejections (node:internal/process/task_queues:82:21)","code":"ENOTFOUND","levelPath":"fatal/service-runner/unhandled"},"msg":"Error sending hot-shots message: Error: getaddrinfo ENOTFOUND prometheus-labmon.eqiad.wmnet","time":"2023-10-24T16:36:43.895Z","v":0}
{"name":"function-evaluator","hostname":"f4b4e024f6d4","pid":776,"level":"FATAL","err":{"message":"","name":"Error","stack":"Error: Error sending hot-shots message: Error: getaddrinfo ENOTFOUND prometheus-labmon.eqiad.wmnet\n    at handleCallback (/srv/service/node_modules/hot-shots/lib/statsd.js:372:32)\n    at processTicksAndRejections (node:internal/process/task_queues:82:21)","code":"ENOTFOUND","levelPath":"fatal/service-runner/unhandled"},"msg":"Error sending hot-shots message: Error: getaddrinfo ENOTFOUND prometheus-labmon.eqiad.wmnet","time":"2023-10-24T16:36:43.909Z","v":0}
{"name":"function-evaluator","hostname":"f4b4e024f6d4","pid":783,"level":"INFO","levelPath":"info","msg":"Worker 783 listening on 0.0.0.0:6927","time":"2023-10-24T16:36:43.951Z","v":0}
{"name":"function-evaluator","hostname":"f4b4e024f6d4","pid":776,"level":"INFO","levelPath":"info","msg":"Worker 776 listening on 0.0.0.0:6927","time":"2023-10-24T16:36:43.999Z","v":0}
Error: Error sending hot-shots message: Error: getaddrinfo ENOTFOUND prometheus-labmon.eqiad.wmnet
{"name":"function-evaluator","hostname":"f4b4e024f6d4","pid":798,"level":"FATAL","err":{"message":"","name":"Error","stack":"Error: Error sending hot-shots message: Error: getaddrinfo ENOTFOUND prometheus-labmon.eqiad.wmnet\n    at handleCallback (/srv/service/node_modules/hot-shots/lib/statsd.js:372:32)\n    at processTicksAndRejections (node:internal/process/task_queues:82:21)","code":"ENOTFOUND","levelPath":"fatal/service-runner/unhandled"},"msg":"Error sending hot-shots message: Error: getaddrinfo ENOTFOUND prometheus-labmon.eqiad.wmnet","time":"2023-10-24T16:36:44.100Z","v":0}
{"name":"function-evaluator","hostname":"f4b4e024f6d4","pid":798,"level":"INFO","levelPath":"info","msg":"Worker 798 listening on 0.0.0.0:6927","time":"2023-10-24T16:36:44.108Z","v":0}
{"name":"function-evaluator","hostname":"f4b4e024f6d4","pid":1,"level":"ERROR","message":"worker died, restarting","worker_pid":776,"exit_code":1,"levelPath":"error/service-runner/master","msg":"worker died, restarting","time":"2023-10-24T16:36:44.150Z","v":0}
{"name":"function-evaluator","hostname":"f4b4e024f6d4","pid":1,"level":"ERROR","message":"worker died, restarting","worker_pid":783,"exit_code":1,"levelPath":"error/service-runner/master","msg":"worker died, restarting","time":"2023-10-24T16:36:44.154Z","v":0}
Error: Error sending hot-shots message: Error: getaddrinfo ENOTFOUND prometheus-labmon.eqiad.wmnet
{"name":"function-evaluator","hostname":"f4b4e024f6d4","pid":785,"level":"FATAL","err":{"message":"","name":"Error","stack":"Error: Error sending hot-shots message: Error: getaddrinfo ENOTFOUND prometheus-labmon.eqiad.wmnet\n    at handleCallback (/srv/service/node_modules/hot-shots/lib/statsd.js:372:32)\n    at processTicksAndRejections (node:internal/process/task_queues:82:21)","code":"ENOTFOUND","levelPath":"fatal/service-runner/unhandled"},"msg":"Error sending hot-shots message: Error: getaddrinfo ENOTFOUND prometheus-labmon.eqiad.wmnet","time":"2023-10-24T16:36:44.170Z","v":0}
{"name":"function-evaluator","hostname":"f4b4e024f6d4","pid":1,"level":"ERROR","message":"worker died, restarting","worker_pid":798,"exit_code":1,"levelPath":"error/service-runner/master","msg":"worker died, restarting","time":"2023-10-24T16:36:44.374Z","v":0}
{"name":"function-evaluator","hostname":"f4b4e024f6d4","pid":1,"level":"WARN","message":"worker died during startup, continue startup","exit_code":1,"worker_pid":785,"levelPath":"warn/service-runner/master","msg":"worker died during startup, continue startup","time":"2023-10-24T16:36:44.434Z","v":0}
^XError: Error sending hot-shots message: Error: getaddrinfo ENOTFOUND prometheus-labmon.eqiad.wmnet
{"name":"function-evaluator","hostname":"f4b4e024f6d4","pid":807,"level":"FATAL","err":{"message":"","name":"Error","stack":"Error: Error sending hot-shots message: Error: getaddrinfo ENOTFOUND prometheus-labmon.eqiad.wmnet\n    at handleCallback (/srv/service/node_modules/hot-shots/lib/statsd.js:372:32)\n    at processTicksAndRejections (node:internal/process/task_queues:82:21)","code":"ENOTFOUND","levelPath":"fatal/service-runner/unhandled"},"msg":"Error sending hot-shots message: Error: getaddrinfo ENOTFOUND prometheus-labmon.eqiad.wmnet","time":"2023-10-24T16:36:45.385Z","v":0}

Above crash looks like T241263: Service metrics starts crashing if non-resolvable logstash domain is provided

[...]
Error: getaddrinfo ENOTFOUND prometheus-labmon.eqiad.wmnet sounds suspiciously related to the recently completed T344974: De-provision beta-specific Prometheus

Oop, no, looks like prometheus-labmon.eqiad.wmnet got nix'd in rODNSb4463d8c6ba8: wmnet: drop cloudmetrics CNAMEs yesterday (rel T326266, T336854)

Horrible hacky "fix" — changed host: prometheus-labmon.eqiad.wmnet to host: 127.0.0.1 in the hiera config metrics stanzas

No longer crash-looping, and https://wikifunctions.beta.wmflabs.org/view/en/Z866 appears to work

Aha, thanks! Do you know what these should be instead?

Oh lovely. Those internal DNS aliases were never documented to be stable, and the retirement of the statsd service was announced about four months ago. It's just that it's UDP, so nothing broke immediately when two layers of firewalls started dropping the traffic.

Oh lovely. Those internal DNS aliases were never documented to be stable, and the retirement of the statsd service was announced about four months ago. It's just that it's UDP, so nothing broke immediately when two layers of firewalls started dropping the traffic.

Is there a replacement for writing statsd-like metrics, or are we expected to just stop (or build our own)? The relevant part of that announcement is about clients reading data out of those statsd service, FWICT.

Oh lovely. Those internal DNS aliases were never documented to be stable, and the retirement of the statsd service was announced about four months ago. It's just that it's UDP, so nothing broke immediately when two layers of firewalls started dropping the traffic.

Is there a replacement for writing statsd-like metrics, or are we expected to just stop (or build our own)? The relevant part of that announcement is about clients reading data out of those statsd service, FWICT.

It's briefly mentioned in the Wikitech news page linked from the announcement, but no, there is no replacement maintained by the WMCS team.

Thanks. Will just drop. Yet another way that Beta Cluster is un-prod-like, ah well.

Mentioned in SAL (#wikimedia-releng) [2023-10-24T18:44:17Z] <James_F> tools.deployment-prep Re-configure deployment-docker-wikifunctions01 to drop statsd metrics writing, no longer supported, per T349648

Oh lovely. Those internal DNS aliases were never documented to be stable, and the retirement of the statsd service was announced about four months ago. It's just that it's UDP, so nothing broke immediately when two layers of firewalls started dropping the traffic.

Is there a replacement for writing statsd-like metrics, or are we expected to just stop (or build our own)? The relevant part of that announcement is about clients reading data out of those statsd service, FWICT.

It's briefly mentioned in the Wikitech news page linked from the announcement, but no, there is no replacement maintained by the WMCS team.

Note that there are other references, presumably equally broken: https://codesearch-beta.wmcloud.org/search/?q=prometheus-labmon.eqiad.wmnet&files=&excludeFiles=&repos=