Description
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | Andrew | T327742 Migrate deployment-prep away from Debian Buster to Bullseye/Bookworm | |||
| Resolved | Eevans | T370460 Remove or replace deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud (Buster deprecation) | |||
| Resolved | BUG REPORT | bd808 | T388043 No Puppet resources found on instance deployment-changeprop-1 on project deployment-prep |
Event Timeline
$ ssh deployment-changeprop-1.deployment-prep.eqiad1.wikimedia.cloud $ sudo -i puppet agent -tv Error: No space left on device @ fptr_finalize_flush - /var/lib/puppet/ssl/ssl.lock Error: No space left on device @ fptr_finalize_flush - /var/lib/puppet/ssl/ssl.lock $ df -h Filesystem Size Used Avail Use% Mounted on udev 2.0G 0 2.0G 0% /dev tmpfs 393M 38M 356M 10% /run /dev/sda1 20G 20G 0 100% / tmpfs 2.0G 0 2.0G 0% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock /dev/sda15 124M 12M 113M 10% /boot/efi tmpfs 393M 0 393M 0% /run/user/0 tmpfs 393M 0 393M 0% /run/user/3518 $ sudo su - # du -sh /var/lib/docker 16G /var/lib/docker
Further digging led to finding a 12G log file at /var/lib/docker/containers/e78527c860483b10eb112fac855e9b7bcf4bbc29894004fc2d67227bdc44e8cb/e78527c860483b10eb112fac855e9b7bcf4bbc29894004fc2d67227bdc44e8cb-json.log
I truncted the log to get things into a state where we can think about better long term solutions:
root@deployment-changeprop-1:/var/lib/docker/containers/e78527c860483b10eb112fac855e9b7bcf4bbc29894004fc2d67227bdc44e8cb# truncate -s 0 e78527c860483b10eb112fac855e9b7bcf4bbc29894004fc2d67227bdc44e8cb-json.log
The underlying problem is that changeprop is big mad about deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud not existing:
{ "name": "changeprop", "hostname": "e78527c86048", "pid": 1, "level": "WARN", "message": "Retry count exceeded", "event_str": "{\"$schema\":\"/change-prop/retry/1.0.0\",\"meta\":{\"stream\":\"changeprop.retry.resource_change\",\"uri\":\"https://wikifunctions.beta.wmflabs.org/wiki/Z17\",\"domain\":\"wikifunctions.beta.wmflabs.org\"},\"triggered_by\":\"req:7ebdeef5241ee23c86566808,resource_change:https://wikifunctions.beta.wmflabs.org/wiki/Z17,changeprop.retry.resource_change:https://wikifunctions.beta.wmflabs.org/wiki/Z17,changeprop.retry.resource_change:https://wikifunctions.beta.wmflabs.org/wiki/Z17\",\"emitter_id\":\"changeprop#mw_purge\",\"retries_left\":0,\"original_event\":{\"$schema\":\"/resource_change/1.0.0\",\"meta\":{\"uri\":\"https://wikifunctions.beta.wmflabs.org/wiki/Z17\",\"request_id\":\"7ebdeef5241ee23c86566808\",\"id\":\"89c57c3c-6603-4683-bda3-65c2e701daac\",\"dt\":\"2025-03-05T18:22:47Z\",\"domain\":\"wikifunctions.beta.wmflabs.org\",\"stream\":\"resource_change\"},\"tags\":[\"purge\"]},\"error_status\":504}", "status": 504, "page": "https://wikifunctions.beta.wmflabs.org/wiki/Z17", "description": "getaddrinfo ENOTFOUND deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud", "stack": "HTTPError: getaddrinfo ENOTFOUND deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud\n at /srv/service/node_modules/preq/index.js:246:19\n at tryCatcher (/srv/service/node_modules/bluebird/js/release/util.js:16:23)\n at Promise._settlePromiseFromHandler (/srv/service/node_modules/bluebird/js/release/promise.js:547:31)\n at Promise._settlePromise (/srv/service/node_modules/bluebird/js/release/promise.js:604:18)\n at Promise._settlePromise0 (/srv/service/node_modules/bluebird/js/release/promise.js:649:10)\n at Promise._settlePromises (/srv/service/node_modules/bluebird/js/release/promise.js:725:18)\n at _drainQueueStep (/srv/service/node_modules/bluebird/js/release/async.js:93:12)\n at _drainQueue (/srv/service/node_modules/bluebird/js/release/async.js:86:9)\n at Async._drainQueues (/srv/service/node_modules/bluebird/js/release/async.js:102:5)\n at Async.drainQueues [as _onImmediate] (/srv/service/node_modules/bluebird/js/release/async.js:15:14)\n at process.processImmediate (node:internal/timers:471:21)", "stream": "changeprop.retry.resource_change", "rule_name": "mw_purge", "executor": "RetryExecutor", "levelPath": "warn/retry_count", "msg": "Retry count exceeded", "time": "2025-03-05T18:29:47.840Z", "v": 0 }
deployment-restbase05.deployment-prep.eqiad1.wikimedia.cloud should be the replacement.
# docker exec -it changeprop.service /bin/bash $ grep deployment-restbase04 /etc/changeprop/config.yaml uri: 'http://deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud:7231/{{message.meta.domain}}/v1/page/summary/{{match.meta.uri[1]}}' uri: 'http://deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud:7231/{{message.meta.domain}}/v1/page/media-list/{{match.meta.uri[1]}}' uri: 'http://deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud:7231/{{message.meta.domain}}/v1/page/mobile-html/{{match.meta.uri[1]}}' uri: 'http://deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud:7231/{{message.meta.domain}}/v1/page/html/{decode(match.meta.uri.title)}' uri: 'http://deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud:7231/{{message.meta.domain}}/v1/page/summary/{decode(match.meta.uri.title)}' uri: 'http://deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud:7231/{{message.meta.domain}}/v1/page/html/{decode(match.meta.uri.title)}' uri: 'http://deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud:7231/{{message.meta.domain}}/v1/page/html/{decode(match.meta.uri.title)}' uri: 'http://deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud:7231/{{message.meta.domain}}/v1/page/summary/{decode(match.meta.uri.title)}' uri: 'http://deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud:7231/{{message.meta.domain}}/v1/page/summary/{{match.meta.uri.title}}' uri: 'http://deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud:7231/{{message.meta.domain}}/v1/page/mobile-html/{{match.meta.uri.title}}' uri: 'http://deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud:7231/{{message.meta.domain}}/v1/page/summary/{{match.meta.uri.title}}' uri: 'http://deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud:7231/{{message.meta.domain}}/v1/page/html/{message.page_title}/{{message.rev_id}}' uri: 'http://deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud:7231/{{message.meta.domain}}/v1/page/title/{message.page_title}/{{message.rev_id}}' uri: 'http://deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud:7231/{{message.meta.domain}}/v1/page/title/{message.page_title}' uri: 'http://deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud:7231/{{message.meta.domain}}/v1/page/title/{message.page_title}' uri: 'http://deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud:7231/{{message.meta.domain}}/v1/page/html/{message.page_title}/{{message.rev_id}}' uri: 'http://deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud:7231/{{message.meta.domain}}/v1/page/title/{message.prior_state.page_title}' uri: 'http://deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud:7231/{{message.meta.domain}}/v1/page/html/{{match.meta.uri.title}}' uri: 'http://deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud:7231/{{message.meta.domain}}/v1/page/html/{{match.meta.uri.title}}' uri: 'http://deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud:7231/{{message.meta.domain}}/v1/page/summary/{message.page_title}' uri: 'http://deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud:7231/{{message.meta.domain}}/v1/page/summary/{message.page_title}'
Hopefully @Ottomata can help me figure out where to update the config:
[18:57] < bd808> ottomata: would you be able to tell me how the changeprop container running on deployment-changeprop-1.deployment-prep.eqiad1.wikimedia.cloud get's its configuration? I can see a docker volume that contains the /etc/changeprop/config.yaml file that the running service reads, but I'm not understanding how that config got there/gets updated. [18:58] < bd808> context is that T388043 was caused by changeprop filling the disk with error logs because it is trying to route to the wrong restbase instance. [18:58] < stashbot> T388043: No Puppet resources found on instance deployment-changeprop-1 on project deployment-prep - https://phabricator.wikimedia.org/T388043
cc @aaron who has recently been working on change-prop in beta.
would you be able to tell me how the changeprop container running on deployment-changeprop-1.deployment-prep.eqiad1.wikimedia.cloud get's its configuration?
I barely know but @aaron recently asked this in Slack too.
It is in...deployment-charts prod repo?
https://wikitech.wikimedia.org/wiki/Changeprop#To_deployment-prep
https://wikitech.wikimedia.org/wiki/Changeprop#Deploying_the_configuration
Apparently https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1069145 updated the files that generate the config. Possibly no one followed up by deploying that configuration? cc @Jgiannelos, @Eevans
I tried to follow the instructions from https://wikitech.wikimedia.org/wiki/Changeprop#Deploying_the_configuration and things went badly:
$ venv/bin/python3 make_beta_config.py . changeprop | ssh deployment-changeprop-1.deployment-prep.eqiad1.wikimedia.cloud sudo sh -xc \''cat > $(docker volume inspect changeprop -f {{.Mountpoint}})/config.yaml && systemctl restart changeprop'\' + docker volume inspect changeprop -f {{.Mountpoint}} + cat + systemctl restart changeprop
root@deployment-changeprop-1:~# systemctl status changeprop --no-pager -l ● changeprop.service - Systemd runner for changeprop Loaded: loaded (/lib/systemd/system/changeprop.service; enabled; preset: enabled) Active: active (running) since Wed 2025-03-05 22:22:57 UTC; 3s ago Process: 1005117 ExecStartPre=/usr/bin/docker stop changeprop.service (code=exited, status=1/FAILURE) Process: 1005123 ExecStartPre=/usr/bin/docker rm changeprop.service (code=exited, status=1/FAILURE) Main PID: 1005130 (docker) Tasks: 7 (limit: 4685) Memory: 12.1M CPU: 163ms CGroup: /system.slice/changeprop.service └─1005130 /usr/bin/docker run --rm=true --env-file /etc/changeprop/env -p 7272:7272 -v changeprop:/etc/changeprop --name changeprop.service docker-registry.wikimedia.org/wikimedia/mediawiki-services-change-propagation:2025-03-03-200218-production /srv/service/server.js -c /etc/changeprop/config.yaml Mar 05 22:22:57 deployment-changeprop-1 systemd[1]: Starting changeprop.service - Systemd runner for changeprop... Mar 05 22:22:57 deployment-changeprop-1 docker-changeprop[1005117]: Error response from daemon: No such container: changeprop.service Mar 05 22:22:57 deployment-changeprop-1 docker-changeprop[1005123]: Error: No such container: changeprop.service Mar 05 22:22:57 deployment-changeprop-1 systemd[1]: Started changeprop.service - Systemd runner for changeprop. Mar 05 22:22:58 deployment-changeprop-1 docker-changeprop[1005130]: {"name":"change-propagation","hostname":"0bf738485250","pid":1,"level":"WARN","levelPath":"warn/metrics","msg":"makeChild() is deprecated.","time":"2025-03-05T22:22:58.998Z","v":0} Mar 05 22:22:59 deployment-changeprop-1 docker-changeprop[1005130]: {"name":"changeprop","hostname":"0bf738485250","pid":1,"level":"FATAL","err":{"message":"libnode.so.108: cannot open shared object file: No such file or directory","name":"Error","stack":"Error: libnode.so.108: cannot open shared object file: No such file or directory\n at Module._extensions..node (node:internal/modules/cjs/loader:1319:18)\n at Module.load (node:internal/modules/cjs/loader:1091:32)\n at Module._load (node:internal/modules/cjs/loader:938:12)\n at Module.require (node:internal/modules/cjs/loader:1115:19)\n at require (node:internal/modules/helpers:130:18)\n at bindings (/srv/service/node_modules/bindings/bindings.js:112:48)\n at Object.<anonymous> (/srv/service/node_modules/node-rdkafka/librdkafka.js:10:32)\n at Module._compile (node:internal/modules/cjs/loader:1233:14)\n at Module._extensions..js (node:internal/modules/cjs/loader:1287:10)\n at Module.load (node:internal/modules/cjs/loader:1091:32)\n at Module._load (node:internal/modules/cjs/loader:938:12)\n at Module.require (node:internal/modules/cjs/loader:1115:19)\n at require (node:internal/modules/helpers:130:18)\n at Object.<anonymous> (/srv/service/node_modules/node-rdkafka/lib/client.js:14:13)\n at Module._compile (node:internal/modules/cjs/loader:1233:14)\n at Module._extensions..js (node:internal/modules/cjs/loader:1287:10)","code":"ERR_DLOPEN_FAILED"},"stack":"Error: libnode.so.108: cannot open shared object file: No such file or directory\n at Module._extensions..node (node:internal/modules/cjs/loader:1319:18)\n at Module.load (node:internal/modules/cjs/loader:1091:32)\n at Module._load (node:internal/modules/cjs/loader:938:12)\n at Module.require (node:internal/modules/cjs/loader:1115:19)\n at require (node:internal/modules/helpers:130:18)\n at bindings (/srv/service/node_modules/bindings/bindings.js:112:48)\n at Object.<anonymous> (/srv/service/node_modules/node-rdkafka/librdkafka.js:10:32)\n at Module._compile (node:internal/modules/cjs/loader:1233:14)\n at Module._extensions..js (node:internal/modules/cjs/loader:1287:10)\n at Module.load (node:internal/modules/cjs/loader:1091:32)\n at Module._load (node:internal/modules/cjs/loader:938:12)\n at Module.require (node:internal/modules/cjs/loader:1115:19)\n at require (node:internal/modules/helpers:130:18)\n at Object.<anonymous> (/srv/service/node_modules/node-rdkafka/lib/client.js:14:13)\n at Module._compile (node:internal/modules/cjs/loader:1233:14)\n at Module._extensions..js (node:internal/modules/cjs/loader:1287:10)","levelPath":"fatal/startup","msg":"Message not supplied","time":"2025-03-05T22:22:59.064Z","v":0}
Something changed as I was faffing about trying to figure out what the "libnode.so.108: cannot open shared object file: No such file or directory" crash was all about. The container is now running and appears to have the updated config that I pushed in T388043#10607728.
Now we have new errors:
Mar 05 22:49:42 deployment-changeprop-1 docker-changeprop[1029900]: {"name":"changeprop","hostname":"08a1130c311a","pid":1,"level":"ERROR","message":"Exec error in changeprop","status":404,"event_str":"{\"$schema\":\"/resource_change/1.0.0\",\"meta\":{\"uri\":\"https://wikifunctions.beta.wmflabs.org/wiki/Z8882\",\"request_id\":\"5ca2bbc8ac375f483f31af57\",\"id\":\"fbd1ff26-a6a8-4ab4-a472-5562e7add327\",\"dt\":\"2025-03-05T22:31:07Z\",\"domain\":\"wikifunctions.beta.wmflabs.org\",\"stream\":\"resource_change\"},\"tags\":[\"purge\"]}","stream":"resource_change","error_body_str":"{\"type\":\"https://mediawiki.org/wiki/HyperSwitch/errors/not_found#route\",\"title\":\"Not found.\",\"method\":\"get\",\"uri\":\"/wikifunctions.beta.wmflabs.org/v1/page/html/Z8882\",\"internalURI\":\"http://deployment-restbase05.deployment-prep.eqiad1.wikimedia.cloud:7231/wikifunctions.beta.wmflabs.org/v1/page/html/Z8882\",\"internalMethod\":\"get\"}","rule_name":"mw_purge","executor":"RuleExecutor","levelPath":"error/exec_error","msg":"Exec error in changeprop","time":"2025-03-05T22:49:42.337Z","v":0}
I am going to call this "done" weakly because the host is up, the service is up, and the log file that was 12G is now 3.1M. Things are still obviously not completely happy, but hopefully @aaron or someone else who has reason to be mucking about with changeprop will work on that.