Page MenuHomePhabricator

No Puppet resources found on instance deployment-changeprop-1 on project deployment-prep
Closed, ResolvedPublicBUG REPORT

Event Timeline

bd808 changed the task status from Open to In Progress.Mar 5 2025, 6:24 PM
bd808 claimed this task.
bd808 triaged this task as Medium priority.
$ ssh deployment-changeprop-1.deployment-prep.eqiad1.wikimedia.cloud
$ sudo -i puppet agent -tv
Error: No space left on device @ fptr_finalize_flush - /var/lib/puppet/ssl/ssl.lock
Error: No space left on device @ fptr_finalize_flush - /var/lib/puppet/ssl/ssl.lock
$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            2.0G     0  2.0G   0% /dev
tmpfs           393M   38M  356M  10% /run
/dev/sda1        20G   20G     0 100% /
tmpfs           2.0G     0  2.0G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/sda15      124M   12M  113M  10% /boot/efi
tmpfs           393M     0  393M   0% /run/user/0
tmpfs           393M     0  393M   0% /run/user/3518
$ sudo su -
# du -sh /var/lib/docker
16G     /var/lib/docker

Further digging led to finding a 12G log file at /var/lib/docker/containers/e78527c860483b10eb112fac855e9b7bcf4bbc29894004fc2d67227bdc44e8cb/e78527c860483b10eb112fac855e9b7bcf4bbc29894004fc2d67227bdc44e8cb-json.log

I truncted the log to get things into a state where we can think about better long term solutions:

root@deployment-changeprop-1:/var/lib/docker/containers/e78527c860483b10eb112fac855e9b7bcf4bbc29894004fc2d67227bdc44e8cb# truncate -s 0 e78527c860483b10eb112fac855e9b7bcf4bbc29894004fc2d67227bdc44e8cb-json.log

The underlying problem is that changeprop is big mad about deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud not existing:

{
  "name": "changeprop",
  "hostname": "e78527c86048",
  "pid": 1,
  "level": "WARN",
  "message": "Retry count exceeded",
  "event_str": "{\"$schema\":\"/change-prop/retry/1.0.0\",\"meta\":{\"stream\":\"changeprop.retry.resource_change\",\"uri\":\"https://wikifunctions.beta.wmflabs.org/wiki/Z17\",\"domain\":\"wikifunctions.beta.wmflabs.org\"},\"triggered_by\":\"req:7ebdeef5241ee23c86566808,resource_change:https://wikifunctions.beta.wmflabs.org/wiki/Z17,changeprop.retry.resource_change:https://wikifunctions.beta.wmflabs.org/wiki/Z17,changeprop.retry.resource_change:https://wikifunctions.beta.wmflabs.org/wiki/Z17\",\"emitter_id\":\"changeprop#mw_purge\",\"retries_left\":0,\"original_event\":{\"$schema\":\"/resource_change/1.0.0\",\"meta\":{\"uri\":\"https://wikifunctions.beta.wmflabs.org/wiki/Z17\",\"request_id\":\"7ebdeef5241ee23c86566808\",\"id\":\"89c57c3c-6603-4683-bda3-65c2e701daac\",\"dt\":\"2025-03-05T18:22:47Z\",\"domain\":\"wikifunctions.beta.wmflabs.org\",\"stream\":\"resource_change\"},\"tags\":[\"purge\"]},\"error_status\":504}",
  "status": 504,
  "page": "https://wikifunctions.beta.wmflabs.org/wiki/Z17",
  "description": "getaddrinfo ENOTFOUND deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud",
  "stack": "HTTPError: getaddrinfo ENOTFOUND deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud\n    at /srv/service/node_modules/preq/index.js:246:19\n    at tryCatcher (/srv/service/node_modules/bluebird/js/release/util.js:16:23)\n    at Promise._settlePromiseFromHandler (/srv/service/node_modules/bluebird/js/release/promise.js:547:31)\n    at Promise._settlePromise (/srv/service/node_modules/bluebird/js/release/promise.js:604:18)\n    at Promise._settlePromise0 (/srv/service/node_modules/bluebird/js/release/promise.js:649:10)\n    at Promise._settlePromises (/srv/service/node_modules/bluebird/js/release/promise.js:725:18)\n    at _drainQueueStep (/srv/service/node_modules/bluebird/js/release/async.js:93:12)\n    at _drainQueue (/srv/service/node_modules/bluebird/js/release/async.js:86:9)\n    at Async._drainQueues (/srv/service/node_modules/bluebird/js/release/async.js:102:5)\n    at Async.drainQueues [as _onImmediate] (/srv/service/node_modules/bluebird/js/release/async.js:15:14)\n    at process.processImmediate (node:internal/timers:471:21)",
  "stream": "changeprop.retry.resource_change",
  "rule_name": "mw_purge",
  "executor": "RetryExecutor",
  "levelPath": "warn/retry_count",
  "msg": "Retry count exceeded",
  "time": "2025-03-05T18:29:47.840Z",
  "v": 0
}

deployment-restbase05.deployment-prep.eqiad1.wikimedia.cloud should be the replacement.

# docker exec -it changeprop.service /bin/bash
$ grep deployment-restbase04 /etc/changeprop/config.yaml
                      uri: 'http://deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud:7231/{{message.meta.domain}}/v1/page/summary/{{match.meta.uri[1]}}'
                  uri: 'http://deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud:7231/{{message.meta.domain}}/v1/page/media-list/{{match.meta.uri[1]}}'
                  uri: 'http://deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud:7231/{{message.meta.domain}}/v1/page/mobile-html/{{match.meta.uri[1]}}'
                        uri: 'http://deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud:7231/{{message.meta.domain}}/v1/page/html/{decode(match.meta.uri.title)}'
                        uri: 'http://deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud:7231/{{message.meta.domain}}/v1/page/summary/{decode(match.meta.uri.title)}'
                        uri: 'http://deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud:7231/{{message.meta.domain}}/v1/page/html/{decode(match.meta.uri.title)}'
                        uri: 'http://deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud:7231/{{message.meta.domain}}/v1/page/html/{decode(match.meta.uri.title)}'
                        uri: 'http://deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud:7231/{{message.meta.domain}}/v1/page/summary/{decode(match.meta.uri.title)}'
                        uri: 'http://deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud:7231/{{message.meta.domain}}/v1/page/summary/{{match.meta.uri.title}}'
                        uri: 'http://deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud:7231/{{message.meta.domain}}/v1/page/mobile-html/{{match.meta.uri.title}}'
                        uri: 'http://deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud:7231/{{message.meta.domain}}/v1/page/summary/{{match.meta.uri.title}}'
                    uri: 'http://deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud:7231/{{message.meta.domain}}/v1/page/html/{message.page_title}/{{message.rev_id}}'
                    uri: 'http://deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud:7231/{{message.meta.domain}}/v1/page/title/{message.page_title}/{{message.rev_id}}'
                    uri: 'http://deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud:7231/{{message.meta.domain}}/v1/page/title/{message.page_title}'
                    uri: 'http://deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud:7231/{{message.meta.domain}}/v1/page/title/{message.page_title}'
                    uri: 'http://deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud:7231/{{message.meta.domain}}/v1/page/html/{message.page_title}/{{message.rev_id}}'
                    uri: 'http://deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud:7231/{{message.meta.domain}}/v1/page/title/{message.prior_state.page_title}'
                      uri: 'http://deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud:7231/{{message.meta.domain}}/v1/page/html/{{match.meta.uri.title}}'
                      uri: 'http://deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud:7231/{{message.meta.domain}}/v1/page/html/{{match.meta.uri.title}}'
                      uri: 'http://deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud:7231/{{message.meta.domain}}/v1/page/summary/{message.page_title}'
                      uri: 'http://deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud:7231/{{message.meta.domain}}/v1/page/summary/{message.page_title}'

Hopefully @Ottomata can help me figure out where to update the config:

[18:57]  <    bd808> ottomata: would you be able to tell me how the changeprop container running on deployment-changeprop-1.deployment-prep.eqiad1.wikimedia.cloud get's its configuration? I can see a docker volume that contains the /etc/changeprop/config.yaml file that the running service reads, but I'm not understanding how that config got there/gets updated.
[18:58]  <    bd808> context is that T388043 was caused by changeprop filling the disk with error logs because it is trying to route to the wrong restbase instance.
[18:58]  < stashbot> T388043: No Puppet resources found on instance deployment-changeprop-1 on project deployment-prep - https://phabricator.wikimedia.org/T388043

cc @aaron who has recently been working on change-prop in beta.

would you be able to tell me how the changeprop container running on deployment-changeprop-1.deployment-prep.eqiad1.wikimedia.cloud get's its configuration?

I barely know but @aaron recently asked this in Slack too.

It is in...deployment-charts prod repo?

https://wikitech.wikimedia.org/wiki/Changeprop#To_deployment-prep

https://wikitech.wikimedia.org/wiki/Changeprop#Deploying_the_configuration

I tried to follow the instructions from https://wikitech.wikimedia.org/wiki/Changeprop#Deploying_the_configuration and things went badly:

$ venv/bin/python3 make_beta_config.py . changeprop | ssh deployment-changeprop-1.deployment-prep.eqiad1.wikimedia.cloud sudo sh -xc \''cat > $(docker volume inspect changeprop -f {{.Mountpoint}})/config.yaml && systemctl restart changeprop'\'
+ docker volume inspect changeprop -f {{.Mountpoint}}
+ cat
+ systemctl restart changeprop
root@deployment-changeprop-1:~# systemctl status changeprop --no-pager -l
● changeprop.service - Systemd runner for changeprop
     Loaded: loaded (/lib/systemd/system/changeprop.service; enabled; preset: enabled)
     Active: active (running) since Wed 2025-03-05 22:22:57 UTC; 3s ago
    Process: 1005117 ExecStartPre=/usr/bin/docker stop changeprop.service (code=exited, status=1/FAILURE)
    Process: 1005123 ExecStartPre=/usr/bin/docker rm changeprop.service (code=exited, status=1/FAILURE)
   Main PID: 1005130 (docker)
      Tasks: 7 (limit: 4685)
     Memory: 12.1M
        CPU: 163ms
     CGroup: /system.slice/changeprop.service
             └─1005130 /usr/bin/docker run --rm=true --env-file /etc/changeprop/env -p 7272:7272 -v changeprop:/etc/changeprop --name changeprop.service docker-registry.wikimedia.org/wikimedia/mediawiki-services-change-propagation:2025-03-03-200218-production /srv/service/server.js -c /etc/changeprop/config.yaml

Mar 05 22:22:57 deployment-changeprop-1 systemd[1]: Starting changeprop.service - Systemd runner for changeprop...
Mar 05 22:22:57 deployment-changeprop-1 docker-changeprop[1005117]: Error response from daemon: No such container: changeprop.service
Mar 05 22:22:57 deployment-changeprop-1 docker-changeprop[1005123]: Error: No such container: changeprop.service
Mar 05 22:22:57 deployment-changeprop-1 systemd[1]: Started changeprop.service - Systemd runner for changeprop.
Mar 05 22:22:58 deployment-changeprop-1 docker-changeprop[1005130]: {"name":"change-propagation","hostname":"0bf738485250","pid":1,"level":"WARN","levelPath":"warn/metrics","msg":"makeChild() is deprecated.","time":"2025-03-05T22:22:58.998Z","v":0}
Mar 05 22:22:59 deployment-changeprop-1 docker-changeprop[1005130]: {"name":"changeprop","hostname":"0bf738485250","pid":1,"level":"FATAL","err":{"message":"libnode.so.108: cannot open shared object file: No such file or directory","name":"Error","stack":"Error: libnode.so.108: cannot open shared object file: No such file or directory\n    at Module._extensions..node (node:internal/modules/cjs/loader:1319:18)\n    at Module.load (node:internal/modules/cjs/loader:1091:32)\n    at Module._load (node:internal/modules/cjs/loader:938:12)\n    at Module.require (node:internal/modules/cjs/loader:1115:19)\n    at require (node:internal/modules/helpers:130:18)\n    at bindings (/srv/service/node_modules/bindings/bindings.js:112:48)\n    at Object.<anonymous> (/srv/service/node_modules/node-rdkafka/librdkafka.js:10:32)\n    at Module._compile (node:internal/modules/cjs/loader:1233:14)\n    at Module._extensions..js (node:internal/modules/cjs/loader:1287:10)\n    at Module.load (node:internal/modules/cjs/loader:1091:32)\n    at Module._load (node:internal/modules/cjs/loader:938:12)\n    at Module.require (node:internal/modules/cjs/loader:1115:19)\n    at require (node:internal/modules/helpers:130:18)\n    at Object.<anonymous> (/srv/service/node_modules/node-rdkafka/lib/client.js:14:13)\n    at Module._compile (node:internal/modules/cjs/loader:1233:14)\n    at Module._extensions..js (node:internal/modules/cjs/loader:1287:10)","code":"ERR_DLOPEN_FAILED"},"stack":"Error: libnode.so.108: cannot open shared object file: No such file or directory\n    at Module._extensions..node (node:internal/modules/cjs/loader:1319:18)\n    at Module.load (node:internal/modules/cjs/loader:1091:32)\n    at Module._load (node:internal/modules/cjs/loader:938:12)\n    at Module.require (node:internal/modules/cjs/loader:1115:19)\n    at require (node:internal/modules/helpers:130:18)\n    at bindings (/srv/service/node_modules/bindings/bindings.js:112:48)\n    at Object.<anonymous> (/srv/service/node_modules/node-rdkafka/librdkafka.js:10:32)\n    at Module._compile (node:internal/modules/cjs/loader:1233:14)\n    at Module._extensions..js (node:internal/modules/cjs/loader:1287:10)\n    at Module.load (node:internal/modules/cjs/loader:1091:32)\n    at Module._load (node:internal/modules/cjs/loader:938:12)\n    at Module.require (node:internal/modules/cjs/loader:1115:19)\n    at require (node:internal/modules/helpers:130:18)\n    at Object.<anonymous> (/srv/service/node_modules/node-rdkafka/lib/client.js:14:13)\n    at Module._compile (node:internal/modules/cjs/loader:1233:14)\n    at Module._extensions..js (node:internal/modules/cjs/loader:1287:10)","levelPath":"fatal/startup","msg":"Message not supplied","time":"2025-03-05T22:22:59.064Z","v":0}

Something changed as I was faffing about trying to figure out what the "libnode.so.108: cannot open shared object file: No such file or directory" crash was all about. The container is now running and appears to have the updated config that I pushed in T388043#10607728.

Now we have new errors:

Mar 05 22:49:42 deployment-changeprop-1 docker-changeprop[1029900]: {"name":"changeprop","hostname":"08a1130c311a","pid":1,"level":"ERROR","message":"Exec error in changeprop","status":404,"event_str":"{\"$schema\":\"/resource_change/1.0.0\",\"meta\":{\"uri\":\"https://wikifunctions.beta.wmflabs.org/wiki/Z8882\",\"request_id\":\"5ca2bbc8ac375f483f31af57\",\"id\":\"fbd1ff26-a6a8-4ab4-a472-5562e7add327\",\"dt\":\"2025-03-05T22:31:07Z\",\"domain\":\"wikifunctions.beta.wmflabs.org\",\"stream\":\"resource_change\"},\"tags\":[\"purge\"]}","stream":"resource_change","error_body_str":"{\"type\":\"https://mediawiki.org/wiki/HyperSwitch/errors/not_found#route\",\"title\":\"Not found.\",\"method\":\"get\",\"uri\":\"/wikifunctions.beta.wmflabs.org/v1/page/html/Z8882\",\"internalURI\":\"http://deployment-restbase05.deployment-prep.eqiad1.wikimedia.cloud:7231/wikifunctions.beta.wmflabs.org/v1/page/html/Z8882\",\"internalMethod\":\"get\"}","rule_name":"mw_purge","executor":"RuleExecutor","levelPath":"error/exec_error","msg":"Exec error in changeprop","time":"2025-03-05T22:49:42.337Z","v":0}

I am going to call this "done" weakly because the host is up, the service is up, and the log file that was 12G is now 3.1M. Things are still obviously not completely happy, but hopefully @aaron or someone else who has reason to be mucking about with changeprop will work on that.