Page MenuHomePhabricator

changeprop-jobqueue@deployment-prep fails with: getaddrinfo ENOTFOUND cloudmetrics1002.eqiad.wmnet
Closed, DuplicatePublic

Description

MW Jobs do not seem to be processed on the beta cluster (T325786, T325464 and CirrusSearch index updates) and it could to be related to a changeprop issue:

Error: Error sending hot-shots message: Error: getaddrinfo ENOTFOUND cloudmetrics1002.eqiad.wmnet
    at handleCallback (/srv/service/node_modules/hot-shots/lib/statsd.js:372:32)
    at process._tickCallback (internal/process/next_tick.js:63:19)

Subsequent message says: worker died, restarting.

Event Timeline

Mentioned in SAL (#wikimedia-releng) [2023-03-13T17:44:39Z] <James_F> Manually changed cloudmetrics1002 to cloudmetrics1003 on deployment-docker-cpjobqueue01 whilst debugging T326192

This is specified in /etc/cpjobqueue/config.yaml on disc on deployment-docker-cpjobqueue01; FWICT that's not puppetised?

I'd also note that Beta Cluster is running docker-registry.wikimedia.org/wikimedia/mediawiki-services-change-propagation:v0.9.5 from April 2020 whereas the latest tag is v0.10.5 from Feb 2021. Perhaps this needs switching to just use :latest?

Mentioned in SAL (#wikimedia-releng) [2023-03-13T17:57:00Z] <James_F> Moved deployment-docker-cpjobqueue01 from v0.9.5 to v0.10.5 of change-prop whilst debugging T326192

Ah heck, I missed this task and created T332211: deployment-docker-changeprop01: `worker died, restarting` as a dupe — issue seems to be resolved though!

Ah heck, I missed this task and created T332211: deployment-docker-changeprop01: `worker died, restarting` as a dupe — issue seems to be resolved though!

Yes, thank you!