When investigating some stuck global renames (T400974, T402364), I noticed that some MediaWiki jobs do not succeed (the database changes they should perform are not done), but there are also no log events that would indicate they failed (no exception or anything on the MediaWiki side).
I eventually tracked down errors like this, which are logged by mediawiki/services/change-propagation here:
- https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-k8s-1-7.0.0-1-2025.07.31?id=dF2AYpgBsPjmLNTo7R87
- https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-k8s-1-7.0.0-1-2025.08.19?id=ZSPBwpgBAlpnrixJODUD
In both cases the error is:
{"type":"internal_http_error","detail":"socket hang up","internalStack":"Error: socket hang up\n at connResetException (node:internal/errors:720:14)\n at TLSSocket.socketOnEnd (node:_http_client:519:23)\n at TLSSocket.emit (node:events:526:35)\n at endReadableNT (node:internal/streams/readable:1376:12)\n at process.processTicksAndRejections (node:internal/process/task_queues:82:21)","internalURI":"https://mw-jobrunner.discovery.wmnet:4448/rpc/RunSingleJob.php","internalQuery":"{}","internalErr":"socket hang up","internalMethod":"post"}
I don't know what this means.
Searching in logstash for this message, I found that it is a common error, with a very clear weekly pattern (happens almost entirely on weekdays but not weekends):
https://logstash.wikimedia.org/goto/4b21f484a524ed37771e352f40f246e6
It affects many kinds of jobs, and every job type individually also displays the same pattern. The job gets retried afterwards, and then presumably succeeds, so this doesn't cause very visible issues in most cases except perhaps a delay – but global rename jobs are not idempotent and end up in a state that needs to be unstuck manually, which motivated me to investigate this.
I would like to learn why this happens, and whether it can be fixed or if we should rework global renames to be more resilient to this situation.
