Page MenuHomePhabricator

Jobs sometimes disappear without a trace (except "Exec error in cpjobqueue" / "Error: socket hang up" from change-propagation service)
Closed, DeclinedPublic

Description

When investigating some stuck global renames (T400974, T402364), I noticed that some MediaWiki jobs do not succeed (the database changes they should perform are not done), but there are also no log events that would indicate they failed (no exception or anything on the MediaWiki side).

I eventually tracked down errors like this, which are logged by mediawiki/services/change-propagation here:

In both cases the error is:

{"type":"internal_http_error","detail":"socket hang up","internalStack":"Error: socket hang up\n    at connResetException (node:internal/errors:720:14)\n    at TLSSocket.socketOnEnd (node:_http_client:519:23)\n    at TLSSocket.emit (node:events:526:35)\n    at endReadableNT (node:internal/streams/readable:1376:12)\n    at process.processTicksAndRejections (node:internal/process/task_queues:82:21)","internalURI":"https://mw-jobrunner.discovery.wmnet:4448/rpc/RunSingleJob.php","internalQuery":"{}","internalErr":"socket hang up","internalMethod":"post"}

I don't know what this means.

Searching in logstash for this message, I found that it is a common error, with a very clear weekly pattern (happens almost entirely on weekdays but not weekends):
https://logstash.wikimedia.org/goto/4b21f484a524ed37771e352f40f246e6

image.png (262×1 px, 38 KB)

It affects many kinds of jobs, and every job type individually also displays the same pattern. The job gets retried afterwards, and then presumably succeeds, so this doesn't cause very visible issues in most cases except perhaps a delay – but global rename jobs are not idempotent and end up in a state that needs to be unstuck manually, which motivated me to investigate this.

I would like to learn why this happens, and whether it can be fixed or if we should rework global renames to be more resilient to this situation.

Event Timeline

Any job should really try to be as idempotent as possible, otherwise it will always be risky.

If the hangup is from RunSingleJob.php, but nothing was logged, I wonder if the job went through the first time, but changeprop lost the connection and thus retried, which could breakage if the job cannot run twice.

Looking at scap logs, it almost certainly comes from deployments terminating and recreating mw-jobrunner pods, thus killing the jobs in flight.

Any job should really try to be as idempotent as possible, otherwise it will always be risky.

In principle, of course it should; in practice, the rename code is spread across core, CentralAuth and random extensions, and it is difficult to make it better.

Looking at scap logs, it almost certainly comes from deployments terminating and recreating mw-jobrunner pods, thus killing the jobs in flight.

You know, I joked that this looks almost as if someone unplugged the servers, I didn't realize how close to true that was… :) I'm glad that at least there's an explanation.

Would it be difficult to make the deployment process depool the pods or something, and give the in-progress jobs some time to finish, before terminating them?

Global rename jobs are mostly idempotent - if they abort and someone re-runs them, the end result is almost always the same as if they didn't abort in the first place. They do require someone to re-run them though, and the user is locked out in the meantime. So it's not great when they get silently discarded.

@Tgr Does the work done in T402830: Global rename jobs should use a lock rather than storing "in progress" state in the database make the jobs rerun without manual intervention?

Would it be difficult to make the deployment process depool the pods or something, and give the in-progress jobs some time to finish, before terminating them?

There is no way that I know of to do that with standard Kubernetes deployment strategies and still guarantee jobs created after a deployment will use the newly deployed code. The best course of action is to make the jobs idempotent and/or automatically retry-able, as far as I can tell anything else would require a complete rearchitecture of how the jobrunners work in mw-on-k8s.

Serviceops backlog triaging here, @Tgr could you confirm if we can close this task?

@Tgr Does the work done in T402830: Global rename jobs should use a lock rather than storing "in progress" state in the database make the jobs rerun without manual intervention?

I was somewhat wrong earlier - the job runner tries the rename three times, and I don't think scap killing the pod is special in that respect; the job will be retried because it's not removed from the job queue until it succeeds or the retry count is reached.

T402830 fixed an issue that caused retries to fail when the initial process was killed abruptly. Whether that was the only such issue, I think we'll just have to wait and see.

I think, based on the comments above, that we can say that this is declined, and this behavior of the job queue will not change. Given that we fixed the only job type (that we know of) which had a problem with this, I think this is fine.