Page MenuHomePhabricator

refreshUserImpactJob requires a high number of file descriptors
Open, MediumPublic

Description

refreshUserImpactJob is a job that refreshes user impact data for the purpose of displaying those at Special:Homepage and Special:Impact, see https://www.mediawiki.org/wiki/Growth/Positive_reinforcement for project details. The job is scheduled from the maintenance server via growthexperiments-userImpactUpdateRecentlyRegistered and growthexperiments-userImpactUpdateRecentlyEdited jobs during UTC morning.

Since recently, the job started to log mysterious errors (T344427, T341658). Presence of those errors was deemed to block release of new Impact module to further wikis. In T344428: refreshUserImpactJob logs mysterious fatal errors, it was determined that the job errors out because it exhausts the limit of opened files. Since https://gerrit.wikimedia.org/r/c/967870, the errors stopped happening, which confirms that the issue is indeed caused by exhausting file descriptors.

@Joe stated via IRC that we can keep the limit increased for as long as necessary, and that it might not be possible to guarantee the same for k8s jobrunners. Since the limit increase made the errors to disappear, we can consider new Impact module rollout unblocked for the timebeing. However, we should still aim to identify the exact cause of the FD exhaustion and fix it, especially since the progressive shift to k8s might make the error return at any time. Within this task, the root cause of the problem should be identified and fixed.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
KStoller-WMF subscribed.
jijiki subscribed.

@Urbanecm_WMF feel free to switch the tag to serviceops when you require input/actions/investigation from us

@Urbanecm_WMF feel free to switch the tag to serviceops when you require input/actions/investigation from us

I'd heavily appreciate helping us to get to a stage when we can reproduce the issue without using jobqueues. Currently, the issue disappeared, but only because @Joe increased the limit on the jobrunner. So far, I was unable to reproduce the issue without scheduling the jobs, and since the FD limit increase on jobrunner, I can't reproduce the issue in any way (I only know it exists because it consistently logged error messages and that it is FD exhaustion related, but nothing more).

Once we are able to reliably reproduce the issue in some easy-to-hack way (mwmaint, dev machine, similar), I think it would be much easier for Growth to continue debugging here, but given our limited knowledge of the infra, we're lost at what could be done to make the issue more easier to reproduce. Is this something serviceops can help with?

FTR, I think this issue now affects serviceops more than the Growth team: from our perspective, https://gerrit.wikimedia.org/r/c/967870 effectively "fixed" the maintenance job we have by making the limit higher. Not sure how big of a problem that is -- maybe we don't actually need to spend any time debugging this further anytime soon.

Urbanecm_WMF lowered the priority of this task from High to Medium.Jan 4 2024, 12:21 PM