Page MenuHomePhabricator

growthexperiments-deleteoldsurveys OOMKilled during June 1st run
Open, HighPublic3 Estimated Story Points

Description

The June 1st run of this job was OOMKilled:

State:          Terminated
  Reason:       OOMKilled
  Exit Code:    137
  Started:      Sun, 01 Jun 2025 03:15:01 +0000
  Finished:     Sun, 01 Jun 2025 04:59:20 +0000

Short-term, we should probably bump our limits to prevent this and re-run the job, but also it should be investigated as to why this job is using so much memory.

Details

Other Assignee
Sgs
Related Changes in Gerrit:

Event Timeline

Growth looks fairly linear over time:

image.png (1×2 px, 138 KB)

I see that batch size is set at a thousand elements, could we experiment with dropping this? But that said, I think there is something else malign at work here as a thousand elements shouldn't cause this incremental increase in memory usage. The log output shows that (for enwiki, unsurprisingly) multiple batches of users are processed before the OOMKill happens, which imo speaks to a failure to free resources used in each iteration of the loop. This probably needs attention from the Growth team rather than a resource change.

We have temporarily removed limits and this job has run successfully, however even when running with a relatively recent historical run recently we can see substantial increases in memory usage that seem extreme for a relatively simple job. The first spike is enwiki, and the second is eswiki. The cliffs correspond with the job moving on from one wiki to another so it seems clear that something isn't being freed or collected during each iteration. Limits will be reintroduced very soon, please have a look at this job to avoid it being killed in future.

image.png (1×2 px, 178 KB)

hnowlan triaged this task as High priority.Jun 4 2025, 4:40 PM

The script calls UserOptionsManager::setOption in a loop, which inits UserOptionsCacheEntry instances for each user. These never get evicted, so memory growth from this would scale linearly with the number of users processed on a given wiki. Edit: these should be cleared when saveOptionsInternal() runs.

Tangentially related question: I'd like to understand how/when old jobs a cleared from the backlog that I can see with kubectl get jobs -l team=growth.

Currently, I'm seeing there:

NAME                                                            COMPLETIONS   DURATION   AGE
growthexperiments-deleteoldsurveys-202507010758                 1/1           5h8m       20d
growthexperiments-deleteoldsurveys-29188995                     0/1           20d        20d
growthexperiments-deleteoldsurveys-29209155                     1/1           4h40m      6d9h
...

The first two jobs seem related to this, neither has any logs available anymore.

How long do we typically keep these entries around? Is there anything we as a team are expected to do here?

Also, I keep seeing this failed job sticking around in the alerts panel:

image.png (808×518 px, 99 KB)

(source: https://alerts.wikimedia.org/?q=team%3Dgrowth)

That is probably somehow related? I looked at https://wikitech.wikimedia.org/wiki/Alertmanager, but did not see anything conclusive.
(Also, what does the "Expires in x minutes" mean in practice? It does not seem to actually do so...)

Tangentially related question: I'd like to understand how/when old jobs a cleared from the backlog that I can see with kubectl get jobs -l team=growth.

Currently, I'm seeing there:

NAME                                                            COMPLETIONS   DURATION   AGE
growthexperiments-deleteoldsurveys-202507010758                 1/1           5h8m       20d
growthexperiments-deleteoldsurveys-29188995                     0/1           20d        20d
growthexperiments-deleteoldsurveys-29209155                     1/1           4h40m      6d9h
...

The first two jobs seem related to this, neither has any logs available anymore.

How long do we typically keep these entries around? Is there anything we as a team are expected to do here?

As your job is set up right now, jobs are kept in the kubernetes API for 21 days maximum, with a limit of 1 failed job and 3 successful. This can be tweaked in puppet with the ttlsecondsafterfinished, failedjobshistorylimit, and successfuljobshistorylimit settings.
This is only the retention period of the jobs in the kubernetes API, not of the logs which can be purged earlier from being accessed through kubectl logs, in which case they should still be available in logstash.

As a team, you should Manually delete failed jobs once you do not need them to be in the kubernetes API anymore in order to reset the alerting.

Also, I keep seeing this failed job sticking around in the alerts panel:

image.png (808×518 px, 99 KB)

(source: https://alerts.wikimedia.org/?q=team%3Dgrowth)

That is probably somehow related? I looked at https://wikitech.wikimedia.org/wiki/Alertmanager, but did not see anything conclusive.
(Also, what does the "Expires in x minutes" mean in practice? It does not seem to actually do so...)

The expiry you see there is the silence expiry, in other words when the alert starts to fire again. It is related, the alert will disappear once you delete the failed job or ttlsecondsafterfinished is reached.

Thank you for that explanation! I guess I'll write myself a TODO item to come up with some sort of "Runbook" for our team for dealing with failed jobs that includes deleting the failed instance of the job from the kubernetes API.

I'll leave this task open because we should still look into that memory consumption issue.

The job failed again today, but the log that I can get from the k8s api is not telling me much about why:

enwiki Processed users up to ID 49756401
enwiki Processed users up to ID 49758094
enwiki Processed users up to ID 49759888
/usr/local/bin/foreachwikiindblist: line 112:  9014 Killed                  ${RUNNER} ${CMD} ${wiki} "${@}" 2>&1
      9015 Done                    | ts "${wiki}"

Where did the output in the task description come from? And from where comes the screenshots of memory usage for the mediawiki-main-app?

This is the output from

kubectl describe pod $(kubectl get pods -l team=growth,cronjob=growthexperiments-deleteoldsurveys | grep OOMKilled | awk '{print $1}')`
[...]
Containers:
  mediawiki-main-app:
    Container ID:  containerd://67d8cfb0c9c49ca61bcedb604bb69f0494ae69d3b00b8ca5151c3b440bbf0e4a
    Image:         docker-registry.discovery.wmnet/restricted/mediawiki-multiversion-cli:2025-07-31-215657-publish-81
    Image ID:      docker-registry.discovery.wmnet/restricted/mediawiki-multiversion-cli@sha256:f9ce677617da93c5bc25ffd57563c20751279115a6cf0a58994085a5b
8da8098             
    Port:          <none>
    Host Port:     <none>                                                   
    Command:                                                                
      /bin/bash                                                                                                                                          
    Args:                                                                   
      -c                                                                    
      /usr/local/bin/foreachwikiindblist /srv/mediawiki/dblists/growthexperiments.dblist extensions/GrowthExperiments/maintenance/deleteOldSurveys.php --
cutoff 60                                                                   
    State:          Terminated                                              
      Reason:       OOMKilled                                                                                                                            
      Exit Code:    137
      Started:      Fri, 01 Aug 2025 03:15:03 +0000
      Finished:     Fri, 01 Aug 2025 06:00:24 +0000

The graphs are from the Kubernetes Pod Details dashboard (here setup to show the run of the latest failed run using the pod name)

The job failed again on the August 15th run, oomkilled again.

I'll remove limits so it runs next time, but this should definitely be investigated by Growth-Team

Change #1181722 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] Revert^2 "mw-cron: Disable memory limit"

https://gerrit.wikimedia.org/r/1181722

Change #1181722 merged by jenkins-bot:

[operations/deployment-charts@master] Revert^2 "mw-cron: Disable memory limit"

https://gerrit.wikimedia.org/r/1181722

The job failed again on the August 15th run, oomkilled again.

I'll remove limits so it runs next time, but this should definitely be investigated by Growth-Team

Thank you! I agree, we should somehow find time for it! I'll see if I can add it to our next sprint planning

KStoller-WMF set the point value for this task to 3.Sep 8 2025, 4:19 PM
KStoller-WMF renamed this task from growthexperiments-deleteoldsurveys OOMKilled during June 1st run to growthexperiments-deleteoldsurveys OOMKilled during June 1st run.Oct 7 2025, 4:51 PM
KStoller-WMF updated Other Assignee, added: Sgs.
KStoller-WMF added a subscriber: Sgs.

Based on a conversation with @Ladsgroup, I tried running this in production manually as an one off (given mw-cron has memory limits disabled temporarily):

[urbanecm@deploy2002 ~]$ mwscript-k8s --dblist=growthexperiments -f extensions/GrowthExperiments/maintenance/deleteOldSurveys.php -- --cutoff 60
ℹ️ Your job will run on PHP 8.3 (T405955). If you encounter a compatibility issue, you can use --php_version 8.1 to explicitly select 8.1.
⏳ Starting extensions/GrowthExperiments/maintenance/deleteOldSurveys.php on Kubernetes as job mw-script.codfw.s5adgwla ...
⏳ Waiting for the container to start...
🚀 Job is running.

Okay, this job finished running successfully:

[urbanecm@deploy2002 ~]$ kubectl get pods -l job-name=mw-script.codfw.s5adgwla
NAME                             READY   STATUS      RESTARTS   AGE
mw-script.codfw.s5adgwla-j25wr   0/3     Completed   0          22h
[urbanecm@deploy2002 ~]$

According to https://grafana.wikimedia.org/goto/j5rpPozvR?orgId=1, it used ~650 MiB:

image.png (1×3 px, 272 KB)

That being said, it is very clear something is still cleaning memory, so presumably, the script didn't have enough users to go through...

My hypothesis on this task is that UserOptionsManager's cache goes havoc and consumes all available memory if we let it. To test this theory, I composed a small maintenance script that reads an user option from ~50k of users:

<?php

namespace GrowthExperiments\Maintenance;

use MediaWiki\Maintenance\Maintenance;
use MediaWiki\User\UserIdentityLookup;
use MediaWiki\User\UserOptionsManager;

$IP = getenv( 'MW_INSTALL_PATH' );
if ( $IP === false ) {
	$IP = __DIR__ . '/../../..';
}
require_once "$IP/maintenance/Maintenance.php";

class TestUOM extends Maintenance {
	private UserIdentityLookup $userIdentityLookup;
	private UserOptionsManager $userOptionsManager;

	private function initServices(): void {
		$this->userIdentityLookup = $this->getServiceContainer()->getUserIdentityLookup();
		$this->userOptionsManager = $this->getServiceContainer()->getUserOptionsManager();
	}

	public function execute() {
		$this->initServices();

		$userIdentities = iterator_to_array( $this->userIdentityLookup->newSelectQueryBuilder()
			->registered()
			->where( [ 'actor_user < 50000' ] )
			->fetchUserIdentities() );

		$this->output( 'Processing ' . count( $userIdentities ) . ' user identities...' . PHP_EOL );
		foreach ( $userIdentities as $userIdentity ) {
			$value = $this->userOptionsManager->getOption( $userIdentity, 'userjs-testing-T395893' );
			$this->output( sprintf(
				'%s-%s',
				$userIdentity->getId(),
				$value ?? '(none)'
			) . PHP_EOL );
		}
	}
}

$maintClass = TestUOM::class;
require_once RUN_MAINTENANCE_IF_MAIN;

Trying to run it against testwiki:

[urbanecm@deploy2002 ~]$ mwscript-k8s -f --file=TestUOM.php -- /data/TestUOM.php --wiki=testwiki
ℹ️ Your job will run on PHP 8.3 (T405955). If you encounter a compatibility issue, you can use --php_version 8.1 to explicitly select 8.1.
⏳ Starting /data/TestUOM.php on Kubernetes as job mw-script.codfw.lduu6ib3 ...
🚀 Job is running.
📜 Streaming logs:
Processing 49176 user identities...
4382-(none)
17638-(none)
40964-(none)
40976-(none)
10149-(none)
40792-(none)
827-(none)
40977-(none)
40978-(none)
[...]
[urbanecm@deploy2002 ~]$ kubectl get pods -l job-name=mw-script.codfw.lduu6ib3
NAME                             READY   STATUS      RESTARTS   AGE
mw-script.codfw.lduu6ib3-hkc4g   0/3     OOMKilled   0          96s
[urbanecm@deploy2002 ~]$

When accessing user options of arbitrary users, the script runs out of memory, proving the theory that UserOptionsManager (or its callees) are behind this memory leak.

Urbanecm_WMF moved this task from Inbox to Blocked on the Growth-Team board.