deployment-deploy03 ran out of memory twice while trying to perform a WikiLambda db migration
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	TheresNoTime
	May 27 2022, 8:42 PM

Description

Root cause

As @Zabe found, this patch introduced a large WikiLambda db migration to resolve T306824: WikiLambda: canonicalize and normalize to work with Benjamin Arrays

In attempting to process it, deployment-deploy03 ran out of memory. The system hang was caused by an OOM, plus kswapd0 trying to agressively manage swap

 PID  USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   43 root      20   0       0      0      0 S 100.0   0.0   0:21.99 kswapd0
32551 www-data  20   0 7422512   6.8g      0 D  16.7  87.5   0:23.69 php

Symptoms

Behaviour

System hangs
Unable to SSH
Can ping
Jenkins agent disconnects

Log entries

Fatal error: Out of memory (allocated 7487094784) (tried to allocate 20480 bytes) in /srv/mediawiki-staging/php-master/extensions/WikiLambda/includes/ZObjectFactory.php on line 158

samtar@deployment-deploy03:~$ last -5 reboot shutdown root
root     pts/2        172.16.5.8       Fri May 27 21:52   still logged in
root     ttyS0                         Fri May 27 21:43   still logged in
reboot   system boot  4.19.0-20-cloud- Fri May 27 21:43   still running
root     pts/3        172.16.5.8       Fri May 27 21:03 - 21:12  (00:08)
root     ttyS0                         Fri May 27 20:49 - crash  (00:53)

wtmp begins Sun Mar 28 15:07:49 2021

Resolution

Likely resolved by @Zabe manually running the migration script ( T309413#7964439 ), combined by @TheresNoTime manually running the wmf-beta-update-databases.py script

Action items

We should probably rebuild deployment-deploy03 soon, and upgrade it from debian-10.0-buster
Can we have a secondary deployment server?

Related Objects

Mentioned In: T309437: Create deployment-deploy04 as future secondary/upgrade
T309415: Grant `Samtar` admin access to the deployment-prep project
Mentioned Here: T306824: WikiLambda: canonicalize and normalize to work with Benjamin Arrays

Event Timeline

TheresNoTime created this task.May 27 2022, 8:42 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 27 2022, 8:42 PM

TheresNoTime triaged this task as High priority.May 27 2022, 8:43 PM

TheresNoTime added projects: SRE, Release-Engineering-Team.

TheresNoTime mentioned this in T309415: Grant `Samtar` admin access to the deployment-prep project.May 27 2022, 8:50 PM

I rebooted using the horizon UI.

@dancy rebooted deployment-deploy03 and it is now accessible

Issue repeated, looking at it now

TheresNoTime renamed this task from deployment-deploy03 unresponsive to deployment-deploy03 crashed twice.May 27 2022, 10:05 PM

TheresNoTime removed TheresNoTime as the assignee of this task.

TheresNoTime raised the priority of this task from High to Needs Triage.

TheresNoTime updated the task description. (Show Details)

TheresNoTime added a subscriber: Zabe.

While running a step of beta-update-databases-eqiad, we go OOM and unresponsive:

 PID  USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   43 root      20   0       0      0      0 S 100.0   0.0   0:21.99 kswapd0
32551 www-data  20   0 7422512   6.8g      0 D  16.7  87.5   0:23.69 php

TheresNoTime claimed this task.May 27 2022, 10:58 PM

FTR, it seems like beta-update-databases-eqiad was running out of memory while trying to perform the migration added in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikiLambda/+/798987. I manually ran the migration script (sal), hopefully that fixed it.

TheresNoTime renamed this task from deployment-deploy03 crashed twice to deployment-deploy03 ran out of memory twice while trying to perform a WikiLambda db migration.May 27 2022, 11:35 PM

TheresNoTime updated the task description. (Show Details)

TheresNoTime updated the task description. (Show Details)May 27 2022, 11:38 PM

TheresNoTime updated the task description. (Show Details)

TheresNoTime mentioned this in T309437: Create deployment-deploy04 as future secondary/upgrade.May 28 2022, 2:10 PM

hashar moved this task from INBOX to Radar on the Release-Engineering-Team board.Jun 7 2022, 1:16 PM

hashar edited projects, added Release-Engineering-Team (Radar); removed Release-Engineering-Team.

deployment-parsoid12 went out of memory this morning which I have filed as T310069. It might be a duplicate or a different issue ;)

In T309413#7985917, @hashar wrote:

deployment-parsoid12 went out of memory this morning which I have filed as T310069. It might be a duplicate or a different issue ;)

Could you (or someone else) add me to that task?

In T309413#7986067, @Zabe wrote:

Could you (or someone else) add me to that task?

Going to mark this resolved (as it kinda/mostly/sorta is)

deployment-deploy03 ran out of memory twice while trying to perform a WikiLambda db migrationClosed, ResolvedPublicActions

Description

Root cause

Symptoms

Behaviour

Log entries

Resolution

Action items

Related Objects

Event Timeline

deployment-deploy03 ran out of memory twice while trying to perform a WikiLambda db migration
Closed, ResolvedPublic
Actions