Page MenuHomePhabricator

deployment-deploy03 ran out of memory twice while trying to perform a WikiLambda db migration
Closed, ResolvedPublic

Description

Root cause

As @Zabe found, this patch introduced a large WikiLambda db migration to resolve T306824: WikiLambda: canonicalize and normalize to work with Benjamin Arrays

In attempting to process it, deployment-deploy03 ran out of memory. The system hang was caused by an OOM, plus kswapd0 trying to agressively manage swap

 PID  USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   43 root      20   0       0      0      0 S 100.0   0.0   0:21.99 kswapd0
32551 www-data  20   0 7422512   6.8g      0 D  16.7  87.5   0:23.69 php

Symptoms

Behaviour

Log entries

Fatal error: Out of memory (allocated 7487094784) (tried to allocate 20480 bytes) in /srv/mediawiki-staging/php-master/extensions/WikiLambda/includes/ZObjectFactory.php on line 158
samtar@deployment-deploy03:~$ last -5 reboot shutdown root
root     pts/2        172.16.5.8       Fri May 27 21:52   still logged in
root     ttyS0                         Fri May 27 21:43   still logged in
reboot   system boot  4.19.0-20-cloud- Fri May 27 21:43   still running
root     pts/3        172.16.5.8       Fri May 27 21:03 - 21:12  (00:08)
root     ttyS0                         Fri May 27 20:49 - crash  (00:53)

wtmp begins Sun Mar 28 15:07:49 2021

Resolution

Likely resolved by @Zabe manually running the migration script ( T309413#7964439 ), combined by @TheresNoTime manually running the wmf-beta-update-databases.py script

Action items

  • We should probably rebuild deployment-deploy03 soon, and upgrade it from debian-10.0-buster
  • Can we have a secondary deployment server?

Event Timeline

TheresNoTime assigned this task to dancy.

@dancy rebooted deployment-deploy03 and it is now accessible

TheresNoTime claimed this task.

Issue repeated, looking at it now

TheresNoTime renamed this task from deployment-deploy03 unresponsive to deployment-deploy03 crashed twice.May 27 2022, 10:05 PM
TheresNoTime removed TheresNoTime as the assignee of this task.
TheresNoTime raised the priority of this task from High to Needs Triage.
TheresNoTime updated the task description. (Show Details)
TheresNoTime added a subscriber: Zabe.

While running a step of beta-update-databases-eqiad, we go OOM and unresponsive:

 PID  USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   43 root      20   0       0      0      0 S 100.0   0.0   0:21.99 kswapd0
32551 www-data  20   0 7422512   6.8g      0 D  16.7  87.5   0:23.69 php

FTR, it seems like beta-update-databases-eqiad was running out of memory while trying to perform the migration added in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikiLambda/+/798987. I manually ran the migration script (sal), hopefully that fixed it.

TheresNoTime renamed this task from deployment-deploy03 crashed twice to deployment-deploy03 ran out of memory twice while trying to perform a WikiLambda db migration.May 27 2022, 11:35 PM
TheresNoTime updated the task description. (Show Details)

deployment-parsoid12 went out of memory this morning which I have filed as T310069. It might be a duplicate or a different issue ;)

deployment-parsoid12 went out of memory this morning which I have filed as T310069. It might be a duplicate or a different issue ;)

Could you (or someone else) add me to that task?

Could you (or someone else) add me to that task?

{{done}}

Going to mark this resolved (as it kinda/mostly/sorta is)