Root cause
As @Zabe found, this patch introduced a large WikiLambda db migration to resolve T306824: WikiLambda: canonicalize and normalize to work with Benjamin Arrays
In attempting to process it, deployment-deploy03 ran out of memory. The system hang was caused by an OOM, plus kswapd0 trying to agressively manage swap
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 43 root 20 0 0 0 0 S 100.0 0.0 0:21.99 kswapd0 32551 www-data 20 0 7422512 6.8g 0 D 16.7 87.5 0:23.69 php
Symptoms
Behaviour
- System hangs
- Unable to SSH
- Can ping
- Jenkins agent disconnects
Log entries
Fatal error: Out of memory (allocated 7487094784) (tried to allocate 20480 bytes) in /srv/mediawiki-staging/php-master/extensions/WikiLambda/includes/ZObjectFactory.php on line 158
samtar@deployment-deploy03:~$ last -5 reboot shutdown root root pts/2 172.16.5.8 Fri May 27 21:52 still logged in root ttyS0 Fri May 27 21:43 still logged in reboot system boot 4.19.0-20-cloud- Fri May 27 21:43 still running root pts/3 172.16.5.8 Fri May 27 21:03 - 21:12 (00:08) root ttyS0 Fri May 27 20:49 - crash (00:53) wtmp begins Sun Mar 28 15:07:49 2021
Resolution
Likely resolved by @Zabe manually running the migration script ( T309413#7964439 ), combined by @TheresNoTime manually running the wmf-beta-update-databases.py script
Action items
- We should probably rebuild deployment-deploy03 soon, and upgrade it from debian-10.0-buster
- Can we have a secondary deployment server?