Page MenuHomePhabricator

beta-scap-sync-world failing
Closed, DuplicatePublic

Description

https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-sync-world/68528/console

['/usr/bin/scap', 'pull', '--no-php-restart', '--no-update-l10n', '--exclude-wikiversions.php', 'deployment-deploy03.deployment-prep.eqiad1.wikimedia.cloud'] (ran as mwdeploy@deployment-parsoid12.deployment-prep.eqiad1.wikimedia.cloud) returned [255]: Connection to 172.16.4.125 port 22 timed out

Event Timeline

and it's back!

['/usr/bin/scap', 'pull', '--no-php-restart', '--no-update-l10n', '--exclude-wikiversions.php', 'deployment-deploy03.deployment-prep.eqiad1.wikimedia.cloud'] (ran as mwdeploy@deployment-parsoid12.deployment-prep.eqiad1.wikimedia.cloud) returned [255]: mwdeploy@deployment-parsoid12.deployment-prep.eqiad1.wikimedia.cloud: Permission denied (publickey).

My "monitoring" (this) shows:

image.png (257×1 px, 24 KB)

for instance-deployment-parsoid12

Repeatedly failing for the last few hours with

sudo -u mwdeploy -n -- /usr/bin/scap cdb-rebuild (ran as mwdeploy@deployment-parsoid12.deployment-prep.eqiad1.wikimedia.cloud) returned [255]: Timeout, server deployment-parsoid12.deployment-prep.eqiad1.wikimedia.cloud not responding.

Can ping, cannot SSH — soft rebooting

Managed to SSH in after a reboot, looks like it's OOMing? Job still failing with the above "not responding" error, and I can't open another SSH session.

top - 11:13:30 up 8 min,  2 users,  load average: 12.15, 17.47, 9.56
Tasks: 133 total,   1 running, 132 sleeping,   0 stopped,   0 zombie
%Cpu(s): 24.4 us, 31.3 sy,  0.0 ni,  8.1 id, 36.1 wa,  0.0 hi,  0.2 si,  0.0 st
MiB Mem :   3949.2 total,    113.6 free,   3677.0 used,    158.6 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.     29.9 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  766 www-data   1 -19 1889064 515388  61868 D   1.0  12.7   0:21.18 php-fpm7.4
  770 www-data   1 -19 1868452 503396  70856 D   1.3  12.4   0:21.55 php-fpm7.4
 1363 www-data   1 -19 1878816 500908  57876 S  12.9  12.4   0:14.48 php-fpm7.4
 2000 www-data   1 -19 1866532 488136  57492 S  12.2  12.1   0:12.73 php-fpm7.4
  784 www-data   1 -19 1865644 486684  57488 S  11.9  12.0   0:22.59 php-fpm7.4
 1364 www-data   1 -19 1854212 475748  57488 S  10.9  11.8   0:13.24 php-fpm7.4
  786 www-data   1 -19 1806356 441032  71264 D   1.0  10.9   0:24.10 php-fpm7.4
 1365 www-data   1 -19 1788680 414700  61968 D   1.0  10.3   0:13.28 php-fpm7.4
  910 root      20   0   83060  59240   2296 S   0.3   1.5   0:00.18 python
  872 root      20   0   83060  59012   2220 S   0.0   1.5   0:00.31 python
  954 root      20   0   83316  57168      0 S   0.0   1.4   0:00.30 python
  576 root      20   0   83316  57164      0 S   0.7   1.4   0:02.33 python
  951 root      20   0   83316  57164      0 S   0.0   1.4   0:00.10 python
  948 root      20   0   83316  57108      0 S   0.0   1.4   0:00.16 python
  918 root      20   0   83316  57072      0 S   0.0   1.4   0:00.10 python
  916 root      20   0   83316  57028      0 S   0.0   1.4   0:00.05 python
  913 root      20   0   83060  56976      0 S   0.0   1.4   0:00.28 python
  904 root      20   0   83060  56880      0 S   0.0   1.4   0:00.06 python

and burning CPU on swap use isn't exactly going to be helping..

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
35 root      20   0       0      0      0 S  81.3   0.0   2:56.40 kswapd0

Managed to SSH in after a reboot, looks like it's OOMing?

rsync: fork failed in do_recv: Cannot allocate memory (12) — so that's a yes..?

Should we rebuild deployment-parsoid12 on a larger VM flavour, or is that just masking whatever has caused this? I've not previously seen the resource usage of that VM, but it obviously wasn't this close to the line...

TheresNoTime raised the priority of this task from High to Unbreak Now!.Sep 15 2022, 11:30 AM

UBN! given that it's mostly preventing beta cluster deployment of code

hashar subscribed.

What is possible is the process uses half of the available memory and when invoking fork() the Kernel reject its. I have encountered that ages ago T125050#3153574. A more recent example is Phan running against mediawiki/core. If that takes 2GB of ram, when Phan forks to stty -a (hardly using any memory) the kernel still tries to allocate 2G and fails :) T219114#5084302 For phan we ended up no more forking.

The Kernel memory check can be disabled using vm.overcommit_memory=1. Else get a VM with more RAM (maybe WMCS people can change the flavor to save the trouble of having to rebuild the instance).

That being said,

You can find the monitoring for deployment-parsoid12 at https://grafana-labs.wikimedia.org/d/000000059/cloud-vps-project-board?orgId=1&var-project=deployment-prep&var-server=deployment-parsoid12 . The two days view:

deployment-parsoid12_memory.png (303×582 px, 33 KB)

deployment-parsoid12_cpu.png (305×607 px, 35 KB)

deployment-parsoid12_iops.png (299×568 px, 26 KB)

So looks like Parsoid goes wild with high CPU, iops and memory, and there is no memory left. Thus the allocation memory error is a symptom of some process going wild?

From Logstash it looks like there are a few process parsing a large json file https://en.wikipedia.beta.wmflabs.org/wiki/Data:DutchMuni-json Which I have spotted previously and filed as T310069 (I knew it sounded familiar!).

From Logstash it looks like there are a few process parsing a large json file https://en.wikipedia.beta.wmflabs.org/wiki/Data:DutchMuni-json Which I have spotted previously and filed as T310069 (I knew it sounded familiar!).

I've blanked that page, is that enough to stop it parsing (or does something need clearing from a queue)?


top - 12:31:19 up  1:26,  2 users,  load average: 0.07, 0.33, 4.91
Tasks: 133 total,   1 running, 132 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.2 us,  0.7 sy,  0.0 ni, 97.8 id,  0.2 wa,  0.0 hi,  0.2 si,  0.0 st
MiB Mem :   3949.2 total,   2349.9 free,    615.9 used,    983.5 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   2940.1 avail Mem

seems to have fixed it!

TheresNoTime lowered the priority of this task from Unbreak Now! to Needs Triage.EditedSep 15 2022, 12:32 PM

Dropping from UBN!, immediate issue is resolved

That was the same issue as T310069 albeit this time you got it fixed by blanking https://en.wikipedia.beta.wmflabs.org/wiki/Data:DutchMuni-json :] The root cause is tracked at T288889