Page MenuHomePhabricator

beta-scap-sync-world failure — 'No space left on device'
Open, Needs TriagePublic

Description

https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-sync-world/87639/console

23:49:52 04:49:52 Finished sync-check-canaries (duration: 00m 28s)
23:49:52 04:49:52 Started sync-apaches
23:50:29 04:49:52 sync-apaches:   0% (ok: 0; fail: 0; left: 6)                           
23:50:29 04:50:29 ['/usr/bin/scap', 'pull', '--no-php-restart', '--no-update-l10n', '--exclude-wikiversions.php', 'deployment-deploy03.deployment-prep.eqiad1.wikimedia.cloud'] (ran as mwdeploy@deployment-mwmaint02.deployment-prep.eqiad1.wikimedia.cloud) returned [70]: 04:49:53 Copying from deployment-deploy03.deployment-prep.eqiad1.wikimedia.cloud:/srv/mediawiki-staging to deployment-mwmaint02.deployment-prep.eqiad1.wikimedia.cloud:/srv/mediawiki
23:50:29 04:49:53 Started rsync common
23:50:29 rsync: write failed on "/srv/mediawiki/php-master/.gitignore": No space left on device (28)
23:50:29 rsync error: error in file IO (code 11) at receiver.c(374) [receiver=3.1.3]
23:50:29 04:50:29 Finished rsync common (duration: 00m 35s)
23:50:29 04:50:29 Unhandled error:
23:50:29 Traceback (most recent call last):
23:50:29   File "/var/lib/scap/scap/lib/python3.7/site-packages/scap/cli.py", line 524, in run
23:50:29     exit_status = app.main(app.extra_arguments)
23:50:29   File "/var/lib/scap/scap/lib/python3.7/site-packages/scap/main.py", line 945, in main
23:50:29     exclude_wikiversionsphp=self.arguments.exclude_wikiversions_php,
23:50:29   File "/var/lib/scap/scap/lib/python3.7/site-packages/scap/utils.py", line 404, in context_wrapper
23:50:29     return func(*args, **kwargs)
23:50:29   File "/var/lib/scap/scap/lib/python3.7/site-packages/scap/tasks.py", line 449, in sync_common
23:50:29     subprocess.check_call(rsync)
23:50:29   File "/usr/lib/python3.7/subprocess.py", line 347, in check_call
23:50:29     raise CalledProcessError(retcode, cmd)
23:50:29 subprocess.CalledProcessError: Command '['sudo', '-u', 'mwdeploy', '-n', '--', '/usr/bin/rsync', '--archive', '--delete-delay', '--delay-updates', '--compress', '--new-compress', '--delete', '--exclude=*.swp', '--no-perms', '--stats', '--exclude=**/.git', '--exclude=/wikiversions*.php', '--exclude=**/cache/l10n/*.cdb', 'deployment-deploy03.deployment-prep.eqiad1.wikimedia.cloud::common', '/srv/mediawiki']' returned non-zero exit status 11.
23:50:29 04:50:29 pull failed: <CalledProcessError> Command '['sudo', '-u', 'mwdeploy', '-n', '--', '/usr/bin/rsync', '--archive', '--delete-delay', '--delay-updates', '--compress', '--new-compress', '--delete', '--exclude=*.swp', '--no-perms', '--stats', '--exclude=**/.git', '--exclude=/wikiversions*.php', '--exclude=**/cache/l10n/*.cdb', 'deployment-deploy03.deployment-prep.eqiad1.wikimedia.cloud::common', '/srv/mediawiki']' returned non-zero exit status 11.

Event Timeline

ah, mwmaint02 is at 100% usage

samtar@deployment-mwmaint02:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            2.0G     0  2.0G   0% /dev
tmpfs           395M   40M  355M  11% /run
/dev/sda1        20G   19G     0 100% /
tmpfs           2.0G  4.0K  2.0G   1% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           2.0G     0  2.0G   0% /sys/fs/cgroup
tmpfs           395M     0  395M   0% /run/user/0
tmpfs           395M     0  395M   0% /run/user/12744
    9.2 GiB [##########] /srv
    5.3 GiB [#####     ] /var
    2.6 GiB [##        ] /usr
  520.4 MiB [          ] /opt
   51.0 MiB [          ] /boot
   39.9 MiB [          ] /run
   11.3 MiB [          ] /etc
    3.4 MiB [          ] /tmp
    1.9 MiB [          ] /home
  144.0 KiB [          ] /root
e  16.0 KiB [          ] /lost+found
    4.0 KiB [          ] /dev
e   4.0 KiB [          ] /mnt
e   4.0 KiB [          ] /media
.   0.0   B [          ] /proc
    0.0   B [          ] /sys
@   0.0   B [          ]  initrd.img.old
@   0.0   B [          ]  initrd.img
@   0.0   B [          ]  vmlinuz.old
@   0.0   B [          ]  vmlinuz
@   0.0   B [          ]  libx32
@   0.0   B [          ]  lib64
@   0.0   B [          ]  lib32
@   0.0   B [          ]  sbin
@   0.0   B [          ]  lib
@   0.0   B [          ]  bin
    0.0   B [          ]  .cloud-init-finished

Shouldn't /srv be a volume?

Deleted a few archived logs, re-ran a deployment, and we're back to

samtar@deployment-mwmaint02:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            2.0G     0  2.0G   0% /dev
tmpfs           395M   40M  355M  11% /run
/dev/sda1        20G   16G  3.6G  82% /
tmpfs           2.0G  4.0K  2.0G   1% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           2.0G     0  2.0G   0% /sys/fs/cgroup
tmpfs           395M     0  395M   0% /run/user/0
tmpfs           395M     0  395M   0% /run/user/12744

Last week deployment-jobrunner04 had its 10G /srv partition filed which we solved by resizing the OpenStack volume T327329.

This deployment-mwmaint02 has a similar issue: its 20G partition is too small to host both the system and MediaWiki material (on /srv). Disk space available over the last six months:

deployment-mwmaint02_disk_available_6months.png (585×882 px, 46 KB)

What I am guessing is when rsync transfers the l10n cache, it creates a temporary copy of the data (roughly 2.5G) which is just enough to fit in the available disk space. But if there is slightly less disk space available than the total size of a l10n cache, rsync crashes and the temporary files are left behind.

We used to have instances with extended disk space mounted at /srv but the last round of rebuild did not take that in account. Short of fixing l10n cache, one should create a new volume, attach it to the instance, transfer the data, wipe /srv and mount the volume to /srv.