Page MenuHomePhabricator

deployment-snapshot01.deployment-prep.eqiad.wmflabs keeps running out of space
Closed, ResolvedPublic

Description

deployment-snapshot01.deployment-prep.eqiad.wmflabs keeps running out of space and breaking the scap job

reedy@deployment-snapshot01:~$ df -h
Filesystem                        Size  Used Avail Use% Mounted on
udev                              2.0G     0  2.0G   0% /dev
tmpfs                             396M   42M  354M  11% /run
/dev/vda3                          19G   18G     0 100% /
tmpfs                             2.0G     0  2.0G   0% /dev/shm
tmpfs                             5.0M     0  5.0M   0% /run/lock
tmpfs                             2.0G     0  2.0G   0% /sys/fs/cgroup
/dev/mapper/vd-data--local--disk   21G  5.1G   14G  27% /mnt/dumpsdata
tmpfs                             396M     0  396M   0% /run/user/0
tmpfs                             396M     0  396M   0% /run/user/1226

9.0G is in /srv/mediawiki/php-master/cache/l10n

root@deployment-snapshot01:/srv/mediawiki/php-master/cache/l10n# du -h .
1.8G	./.~tmp~
1.9G	./upstream/.~tmp~
3.8G	./upstream
9.0G	.
root@deployment-snapshot01:/srv/mediawiki/php-master/cache/l10n# du -ch *.php | grep total
1.8G	total
root@deployment-snapshot01:/srv/mediawiki/php-master/cache/l10n# du -ch *.cdb | grep total
1.8G	total

Event Timeline

I've added manually a root cron job over there to remove wholesale all of /srv/mediawiki/php-master/cache/l10n once a day. But it's not the greatest of workarounds.

We're running the next version of scap in beta currently which generates PHP l10n cache files in addition to the normal JSON + CDB files (T99740).

The additional php files do take up more space:

[thcipriani@deployment-deploy01 ~]$ du -chs /srv/mediawiki-staging/php-master/cache/l10n/*.php
...
1.8G    total

The .~tmp~ directories come from rsync. If you use the rsync --delay-updates option (which we use inside scap to make it more "atomic"). Rsync will create these directories, build the "new" version of files in there and then move all the files into place and remove the directory.

Since rsync is failing, it leaves those directories behind.

I note this is becoming a daily problem again...

I assume there's no easy way to move around the paritions live, so the only solution is to create a new instance with a bigger root partition?

I assume there's no easy way to move around the paritions live, so the only solution is to create a new instance with a bigger root partition?

Or stop creating PHP and CDB l10n...

The easiest thing to do is to enlarge the virtual disk or add an additional vdisk to house /srv. How do we move forward with one of those options?

How can I add an additional vdisk? I'm happy to do that if someone can point me to instructions. Otherwise the fallback position is to make a new instance with a bigger / and I need to block out time for that.

I asked about this in #wikimedia-cloud and Lucas pointed me to https://wikitech.wikimedia.org/wiki/Help:Adding_Disk_Space_to_Cloud_VPS_instances

I'm also happy to do it. I'll find you on IRC to chat about it.

I didn’t realize the question was for deployment-prep – I don’t think those are the right instructions in that case. role::labs::lvm::srv, as advised on that page, seems to mount /dev/mapper/vd-second--local--disk on /srv – but deployment-snapshot01 only has vd-data--local--disk under /dev/mapper. I assume its storage space / logical volumes are arranged differently.

I have tried applying the puppet class; when that failed, I tried running the bash script by hand. Result:

/usr/local/sbin/make-instance-vol second-local-disk '100%FREE' ext4 
100%FREE
  Internal error: Unable to create new logical volume with no extents.

And more specifically, running lvcreate directly,

root@deployment-snapshot01:~# /sbin/lvcreate -l '100%FREE' -n second-local-disk vd
  Internal error: Unable to create new logical volume with no extents.

Is this a matter of the quota and needing to shrink an existing partition or logical volume? If so I'd better just get more space allocated or create a new instance.

Ah, i did not see your comment above. Unfortunately vd-data--local--disk is not big enough to split it and have dump testing space as well as all the stuff in /srv.

Change 631216 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] add deployment-snapshot02 to dsh list for deployment-prep in wmcs

https://gerrit.wikimedia.org/r/631216

Change 631216 merged by ArielGlenn:
[operations/puppet@production] add deployment-snapshot02 to dsh list for deployment-prep in wmcs

https://gerrit.wikimedia.org/r/631216

Welp. Stuck at puppet not running on deploy01 in deployment-prep. Will whack-a-mole-away at it again tomorrow.

I needed to set profile::mediawiki::mcrouter_wancache::use_onhost_memcached: false in the deployment-deploy prefix hiera settings. Now fixing up scap.cfg template.

Change 631404 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] allow deployment-prep scap pull to work from instances with new dns names

https://gerrit.wikimedia.org/r/631404

Change 631404 merged by ArielGlenn:
[operations/puppet@production] allow deployment-prep scap pull to work from instances with new dns names

https://gerrit.wikimedia.org/r/631404

Well the new template is not going around because /usr/local/bin/git-sync-upstream fails to rebase. In /var/spool/cron/crontabs/root we have:

Rebasing (1/18)

Resolve all conflicts manually, mark them as resolved with
"git add/rm <conflicted_files>", then run "git rebase --continue".
You can instead skip this commit: run "git rebase --skip".
To abort and get back to the state before "git rebase", run "git rebase --abort".'; stderr: 'Could not pick 8f5497bb84af276c42d3e847a2802dd3a8da43ad'
2020-10-01T10:10:14Z ERROR    sync-upstream: Rebase failed!

Looking at that next unless someone beats me to it.

That commit is swift-related, and there are a few of those in the 18 unmerged commits by @jbond from the production branch, tagging him here in hopes that he can sort it out.

I've added the new hiera setting profile::swift::proxy::memcached_servers to both the prefix settings and the instance settings for deploy-ms-fe03, and left the old one in place. I've rebased and tweaked the patch: https://gerrit.wikimedia.org/r/c/operations/puppet/+/601429 after jbond pointed me to the two likely conflict issues. Not yet pulled into deployment-prep puppetmaster04 however.

Adding @dpifke since it's his work being affected.

Talked with godog on irc and after getting his eyes on everything, went ahead and updated deployment-prep puppetmaster. Looks like that unstuck everything.

With a lot of help from godog, found many more swift settings that needed to be renamed to profile:: in deployment-ms prefix, in project puppet, and in ms-fe03 settings. While I was doing that I changed

profile::swift::replication_accounts:
  mw_media:
    cluster_codfw: http://deployment-ms-fe02.deployment-prep.eqiad.wmflabs/v1/

to point to ms-fe03 since apparently ms-fe02 is gone.

Ran puppet on ms-fe03, ms-be05, ms-be06 successfully.

It looks like the replacement for swift::params::account_keys in prefix puppet is already in /var/lib/git/labs/private/hieradata/common.yaml as profile::swift::accounts_keys

Side note: swift folks will probably want to go and clean out any redundant keys from everywhere at some point. In the meantime, I've successfully run ruwiki dumps on snapshot02, so it's time to move any old fils off of snapshot01 and prep it for decommissioning. Yay!

Change 631907 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] remove snapshot01 from scap targets in deployment-prep

https://gerrit.wikimedia.org/r/631907

Change 631907 merged by ArielGlenn:
[operations/puppet@production] remove snapshot01 from scap targets in deployment-prep

https://gerrit.wikimedia.org/r/631907

I have shut off deployent-snapshot01 and the new snapshot02 instance is open for business. In a couple of weeks I'll decommission snapshot01 and close this task.

Just a note that the swift patchset https://gerrit.wikimedia.org/r/c/operations/puppet/+/601429 was merged and that cherry-pick silently and automagically removed on deployment-prep puppetmaster with no issues.

Apologies for not being around to help with this last week. I was moving, thus out of office and with limited internet connectivity.

In the future, please feel free to un-cherry-pick any of my patches in deployment-prep if merge conflicts develop; I don't have the assumption that anything there will stay running for any length of time, especially if a work-in-progress starts to block other work.

ArielGlenn claimed this task.

snapshot01 has been decommissoned, so this problem is officially gone.