Page MenuHomePhabricator

scandium lost /srv
Closed, ResolvedPublic

Description

scandium has lots its /srv/ssd partition somehow:

14:11:43   <icinga-wm>	PROBLEM - Disk space on scandium is CRITICAL: DISK CRITICAL - /srv/ssd is not accessible: No such file or directory

The zuul-merger instance rely on it to clone repositories under /srv/ssd/zuul/git. Due to the disk lost, the zuul-merger is unable to process patches. From /var/log/zuul/merger.log

GitCommandError: 'git clone -v ssh://jenkins-bot@ytterbium.wikimedia.org:29418/mediawiki/extensions/Wikibase /srv/ssd/zuul/git/mediawiki/extensions/Wikibase' returned wi
th exit code 128
stderr: 'fatal: could not create leading directories of '/srv/ssd/zuul/git/mediawiki/extensions/Wikibase': Permission denied
GitCommandError: 'git clone -v ssh://jenkins-bot@ytterbium.wikimedia.org:29418/operations/puppet /srv/ssd/zuul/git/operations/puppet' returned with exit code 128
stderr: 'Cloning into '/srv/ssd/zuul/git/operations/puppet'...
error: cannot run /srv/ssd/zuul/git/.ssh_wrapper: No such file or directory
fatal: unable to fork

Event Timeline

hashar raised the priority of this task from to Needs Triage.
hashar updated the task description. (Show Details)
hashar subscribed.

scandium has:

mount { '/srv/ssd':
    ensure  => mounted,
    device  => '/dev/md2',
    fstype  => 'xfs',
    options => 'noatime,nodiratime,nobarrier,logbufs=8',
    require => File['/srv/ssd'],
}

The zuul-merger process clones the git directories under /srv/ssd/zuul/git.

Apparently /dev/md2 is mounted twice. Once via the initial server installation and once more via puppet:

$ mount
/dev/md2 on /srv/ssd type xfs (rw,noatime,nodiratime,attr2,nobarrier,inode64,logbufs=8,noquota)
/dev/md2 on /srv type xfs (rw,relatime,attr2,nobarrier,inode64,logbufs=8,noquota)
$ /etc/fstab
# /srv was on /dev/md2 during installation
UUID=d588649c-4a40-4853-8d33-a82ed028fb1e	/srv	xfs	defaults	0	2
/dev/md2	/srv/ssd	xfs	noatime,nodiratime,nobarrier,logbufs=8	0	0

So we end up with a wrong directory from Dec 12th:

$ ls -ld /srv/zuul/git
drwxr-xr-x 21 zuul root 4096 Dec 12 23:44 /srv/zuul/git

The proper one being:

$ ls -ld /srv/zuul/git/
drwxr-xr-x 21 zuul root 4096 Dec 12 23:44 /srv/zuul/git/

It is all messed up :-(

So I think we need to drop in /etc/fstab the /srv mount:

UUID=d588649c-4a40-4853-8d33-a82ed028fb1e	/srv	xfs	defaults	0	2

Then unmount both point and remount /srv/ssd.

Ok, I did this. /srv/ssd is now mounted, but /srv is not. However, due to some previous job run, it looks like zuul was cloned at /srv/ssd/zuul when /srv was mounted, meaning now both /srv/ssd/ssd/zuul and /srv/ssd/zuul exist. Likely the next git pull will update /srv/ssd/zuul properly. @hashar, can you remove /srv/ssd/ssd, or should I?

Thank you! I restarted the zuul-merger instance since /srv/ssd/zuul/git is now fine.

/srv/ssd/ssd can be nuked entirely. I lack root access to do so.