Page MenuHomePhabricator

gerrit1002 running out of space
Closed, ResolvedPublic

Description

There's an icinga alert for gerrit1002, which is running out of space:

root@gerrit1002:/srv/gerrit# df -hT /
Filesystem     Type  Size  Used Avail Use% Mounted on
/dev/vda1      ext4   63G   58G  2.3G  97% /
root@gerrit1002:/srv/gerrit# du -sh *
30G	git
4.0K	jvmlogs
15G	plugins

Related Objects

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 28 2020, 1:53 AM
Marostegui triaged this task as High priority.Jan 28 2020, 1:53 AM
Dzahn added a comment.Jan 28 2020, 1:55 AM

This is not the production services. This is a test setup for the 2.16 upgrade.

Marostegui lowered the priority of this task from High to Medium.Jan 28 2020, 1:57 AM

Thanks, I have added a comment to the alert to avoid confusions.

Dzahn added a comment.Jan 28 2020, 2:00 AM

Tried to avoid alerts on this and turn off monitoring with https://gerrit.wikimedia.org/r/c/operations/puppet/+/562619 but that isn't enough as it still pops up in the web UI for base checks, even if they don't send notifications.

I will schedule a long downtime.

Tried to avoid alerts on this and turn off monitoring with https://gerrit.wikimedia.org/r/c/operations/puppet/+/562619 but that isn't enough as it still pops up in the web UI for base checks, even if they don't send notifications.

I will schedule a long downtime.

Thank you :-)

Mentioned in SAL (#wikimedia-operations) [2020-01-28T02:05:33Z] <mutante> gerrit1002 - gzipping a bunch of /var/log/gerrit/ log files (T243808)

Dzahn added a comment.Jan 28 2020, 2:19 AM

@thcipriani ^ This is back to 94% as of right now after ^. And it's been downtime for a month. Is the test instance usable with the current size?

Also fwiw, when i looked at /srv and the largest files in it i found a single file: gzip compressed data, was "GoogleNews-vectors-negative300.bin" in /srv/gerrit/plugins/plugins/lfs/21/c0 . It's a 1.6 G compressed file.

@thcipriani ^ This is back to 94% as of right now after ^. And it's been downtime for a month. Is the test instance usable with the current size?

Also fwiw, when i looked at /srv and the largest files in it i found a single file: gzip compressed data, was "GoogleNews-vectors-negative300.bin" in /srv/gerrit/plugins/plugins/lfs/21/c0 . It's a 1.6 G compressed file.

Ugh. I did some digging on this machine today: it seems like most of the data there is legitamately in-use by gerrit; i.e., nothing there obvious to trash (aside from rotating some logs early, but that won't get us the kind of space we evidently need).

Is there an easy way to expand the disk space here/move /srv to a seperate partition? It seems like even though all the git repos are like 30GB we have enough other data to fill up space :(

See T243983. I added a second disk to this VM, it's an additional 10GB and mounted on /srv/dbdump. Hope that does it.

Marostegui added a subscriber: MoritzMuehlenhoff.

Per the duplicate task I merged here filled by @MoritzMuehlenhoff:

root@gerrit1002:~# df -hT /
Filesystem     Type  Size  Used Avail Use% Mounted on
/dev/vda1      ext4   63G   60G     0 100% /

Per discussion me and @thcipriani just had, we found lfs objects using 15G, so if we remove /srv/dbdump and recreate it as a 20g partition, we can move the objects there.

Actually, we will use that partition for db readonly, so i think a new /srv/lfs partition 18g would do.

Mentioned in SAL (#wikimedia-operations) [2020-02-20T23:25:52Z] <mutante> ganeti1003 - adding another virtual 20G disk to gerrit1002 (T243808)

Had to fix /etc/network/interfaces again (interface name changed again, ens5 -> ens6 now ens7) and restart to fix networking.

Then formatted with ext4 and mounted additional 20G on /srv/lfs. Added to /etc/fstab to survive reboots.

/dev/vdc         20G   45M   19G   1% /srv/lfs
thcipriani closed this task as Resolved.Feb 25 2020, 7:50 PM
thcipriani claimed this task.

Moved all the lfs files to a symlinked path under new disk on /srv/lfs (thanks @Dzahn):

thcipriani@gerrit1002:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            7.9G     0  7.9G   0% /dev
tmpfs           1.6G  173M  1.4G  11% /run
/dev/vda1        63G   43G   18G  72% /
tmpfs           7.9G     0  7.9G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           7.9G     0  7.9G   0% /sys/fs/cgroup
/dev/vdb        9.8G   37M  9.3G   1% /srv/dbdump
/dev/vdc         20G   15G  3.7G  81% /srv/lfs
tmpfs           1.6G     0  1.6G   0% /run/user/11634
jbond reopened this task as Open.EditedApr 1 2020, 10:35 AM
jbond added a subscriber: jbond.

I noticed that the disk on gerrit1002 was full again today,

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            7.9G     0  7.9G   0% /dev
tmpfs           1.6G  157M  1.5G  10% /run
/dev/vda1        63G   60G     0 100% /
tmpfs           7.9G     0  7.9G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           7.9G     0  7.9G   0% /sys/fs/cgroup
/dev/vdb        9.8G   37M  9.3G   1% /srv/dbdump
/dev/vdc         20G   15G  3.7G  81% /srv/lfs
tmpfs           1.6G     0  1.6G   0% /run/user/20774

I noticed that the files in /var/log/gerrit are not being gzipped when they are rotated (as they are on gerrit1001). I manually ran find /var/log/gerrit/ -mtime +2 ! -name \*log -exec gzip -9 {} \; to give some breathing space however it seems like the log4j settings on gerrti1002 might need updating

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            7.9G     0  7.9G   0% /dev
tmpfs           1.6G  161M  1.5G  11% /run
/dev/vda1        63G   40G   20G  68% /
tmpfs           7.9G     0  7.9G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           7.9G     0  7.9G   0% /sys/fs/cgroup
/dev/vdb        9.8G   37M  9.3G   1% /srv/dbdump
/dev/vdc         20G   15G  3.7G  81% /srv/lfs
tmpfs           1.6G     0  1.6G   0% /run/user/20774
Dzahn added a comment.Apr 2 2020, 12:33 PM

Icinga alert is in state OK and also in downtime. And it's testing-only.

Dzahn closed this task as Resolved.Apr 2 2020, 12:35 PM

/dev/vda1 63G 41G 20G 68% /

jbond reopened this task as Open.Apr 2 2020, 2:36 PM

@Dzahn see my comment above I think we should investigate fixing the log4j properties so this dosn't keep on coming up

Dzahn added a comment.Apr 2 2020, 7:21 PM

Alright. This is a one-time installation though to test the Gerrit upgrade to 2.16 and then remove it again. But that doesn't mean there can't be fixes for a next time.

jbond added a comment.Apr 3 2020, 9:32 AM

Alright. This is a one-time installation though to test the Gerrit upgrade to 2.16 and then remove it again. But that doesn't mean there can't be fixes for a next time.

My assumption was it would be a simple fix for anyone who knows log4f well enough but if you feel its not worth the effort feel free to resolve again, thanks

Mentioned in SAL (#wikimedia-operations) [2020-04-29T08:49:01Z] <mutante> gerrit1002 - gzipping gerrit.log.2020-04* files in /var/log/gerrit (T243808)

Dzahn closed this task as Resolved.Apr 29 2020, 8:56 AM

Disk space alert is OK since almost a month, i gzipped the existing logs and beyond that i don't think it's worth the effort since it's fine on the production server.