Page MenuHomePhabricator

Move Gerrit data out of root partition
Open, MediumPublic

Description

During the Gerrit 3.5 upgrade, Gerrit caches stored under /var/lib/gerrit2 overflowed the root partition. As an immediate fix we have created a new dedicated partition and moved data there. The incident report https://wikitech.wikimedia.org/wiki/Incidents/2022-11-17_Gerrit_3.5_upgrade

From the incident analysis at T323262#8435688 we should relocate Gerrit installation from /var/lib/gerrit2/review_site to /srv/gerrit.

On gerrit1001 the transient /var/lib/gerrit2 partition should be removed.

Partition scheme for both hosts:

gerrit1001.wikimedia.org

PartitionSizeUsedAvailable
/46G14G30G
/srv314G227G71G
/var/lib/gerrit249G17G31G

gerrit2002.wikimedia.org

PartitionSizeUsedAvailable
/73G20G50G
/srv629G71G527G

Event Timeline

LSobanski moved this task from Incoming to Backlog on the collaboration-services board.

Change 908604 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] gerrit: remove duplicate $gerrit_site definition

https://gerrit.wikimedia.org/r/908604

The LFS plugin stores the data under /srv/gerrit/plugins/lfs which will clash with the $GERRIT_SITE/plugins directory holding the jar/js plugins.

Change 908617 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] gerrit: relocate LFS data

https://gerrit.wikimedia.org/r/908617

Change 911358 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: add /srv/gerrit/data/lfs to dirs managed by puppet

https://gerrit.wikimedia.org/r/911358

Change 911358 merged by Dzahn:

[operations/puppet@production] gerrit: add /srv/gerrit/data/lfs to dirs managed by puppet

https://gerrit.wikimedia.org/r/911358

Change 911362 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] backup: add /srv/gerrit/data to fileset for gerrit repos

https://gerrit.wikimedia.org/r/911362

Change 911362 merged by Dzahn:

[operations/puppet@production] backup: add /srv/gerrit/data to fileset for gerrit repos

https://gerrit.wikimedia.org/r/911362

Change 911363 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: make the lfs data path configurable

https://gerrit.wikimedia.org/r/911363

Change 911363 merged by Dzahn:

[operations/puppet@production] gerrit: make the lfs data path configurable

https://gerrit.wikimedia.org/r/911363

I updated the code so that we have a new path for lfs data under /srv/gerrit/data, as suggested by hashar. It exists (as an empty dir) on all hosts now and on the new machine, gerrit1003, it will already use the new path. Also the path has been added to Bacula backups.

Now the lfs_path is a class parameter and can be set / overridden in Hiera.

So once we have switched to that we just need to move data on gerrit2002 (gerrit-replica) and decom gerrit1001 and this will be done.

Mentioned in SAL (#wikimedia-operations) [2023-04-25T21:19:20Z] <mutante> gerrit1003 - chown -R gerrit2:gerrit2 /srv/gerrit T333143 T326368

Change 920765 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: make new lfs path the default and clean up

https://gerrit.wikimedia.org/r/920765

I think after the patch above is merged we might be able to close this.

Change 908617 abandoned by Hashar:

[operations/puppet@production] gerrit: stop managing /srv/gerrit/plugins/lfs

Reason:

https://gerrit.wikimedia.org/r/908617

Change 920765 merged by Dzahn:

[operations/puppet@production] gerrit: remove lfs_dir parameter, use hardcoded new default

https://gerrit.wikimedia.org/r/920765

As this ticket asks for " the transient /var/lib/gerrit2 partition should be removed" and what we have actually done is move the lfs data outside of it:

We can say that both current gerrit servers, gerrit1003 and gerrit2002, have no more separate /var/lib/gerrit2 partition.

/var/lib/gerrit2 is still 15G on gerrit1003 but it's simply part of the / partition which is 68% used with 23G available and it's not expected to grow much.

Meanwhile the lfs data and git repos are in /srv and it's 37% used with 379G available.

gerrit2002 has even much more space in both locations.

gerrit1001 will be shut down and is already not considered a gerrit server anymore.

So this should be resolved.

Reopening since this task is about relocating Gerrit from the root partition to /srv:

From the incident analysis at T323262#8435688 we should relocate Gerrit installation from /var/lib/gerrit2/review_site to /srv/gerrit.

The incident was the H2 database (stored in db directory) overflowing the root partition. There are other growing directories, notably the disk based caches in cache and Lucene search indices in index.

sudo du -m -d1 /var/lib/gerrit2/review_site
...
216	/var/lib/gerrit2/review_site/data
7358	/var/lib/gerrit2/review_site/cache
6973	/var/lib/gerrit2/review_site/index
859	/var/lib/gerrit2/review_site/db
15411	/var/lib/gerrit2/review_site

Ok, definitely doesn't seem "High" to me though since there is lots of space, no more lfs data in / and no more separate partition for /var/lib/gerrit2.

If we wanted to move even more to /srv/ than lfs I wish we had done that on the new server just recently.

Dzahn removed Dzahn as the assignee of this task.Jun 15 2023, 4:06 PM
Dzahn lowered the priority of this task from High to Low.

The /var/lib/gerrit2/review_site/db directory is only 1 GB with all the reviews so far. There are 22GB free. That makes me think this will not be a real problem any time in the next decade or so. (?)

Dzahn raised the priority of this task from Low to Medium.Jun 15 2023, 4:08 PM

well, to be fair, the cache and index are a bit more:

root@gerrit1003:/var/lib/gerrit2/review_site# du -hs cache/
7.2G	cache/
root@gerrit1003:/var/lib/gerrit2/review_site# du -hs index/
6.9G	index/

Change 908604 merged by Jbond:

[operations/puppet@production] gerrit: remove duplicate $gerrit_site definition

https://gerrit.wikimedia.org/r/908604