⚓ T323262 gerrit1001 running out of space on /

	Subject	Repo	Branch	Lines +/-
	gerrit: split user and application directories	operations/puppet	production	+21 -12

		Status	Subtype	Assigned	Task
		Resolved		Clement_Goubert	T323262 gerrit1001 running out of space on /
		Resolved		hashar	T323754 Investigate Gerrit h2 cache being way too large

LSobanski created this task.Nov 17 2022, 11:44 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 17 2022, 11:44 AM

hashar mentioned this in T307334: Upgrade to Gerrit 3.5.Nov 17 2022, 11:46 AM

LSobanski added subscribers: hashar, Clement_Goubert, Joe.Nov 17 2022, 11:47 AM

Largest H2 files:

-rw-r--r--  1 gerrit2 gerrit2 8.2G Nov 17 11:47 git_file_diff.h2.db
-rw-r--r--  1 gerrit2 gerrit2  12G Nov 17 11:47 gerrit_file_diff.h2.db

Then gerrit show-cache show both caches to use 128m disk:

  Name                          |Entries              |  AvgGet |Hit Ratio|
                                |   Mem   Disk   Space|         |Mem  Disk|
--------------------------------+---------------------+---------+---------+
D gerrit_file_diff              |   159 109940 128.13m| 124.0ms |  5%     |
D git_file_diff                 |   108 108286 128.08m| 114.0ms |  0%     |

I suspect the H2 files need to be defragment/vacuumed from time to time. Or we can remove the files on disk and let them be repopulated from scratch.

LSobanski updated the task description. (Show Details)Nov 17 2022, 11:51 AM

~~https://gerrit.wikimedia.org/r/Documentation/cmd-flush-caches.html~~ Ignore, this is for in-memory cache only.

hashar updated the task description. (Show Details)Nov 17 2022, 11:55 AM

LSobanski triaged this task as High priority.Nov 17 2022, 11:56 AM

LSobanski moved this task from Incoming to Work in Progress on the collaboration-services board.

If an incident is planned to be written, let me add the corresponding tag for tracking purposes only (I'm on clinic duty this week).

jcrespo moved this task from Active investigation to Awaiting report on the Wikimedia-Incident board.Nov 17 2022, 12:17 PM

Jelto subscribed.Nov 17 2022, 1:07 PM

@jcrespo I am writing the incident report at https://wikitech.wikimedia.org/wiki/Incidents/2022-11-17_Gerrit_3.5_upgrade ;)

hashar added a project: SRE-OnFire.Nov 17 2022, 2:45 PM

hashar moved this task from Backlog to Pending Review & Scorecard on the SRE-OnFire board.

I have marked with Sustainability (Incident Followup) and SRE-OnFire based on the incident report template.

hashar added a project: Gerrit.Nov 17 2022, 2:48 PM

hashar updated the task description. (Show Details)Nov 17 2022, 3:11 PM

LSobanski updated the task description. (Show Details)Nov 17 2022, 3:47 PM

LSobanski updated the task description. (Show Details)Nov 17 2022, 4:32 PM

LSobanski updated the task description. (Show Details)

hashar updated the task description. (Show Details)Nov 17 2022, 4:36 PM

Arnoldokoth subscribed.Nov 17 2022, 4:41 PM

hashar updated the task description. (Show Details)Nov 17 2022, 4:43 PM

SRE:

Bring primary and replica in sync configuration-wise (SRE)

summarize disk stuff (Partman recipe etc.) on the task

Currently gerrit1001 and gerri2002 use different partitioning schema, as lvm setup on gerrit1001 was altered to fix the incident (thanks @Joe and @Clement_Goubert!). Depending on some more investigation about the cache we might want to persist the current partitioning schema (or one with more space for /srv) with partman and also use it on gerrit2002.

@Dzahn what are you thoughts on reimaging gerrit2002 with a new partman config? Are there a lot of manual steps needed to configure gerrit as a replica again or is this handled all by puppet?

(meta note: tagging with the weird-for-this-task tag: Release-Engineering-Team (GitLab III: GitLab in LA 🪃) because that's our current sprint. @hashar and other Release-Engineering-Team folks will do some followups over the next few weeks during this sprint)

In T323262#8403129, @Jelto wrote:

@Dzahn what are you thoughts on reimaging gerrit2002 with a new partman config? Are there a lot of manual steps needed to configure gerrit as a replica again or is this handled all by puppet?

I looked at tickets like T313250, T243027 and if the host name stays the same and we just do a reimage and we can accept downtime of gerrit-replica as long as the reimage cookbook takes.. then the effort should be relatively low.

LVM/disk setup

In T323262#8403593, @Dzahn wrote:

I looked at tickets like T313250, T243027 and if the host name stays the same and we just do a reimage and we can accept downtime of gerrit-replica as long as the reimage cookbook takes.. then the effort should be relatively low.

Ok thanks for the estimation! Once we have feedback from RelEng regarding cleanup of directories (h2 and /srv/) we know if we still need a dedicated volume for /var/lib/gerrit2. Then a dedicated partman config and reimage would be necessary also for the new gerrit host. Otherwise we can roll back the change on gerrit1001 and keep using the default partman config.

Alerting

Regarding:

Verify alerting to see if we should have been warned in advance (SRE)

I grepped through the #wikimedia-operations logs for alerts on gerrit1001 and gerrit.wikimedia.org. It seems the first alerts appeared during the ongoing incident after the downtime expired. As far as I can see there were no active alerts before the upgrade and the days before the upgrade. Maybe there was a warning in alertmanager regarding disk space, I'll try to find that history somehow. Alerts from icinga from IRC:

[09:57:11] <icinga-wm>   PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - page size 1532 too small - 1532 bytes in 0.008 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[10:59:55] <icinga-wm>   PROBLEM - SSH access on gerrit1001 is CRITICAL: connect to address 208.80.154.137 and port 29418: Connection refused https://wikitech.wikimedia.org/wiki/Gerrit
[11:01:53] <icinga-wm>   RECOVERY - SSH access on gerrit1001 is OK: SSH OK - GerritCodeReview_3.5.4 (APACHE-SSHD-2.8.0) (protocol 2.0) https://wikitech.wikimedia.org/wiki/Gerrit
[11:02:19] <icinga-wm>   RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 973 bytes in 0.039 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[11:15:16] <godog>       yeah gerrit1001 isn't on that unfortunately yet, it'll be of course eventually
[11:19:47] <icinga-wm>   PROBLEM - Disk space on gerrit1001 is CRITICAL: DISK CRITICAL - free space: / 1321 MB (2% inode=91%): /tmp 1321 MB (2% inode=91%): /var/tmp 1321 MB (2% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=gerrit1001&var-datasource=eqiad+prometheus/ops
[11:37:11] <icinga-wm>   PROBLEM - SSH access on gerrit1001 is CRITICAL: connect to address 208.80.154.137 and port 29418: Connection refused https://wikitech.wikimedia.org/wiki/Gerrit
[11:37:39] <icinga-wm>   PROBLEM - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1532 bytes in 0.007 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[11:37:55] <icinga-wm>   PROBLEM - gerrit process on gerrit1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit
[11:38:31] <icinga-wm>   PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - page size 1532 too small - 1532 bytes in 0.008 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[11:38:35] <icinga-wm>   PROBLEM - Check systemd state on gerrit1001 is CRITICAL: CRITICAL - degraded: The following units failed: gerrit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:45:47] <icinga-wm>   RECOVERY - gerrit process on gerrit1001 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit
[11:46:25] <icinga-wm>   RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 53356 bytes in 0.066 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[11:46:29] <icinga-wm>   RECOVERY - Check systemd state on gerrit1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:47:01] <icinga-wm>   RECOVERY - SSH access on gerrit1001 is OK: SSH OK - GerritCodeReview_3.5.4 (APACHE-SSHD-2.8.0) (protocol 2.0) https://wikitech.wikimedia.org/wiki/Gerrit
[11:47:29] <icinga-wm>   RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 973 bytes in 0.037 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[12:01:41] <icinga-wm>   RECOVERY - Disk space on gerrit1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=gerrit1001&var-datasource=eqiad+prometheus/ops

What also leads to

Adjust upgrade documentation to no more disable all monitoring probes (RelEng)

Currently the Gerrit upgrade documentation states:

1) Schedule downtime for Gerrit hosts to avoid notifications

    https://icinga.wikimedia.org/cgi-bin/icinga/cmd.cgi?cmd_typ=86&host=gerrit1001
    https://icinga.wikimedia.org/cgi-bin/icinga/cmd.cgi?cmd_typ=86&host=gerrit2002

Which disables quite a lot of alerts for the hosts (and also the disk space alert). So I went through all of the icinga checks which would alert during a normal upgrade. I found :
https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=1&host=gerrit.wikimedia.org (checks for gerrit.wikimedia.org)
https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=gerrit+process (gerrit process)
https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=SSH+access (client ssh access)
https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=gerrit1001&service=HTTPS (https)
https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=gerrit1001&service=puppet+last+run (puppet run)

Downtime each services individually seems quite a lot of work. Also downtiming a complete host for maintenance is quite common at the moment. So we either need a script to downtime this services, link to the icinga page in the upgrade docs or just accept noise during the upgrade. I'm not sure what is the best approach here. I'd favor just to downtime https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=1&host=gerrit.wikimedia.org (checks for gerrit.wikimedia.org) and get alerts for the rest during an upgrade?

fgiunchedi subscribed.Nov 21 2022, 4:25 PM

Jelto mentioned this in T323569: Automate GitLab version upgrade process.Nov 22 2022, 12:57 PM

hashar mentioned this in T323754: Investigate Gerrit h2 cache being way too large.Nov 24 2022, 8:53 AM

hashar updated the task description. (Show Details)Nov 24 2022, 8:42 PM

LSobanski claimed this task.Nov 28 2022, 4:26 PM

Sorry for lack of update. I did dig into the Gerrit cache which are backed up by H2 Database. Some finding are at T323754 and it looks like the easiest path is to empty up the cache and let Gerrit regenerate them. That would include deleting the web session cache which would logout every single users, which I don't think it is a problem granted some announcement is made before the operation. I will look at scheduling that at some point next week.

@Jelto thank you for the excellent analysis about monitoring. I will look at integrating that to the deployment runbook we have at https://wikitech.wikimedia.org/wiki/Gerrit/Upgrade#Deploying

In T323262#8403593, @Dzahn wrote:

In T323262#8403129, @Jelto wrote:

@Dzahn what are you thoughts on reimaging gerrit2002 with a new partman config? Are there a lot of manual steps needed to configure gerrit as a replica again or is this handled all by puppet?

I looked at tickets like T313250, T243027 and if the host name stays the same and we just do a reimage and we can accept downtime of gerrit-replica as long as the reimage cookbook takes.. then the effort should be relatively low.

The Gerrit replica is a production service. It serves repo at https://gerrit-replica.wikimedia.org/ and is used by most of our bots. So that slightly disruptive.

For the disk partitioning, I am tempted to move Gerrit varying data from the root partition to /srv (where git repositories are). Beside the cache directory which caused the problem, we also have a Lucene index in index (7G) and another database in db (1G) (which needs compacting, I will look at it in sub task T323754). I can see a few scenarios:

A) use symbolic links

Do not change much beside moving directories from /var/lib/gerrit2/review_site to /srv/gerrit and leave symbolic link behind:

Source	Destination
/var/lib/gerrit2/review_site/db	/srv/gerrit/db
/var/lib/gerrit2/review_site/cache	/srv/gerrit/cache
/var/lib/gerrit2/review_site/index	/srv/gerrit/index

Then remove the /var/lib/gerrit2 partition (move files to some place, remove the partition, move back the files at /var/lib/gerrit2.

B) Preferred relocate Gerrit installation to from /var/lib/gerrit2/review_site to /srv/gerrit

That involves a lot more change to Puppet and moving a bunch of files around. In the end we would get all of Gerrit at the same in a single partition which might be less confusing.

C) Use two partitions

/srv for the Git repositories
var/lib/gerrit2/review_site partition for Gerrit itself (current state of gerrit1001)

Advantage is there is no change in Puppet but it might be confusing.

In my favor of relocating all of Gerrit to /srv/gerrit (B) albeit it has the cost of doing Puppet changes and migrating, the resulting state would be easier to deal with in the future.

hashar closed subtask T323754: Investigate Gerrit h2 cache being way too large as Resolved.Dec 8 2022, 3:21 PM

LSobanski moved this task from Work in Progress to Backlog on the collaboration-services board.Dec 19 2022, 4:30 PM

thanks @hashar for the h2 cleanup. The filesystem usage on /var/lib/gerrit2 went down from 75% to 34%. Currently 15GB are used out of 49GB on that partition, which sounds good and gives use enough room for migrations and to operate for some time (see grafana):

2022-11-17_Gerrit_3.5_upgrade-h2-cleanup.png (821×1 px, 80 KB)

And thanks for coming up with the three options regarding future folder and partition layout. I would also favor option B. This should be the cleanest solution without adding additional technical debt (like custom partman configs or symlinks).
This would mean we have to revert gerrit1001 back to the default partitioning schema (/ + the rest in /srv). For gerrit2002 and future gerrit hosts no config change is required in regards to partitioning.

I'm happy to help setting up the required puppet changes and plan the migration.

I also updated the task description and checked some of the checkboxes. I also created a draft for the update docs in wikitech. Feel free to review them: https://wikitech.wikimedia.org/w/index.php?title=Gerrit%2FUpgrade&diff=2042063&oldid=2028111

Jelto updated the task description. (Show Details)Dec 23 2022, 1:18 PM

Jelto updated the task description. (Show Details)Dec 23 2022, 1:31 PM

Thanks @Jelto

I will look at preparing patches for option B (relocate Gerrit installation from /var/lib/gerrit2/review_site to /srv/gerrit).

hashar mentioned this in T324771: Investigate dropping diff.h2.db and quota.repo_size.h2.db disk caches.Jan 9 2023, 10:45 AM

hashar updated the task description. (Show Details)

thcipriani edited projects, added Release-Engineering-Team (Holiday Leftovers 🥡); removed Release-Engineering-Team (GitLab III: GitLab in LA 🪃).Jan 11 2023, 6:24 PM

thcipriani moved this task from Backlog to In progress on the Release-Engineering-Team (Holiday Leftovers 🥡) board.

LSobanski removed LSobanski as the assignee of this task.Jan 17 2023, 4:14 PM

LSobanski lowered the priority of this task from High to Medium.

Change 881359 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] gerrit: split user and application directories

https://gerrit.wikimedia.org/r/881359

gerritbot added a project: Patch-For-Review.Jan 18 2023, 9:47 AM

Change 881359 merged by Dzahn:

[operations/puppet@production] gerrit: split user and application directories

https://gerrit.wikimedia.org/r/881359

Maintenance_bot removed a project: Patch-For-Review.Jan 18 2023, 8:30 PM

thcipriani edited projects, added Release-Engineering-Team (GitLab IV: Mise En Place 🍱); removed Release-Engineering-Team (Holiday Leftovers 🥡).Jan 19 2023, 3:45 PM

thcipriani moved this task from Backlog to In progress on the Release-Engineering-Team (GitLab IV: Mise En Place 🍱) board.Jan 19 2023, 3:46 PM

thcipriani edited projects, added Release-Engineering-Team (Priority Backlog 📥); removed Release-Engineering-Team (GitLab IV: Mise En Place 🍱).Feb 13 2023, 4:24 PM

Clement_Goubert unsubscribed.Feb 13 2023, 4:53 PM

Is there anything left to do here related to SRE-OnFire or Sustainability (Incident Followup) ?

Volans moved this task from Backlog to Doing on the SRE-Sprint-Week-Sustainability-March2023 board.Mar 20 2023, 3:27 PM

I believe this was only open waiting for the incident writing, which happened long time ago, but Releng or Dzahn to confirm?

Volans claimed this task.Mar 21 2023, 10:42 AM

@hashar @Jelto Do you have any comments on the question by Volans above?

The incident report is at https://wikitech.wikimedia.org/wiki/Incidents/2022-11-17_Gerrit_3.5_upgrade

The Sustainability (Incident Followup) action is:

B) Relocate Gerrit installation to from /var/lib/gerrit2/review_site to /srv/gerrit

That involves a lot more change to Puppet and moving a bunch of files around. In the end we would get all of Gerrit at the same in a single partition which might be less confusing.

I have to write down the migration runbook describing the current and target situation, all the steps necessary to do the transition and write down the Puppet patches.

That clashed with a few related projects:

overhaul how Gerrit is deployed
change the application user from gerrit2 to gerrit
prepare the Gerrit 3.5 → 3.6 → 3.7 upgrades

The dev instance I used to test those migrations got shutdown which is T330312. So now I have to overhaul the way Gerrit authenticates in WMCS context.

Gerrit itself is fine, but there is a long tail of actions to realign it with all data being on /srv/ and to drop the temporary /var/lib/gerrit2` partition.

I guess we can drop the SRE-OnFire tag?

I guess we can drop the SRE-OnFire tag?

Hashar: alternatively, this could be closed, as per title scope and the mentioned work could be filed on a separate task. There is no perfect way, but I think lately we are usually separating the incident task (and immediate mitigarion actionables) from longer term actionables, which from your words, they have become so.

Hashar: alternatively, this could be closed, as per title scope and the mentioned work could be filed on a separate task. There is no perfect way, but I think lately we are usually separating the incident task (and immediate mitigarion actionables) from longer term actionables, which from your words, they have become so.

Sure I will craft a new task tomorrow with the summary of actions that needs to be accomplished and link it from the incident report as a follow up actionable.

Volans removed Volans as the assignee of this task.Mar 23 2023, 6:46 PM

Volans moved this task from Doing to Not an SRE issue on the SRE-Sprint-Week-Sustainability-March2023 board.

hashar mentioned this in T333143: Move Gerrit data out of root partition.Mar 27 2023, 12:02 PM

I have finally filled the follow up task: T333143: Move Gerrit data out of root partition

Marking this one as closed per T323262#8710550 above and assigning to @Clement_Goubert who has done the emergency fix.

gerrit1001 running out of space on /
Closed, ResolvedPublic
Actions

Description

Details

Related Objects
Search...

Event Timeline

LVM/disk setup

Alerting

	F35889949: 2022-11-17_Gerrit_3.5_upgrade-h2-cleanup.png
	Dec 23 2022, 1:17 PM

gerrit1001 running out of space on /Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

LVM/disk setup

Alerting

gerrit1001 running out of space on /
Closed, ResolvedPublic
Actions

Related Objects
Search...