Page MenuHomePhabricator

gerrit1001 running out of space on /
Closed, ResolvedPublic

Description

Following the Gerrit upgrade (T307334) gerrit1001 ran out of space on /. This resulted in any write action (submit +2, rebase, remove +2 vote) resulting in a modal error 500 Internal Server Error.

To mitigate, a new volume was created, /var/lib/gerrit2 was copied to it and mounted.

Follow up actions:

RelEng

  • Clean up the h2 cache (RelEng) - T323754
  • And drop obsolete h2 cache files (RelEng) - T324771
  • Clean up /srv/ on gerrit1001 (lot of copy of repositories)
  • Add a step to the Gerrit upgrade procedure to check for disk space (RelEng)
  • Adjust upgrade documentation to no more disable all monitoring probes (RelEng)
  • Write incident report https://wikitech.wikimedia.org/wiki/Incidents/2022-11-17_Gerrit_3.5_upgrade (Antoine, RelEng)
  • Prepare a Gerrit overview presentation for SRE (Antoine)

SRE:

  • Bring primary and replica in sync configuration-wise (SRE)
    • summarize disk stuff (Partman recipe etc.) on the task
  • Verify alerting to see if we should have been warned in advance (SRE)

Together?

  • Introduce regular cache clean-up
  • Test failover (and plan regular failovers)

Event Timeline

Largest H2 files:

-rw-r--r--  1 gerrit2 gerrit2 8.2G Nov 17 11:47 git_file_diff.h2.db
-rw-r--r--  1 gerrit2 gerrit2  12G Nov 17 11:47 gerrit_file_diff.h2.db

Then gerrit show-cache show both caches to use 128m disk:

  Name                          |Entries              |  AvgGet |Hit Ratio|
                                |   Mem   Disk   Space|         |Mem  Disk|
--------------------------------+---------------------+---------+---------+
D gerrit_file_diff              |   159 109940 128.13m| 124.0ms |  5%     |
D git_file_diff                 |   108 108286 128.08m| 114.0ms |  0%     |

I suspect the H2 files need to be defragment/vacuumed from time to time. Or we can remove the files on disk and let them be repopulated from scratch.

LSobanski moved this task from Incoming to Work in Progress on the collaboration-services board.
jcrespo subscribed.

If an incident is planned to be written, let me add the corresponding tag for tracking purposes only (I'm on clinic duty this week).

I have marked with Sustainability (Incident Followup) and SRE-OnFire based on the incident report template.

LSobanski updated the task description. (Show Details)

SRE:

  • Bring primary and replica in sync configuration-wise (SRE)
    • summarize disk stuff (Partman recipe etc.) on the task

Currently gerrit1001 and gerri2002 use different partitioning schema, as lvm setup on gerrit1001 was altered to fix the incident (thanks @Joe and @Clement_Goubert!). Depending on some more investigation about the cache we might want to persist the current partitioning schema (or one with more space for /srv) with partman and also use it on gerrit2002.

@Dzahn what are you thoughts on reimaging gerrit2002 with a new partman config? Are there a lot of manual steps needed to configure gerrit as a replica again or is this handled all by puppet?

thcipriani subscribed.

(meta note: tagging with the weird-for-this-task tag: Release-Engineering-Team (GitLab III: GitLab in LA 🪃) because that's our current sprint. @hashar and other Release-Engineering-Team folks will do some followups over the next few weeks during this sprint)

@Dzahn what are you thoughts on reimaging gerrit2002 with a new partman config? Are there a lot of manual steps needed to configure gerrit as a replica again or is this handled all by puppet?

I looked at tickets like T313250, T243027 and if the host name stays the same and we just do a reimage and we can accept downtime of gerrit-replica as long as the reimage cookbook takes.. then the effort should be relatively low.

LVM/disk setup

I looked at tickets like T313250, T243027 and if the host name stays the same and we just do a reimage and we can accept downtime of gerrit-replica as long as the reimage cookbook takes.. then the effort should be relatively low.

Ok thanks for the estimation! Once we have feedback from RelEng regarding cleanup of directories (h2 and /srv/) we know if we still need a dedicated volume for /var/lib/gerrit2. Then a dedicated partman config and reimage would be necessary also for the new gerrit host. Otherwise we can roll back the change on gerrit1001 and keep using the default partman config.

Alerting

Regarding:

  • Verify alerting to see if we should have been warned in advance (SRE)

I grepped through the #wikimedia-operations logs for alerts on gerrit1001 and gerrit.wikimedia.org. It seems the first alerts appeared during the ongoing incident after the downtime expired. As far as I can see there were no active alerts before the upgrade and the days before the upgrade. Maybe there was a warning in alertmanager regarding disk space, I'll try to find that history somehow. Alerts from icinga from IRC:

[09:57:11] <icinga-wm>   PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - page size 1532 too small - 1532 bytes in 0.008 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[10:59:55] <icinga-wm>   PROBLEM - SSH access on gerrit1001 is CRITICAL: connect to address 208.80.154.137 and port 29418: Connection refused https://wikitech.wikimedia.org/wiki/Gerrit
[11:01:53] <icinga-wm>   RECOVERY - SSH access on gerrit1001 is OK: SSH OK - GerritCodeReview_3.5.4 (APACHE-SSHD-2.8.0) (protocol 2.0) https://wikitech.wikimedia.org/wiki/Gerrit
[11:02:19] <icinga-wm>   RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 973 bytes in 0.039 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[11:15:16] <godog>       yeah gerrit1001 isn't on that unfortunately yet, it'll be of course eventually
[11:19:47] <icinga-wm>   PROBLEM - Disk space on gerrit1001 is CRITICAL: DISK CRITICAL - free space: / 1321 MB (2% inode=91%): /tmp 1321 MB (2% inode=91%): /var/tmp 1321 MB (2% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=gerrit1001&var-datasource=eqiad+prometheus/ops
[11:37:11] <icinga-wm>   PROBLEM - SSH access on gerrit1001 is CRITICAL: connect to address 208.80.154.137 and port 29418: Connection refused https://wikitech.wikimedia.org/wiki/Gerrit
[11:37:39] <icinga-wm>   PROBLEM - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1532 bytes in 0.007 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[11:37:55] <icinga-wm>   PROBLEM - gerrit process on gerrit1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit
[11:38:31] <icinga-wm>   PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - page size 1532 too small - 1532 bytes in 0.008 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[11:38:35] <icinga-wm>   PROBLEM - Check systemd state on gerrit1001 is CRITICAL: CRITICAL - degraded: The following units failed: gerrit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:45:47] <icinga-wm>   RECOVERY - gerrit process on gerrit1001 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit
[11:46:25] <icinga-wm>   RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 53356 bytes in 0.066 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[11:46:29] <icinga-wm>   RECOVERY - Check systemd state on gerrit1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:47:01] <icinga-wm>   RECOVERY - SSH access on gerrit1001 is OK: SSH OK - GerritCodeReview_3.5.4 (APACHE-SSHD-2.8.0) (protocol 2.0) https://wikitech.wikimedia.org/wiki/Gerrit
[11:47:29] <icinga-wm>   RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 973 bytes in 0.037 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[12:01:41] <icinga-wm>   RECOVERY - Disk space on gerrit1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=gerrit1001&var-datasource=eqiad+prometheus/ops

What also leads to

  • Adjust upgrade documentation to no more disable all monitoring probes (RelEng)

Currently the Gerrit upgrade documentation states:

1) Schedule downtime for Gerrit hosts to avoid notifications

    https://icinga.wikimedia.org/cgi-bin/icinga/cmd.cgi?cmd_typ=86&host=gerrit1001
    https://icinga.wikimedia.org/cgi-bin/icinga/cmd.cgi?cmd_typ=86&host=gerrit2002

Which disables quite a lot of alerts for the hosts (and also the disk space alert). So I went through all of the icinga checks which would alert during a normal upgrade. I found :
https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=1&host=gerrit.wikimedia.org (checks for gerrit.wikimedia.org)
https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=gerrit+process (gerrit process)
https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=SSH+access (client ssh access)
https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=gerrit1001&service=HTTPS (https)
https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=gerrit1001&service=puppet+last+run (puppet run)

Downtime each services individually seems quite a lot of work. Also downtiming a complete host for maintenance is quite common at the moment. So we either need a script to downtime this services, link to the icinga page in the upgrade docs or just accept noise during the upgrade. I'm not sure what is the best approach here. I'd favor just to downtime https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=1&host=gerrit.wikimedia.org (checks for gerrit.wikimedia.org) and get alerts for the rest during an upgrade?

Sorry for lack of update. I did dig into the Gerrit cache which are backed up by H2 Database. Some finding are at T323754 and it looks like the easiest path is to empty up the cache and let Gerrit regenerate them. That would include deleting the web session cache which would logout every single users, which I don't think it is a problem granted some announcement is made before the operation. I will look at scheduling that at some point next week.

@Jelto thank you for the excellent analysis about monitoring. I will look at integrating that to the deployment runbook we have at https://wikitech.wikimedia.org/wiki/Gerrit/Upgrade#Deploying

@Dzahn what are you thoughts on reimaging gerrit2002 with a new partman config? Are there a lot of manual steps needed to configure gerrit as a replica again or is this handled all by puppet?

I looked at tickets like T313250, T243027 and if the host name stays the same and we just do a reimage and we can accept downtime of gerrit-replica as long as the reimage cookbook takes.. then the effort should be relatively low.

The Gerrit replica is a production service. It serves repo at https://gerrit-replica.wikimedia.org/ and is used by most of our bots. So that slightly disruptive.

For the disk partitioning, I am tempted to move Gerrit varying data from the root partition to /srv (where git repositories are). Beside the cache directory which caused the problem, we also have a Lucene index in index (7G) and another database in db (1G) (which needs compacting, I will look at it in sub task T323754). I can see a few scenarios:

A) use symbolic links

Do not change much beside moving directories from /var/lib/gerrit2/review_site to /srv/gerrit and leave symbolic link behind:

SourceDestination
/var/lib/gerrit2/review_site/db/srv/gerrit/db
/var/lib/gerrit2/review_site/cache/srv/gerrit/cache
/var/lib/gerrit2/review_site/index/srv/gerrit/index

Then remove the /var/lib/gerrit2 partition (move files to some place, remove the partition, move back the files at /var/lib/gerrit2.

B) Preferred relocate Gerrit installation to from /var/lib/gerrit2/review_site to /srv/gerrit

That involves a lot more change to Puppet and moving a bunch of files around. In the end we would get all of Gerrit at the same in a single partition which might be less confusing.

C) Use two partitions

/srv for the Git repositories
var/lib/gerrit2/review_site partition for Gerrit itself (current state of gerrit1001)

Advantage is there is no change in Puppet but it might be confusing.

In my favor of relocating all of Gerrit to /srv/gerrit (B) albeit it has the cost of doing Puppet changes and migrating, the resulting state would be easier to deal with in the future.

thanks @hashar for the h2 cleanup. The filesystem usage on /var/lib/gerrit2 went down from 75% to 34%. Currently 15GB are used out of 49GB on that partition, which sounds good and gives use enough room for migrations and to operate for some time (see grafana):

2022-11-17_Gerrit_3.5_upgrade-h2-cleanup.png (821×1 px, 80 KB)

And thanks for coming up with the three options regarding future folder and partition layout. I would also favor option B. This should be the cleanest solution without adding additional technical debt (like custom partman configs or symlinks).
This would mean we have to revert gerrit1001 back to the default partitioning schema (/ + the rest in /srv). For gerrit2002 and future gerrit hosts no config change is required in regards to partitioning.

I'm happy to help setting up the required puppet changes and plan the migration.

I also updated the task description and checked some of the checkboxes. I also created a draft for the update docs in wikitech. Feel free to review them: https://wikitech.wikimedia.org/w/index.php?title=Gerrit%2FUpgrade&diff=2042063&oldid=2028111

Thanks @Jelto

I will look at preparing patches for option B (relocate Gerrit installation from /var/lib/gerrit2/review_site to /srv/gerrit).

LSobanski lowered the priority of this task from High to Medium.

Change 881359 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] gerrit: split user and application directories

https://gerrit.wikimedia.org/r/881359

Change 881359 merged by Dzahn:

[operations/puppet@production] gerrit: split user and application directories

https://gerrit.wikimedia.org/r/881359

I believe this was only open waiting for the incident writing, which happened long time ago, but Releng or Dzahn to confirm?

@hashar @Jelto Do you have any comments on the question by Volans above?

The incident report is at https://wikitech.wikimedia.org/wiki/Incidents/2022-11-17_Gerrit_3.5_upgrade

The Sustainability (Incident Followup) action is:

B) Relocate Gerrit installation to from /var/lib/gerrit2/review_site to /srv/gerrit

That involves a lot more change to Puppet and moving a bunch of files around. In the end we would get all of Gerrit at the same in a single partition which might be less confusing.

I have to write down the migration runbook describing the current and target situation, all the steps necessary to do the transition and write down the Puppet patches.

That clashed with a few related projects:

  • overhaul how Gerrit is deployed
  • change the application user from gerrit2 to gerrit
  • prepare the Gerrit 3.53.63.7 upgrades

The dev instance I used to test those migrations got shutdown which is T330312. So now I have to overhaul the way Gerrit authenticates in WMCS context.

Gerrit itself is fine, but there is a long tail of actions to realign it with all data being on /srv/ and to drop the temporary /var/lib/gerrit2` partition.

I guess we can drop the SRE-OnFire tag?

I guess we can drop the SRE-OnFire tag?

Hashar: alternatively, this could be closed, as per title scope and the mentioned work could be filed on a separate task. There is no perfect way, but I think lately we are usually separating the incident task (and immediate mitigarion actionables) from longer term actionables, which from your words, they have become so.

Hashar: alternatively, this could be closed, as per title scope and the mentioned work could be filed on a separate task. There is no perfect way, but I think lately we are usually separating the incident task (and immediate mitigarion actionables) from longer term actionables, which from your words, they have become so.

Sure I will craft a new task tomorrow with the summary of actions that needs to be accomplished and link it from the incident report as a follow up actionable.

hashar assigned this task to Clement_Goubert.
hashar added a subscriber: Clement_Goubert.

I have finally filled the follow up task: T333143: Move Gerrit data out of root partition

Marking this one as closed per T323262#8710550 above and assigning to @Clement_Goubert who has done the emergency fix.