Page MenuHomePhabricator

gerrit1001 running out of space on /
Open, HighPublic

Description

Following the Gerrit upgrade (T307334) gerrit1001 ran out of space on /. This resulted in any write action (submit +2, rebase, remove +2 vote) resulting in a modal error 500 Internal Server Error.

To mitigate, a new volume was created, /var/lib/gerrit2 was copied to it and mounted.

Follow up actions:

RelEng

  • Clean up the h2 cache (RelEng) - T323754
  • Clean up /srv/ on gerrit1001 (lot of copy of repositories)
  • Add a step to the Gerrit upgrade procedure to check for disk space (RelEng)
  • Adjust upgrade documentation to no more disable all monitoring probes (RelEng)
  • Write incident report https://wikitech.wikimedia.org/wiki/Incidents/2022-11-17_Gerrit_3.5_upgrade (Antoine, RelEng)
  • Prepare a Gerrit overview presentation for SRE (Antoine)

SRE:

  • Bring primary and replica in sync configuration-wise (SRE)
    • summarize disk stuff (Partman recipe etc.) on the task
  • Verify alerting to see if we should have been warned in advance (SRE)

Together?

  • Introduce regular cache clean-up
  • Test failover (and plan regular failovers)

Event Timeline

Largest H2 files:

-rw-r--r--  1 gerrit2 gerrit2 8.2G Nov 17 11:47 git_file_diff.h2.db
-rw-r--r--  1 gerrit2 gerrit2  12G Nov 17 11:47 gerrit_file_diff.h2.db

Then gerrit show-cache show both caches to use 128m disk:

  Name                          |Entries              |  AvgGet |Hit Ratio|
                                |   Mem   Disk   Space|         |Mem  Disk|
--------------------------------+---------------------+---------+---------+
D gerrit_file_diff              |   159 109940 128.13m| 124.0ms |  5%     |
D git_file_diff                 |   108 108286 128.08m| 114.0ms |  0%     |

I suspect the H2 files need to be defragment/vacuumed from time to time. Or we can remove the files on disk and let them be repopulated from scratch.

LSobanski moved this task from Incoming to Work in Progress on the serviceops-collab board.
jcrespo added a subscriber: jcrespo.

If an incident is planned to be written, let me add the corresponding tag for tracking purposes only (I'm on clinic duty this week).

I have marked with Sustainability (Incident Followup) and SRE-OnFire based on the incident report template.

LSobanski updated the task description. (Show Details)

SRE:

  • Bring primary and replica in sync configuration-wise (SRE)
    • summarize disk stuff (Partman recipe etc.) on the task

Currently gerrit1001 and gerri2002 use different partitioning schema, as lvm setup on gerrit1001 was altered to fix the incident (thanks @Joe and @Clement_Goubert!). Depending on some more investigation about the cache we might want to persist the current partitioning schema (or one with more space for /srv) with partman and also use it on gerrit2002.

@Dzahn what are you thoughts on reimaging gerrit2002 with a new partman config? Are there a lot of manual steps needed to configure gerrit as a replica again or is this handled all by puppet?

thcipriani added a subscriber: thcipriani.

(meta note: tagging with the weird-for-this-task tag: Release-Engineering-Team (GitLab III: GitLab in LA 🪃) because that's our current sprint. @hashar and other Release-Engineering-Team folks will do some followups over the next few weeks during this sprint)

@Dzahn what are you thoughts on reimaging gerrit2002 with a new partman config? Are there a lot of manual steps needed to configure gerrit as a replica again or is this handled all by puppet?

I looked at tickets like T313250, T243027 and if the host name stays the same and we just do a reimage and we can accept downtime of gerrit-replica as long as the reimage cookbook takes.. then the effort should be relatively low.

LVM/disk setup

I looked at tickets like T313250, T243027 and if the host name stays the same and we just do a reimage and we can accept downtime of gerrit-replica as long as the reimage cookbook takes.. then the effort should be relatively low.

Ok thanks for the estimation! Once we have feedback from RelEng regarding cleanup of directories (h2 and /srv/) we know if we still need a dedicated volume for /var/lib/gerrit2. Then a dedicated partman config and reimage would be necessary also for the new gerrit host. Otherwise we can roll back the change on gerrit1001 and keep using the default partman config.

Alerting

Regarding:

  • Verify alerting to see if we should have been warned in advance (SRE)

I grepped through the #wikimedia-operations logs for alerts on gerrit1001 and gerrit.wikimedia.org. It seems the first alerts appeared during the ongoing incident after the downtime expired. As far as I can see there were no active alerts before the upgrade and the days before the upgrade. Maybe there was a warning in alertmanager regarding disk space, I'll try to find that history somehow. Alerts from icinga from IRC:

[09:57:11] <icinga-wm>   PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - page size 1532 too small - 1532 bytes in 0.008 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[10:59:55] <icinga-wm>   PROBLEM - SSH access on gerrit1001 is CRITICAL: connect to address 208.80.154.137 and port 29418: Connection refused https://wikitech.wikimedia.org/wiki/Gerrit
[11:01:53] <icinga-wm>   RECOVERY - SSH access on gerrit1001 is OK: SSH OK - GerritCodeReview_3.5.4 (APACHE-SSHD-2.8.0) (protocol 2.0) https://wikitech.wikimedia.org/wiki/Gerrit
[11:02:19] <icinga-wm>   RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 973 bytes in 0.039 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[11:15:16] <godog>       yeah gerrit1001 isn't on that unfortunately yet, it'll be of course eventually
[11:19:47] <icinga-wm>   PROBLEM - Disk space on gerrit1001 is CRITICAL: DISK CRITICAL - free space: / 1321 MB (2% inode=91%): /tmp 1321 MB (2% inode=91%): /var/tmp 1321 MB (2% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=gerrit1001&var-datasource=eqiad+prometheus/ops
[11:37:11] <icinga-wm>   PROBLEM - SSH access on gerrit1001 is CRITICAL: connect to address 208.80.154.137 and port 29418: Connection refused https://wikitech.wikimedia.org/wiki/Gerrit
[11:37:39] <icinga-wm>   PROBLEM - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1532 bytes in 0.007 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[11:37:55] <icinga-wm>   PROBLEM - gerrit process on gerrit1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit
[11:38:31] <icinga-wm>   PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - page size 1532 too small - 1532 bytes in 0.008 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[11:38:35] <icinga-wm>   PROBLEM - Check systemd state on gerrit1001 is CRITICAL: CRITICAL - degraded: The following units failed: gerrit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:45:47] <icinga-wm>   RECOVERY - gerrit process on gerrit1001 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit
[11:46:25] <icinga-wm>   RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 53356 bytes in 0.066 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[11:46:29] <icinga-wm>   RECOVERY - Check systemd state on gerrit1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:47:01] <icinga-wm>   RECOVERY - SSH access on gerrit1001 is OK: SSH OK - GerritCodeReview_3.5.4 (APACHE-SSHD-2.8.0) (protocol 2.0) https://wikitech.wikimedia.org/wiki/Gerrit
[11:47:29] <icinga-wm>   RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 973 bytes in 0.037 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[12:01:41] <icinga-wm>   RECOVERY - Disk space on gerrit1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=gerrit1001&var-datasource=eqiad+prometheus/ops

What also leads to

  • Adjust upgrade documentation to no more disable all monitoring probes (RelEng)

Currently the Gerrit upgrade documentation states:

1) Schedule downtime for Gerrit hosts to avoid notifications

    https://icinga.wikimedia.org/cgi-bin/icinga/cmd.cgi?cmd_typ=86&host=gerrit1001
    https://icinga.wikimedia.org/cgi-bin/icinga/cmd.cgi?cmd_typ=86&host=gerrit2002

Which disables quite a lot of alerts for the hosts (and also the disk space alert). So I went through all of the icinga checks which would alert during a normal upgrade. I found :
https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=1&host=gerrit.wikimedia.org (checks for gerrit.wikimedia.org)
https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=gerrit+process (gerrit process)
https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=SSH+access (client ssh access)
https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=gerrit1001&service=HTTPS (https)
https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=gerrit1001&service=puppet+last+run (puppet run)

Downtime each services individually seems quite a lot of work. Also downtiming a complete host for maintenance is quite common at the moment. So we either need a script to downtime this services, link to the icinga page in the upgrade docs or just accept noise during the upgrade. I'm not sure what is the best approach here. I'd favor just to downtime https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=1&host=gerrit.wikimedia.org (checks for gerrit.wikimedia.org) and get alerts for the rest during an upgrade?