Following the Gerrit upgrade (T307334) gerrit1001 ran out of space on /. This resulted in any write action (submit +2, rebase, remove +2 vote) resulting in a modal error 500 Internal Server Error.
To mitigate, a new volume was created, /var/lib/gerrit2 was copied to it and mounted.
Follow up actions:
RelEng
- [x] Clean up the h2 cache (RelEng) - T323754
- [ ] Clean up `/srv/` on gerrit1001 (lot of copy of repositories)
- [ ] Add a step to the Gerrit upgrade procedure to check for disk space (RelEng)
- [ ] Adjust upgrade documentation to no more disable all monitoring probes (RelEng)
- [x] Write incident report https://wikitech.wikimedia.org/wiki/Incidents/2022-11-17_Gerrit_3.5_upgrade (Antoine, RelEng)
- [ ] Prepare a Gerrit overview presentation for SRE (Antoine)
SRE:
- [ ] Bring primary and replica in sync configuration-wise (SRE)
-- [x] summarize disk stuff (Partman recipe etc.) on the task
- [x] Verify alerting to see if we should have been warned in advance (SRE)
Together?
- [ ] Introduce regular cache clean-up
- [ ] Test failover (and plan regular failovers)