Page MenuHomePhabricator

gitlab-restore: version detection fail / restore fail
Closed, ResolvedPublic

Description

I noticed the gitlab-restore unit failed today.

I did not get to debug it much yet but the status is:

● backup-restore.service - GitLab Backup Restore
   Loaded: loaded (/lib/systemd/system/backup-restore.service; static; vendor preset: enabled)
   Active: failed (Result: exit-code) since Wed 2022-05-11 01:43:12 UTC; 47s ago
  Process: 4671 ExecStart=/mnt/gitlab-backup/gitlab-restore.sh (code=exited, status=1/FAILURE)
 Main PID: 4671 (code=exited, status=1/FAILURE)

May 11 01:42:39 gitlab2001 gitlab-restore.sh[4671]: run: nginx: (pid 9123) 142099s; run: log: (pid 662) 3462960s
May 11 01:42:39 gitlab2001 gitlab-restore.sh[4671]: run: postgres-exporter: (pid 9135) 142099s; run: log: (pid 661) 3462960s
May 11 01:42:39 gitlab2001 gitlab-restore.sh[4671]: run: postgresql: (pid 16061) 172662s; run: log: (pid 682) 3462960s
May 11 01:42:39 gitlab2001 gitlab-restore.sh[4671]: run: puma: (pid 9143) 142098s, want down; run: log: (pid 683) 3462960s
May 11 01:42:39 gitlab2001 gitlab-restore.sh[4671]: run: redis: (pid 16077) 172661s; run: log: (pid 678) 3462960s
May 11 01:42:39 gitlab2001 gitlab-restore.sh[4671]: run: redis-exporter: (pid 9149) 142098s; run: log: (pid 664) 3462960s
May 11 01:42:39 gitlab2001 gitlab-restore.sh[4671]: run: sidekiq: (pid 9158) 142097s; run: log: (pid 660) 3462960s
May 11 01:43:11 gitlab2001 gitlab-restore.sh[4671]: timeout: run: puma: (pid 9143) 142130s, want down
May 11 01:43:12 gitlab2001 systemd[1]: backup-restore.service: Main process exited, code=exited, status=1/FAILURE
May 11 01:43:12 gitlab2001 systemd[1]: backup-restore.service: Failed with result 'exit-code'.

and I have already been able to reproduce it a second time when manually starting it.

It's timing out trying to shutdown puma.

after: ├─5360 /opt/gitlab/embedded/bin/ruby /opt/gitlab/embedded/bin/omnibus-ctl gitlab /opt/gitlab/embedded/service/omnibus-ctl* stop puma


Manually running that command confirms it:

/opt/gitlab/embedded/bin/ruby /opt/gitlab/embedded/bin/omnibus-ctl gitlab /opt/gitlab/embedded/service/omnibus-ctl* stop puma

timeout: run: puma: (pid 9143) 142431s, want down

Additionally I noticed:

/mnt/gitlab-backup/gitlab-restore.sh: line 14: [: 14.10.2: unary operator expected

On that line it is comparing $installed_version to $backup_version and the way it tries to detect those versions somehow fails.

I will look more tomorrow. It was already a bit late here. I just wanted to save the info though and share it.

Event Timeline

Change 791029 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: fix regex for restore version check

https://gerrit.wikimedia.org/r/791029

Change 791029 merged by Jelto:

[operations/puppet@production] gitlab: fix regex for restore version check

https://gerrit.wikimedia.org/r/791029

Jelto claimed this task.
Jelto added subscribers: LSobanski, thcipriani.

Thanks for opening the task!

I can confirm, puma was stuck and can't be stopped which blocked the restore. There were multiple issues:

A bug in the version check. The regex used to grep for the version was \\d*\.\\d\.\\d. With our new GitLab version 14.10.1 the version check fails (because 10 has two digits). Because of this failed version check, the restore script tried to restore a 14.9.2 backup (from production) to a 14.10.2 GitLab installation on the Replica. I fixed the regex in https://gerrit.wikimedia.org/r/c/operations/puppet/+/791029.

Then puma was stuck (maybe because of the wrong backup). I force stopped puma using gitlab-ctl kill puma.

Then the backup was not complete (it was around 10GB, normally backups are around 15GB currently), most probably due to not enough disk space on /mnt/gitlab-backup. Restore failed with gzip: stdin: unexpected end of file. So I delete a old backup, scheduled a new backup on gitlab1001 with /mnt/gitlab-backup/gitlab-backup.sh full and sync the backup to the replica with /usr/bin/rsync -avp --delete /mnt/gitlab-backup/latest/ rsync://gitlab2001.wikimedia.org/data-backup

Then I scheduled a manual restore on gitlab2001 with sudo systemctl start backup-restore which failed with Errno::ENOSPC: No space left on device - sendfile. So I deleted also backups on gitlab2001.

Then I scheduled a manual restore on gitlab2001 with sudo systemctl start backup-restore which was successful.

I'm going to close this task.

I'd like give T307142 and maybe T274463 more priority because otherwise this kind of issue will come up again more frequently. And if usage of GitLab increases over the next weeks we wont be able to do the backup-restore cycle because the disks will too small. And with even more usage we wont be able to do backups at all.

Adding @LSobanski and @thcipriani for awareness ⬆️

Reopening, puma still fails to stop:

May 12 01:33:19 gitlab2001 gitlab-restore.sh[26219]: timeout: run: puma: (pid 6268) 44600s, want down

I'll try to find out why this is happening.

Mentioned in SAL (#wikimedia-operations) [2022-05-12T16:21:21Z] <mutante> gitlab2001 - trying to stop 'puma' for debugging T308089

Change 791410 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gitlab: use 'gracefull-kill' instead of stop to stop puma

https://gerrit.wikimedia.org/r/791410

Change 791410 merged by Dzahn:

[operations/puppet@production] gitlab: use 'gracefull-kill' instead of stop to stop puma

https://gerrit.wikimedia.org/r/791410

Mentioned in SAL (#wikimedia-operations) [2022-05-12T19:53:11Z] <mutante> gitlab2001 - systemctl start backup-restore - systemd[1]: Started GitLab Backup Restore. after gerrit:791410 for T308089

after deploying the change above, now this is what happens when manually starting the unit:

May 12 19:55:23 gitlab2001 gitlab-restore.sh[17479]: timeout: run: puma: (pid 672) 3552s, want down
May 12 19:55:24 gitlab2001 gitlab-restore.sh[17479]: found stuck pids still running in process group: 672 18300 18302 18304 18306 18308 18310 18315 18317, sending 
May 12 19:55:30 gitlab2001 gitlab-restore.sh[17479]: ok: down: sidekiq: 0s, normally up
May 12 19:55:32 gitlab2001 gitlab-restore.sh[17479]: running gitlab-backup restore

restore ran and ended like this:

May 12 20:08:13 gitlab2001 gitlab-restore.sh[17479]: Internal API available: FAILED - Internal API error (502)
May 12 20:10:23 gitlab2001 puppet-agent[30156]: Enabling Puppet.
May 12 20:10:23 gitlab2001 systemd[1]: backup-restore.service: Succeeded.

So it did succeed, good! Not sure about that internal API error though.

Dzahn changed the task status from Open to In Progress.May 12 2022, 8:28 PM

Change 797301 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: retry rails console, don't keep gitlab-secrets.json

https://gerrit.wikimedia.org/r/797301

Jelto triaged this task as Medium priority.May 23 2022, 1:59 PM

So it did succeed, good! Not sure about that internal API error though.

API error should be fixed with https://gerrit.wikimedia.org/r/c/operations/puppet/+/797301

Change 797301 merged by Jelto:

[operations/puppet@production] gitlab: retry rails console, don't keep gitlab-secrets.json

https://gerrit.wikimedia.org/r/797301

Change 799016 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gitlab: switch backup location to /srv, don't use /etc

https://gerrit.wikimedia.org/r/799016

Change 799016 abandoned by Dzahn:

[operations/puppet@production] gitlab: switch backup location to /srv, don't use /etc

Reason:

superseded by https://gerrit.wikimedia.org/r/c/operations/puppet/+/799280

https://gerrit.wikimedia.org/r/799016

Change 799016 restored by Dzahn:

[operations/puppet@production] gitlab: switch backup location to /srv, don't use /etc

https://gerrit.wikimedia.org/r/799016

Mentioned in SAL (#wikimedia-operations) [2022-05-26T00:56:51Z] <mutante> gitlab1001 - T308089 T274463 - '<+icinga-wm> PROBLEM - Disk space on gitlab1001 is CRITICAL: DISK CRITICAL - free space: /mnt/gitlab-backup 0 MB' - manually deleted 1653294190_2022_05_23_14.10.2_gitlab_backup.tar (we have May 24 and 25, 26 could not finish writing backup) - RECOVERY - Disk space on gitlab1001 is OK

Mentioned in SAL (#wikimedia-operations) [2022-05-26T00:58:18Z] <mutante> gitlab1001 - T308089 T274463 - gitlab1001 - systemctl start full-backup

Mentioned in SAL (#wikimedia-operations) [2022-05-26T01:01:21Z] <mutante> gitlab1003 - T308089 T274463 - gitlab1003 - systemctl status backup-restore is failed because it's looking for /mnt/gitlab-backup/latest/latest.tar needs gerrit:799016

Change 799016 merged by Dzahn:

[operations/puppet@production] gitlab: switch backup location to /srv, don't use /etc

https://gerrit.wikimedia.org/r/799016

Mentioned in SAL (#wikimedia-operations) [2022-05-26T01:20:04Z] <mutante> gitlab1003 - T308089 T274463 - gitlab1001 - deleted backups from April 4 and April 5 from /srv/gitlab-backup AND deleted partial failed backups from May 26 from /mnt/gitlab-backup; deployed both gerrit:799016 and gerrit:799280 ; restarting full-backup service

Mentioned in SAL (#wikimedia-operations) [2022-05-26T01:27:51Z] <mutante> T308089 T274463 - gitlab1001 - systemctl start rsync-config-backup-gitlab1003.wikimedia.org - Suceeded - RECOVERY - Check systemd state on gitlab1001 is OK

Mentioned in SAL (#wikimedia-operations) [2022-05-26T01:46:39Z] <mutante> T308089 T274463 - gitlab1001 - still not enough disk space to finish full backup. moved backup of May 24th to /root/ . deleted latest.tar; started full-backup service once again

After T274463#7966543 ff we now have a working backup-restore service again:

<+icinga-wm> RECOVERY - Check systemd state on gitlab2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state