gitlab-restore: version detection fail / restore fail
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Dzahn
	May 11 2022, 1:50 AM

Description

I noticed the gitlab-restore unit failed today.

I did not get to debug it much yet but the status is:

● backup-restore.service - GitLab Backup Restore
   Loaded: loaded (/lib/systemd/system/backup-restore.service; static; vendor preset: enabled)
   Active: failed (Result: exit-code) since Wed 2022-05-11 01:43:12 UTC; 47s ago
  Process: 4671 ExecStart=/mnt/gitlab-backup/gitlab-restore.sh (code=exited, status=1/FAILURE)
 Main PID: 4671 (code=exited, status=1/FAILURE)

May 11 01:42:39 gitlab2001 gitlab-restore.sh[4671]: run: nginx: (pid 9123) 142099s; run: log: (pid 662) 3462960s
May 11 01:42:39 gitlab2001 gitlab-restore.sh[4671]: run: postgres-exporter: (pid 9135) 142099s; run: log: (pid 661) 3462960s
May 11 01:42:39 gitlab2001 gitlab-restore.sh[4671]: run: postgresql: (pid 16061) 172662s; run: log: (pid 682) 3462960s
May 11 01:42:39 gitlab2001 gitlab-restore.sh[4671]: run: puma: (pid 9143) 142098s, want down; run: log: (pid 683) 3462960s
May 11 01:42:39 gitlab2001 gitlab-restore.sh[4671]: run: redis: (pid 16077) 172661s; run: log: (pid 678) 3462960s
May 11 01:42:39 gitlab2001 gitlab-restore.sh[4671]: run: redis-exporter: (pid 9149) 142098s; run: log: (pid 664) 3462960s
May 11 01:42:39 gitlab2001 gitlab-restore.sh[4671]: run: sidekiq: (pid 9158) 142097s; run: log: (pid 660) 3462960s
May 11 01:43:11 gitlab2001 gitlab-restore.sh[4671]: timeout: run: puma: (pid 9143) 142130s, want down
May 11 01:43:12 gitlab2001 systemd[1]: backup-restore.service: Main process exited, code=exited, status=1/FAILURE
May 11 01:43:12 gitlab2001 systemd[1]: backup-restore.service: Failed with result 'exit-code'.

and I have already been able to reproduce it a second time when manually starting it.

It's timing out trying to shutdown puma.

after: ├─5360 /opt/gitlab/embedded/bin/ruby /opt/gitlab/embedded/bin/omnibus-ctl gitlab /opt/gitlab/embedded/service/omnibus-ctl* stop puma

Manually running that command confirms it:

/opt/gitlab/embedded/bin/ruby /opt/gitlab/embedded/bin/omnibus-ctl gitlab /opt/gitlab/embedded/service/omnibus-ctl* stop puma

timeout: run: puma: (pid 9143) 142431s, want down

Additionally I noticed:

/mnt/gitlab-backup/gitlab-restore.sh: line 14: [: 14.10.2: unary operator expected

On that line it is comparing $installed_version to $backup_version and the way it tries to detect those versions somehow fails.

I will look more tomorrow. It was already a bit late here. I just wanted to save the info though and share it.

Details

Subject	Repo	Branch	Lines +/-
gitlab: fix regex for restore version check	operations/puppet	production	+1 -1
gitlab: retry rails console, don't keep gitlab-secrets.json	operations/puppet	production	+7 -8
gitlab: switch backup location to /srv, don't use /etc	operations/puppet	production	+6 -6
gitlab: use 'gracefull-kill' instead of stop to stop puma	operations/puppet	production	+3 -3

Customize query in gerrit

Related Objects

Mentioned In: T274463: Backups for GitLab
Mentioned Here: T274463: Backups for GitLab
T307142: bring new gitlab hardware servers into production

Event Timeline

Dzahn created this task.May 11 2022, 1:50 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 11 2022, 1:50 AM

Dzahn updated the task description. (Show Details)May 11 2022, 1:51 AM

Change 791029 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: fix regex for restore version check

https://gerrit.wikimedia.org/r/791029

gerritbot added a project: Patch-For-Review.May 11 2022, 1:12 PM

Change 791029 merged by Jelto:

[operations/puppet@production] gitlab: fix regex for restore version check

https://gerrit.wikimedia.org/r/791029

Thanks for opening the task!

I can confirm, puma was stuck and can't be stopped which blocked the restore. There were multiple issues:

A bug in the version check. The regex used to grep for the version was \\d*\.\\d\.\\d. With our new GitLab version 14.10.1 the version check fails (because 10 has two digits). Because of this failed version check, the restore script tried to restore a 14.9.2 backup (from production) to a 14.10.2 GitLab installation on the Replica. I fixed the regex in https://gerrit.wikimedia.org/r/c/operations/puppet/+/791029.

Then puma was stuck (maybe because of the wrong backup). I force stopped puma using gitlab-ctl kill puma.

Then the backup was not complete (it was around 10GB, normally backups are around 15GB currently), most probably due to not enough disk space on /mnt/gitlab-backup. Restore failed with gzip: stdin: unexpected end of file. So I delete a old backup, scheduled a new backup on gitlab1001 with /mnt/gitlab-backup/gitlab-backup.sh full and sync the backup to the replica with /usr/bin/rsync -avp --delete /mnt/gitlab-backup/latest/ rsync://gitlab2001.wikimedia.org/data-backup

Then I scheduled a manual restore on gitlab2001 with sudo systemctl start backup-restore which failed with Errno::ENOSPC: No space left on device - sendfile. So I deleted also backups on gitlab2001.

Then I scheduled a manual restore on gitlab2001 with sudo systemctl start backup-restore which was successful.

I'm going to close this task.

I'd like give T307142 and maybe T274463 more priority because otherwise this kind of issue will come up again more frequently. And if usage of GitLab increases over the next weeks we wont be able to do the backup-restore cycle because the disks will too small. And with even more usage we wont be able to do backups at all.

Adding @LSobanski and @thcipriani for awareness ⬆️

Maintenance_bot removed a project: Patch-For-Review.May 11 2022, 1:30 PM

Dzahn awarded a token.May 11 2022, 3:52 PM

Reopening, puma still fails to stop:

May 12 01:33:19 gitlab2001 gitlab-restore.sh[26219]: timeout: run: puma: (pid 6268) 44600s, want down

I'll try to find out why this is happening.

brennen moved this task from Inbox to Infrastructure on the GitLab board.May 12 2022, 2:25 PM

brennen edited projects, added GitLab (Infrastructure); removed GitLab.

Mentioned in SAL (#wikimedia-operations) [2022-05-12T16:21:21Z] <mutante> gitlab2001 - trying to stop 'puma' for debugging T308089

Change 791410 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gitlab: use 'gracefull-kill' instead of stop to stop puma

https://gerrit.wikimedia.org/r/791410

gerritbot added a project: Patch-For-Review.May 12 2022, 5:08 PM

Change 791410 merged by Dzahn:

[operations/puppet@production] gitlab: use 'gracefull-kill' instead of stop to stop puma

https://gerrit.wikimedia.org/r/791410

Mentioned in SAL (#wikimedia-operations) [2022-05-12T19:53:11Z] <mutante> gitlab2001 - systemctl start backup-restore - systemd[1]: Started GitLab Backup Restore. after gerrit:791410 for T308089

after deploying the change above, now this is what happens when manually starting the unit:

May 12 19:55:23 gitlab2001 gitlab-restore.sh[17479]: timeout: run: puma: (pid 672) 3552s, want down
May 12 19:55:24 gitlab2001 gitlab-restore.sh[17479]: found stuck pids still running in process group: 672 18300 18302 18304 18306 18308 18310 18315 18317, sending 
May 12 19:55:30 gitlab2001 gitlab-restore.sh[17479]: ok: down: sidekiq: 0s, normally up
May 12 19:55:32 gitlab2001 gitlab-restore.sh[17479]: running gitlab-backup restore

restore ran and ended like this:

May 12 20:08:13 gitlab2001 gitlab-restore.sh[17479]: Internal API available: FAILED - Internal API error (502)
May 12 20:10:23 gitlab2001 puppet-agent[30156]: Enabling Puppet.
May 12 20:10:23 gitlab2001 systemd[1]: backup-restore.service: Succeeded.

So it did succeed, good! Not sure about that internal API error though.

Dzahn changed the task status from Open to In Progress.May 12 2022, 8:28 PM

Maintenance_bot removed a project: Patch-For-Review.May 12 2022, 8:30 PM

Change 797301 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: retry rails console, don't keep gitlab-secrets.json

https://gerrit.wikimedia.org/r/797301

gerritbot added a project: Patch-For-Review.May 23 2022, 1:55 PM

So it did succeed, good! Not sure about that internal API error though.

API error should be fixed with https://gerrit.wikimedia.org/r/c/operations/puppet/+/797301

Change 797301 merged by Jelto:

[operations/puppet@production] gitlab: retry rails console, don't keep gitlab-secrets.json

https://gerrit.wikimedia.org/r/797301

Maintenance_bot removed a project: Patch-For-Review.May 24 2022, 2:30 PM

Change 799016 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gitlab: switch backup location to /srv, don't use /etc

https://gerrit.wikimedia.org/r/799016

gerritbot added a project: Patch-For-Review.May 25 2022, 1:45 AM

Change 799016 abandoned by Dzahn:

[operations/puppet@production] gitlab: switch backup location to /srv, don't use /etc

Reason:

superseded by https://gerrit.wikimedia.org/r/c/operations/puppet/+/799280

https://gerrit.wikimedia.org/r/799016

Change 799016 restored by Dzahn:

[operations/puppet@production] gitlab: switch backup location to /srv, don't use /etc

https://gerrit.wikimedia.org/r/799016

Mentioned in SAL (#wikimedia-operations) [2022-05-26T00:56:51Z] <mutante> gitlab1001 - T308089 T274463 - '<+icinga-wm> PROBLEM - Disk space on gitlab1001 is CRITICAL: DISK CRITICAL - free space: /mnt/gitlab-backup 0 MB' - manually deleted 1653294190_2022_05_23_14.10.2_gitlab_backup.tar (we have May 24 and 25, 26 could not finish writing backup) - RECOVERY - Disk space on gitlab1001 is OK

Mentioned in SAL (#wikimedia-operations) [2022-05-26T00:58:18Z] <mutante> gitlab1001 - T308089 T274463 - gitlab1001 - systemctl start full-backup

Mentioned in SAL (#wikimedia-operations) [2022-05-26T01:01:21Z] <mutante> gitlab1003 - T308089 T274463 - gitlab1003 - systemctl status backup-restore is failed because it's looking for /mnt/gitlab-backup/latest/latest.tar needs gerrit:799016

Change 799016 merged by Dzahn:

[operations/puppet@production] gitlab: switch backup location to /srv, don't use /etc

https://gerrit.wikimedia.org/r/799016

Dzahn added a project: serviceops.May 26 2022, 1:13 AM

Mentioned in SAL (#wikimedia-operations) [2022-05-26T01:20:04Z] <mutante> gitlab1003 - T308089 T274463 - gitlab1001 - deleted backups from April 4 and April 5 from /srv/gitlab-backup AND deleted partial failed backups from May 26 from /mnt/gitlab-backup; deployed both gerrit:799016 and gerrit:799280 ; restarting full-backup service

Mentioned in SAL (#wikimedia-operations) [2022-05-26T01:27:51Z] <mutante> T308089 T274463 - gitlab1001 - systemctl start rsync-config-backup-gitlab1003.wikimedia.org - Suceeded - RECOVERY - Check systemd state on gitlab1001 is OK

Mentioned in SAL (#wikimedia-operations) [2022-05-26T01:46:39Z] <mutante> T308089 T274463 - gitlab1001 - still not enough disk space to finish full backup. moved backup of May 24th to /root/ . deleted latest.tar; started full-backup service once again

After T274463#7966543 ff we now have a working backup-restore service again:

<+icinga-wm> RECOVERY - Check systemd state on gitlab2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state

Dzahn closed this task as Resolved.May 31 2022, 11:37 PM

gitlab-restore: version detection fail / restore failClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

gitlab-restore: version detection fail / restore fail
Closed, ResolvedPublic
Actions