⚓ T329931 Switchover gitlab (gitlab1004 -> gitlab2002)

Subject	Repo	Branch	Lines +/-
gitlab: enable restore for replicas, disable on active_host	operations/puppet	production	+1 -15
Update records for gitlab from gitlab1004 -> gitlab2002	operations/dns	master	+9 -9
Switch active gitlab host from gitlab1004 (eqiad) -> gitlab2002 (codfw)	operations/puppet	production	+5 -5
Lower TTL on gitlab records to 300 seconds to facilitate failover	operations/dns	master	+6 -6

	Title	Reference	Author	Source Branch	Dest Branch
	update settings files for 2023 switchover eqiad -> codfw	repos/releng/gitlab-settings!17	jelto	gitlab-switchover	main

Status	Assigned	Task
Resolved	Clement_Goubert	T327920 March 2023 Datacenter Switchover
Resolved	eoghan	T329931 Switchover gitlab (gitlab1004 -> gitlab2002)
Resolved	jbond	T330744 gitlab backup timer failing

Jelto created this task.Feb 17 2023, 10:04 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 17 2023, 10:04 AM

Jelto mentioned this in T329930: Switchover gitlab-replica (gitlab2002 -> gitlab1003).Feb 17 2023, 10:04 AM

Jelto mentioned this in T326315: Align and refactor GitLab restore scripts.Feb 17 2023, 10:51 AM

Jelto updated the task description. (Show Details)Feb 20 2023, 4:20 PM

LSobanski triaged this task as High priority.Feb 20 2023, 4:23 PM

LSobanski moved this task from Incoming to Backlog on the collaboration-services board.

Jelto updated the task description. (Show Details)Feb 23 2023, 8:55 AM

Jelto updated the task description. (Show Details)Feb 23 2023, 9:33 AM

jelto opened https://gitlab.wikimedia.org/repos/releng/gitlab-settings/-/merge_requests/17

update settings files for 2023 switchover eqiad -> codfw

gerritbot added a project: Patch-For-Review.Feb 23 2023, 9:49 AM

Jelto updated the task description. (Show Details)Feb 23 2023, 9:51 AM

Jelto updated the task description. (Show Details)

Jelto updated the task description. (Show Details)Feb 23 2023, 9:59 AM

I scheduled a new broadcast message on GitLab for the upcoming switchover next Monday. The message should appear tomorrow morning until Monday after the maintenance window.

Change 891863 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] Switch active gitlab host from gitlab1004 (eqiad) -> gitlab2002 (codfw)

https://gerrit.wikimedia.org/r/891863

Change 891886 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/dns@master] Lower TTL on gitlab records to 300 seconds to facilitate failover

https://gerrit.wikimedia.org/r/891886

Change 891888 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/dns@master] Update records for gitlab from gitlab1004 -> gitlab2002

https://gerrit.wikimedia.org/r/891888

eoghan updated the task description. (Show Details)Feb 24 2023, 6:17 PM

Jelto updated the task description. (Show Details)Feb 27 2023, 8:24 AM

Change 891886 merged by Jelto:

[operations/dns@master] Lower TTL on gitlab records to 300 seconds to facilitate failover

https://gerrit.wikimedia.org/r/891886

Jelto updated the task description. (Show Details)Feb 27 2023, 9:35 AM

Jelto assigned this task to eoghan.Feb 27 2023, 9:59 AM

Jelto moved this task from Backlog to Work in Progress on the collaboration-services board.

Jelto updated the task description. (Show Details)Feb 27 2023, 10:07 AM

Mentioned in SAL (#wikimedia-operations) [2023-02-27T10:17:50Z] <eoghan@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab1004.wikimedia.org with reason: Running failover to gitlab2002- T329931

Mentioned in SAL (#wikimedia-operations) [2023-02-27T10:18:03Z] <eoghan@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab1004.wikimedia.org with reason: Running failover to gitlab2002- T329931

Jelto updated the task description. (Show Details)Feb 27 2023, 10:18 AM

Jelto updated the task description. (Show Details)

Jelto updated the task description. (Show Details)Feb 27 2023, 11:03 AM

Change 891863 merged by EoghanGaffney:

[operations/puppet@production] Switch active gitlab host from gitlab1004 (eqiad) -> gitlab2002 (codfw)

https://gerrit.wikimedia.org/r/891863

Jelto updated the task description. (Show Details)Feb 27 2023, 11:25 AM

valerio.bozzolan subscribed.Feb 27 2023, 11:29 AM

Change 891888 merged by EoghanGaffney:

[operations/dns@master] Update records for gitlab from gitlab1004 -> gitlab2002

https://gerrit.wikimedia.org/r/891888

Jelto updated the task description. (Show Details)Feb 27 2023, 11:36 AM

Jelto updated the task description. (Show Details)Feb 27 2023, 11:51 AM

Jelto updated the task description. (Show Details)Feb 27 2023, 11:58 AM

Maintenance finished, GitLab is back again. If you face any issues feel free to post it here.

I'm leaving the task open in case any issues come up. Otherwise we can close this task tomorrow.

Thanks @eoghan for running the switchover!

Jelto awarded a token.Feb 27 2023, 12:03 PM

valerio.bozzolan awarded a token.Feb 27 2023, 12:22 PM

A "Backup freshness" alert with summary:"No backups: 1 (gitlab2002), Fresh: 116 jobs"" triggered. This should resolve on the next bacula run this night. Bacula was disabled for the old instance gitlab1004 and enabled for gitlab2002.

thcipriani awarded a token.Feb 27 2023, 6:51 PM

Jelto mentioned this in T330717: ProbeDown.Feb 28 2023, 7:53 AM

Change 892892 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: enable restore for replicas, disable on active_host

https://gerrit.wikimedia.org/r/892892

As mentioned in T330717, the new production host gitlab2002 still had the restore enabled and executed a restore this night.

The restore timer is handled manually by the hiera key profile::gitlab::enable_restore: true. This key was not updated during the switchover and the timer was not removed. So the restore was triggered at 2:00 UTC. The restore was executed from previously done backup from 0:04 UTC.

So we lost data for this two hours.

I posted some updates about that wikitech-l and engineering-all. I'll also add a short message to slack and IRC, to make sure all are aware of that.

The above change should fix this and automatically enable and disable the restore, depending on the state of the host (replica or production host).

I started a incident report in 2023-02-28_GitLab_data_loss.

It seems we were quite lucky and only got one git connection over SSH during that time. I'm still grepping over the other logs to try estimate the impact of affected users/actions.

Change 892892 merged by Jelto:

[operations/puppet@production] gitlab: enable restore for replicas, disable on active_host

https://gerrit.wikimedia.org/r/892892

The rsync jobs between production host and replica are only created but not removed, when the list of replicas change. On the former production instance gitlab1004 the jobs rsync-config-backup-gitlab2002.wikimedia.org.timer and rsync-data-backup-gitlab2002.wikimedia.org.timer are still present.

https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/gitlab/manifests/rsync.pp#40

The jobs failed because rsync server is disabled automatically on the new production instance. However we should find a way to manag the rsync jobs properly. Removing the jobs by hand after each failover sounds error prone.

Jelto mentioned this in T330744: gitlab backup timer failing.Feb 28 2023, 12:01 PM

jbond added a subtask: T330744: gitlab backup timer failing.Feb 28 2023, 12:13 PM

jbond closed subtask T330744: gitlab backup timer failing as Resolved.Feb 28 2023, 5:10 PM

In T329931#8652138, @Jelto wrote:

The rsync jobs between production host and replica are only created but not removed, when the list of replicas change. On the former production instance gitlab1004 the jobs rsync-config-backup-gitlab2002.wikimedia.org.timer and rsync-data-backup-gitlab2002.wikimedia.org.timer are still present.

https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/gitlab/manifests/rsync.pp#40

The jobs failed because rsync server is disabled automatically on the new production instance. However we should find a way to manag the rsync jobs properly. Removing the jobs by hand after each failover sounds error prone.

^ this is fixed now, subtask T330744 closed.

I've done some more research what has happened during the backup and restore incident on the new production instance gitlab2002.

According to GitLab gitaly and SSH logs only one commit over SSH happened during the incident. In two other projects I found merges and some kind of upload:

https://logstash.wikimedia.org/goto/3e9666227f80f732b35548191d15e4f8

The API/webserver is quite verbose and is not fully parsed in logstash. So it's a bit tricky to find out other actions like comments, actions on merge requests and so on.

Jelto mentioned this in T331295: Add safeguard flag to gitlab-restore.sh script.Mar 6 2023, 2:12 PM

I've spoken with Jelto and we're happy to close this ticket.

The switchover process itself was largely completed without issue, however the data-loss incident did occur that night. We've identified a few places to safeguard this, and either completed them or opened tickets for them. There is also future work around our architecture for the gitlab service that will be put on the roadmap later.

brennen merged https://gitlab.wikimedia.org/repos/releng/gitlab-settings/-/merge_requests/17

update settings files for 2023 switchover eqiad -> codfw

Maintenance_bot removed a project: Patch-For-Review.Mar 20 2023, 11:10 PM

eoghan mentioned this in T334838: Switch over gitlab-replica (gitlab1003 -> gitlab1004).Apr 17 2023, 11:59 AM

Switchover gitlab (gitlab1004 -> gitlab2002)
Closed, ResolvedPublic
Actions

Description

Details

Related Objects
Search...

Event Timeline

Switchover gitlab (gitlab1004 -> gitlab2002)Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Switchover gitlab (gitlab1004 -> gitlab2002)
Closed, ResolvedPublic
Actions

Related Objects
Search...