- User Since
- Jun 7 2021, 7:25 AM (94 w, 2 d)
- LDAP User
- MediaWiki User
- JWodstrcil (WMF) [ Global Accounts ]
After some troulbleshooting with @eoghan we found out systemd::unmask has set the refreshonly parameter to true in the exec command. Which means
With merging https://gerrit.wikimedia.org/r/903693 the aphlict logrotate job was removed from phabricator hosts and stayed on aphlict host.
thanks to @eoghan for pointing me to a patch which disabled logrotate on phabricator hosts: https://gerrit.wikimedia.org/r/c/operations/puppet/+/902396/8/modules/phabricator/manifests/aphlict.pp
I can confirm log rotate failed 4 days ago. It seems to be stuck (inactive (dead)):
Thanks @Arnoldokoth for the tests. Yes it seems the "incremental" backup is significantly smaller and faster than a full backup. But it's not fully clear to me what's happening internally and what backup file we need for a restore. Documentation around the incremental backup feature is quite limited. Also the output of backup jobs is very limited (when leaving the CRON=1 parameter).
Mon, Mar 27
Thu, Mar 16
Wed, Mar 15
The restore failed with a deadlock in the postgres DB restore:
Tue, Mar 14
Mon, Mar 13
blackbox checks for releases fail since it was added to prometheus: https://logstash.wikimedia.org/goto/a0d856fad268c5301031fb6e47c56b9b and https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes%2Fcustom&var-module=All&orgId=1&viewPanel=3. A puppet issue on prometheus prevented the blackbox check from being configured properly before.
A docker registry container is running on runner-1029 in WMCS now. A quick test with enabling the pull-through cache seemed to work. With enabled pull-through cache and a docker pull, the image data ends up in /var/lib/docker-registry.
Wed, Mar 8
The restore script has a safeguard implemented on production (see change above):
Tue, Mar 7
I deleted the new host cumin-master-1001.devtools.eqiad1.wikimedia.cloud.
The above estimates consider just the current GitLab usage. At some point we may migrate all projects from gerrit to GitLab. The sizes on gerrit1001 are:
Adding @Volans to get some feedback about running cookbooks in WMCS. I'm not sure if creating a dedicated cumin master helps us running cookbooks in WMCS.
Mon, Mar 6
I created cumin-master-1001 in devtools project and configured the host according to https://wikitech.wikimedia.org/wiki/Help:Cumin_master.
Thu, Mar 2
One of the WMCS Shared Runners runner-29 has a dedicated disk for storing the mirrored dockerhub images now:
Wed, Mar 1
I've done some more research on the above ideas:
^ the alert is firing again. I guess the downtime expired?
Tue, Feb 28
Time/jobs for gitlab2002 were removed on gitlab1004:
thanks for keeping an eye on that!
The rsync jobs between production host and replica are only created but not removed, when the list of replicas change. On the former production instance gitlab1004 the jobs rsync-config-backup-gitlab2002.wikimedia.org.timer and rsync-data-backup-gitlab2002.wikimedia.org.timer are still present.
After merging the above change the restore timer is gone form gitlab2002. So we should not see a ProbeDown alert again due to restores on the production instance.
I started a incident report in 2023-02-28_GitLab_data_loss
As mentioned in T330717, the new production host gitlab2002 still had the restore enabled and executed a restore this night.
So gitlab2002 was down for around 20 minutes while doing the restore. The above change should disable restore on the production host.
gitlab2002 was switched from replica to a production instance yesterday in T329931.
Mon, Feb 27
@Dzahn can you follow up on the task status?
Feb 27 2023
A "Backup freshness" alert with summary:"No backups: 1 (gitlab2002), Fresh: 116 jobs"" triggered. This should resolve on the next bacula run this night. Bacula was disabled for the old instance gitlab1004 and enabled for gitlab2002.
Maintenance finished, GitLab is back again. If you face any issues feel free to post it here.
Feb 23 2023
I deleted all unused volumes in the gitlab-runners project. This freed some storage/volume quotas in the project.
I scheduled a new broadcast message on GitLab for the upcoming switchover next Monday. The message should appear tomorrow morning until Monday after the maintenance window.
Feb 22 2023
After switching gitlab2002 and gitlab1003 I get a SSH key warning, although the hosts use the same host key: