User Details
- User Since
- Jun 7 2021, 7:25 AM (152 w, 3 d)
- Availability
- Available
- LDAP User
- Jelto
- MediaWiki User
- JWodstrcil (WMF) [ Global Accounts ]
Yesterday
I this can be closed, thanks for taking and copying the notes here @debt !
I think it can be closed, thanks for taking and copying the notes here!
A job tried to restart envoy, but envoy is not running on contint1003 (because it's a passive host).
Mon, May 6
Sun, May 5
One note: the puppet sync was broken due to permission issues on the directory since end of March. Some folders were owned by root and not gitpuppet (in both labs private and puppet). I fixed the permissions using chown gitpuppet:gitpuppet -R /srv/git/operations/puppet/. Puppet is happy again on the devtools puppet server. Thanks to @taavi again for the help.
The exporter runs on the test instance now. I'll enable the exporter on the prod machines and add them to Prometheus next week.
Sat, May 4
GitLab Runner configuration values are available now in the exporter:
A first metric is fetched successfully in the exporter:
Thanks @Dzahn for the troubleshooting, it was indeed a missing OIDC secret. The secret was not copied from the old to the new puppet server (because we used the wrong location to put the secret).
Fri, May 3
Thu, May 2
Tue, Apr 30
One execution of the timer job failed due to a timeout. We saw that multiple times before. Next execution was successful, I'm closing the task.
Mon, Apr 29
This run at the same time with a backup-restore on the replica (which disables gitlabs ssh service). If that happens more often we might have to tweak the scheduling or add proper dependency between those jobs.
Fri, Apr 26
Thu, Apr 25
Happened while restarting the runner for software updates
expected due to ongoing upgrade in T363349. I created a silence for backup-restore.service until tomorrow after out maintenance window.
Tue, Apr 23
I've done some queries in superset and it seems that was amazonbot scraping phabricator, see https://superset.wikimedia.org/superset/dashboard/p/56nOdzPB0q8/ for example.
Thu, Apr 18
Wed, Apr 17
Tue, Apr 16
this happened due to maintenance on the wikikube cluster in T290020
In https://gerrit.wikimedia.org/r/1019039 I tried to add more probes to the service::catalog entry for miscweb. However the current puppet implementation does not support multiple blackbox checks. In https://gerrit.wikimedia.org/r/1020185 I tried to add this feature but this requires significantly more refactoring in other puppet modules as well. So I abandoned it for now.
Mon, Apr 15
I think we are mostly settled about which runners have which kind of access to wmf and external infrastructure. Also the permission to this runners seems to work as expected (default access to cloud Runners, opt-in access to Trusted Runners).
Puppet runs on some machines which use the new puppetmaster in devtools fail, here example of gitlab-runner-1002.devtools.eqiad1.wikimedia.cloud:
Fri, Apr 12
this was because of the update in T362298. I'm not 100% sure why this is not silenced, as the cookbook creates a downtime of 180 minutes:
Thu, Apr 11
The Trusted Dockerfile Runner is available now and firsts tests with building the buildkit image were successful. I also adjusted the docs and added Dockerfile support to one of the test runners as well.
Wed, Apr 10
Apr 10 13:00:05 gitlab2002 systemd[1]: Starting rsync GitLab data backup primary to a secondary server... Apr 10 13:00:06 gitlab2002 gitlab-backup-periodic-rsync.sh[383550]: sending incremental file list Apr 10 13:00:06 gitlab2002 gitlab-backup-periodic-rsync.sh[383550]: rsync: [sender] link_stat "/srv/gitlab-backup/*_gitlab_backup.tar" failed: No such file or directory (2) Apr 10 13:00:06 gitlab2002 gitlab-backup-periodic-rsync.sh[383550]: sent 94 bytes received 20 bytes 76.00 bytes/sec Apr 10 13:00:06 gitlab2002 gitlab-backup-periodic-rsync.sh[383550]: total size is 0 speedup is 0.00 Apr 10 13:00:06 gitlab2002 gitlab-backup-periodic-rsync.sh[383550]: rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1333) [sender=3.2.3] Apr 10 13:00:06 gitlab2002 systemd[1]: rsync-data-backup-gitlab1003.wikimedia.org.service: Main process exited, code=exited, status=23/n/a Apr 10 13:00:06 gitlab2002 systemd[1]: rsync-data-backup-gitlab1003.wikimedia.org.service: Failed with result 'exit-code'. Apr 10 13:00:06 gitlab2002 systemd[1]: Failed to start rsync GitLab data backup primary to a secondary server.
@Volker_E re this ticket and your slack message:
Tue, Apr 9
I've done some more research regarding self-building the dockerfile-frontend image. I compared the upstream image and the wmf image and it's quite obvious that they are different images
Apr 8 2024
Service on host was down for 5 minutes, I'll resolve the task as it's up again. We can do research if that happens again.
Logrotate failed on moscovium:
Apr 08 00:01:49 moscovium logrotate[4171620]: error: error running shared postrotate script for '/var/log/apache2/*.log ' Apr 08 00:01:56 moscovium systemd[1]: logrotate.service: Main process exited, code=exited, status=1/FAILURE Apr 08 00:01:56 moscovium systemd[1]: logrotate.service: Failed with result 'exit-code'. Apr 08 00:01:56 moscovium systemd[1]: Failed to start Rotate log files.
Apr 5 2024
This should be fixed now. wmf known hosts contains gitlab.wikimedia.org with the ecdsa-sha2-nistp256 algorithm only:
Apr 4 2024
Apr 3 2024
related to a bigger Kubernetes incident (T361706)
One important note (thanks @eoghan for pointing this out): GitLab HA is marked as a premium feature here. The 2000 users reference architecture and zero downtime upgrades are marked as "free". So we have to double check which features are premium and which are free.
Apr 2 2024
The new puppetserver looks fine. I un-registered one runner (runner-1026.gitlab-runners.eqiad1.wikimedia.cloud) and re-registered and it looks good. Also the private profile::gitlab::runner::token is correct.
The Trusted Dockerfile Runner gitlab-runner2004 is available now. The first project which is allowed to use this runner is buildkit. I merged the change above to build the dockerfile frontend image also in CI, which should be a good test.
Apr 1 2024
Mar 29 2024
Mar 28 2024
This was because of the upgrade in T361165.
Mar 27 2024
This is related to maintenance in T360759.
Mar 26 2024
Details see T303534#9660437
Details see T303534#9660437