One note: the puppet sync was broken due to permission issues on the directory since end of March. Some folders were owned by root and not gitpuppet (in both labs private and puppet). I fixed the permissions using chown gitpuppet:gitpuppet -R /srv/git/operations/puppet/. Puppet is happy again on the devtools puppet server. Thanks to @taavi again for the help.

Sun, May 5, 12:47 PM · VPS-project-devtools, Release-Engineering-Team (Now this 🫠), User-brennen, collaboration-services, Puppet (Puppet 7.0), cloud-services-team

Jelto awarded T364239: Add fox icon / badge to Phabricator a Fox token.

Sun, May 5, 11:24 AM · Phabricator (2024-05-05), Release-Engineering-Team, User-brennen, Upstream, Wikimedia-Hackathon-2024

Jelto added a comment to T354656: Create a custom GitLab Prometheus exporter.

The exporter runs on the test instance now. I'll enable the exporter on the prod machines and add them to Prometheus next week.

Sun, May 5, 10:22 AM · Patch-For-Review, Wikimedia-Hackathon-2024, GitLab (Infrastructure), collaboration-services

Jelto added a comment to T363870: [Session] Cuteness association meetup.

IMG_20240502_134803.jpg (1×1 px, 211 KB)

Sun, May 5, 6:53 AM · Wikimedia-Hackathon-2024

Sat, May 4

Jelto committed rLPRI0f4df3a1fede: gitlab: add dummy token for exporter.

gitlab: add dummy token for exporter

Sat, May 4, 6:19 PM

Jelto added a comment to T354656: Create a custom GitLab Prometheus exporter.

GitLab Runner configuration values are available now in the exporter:

Sat, May 4, 12:43 PM · Patch-For-Review, Wikimedia-Hackathon-2024, GitLab (Infrastructure), collaboration-services

Jelto added a comment to T354656: Create a custom GitLab Prometheus exporter.

A first metric is fetched successfully in the exporter:

Sat, May 4, 9:21 AM · Patch-For-Review, Wikimedia-Hackathon-2024, GitLab (Infrastructure), collaboration-services

Jelto closed T364091: Login on GitLab test instance is not working as Resolved.

Thanks @Dzahn for the troubleshooting, it was indeed a missing OIDC secret. The secret was not copied from the old to the new puppet server (because we used the wrong location to put the secret).

Sat, May 4, 8:17 AM · Release-Engineering-Team (Radar), collaboration-services, GitLab

Fri, May 3

Jelto moved T364013: Upgrade to buildkit v13.2 from Incoming to Consultation on the collaboration-services board.

Fri, May 3, 6:53 PM · collaboration-services, GitLab

Jelto added a project to T364013: Upgrade to buildkit v13.2: collaboration-services.

Fri, May 3, 6:52 PM · collaboration-services, GitLab

Jelto created T364091: Login on GitLab test instance is not working.

Fri, May 3, 9:48 AM · Release-Engineering-Team (Radar), collaboration-services, GitLab

Thu, May 2

Jelto reopened Restricted Task, a subtask of T84: Make sure anti-vandalism features are up to snuff, as Open.

Thu, May 2, 7:37 PM · Phabricator, Wikimedia Phabricator RfC

Tue, Apr 30

Jelto moved T354656: Create a custom GitLab Prometheus exporter from Backlog to Work in Progress on the collaboration-services board.

Tue, Apr 30, 8:42 AM · Patch-For-Review, Wikimedia-Hackathon-2024, GitLab (Infrastructure), collaboration-services

Jelto renamed T363735: SystemdUnitFailed (gitlab-runner1002) from SystemdUnitFailed to SystemdUnitFailed (gitlab-runner1002).

Tue, Apr 30, 7:32 AM · collaboration-services

Jelto closed T363735: SystemdUnitFailed (gitlab-runner1002) as Resolved.

One execution of the timer job failed due to a timeout. We saw that multiple times before. Next execution was successful, I'm closing the task.

Tue, Apr 30, 7:31 AM · collaboration-services

Jelto renamed T363760: ProbeDown (moscovium, rt) from ProbeDown to ProbeDown (moscovium, rt).

Tue, Apr 30, 7:27 AM · collaboration-services

Mon, Apr 29

Jelto closed T363582: SystemdUnitFailed - wmf_auto_restart_ssh-gitlab.service on gitlab1004:9100 as Resolved.

This run at the same time with a backup-restore on the replica (which disables gitlabs ssh service). If that happens more often we might have to tweak the scheduling or add proper dependency between those jobs.

Mon, Apr 29, 2:48 PM · collaboration-services

Jelto closed Restricted Task, a subtask of T84: Make sure anti-vandalism features are up to snuff, as Resolved.

Mon, Apr 29, 7:48 AM · Phabricator, Wikimedia Phabricator RfC

Fri, Apr 26

Jelto renamed T363564: make sure gitlab downtime works during version upgrades from ProbeDown to make sure gitlab downtime works during version upgrades.

Fri, Apr 26, 11:10 AM · collaboration-services

Thu, Apr 25

Jelto closed T363446: SystemdUnitFailed as Resolved.

Happened while restarting the runner for software updates

Thu, Apr 25, 9:00 AM · collaboration-services

Jelto closed T363440: SystemdUnitFailed (gitlab1003 backup-restore) as Resolved.

expected due to ongoing upgrade in T363349. I created a silence for backup-restore.service until tomorrow after out maintenance window.

Thu, Apr 25, 7:23 AM · collaboration-services

Jelto renamed T363440: SystemdUnitFailed (gitlab1003 backup-restore) from SystemdUnitFailed to SystemdUnitFailed (gitlab1003 backup-restore).

Thu, Apr 25, 7:19 AM · collaboration-services

Jelto added a parent task for T363440: SystemdUnitFailed (gitlab1003 backup-restore): Unknown Object (Task).

Thu, Apr 25, 7:19 AM · collaboration-services

Dzahn awarded T363113: ProbeDown - phab1004 a Barnstar token.

Thu, Apr 25, 4:27 AM · collaboration-services

Tue, Apr 23

Jelto changed the status of Restricted Task, a subtask of T84: Make sure anti-vandalism features are up to snuff, from Open to In Progress.

Tue, Apr 23, 7:22 AM · Phabricator, Wikimedia Phabricator RfC

Jelto closed T363113: ProbeDown - phab1004 as Resolved.

I've done some queries in superset and it seems that was amazonbot scraping phabricator, see https://superset.wikimedia.org/superset/dashboard/p/56nOdzPB0q8/ for example.

Tue, Apr 23, 7:20 AM · collaboration-services

Thu, Apr 18

Jelto claimed T354656: Create a custom GitLab Prometheus exporter.

Thu, Apr 18, 12:11 PM · Patch-For-Review, Wikimedia-Hackathon-2024, GitLab (Infrastructure), collaboration-services

Wed, Apr 17

Jelto moved T361090: Move k8s miscweb blackbox checks out of microsites puppet module from Work in Progress to Backlog on the collaboration-services board.

Wed, Apr 17, 12:10 PM · collaboration-services

Tue, Apr 16

Jelto updated the task description for T300171: Move micro sites from Ganeti to Kubernetes and from Gerrit to GitLab.

Tue, Apr 16, 3:08 PM · GitLab (Pipeline Services Migration🐤), collaboration-services

Jelto closed T362634: ProbeDown (miscweb), a subtask of T290020: Enable audit logging for kube-apiserver, as Resolved.

Tue, Apr 16, 2:43 PM · Observability-Logging, Patch-For-Review, Prod-Kubernetes, serviceops, Kubernetes

Jelto closed T362634: ProbeDown (miscweb) as Resolved.

this happened due to maintenance on the wikikube cluster in T290020

Tue, Apr 16, 2:42 PM · collaboration-services

Jelto added a subtask for T290020: Enable audit logging for kube-apiserver: T362634: ProbeDown (miscweb).

Tue, Apr 16, 2:42 PM · Observability-Logging, Patch-For-Review, Prod-Kubernetes, serviceops, Kubernetes

Jelto added a parent task for T362634: ProbeDown (miscweb): T290020: Enable audit logging for kube-apiserver.

Tue, Apr 16, 2:42 PM · collaboration-services

Jelto renamed T362634: ProbeDown (miscweb) from ProbeDown to ProbeDown (miscweb).

Tue, Apr 16, 2:42 PM · collaboration-services

Jelto changed the status of T361090: Move k8s miscweb blackbox checks out of microsites puppet module from Open to Stalled.

In https://gerrit.wikimedia.org/r/1019039 I tried to add more probes to the service::catalog entry for miscweb. However the current puppet implementation does not support multiple blackbox checks. In https://gerrit.wikimedia.org/r/1020185 I tried to add this feature but this requires significantly more refactoring in other puppet modules as well. So I abandoned it for now.

Tue, Apr 16, 1:57 PM · collaboration-services

Jelto changed the status of T361090: Move k8s miscweb blackbox checks out of microsites puppet module, a subtask of T300171: Move micro sites from Ganeti to Kubernetes and from Gerrit to GitLab, from Open to Stalled.

Tue, Apr 16, 1:57 PM · GitLab (Pipeline Services Migration🐤), collaboration-services

Jelto merged task T362600: ProbeDown into Restricted Task.

Tue, Apr 16, 5:33 AM · collaboration-services

Mon, Apr 15

Jelto closed T320730: Define access to external resources for GitLab CI Runners as Resolved.

I think we are mostly settled about which runners have which kind of access to wmf and external infrastructure. Also the permission to this runners seems to work as expected (default access to cloud Runners, opt-in access to Trusted Runners).

Mon, Apr 15, 12:55 PM · Release-Engineering-Team, GitLab (CI & Job Runners), collaboration-services

Jelto added a comment to T360470: Update devtools project puppetmaster.

Puppet runs on some machines which use the new puppetmaster in devtools fail, here example of gitlab-runner-1002.devtools.eqiad1.wikimedia.cloud:

Mon, Apr 15, 6:56 AM · VPS-project-devtools, Release-Engineering-Team (Now this 🫠), User-brennen, collaboration-services, Puppet (Puppet 7.0), cloud-services-team

Fri, Apr 12

Jelto closed T362393: ProbeDown (gitlab2002) as Resolved.

this was because of the update in T362298. I'm not 100% sure why this is not silenced, as the cookbook creates a downtime of 180 minutes:

Fri, Apr 12, 11:11 AM · collaboration-services

Jelto renamed T362393: ProbeDown (gitlab2002) from ProbeDown to ProbeDown (gitlab2002).

Fri, Apr 12, 11:10 AM · collaboration-services

Thu, Apr 11

Jelto moved T361090: Move k8s miscweb blackbox checks out of microsites puppet module from Backlog to Work in Progress on the collaboration-services board.

Thu, Apr 11, 1:19 PM · collaboration-services

Jelto closed T357612: Create a special-purpose Trusted Runner with Dockerfile frontend as Resolved.

The Trusted Dockerfile Runner is available now and firsts tests with building the buildkit image were successful. I also adjusted the docs and added Dockerfile support to one of the test runners as well.

Thu, Apr 11, 12:54 PM · Patch-For-Review, GitLab (CI & Job Runners), collaboration-services

Jelto updated the task description for T357612: Create a special-purpose Trusted Runner with Dockerfile frontend.

Thu, Apr 11, 12:42 PM · Patch-For-Review, GitLab (CI & Job Runners), collaboration-services

Wed, Apr 10

Jelto added a comment to T362227: SystemdUnitFailed (gitlab rsync).

Apr 10 13:00:05 gitlab2002 systemd[1]: Starting rsync GitLab data backup primary to a secondary server...
Apr 10 13:00:06 gitlab2002 gitlab-backup-periodic-rsync.sh[383550]: sending incremental file list
Apr 10 13:00:06 gitlab2002 gitlab-backup-periodic-rsync.sh[383550]: rsync: [sender] link_stat "/srv/gitlab-backup/*_gitlab_backup.tar" failed: No such file or directory (2)
Apr 10 13:00:06 gitlab2002 gitlab-backup-periodic-rsync.sh[383550]: sent 94 bytes  received 20 bytes  76.00 bytes/sec
Apr 10 13:00:06 gitlab2002 gitlab-backup-periodic-rsync.sh[383550]: total size is 0  speedup is 0.00
Apr 10 13:00:06 gitlab2002 gitlab-backup-periodic-rsync.sh[383550]: rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1333) [sender=3.2.3]
Apr 10 13:00:06 gitlab2002 systemd[1]: rsync-data-backup-gitlab1003.wikimedia.org.service: Main process exited, code=exited, status=23/n/a
Apr 10 13:00:06 gitlab2002 systemd[1]: rsync-data-backup-gitlab1003.wikimedia.org.service: Failed with result 'exit-code'.
Apr 10 13:00:06 gitlab2002 systemd[1]: Failed to start rsync GitLab data backup primary to a secondary server.

Wed, Apr 10, 2:38 PM · collaboration-services

Jelto renamed T362227: SystemdUnitFailed (gitlab rsync) from SystemdUnitFailed to SystemdUnitFailed (gitlab rsync).

Wed, Apr 10, 2:37 PM · collaboration-services

Jelto updated the task description for T357612: Create a special-purpose Trusted Runner with Dockerfile frontend.

Wed, Apr 10, 1:10 PM · Patch-For-Review, GitLab (CI & Job Runners), collaboration-services

Jelto added a comment to T361170: Establish codex.wikimedia.org subdomain.

@Volker_E re this ticket and your slack message:

Wed, Apr 10, 11:55 AM · collaboration-services, Design-System-Team, Documentation, Codex

Tue, Apr 9

Jelto added a comment to T357612: Create a special-purpose Trusted Runner with Dockerfile frontend.

Rebuild of docker-registry.wikimedia.org/repos/releng/buildkit/dockerfile-frontend was successful, see pipelines. I also pushed a new tag wmf-v0.12.5-11 and the build of buildkit and the dockerfile-frontend images were successful, see pipelines.

Tue, Apr 9, 1:03 PM · Patch-For-Review, GitLab (CI & Job Runners), collaboration-services

Jelto added a comment to T357612: Create a special-purpose Trusted Runner with Dockerfile frontend.

I've done some more research regarding self-building the dockerfile-frontend image. I compared the upstream image and the wmf image and it's quite obvious that they are different images

Tue, Apr 9, 11:55 AM · Patch-For-Review, GitLab (CI & Job Runners), collaboration-services

Apr 8 2024

Jelto closed T362009: ProbeDown (moscovium, static-rt) as Resolved.

Service on host was down for 5 minutes, I'll resolve the task as it's up again. We can do research if that happens again.

Apr 8 2024, 7:30 AM · collaboration-services

Jelto renamed T362009: ProbeDown (moscovium, static-rt) from ProbeDown to ProbeDown (moscovium, static-rt).

Apr 8 2024, 7:28 AM · collaboration-services

Jelto closed T362032: SystemdUnitFailed - moscovium - logrotate as Resolved.

Logrotate failed on moscovium:

Apr 08 00:01:49 moscovium logrotate[4171620]: error: error running shared postrotate script for '/var/log/apache2/*.log '
Apr 08 00:01:56 moscovium systemd[1]: logrotate.service: Main process exited, code=exited, status=1/FAILURE
Apr 08 00:01:56 moscovium systemd[1]: logrotate.service: Failed with result 'exit-code'.
Apr 08 00:01:56 moscovium systemd[1]: Failed to start Rotate log files.

Apr 8 2024, 7:23 AM · collaboration-services

Apr 5 2024

Dzahn awarded T337107: gitlab.wikimedia.org ssh host key should appear in wmf-known-host a Party Time token.

Apr 5 2024, 4:33 PM · Patch-For-Review, collaboration-services, GitLab (Infrastructure)

Jelto moved T361937: [Session] Using kokkuri to build container images on GitLab CI from Backlog to Proposed sessions on the Wikimedia-Hackathon-2024 board.

Apr 5 2024, 1:09 PM · User-brennen, Release-Engineering-Team (Radar), Wikimedia-Hackathon-2024

Jelto created T361937: [Session] Using kokkuri to build container images on GitLab CI.

Apr 5 2024, 1:08 PM · User-brennen, Release-Engineering-Team (Radar), Wikimedia-Hackathon-2024

Jelto moved T361922: [Session] Using wmf-debci to build Debian packages on GitLab from Backlog to Proposed sessions on the Wikimedia-Hackathon-2024 board.

Apr 5 2024, 9:52 AM · Wikimedia-Hackathon-2024

Jelto created T361922: [Session] Using wmf-debci to build Debian packages on GitLab.

Apr 5 2024, 9:52 AM · Wikimedia-Hackathon-2024

Jelto closed T337107: gitlab.wikimedia.org ssh host key should appear in wmf-known-host as Resolved.

This should be fixed now. wmf known hosts contains gitlab.wikimedia.org with the ecdsa-sha2-nistp256 algorithm only:

Apr 5 2024, 7:19 AM · Patch-For-Review, collaboration-services, GitLab (Infrastructure)

Jelto closed T337107: gitlab.wikimedia.org ssh host key should appear in wmf-known-host, a subtask of T333840: Move gitlab ssh host keys to private puppet, as Resolved.

Apr 5 2024, 7:19 AM · Patch-For-Review, collaboration-services, GitLab (Infrastructure)

Apr 4 2024

Jelto awarded T361309: [Session] Managing Wikimedia servers with Puppet a Like token.

Apr 4 2024, 12:51 PM · Wikimedia-Hackathon-2024

Apr 3 2024

Jelto updated the task description for T361706: 2024-04-03 calico/typha down.

Apr 3 2024, 2:19 PM · Prod-Kubernetes, Wikimedia-Incident

Jelto changed the status of T361705: ProbeDown - miscweb1003 from Open to In Progress.

related to a bigger Kubernetes incident (T361706)

Apr 3 2024, 1:48 PM · collaboration-services

Jelto updated subscribers of T323201: Evaluate a high available GitLab architecture.

One important note (thanks @eoghan for pointing this out): GitLab HA is marked as a premium feature here. The 2000 users reference architecture and zero downtime upgrades are marked as "free". So we have to double check which features are premium and which are free.

Apr 3 2024, 12:27 PM · GitLab (Infrastructure), collaboration-services

Jelto added a comment to T357612: Create a special-purpose Trusted Runner with Dockerfile frontend.

In T357612#9680650, @dancy wrote:

In T357612#9679092, @Jelto wrote:

The Trusted Dockerfile Runner gitlab-runner2004 is available now. The first project which is allowed to use this runner is buildkit. I merged the change above to build the dockerfile frontend image also in CI, which should be a good test.

@dancy are you okay to push a new tag for buildkit to trigger the image build pipeline?

If this works as expected, the only missing steps are to update docs about the Dockerfile Runner and install a Dockerfile Runner in our test environment as well.

Two jobs failed due to 'docker/dockerfile-upstream:master' is not an allowed gateway frontend:
https://gitlab.wikimedia.org/repos/releng/buildkit/-/jobs/235180
https://gitlab.wikimedia.org/repos/releng/buildkit/-/jobs/235181

Apr 3 2024, 9:25 AM · Patch-For-Review, GitLab (CI & Job Runners), collaboration-services

Apr 2 2024

Jelto closed T360459: Update gitlab-runners project puppetmaster as Resolved.

The new puppetserver looks fine. I un-registered one runner (runner-1026.gitlab-runners.eqiad1.wikimedia.cloud) and re-registered and it looks good. Also the private profile::gitlab::runner::token is correct.

Apr 2 2024, 2:31 PM · collaboration-services, VPS-Projects, Puppet (Puppet 7.0), cloud-services-team

Jelto closed T360459: Update gitlab-runners project puppetmaster, a subtask of T351452: Migrate per-project Puppet servers to Puppet 7, as Resolved.

Apr 2 2024, 2:30 PM · VPS-Projects, Puppet (Puppet 7.0), cloud-services-team

Jelto added a comment to T357612: Create a special-purpose Trusted Runner with Dockerfile frontend.

In T357612#9676305, @bd808 wrote:

Just a note that Toolforge has a collection of images that are built from Dockerfiles as well. There have been a few past discussions in the cloud-services-team, mostly very long ago now, about automating builds of these containers. I am interested to hear how this initial trusted Docker runner experiment works out.

Apr 2 2024, 9:14 AM · Patch-For-Review, GitLab (CI & Job Runners), collaboration-services

Jelto added a comment to T357612: Create a special-purpose Trusted Runner with Dockerfile frontend.

The Trusted Dockerfile Runner gitlab-runner2004 is available now. The first project which is allowed to use this runner is buildkit. I merged the change above to build the dockerfile frontend image also in CI, which should be a good test.

Apr 2 2024, 9:06 AM · Patch-For-Review, GitLab (CI & Job Runners), collaboration-services

Jelto updated the task description for T357612: Create a special-purpose Trusted Runner with Dockerfile frontend.

Apr 2 2024, 8:46 AM · Patch-For-Review, GitLab (CI & Job Runners), collaboration-services