Page MenuHomePhabricator

dduvall (Dan Duvall)
Staff Software Engineer

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
Oct 7 2014, 4:24 PM (450 w, 5 d)
Availability
Available
IRC Nick
marxarelli
LDAP User
Dduvall
MediaWiki User
DDuvall (WMF) [ Global Accounts ]

Recent Activity

Apr 12 2023

dduvall added a comment to T334254: Error when using multi arch build on gitlab with blubber and kokkuri.
  • which is I think executed within the arm64 image and then are causing the error: standard_init_linux.go:219: exec user process caused: exec format error. My understanding is that if I want to use copies I must be setting lives.in, in my cases all the folders/users are existing, would there be a way to skip this detection and tell blubber to trust my image?
Apr 12 2023, 7:42 PM · Release-Engineering-Team (Priority Backlog 📥), GitLab

Apr 7 2023

dduvall edited projects for T334254: Error when using multi arch build on gitlab with blubber and kokkuri, added: Release-Engineering-Team (Priority Backlog 📥); removed Release-Engineering-Team.
Apr 7 2023, 10:34 PM · Release-Engineering-Team (Priority Backlog 📥), GitLab
dduvall added a comment to T334254: Error when using multi arch build on gitlab with blubber and kokkuri.

@dcausse thanks for filing this.

Apr 7 2023, 10:32 PM · Release-Engineering-Team (Priority Backlog 📥), GitLab
dduvall closed T322453: Buildkit erroring with "cannot reuse body, request must be retried" upon multi-platform push as Resolved.

Confirmed that we can now push blubber's multi platform image to our registry. See https://gitlab.wikimedia.org/repos/releng/blubber/-/jobs/90258

Apr 7 2023, 9:59 PM · Patch-For-Review, Release Pipeline (Blubber), Release-Engineering-Team (Priority Backlog 📥), serviceops

Mar 29 2023

dduvall updated the task description for T333483: Less_Exception_Parser: File `resources/lib/codex-design-tokens/theme-wikimedia-ui-legacy.less` not found. in interface.less.
Mar 29 2023, 6:42 PM · Anti-Harassment (AHaT Sprint 28: The Mad Hatter Hat), MW-1.41-notes (1.41.0-wmf.3; 2023-04-03), Growth-Team, Timeless, GrowthExperiments, Wikimedia-production-error
dduvall added projects to T333483: Less_Exception_Parser: File `resources/lib/codex-design-tokens/theme-wikimedia-ui-legacy.less` not found. in interface.less: GrowthExperiments, Timeless.
Mar 29 2023, 6:42 PM · Anti-Harassment (AHaT Sprint 28: The Mad Hatter Hat), MW-1.41-notes (1.41.0-wmf.3; 2023-04-03), Growth-Team, Timeless, GrowthExperiments, Wikimedia-production-error
dduvall created T333483: Less_Exception_Parser: File `resources/lib/codex-design-tokens/theme-wikimedia-ui-legacy.less` not found. in interface.less.
Mar 29 2023, 6:39 PM · Anti-Harassment (AHaT Sprint 28: The Mad Hatter Hat), MW-1.41-notes (1.41.0-wmf.3; 2023-04-03), Growth-Team, Timeless, GrowthExperiments, Wikimedia-production-error
dduvall added a comment to T322453: Buildkit erroring with "cannot reuse body, request must be retried" upon multi-platform push.

I was able to reproduce the problem locally and it has to do with how nginx (doesn't) apply the set $auth_request_path to the nested location block that matches exact manifest/blob digests. From the commit message of the associated patch:

Mar 29 2023, 5:55 PM · Patch-For-Review, Release Pipeline (Blubber), Release-Engineering-Team (Priority Backlog 📥), serviceops

Mar 28 2023

dduvall added a comment to T322453: Buildkit erroring with "cannot reuse body, request must be retried" upon multi-platform push.

Thank you, @JMeybohm. That's very helpful.

Mar 28 2023, 9:43 PM · Patch-For-Review, Release Pipeline (Blubber), Release-Engineering-Team (Priority Backlog 📥), serviceops

Mar 27 2023

dduvall renamed T333047: Gitlab CI failure: image pull failed: Back-off pulling image "docker-hub-mirror.staging.cloud.releng.team/library/rust:latest" from Gitlab CI failure: ContainersNotReady: "containers with unready status: [build helper istio-proxy]" to Gitlab CI failure: image pull failed: Back-off pulling image "docker-hub-mirror.staging.cloud.releng.team/library/rust:latest".
Mar 27 2023, 4:34 PM · Release-Engineering-Team, GitLab (CI & Job Runners)

Mar 22 2023

dduvall created T332804: Isolate buildkitd build container namespaces on trusted runners.
Mar 22 2023, 5:41 PM · Release-Engineering-Team (They Live 🕶️🧟), serviceops-collab
dduvall added a comment to T322579: give releng access to logs to debug buildkit-to-wmf-registry publishing.

Friendly ping. :)

Mar 22 2023, 4:09 PM · serviceops-radar, Release-Engineering-Team (Radar), serviceops-collab
dduvall added a comment to T322453: Buildkit erroring with "cannot reuse body, request must be retried" upon multi-platform push.

I'm finally circling back to this, and I ran another test yesterday. The additional logging in jwt-authorizer confirms that this is an auth failure due to the token scope not matching the request URL. However, the reason for this mismatch is strange. (See below.)

Mar 22 2023, 4:07 PM · Patch-For-Review, Release Pipeline (Blubber), Release-Engineering-Team (Priority Backlog 📥), serviceops

Mar 20 2023

dduvall closed T327416: Mitigate thundering herd on GitLab runners as Resolved.

The last few rounds of load testing, whereby 100 concurrent image-build jobs are triggered at once, have all been successful in that:

Mar 20 2023, 8:16 PM · Patch-For-Review, Release-Engineering-Team (GitLab V: Event Horizon 🌄), GitLab (CI & Job Runners)

Mar 9 2023

dduvall added a comment to T331497: [Java] client pipeline tests fail with pipeline CPS method mismatches .

Just noting that the specific CPS mismatch here is the use of a CPS transformed method in a constructor. Constructors are never CPS transformed.

Mar 9 2023, 5:21 PM · Release Pipeline (Blubber), ci-test-error, Metrics-Platform-Planning (Metrics Platform Kanban)

Mar 1 2023

dduvall closed T330433: Isito interferes with HTTP traffic from buildkitd build containers as Resolved.

Deployed.

Mar 1 2023, 5:34 PM · GitLab (CI & Job Runners), Release-Engineering-Team (GitLab V: Event Horizon 🌄)
dduvall added a comment to T330433: Isito interferes with HTTP traffic from buildkitd build containers.

I did some more investigation on this to better understand how exactly Istio is interfering with the traffic from the network namespaces on the buildkit0 bridge interface, and in doing so discovered the traffic.sidecar.istio.io/kubevirtInterfaces annotation that solves this problem in a less hacky way—or at least in an official Istio hacky way as the resulting iptables PREROUTING chain looks similar to what we added.

Mar 1 2023, 12:53 AM · GitLab (CI & Job Runners), Release-Engineering-Team (GitLab V: Event Horizon 🌄)

Feb 28 2023

dduvall claimed T330433: Isito interferes with HTTP traffic from buildkitd build containers.
Feb 28 2023, 7:58 PM · GitLab (CI & Job Runners), Release-Engineering-Team (GitLab V: Event Horizon 🌄)

Feb 24 2023

dduvall updated the task description for T330520: Test out alternative k8s platforms for gitlab-cloud-runner.
Feb 24 2023, 6:55 PM · Release-Engineering-Team (They Live 🕶️🧟)
dduvall created T330520: Test out alternative k8s platforms for gitlab-cloud-runner.
Feb 24 2023, 6:54 PM · Release-Engineering-Team (They Live 🕶️🧟)

Feb 23 2023

dduvall added a comment to T330433: Isito interferes with HTTP traffic from buildkitd build containers.

Awesome!

Feb 23 2023, 9:57 PM · GitLab (CI & Job Runners), Release-Engineering-Team (GitLab V: Event Horizon 🌄)

Feb 21 2023

dduvall created T330239: Reggie raises frequent "sqlite3.OperationalError: database is locked" under high load.
Feb 21 2023, 10:10 PM · Patch-For-Review, Release-Engineering-Team (GitLab V: Event Horizon 🌄)
dduvall closed T329213: Refactor buildkitd deployment for better build container isolation as Resolved.

BuildKit has been redeployed. It now runs in privileged mode and has CNI configured for network isolation of build containers.

Feb 21 2023, 8:13 PM · Patch-For-Review, Release-Engineering-Team (GitLab V: Event Horizon 🌄)

Feb 17 2023

dduvall added a comment to T329216: GitLab CI jobs failing with "You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit".

For starters, perhaps we can create an account on docker.io for Release Engineering and generate a public puller token to use on WMCS runners. According to the GitLab CI runner docs, the docker executor should respect the ~/.docker/config.json for the runner's user. In this scenario, the creds would not be leaked to jobs.

Feb 17 2023, 7:10 PM · mwcli, Release-Engineering-Team, mwbot-rs, GitLab (CI & Job Runners)

Feb 16 2023

dduvall closed T325586: 1.40.0-wmf.23 deployment blockers as Resolved.
Feb 16 2023, 10:34 PM · Patch-For-Review, Release-Engineering-Team (Priority Backlog 📥), Release, Train Deployments
dduvall claimed T329213: Refactor buildkitd deployment for better build container isolation.
Feb 16 2023, 8:45 PM · Patch-For-Review, Release-Engineering-Team (GitLab V: Event Horizon 🌄)

Feb 15 2023

dduvall added a comment to T325586: 1.40.0-wmf.23 deployment blockers.

Thanks very much for your help today, @cscott and @ssastry. I've re-rolled group1.

Feb 15 2023, 11:33 PM · Patch-For-Review, Release-Engineering-Team (Priority Backlog 📥), Release, Train Deployments
dduvall added a comment to T325586: 1.40.0-wmf.23 deployment blockers.

Thanks! I will merge and deploy the backport. I might wait until tomorrow to re-roll train to group1, depending on my own time constraints.

Feb 15 2023, 10:43 PM · Patch-For-Review, Release-Engineering-Team (Priority Backlog 📥), Release, Train Deployments
dduvall added a comment to T325586: 1.40.0-wmf.23 deployment blockers.

Let me consult with Scott and we'll respond here with a proposal.

Feb 15 2023, 9:04 PM · Patch-For-Review, Release-Engineering-Team (Priority Backlog 📥), Release, Train Deployments
dduvall added a comment to T325586: 1.40.0-wmf.23 deployment blockers.

So, in terms of decision, what we need to figure out is whether we suppress logs as in bullet point #1 OR if we should try to fix the code and try to get it on the train. Since this is in the Parsoid repo, it is more involved and requires a new tag, vendor release, patch to update vendor in core, etc.

Feb 15 2023, 9:00 PM · Patch-For-Review, Release-Engineering-Team (Priority Backlog 📥), Release, Train Deployments
dduvall added a comment to T325586: 1.40.0-wmf.23 deployment blockers.

Ya, so, 3 things:

  • Suppressing notices as in T329740#8618948 gets rid of the noise (and suppresses the spike)
  • The other errors Scott referenced and one I was concerned about are only a handful and seem to be seen during rollback (afaict) and are transient ones.
  • The UTF-8 errors are also all a handful and all old known things which could potentially be bad wikitext being posted -- nothing specific to this train.

So, in terms of decision, what we need to figure out is whether we suppress logs as in bullet point #1 OR if we should try to fix the code and try to get it on the train. Since this is in the Parsoid repo, it is more involved and requires a new tag, vendor release, patch to update vendor in core, etc.

Feb 15 2023, 8:56 PM · Patch-For-Review, Release-Engineering-Team (Priority Backlog 📥), Release, Train Deployments
dduvall added a comment to T325586: 1.40.0-wmf.23 deployment blockers.

@cscott I'm seeing a large spike in errors from Parsoid today. See https://logstash.wikimedia.org/goto/a7eb8ccc8cf7a123b9577e32e98f1b1c

Feb 15 2023, 7:29 PM · Patch-For-Review, Release-Engineering-Team (Priority Backlog 📥), Release, Train Deployments
dduvall added a comment to T327416: Mitigate thundering herd on GitLab runners.

Summary of our current solution:

Feb 15 2023, 4:39 PM · Patch-For-Review, Release-Engineering-Team (GitLab V: Event Horizon 🌄), GitLab (CI & Job Runners)

Feb 14 2023

dduvall renamed T329213: Refactor buildkitd deployment for better build container isolation from Try deploying buildkitd as a GitLab CI service to Refactor buildkitd deployment for better build container isolation.
Feb 14 2023, 8:19 PM · Patch-For-Review, Release-Engineering-Team (GitLab V: Event Horizon 🌄)
dduvall added a comment to T278365: CVE-2023-29138: Using checkuser api module with bad user name can still break Special:CheckUserLog even after security fixes.

I should clarify. I did not refactor the security patch implementation, only the context so it would apply cleanly.

Feb 14 2023, 6:13 PM · Security-Team, SecTeam-Processed, CheckUser, Security
dduvall added a comment to T278365: CVE-2023-29138: Using checkuser api module with bad user name can still break Special:CheckUserLog even after security fixes.

I've refactored the patch following application failure during scap stage-train.

Feb 14 2023, 6:12 PM · Security-Team, SecTeam-Processed, CheckUser, Security

Feb 10 2023

thcipriani awarded T329266: Debian security update for git silently broke mediawiki-i18n-check-docker a Yellow Medal token.
Feb 10 2023, 10:34 PM · Release-Engineering-Team (Priority Backlog 📥), Vuln-Misconfiguration, SecTeam-Processed, Security, Security-Team
dduvall added a comment to T329266: Debian security update for git silently broke mediawiki-i18n-check-docker.

Hey @dduvall - Thanks for looking into this! +1 to your patch above and yes, it should be fine to push through gerrit since the "secret" static analysis script has been public for a long time :)

Feb 10 2023, 9:48 PM · Release-Engineering-Team (Priority Backlog 📥), Vuln-Misconfiguration, SecTeam-Processed, Security, Security-Team
dduvall claimed T329266: Debian security update for git silently broke mediawiki-i18n-check-docker.
Feb 10 2023, 8:34 PM · Release-Engineering-Team (Priority Backlog 📥), Vuln-Misconfiguration, SecTeam-Processed, Security, Security-Team
dduvall added a comment to T329266: Debian security update for git silently broke mediawiki-i18n-check-docker.

Patch file for integration/config ^. I assume it should be fine to submit this for review since I've already updated the job in Jenkins, but I thought I'd check with you first, @sbassett.

Feb 10 2023, 8:32 PM · Release-Engineering-Team (Priority Backlog 📥), Vuln-Misconfiguration, SecTeam-Processed, Security, Security-Team
dduvall added a comment to T329266: Debian security update for git silently broke mediawiki-i18n-check-docker.

I've updated the job from changes I made locally and it seems to work as expected.

Feb 10 2023, 8:27 PM · Release-Engineering-Team (Priority Backlog 📥), Vuln-Misconfiguration, SecTeam-Processed, Security, Security-Team
dduvall added a comment to T329266: Debian security update for git silently broke mediawiki-i18n-check-docker.

Given this job is run inside a container as user nobody (which is the cause of the error since the repo is mounted and has different ownership), I think it's probably safe to add git config --global --add safe.directory to the script, but the checks (the entire script really) seem quite difficult to reason about and the status of the git command should be handled separately from those of sed/grep.

Feb 10 2023, 7:52 PM · Release-Engineering-Team (Priority Backlog 📥), Vuln-Misconfiguration, SecTeam-Processed, Security, Security-Team

Feb 8 2023

dduvall created T329213: Refactor buildkitd deployment for better build container isolation.
Feb 8 2023, 6:12 PM · Patch-For-Review, Release-Engineering-Team (GitLab V: Event Horizon 🌄)

Jan 19 2023

dduvall claimed T327416: Mitigate thundering herd on GitLab runners.
Jan 19 2023, 6:58 PM · Patch-For-Review, Release-Engineering-Team (GitLab V: Event Horizon 🌄), GitLab (CI & Job Runners)
dduvall closed T287211: Figure out the future of (or replacements for) PipelineLib in a GitLab world as Resolved.

This is done. Evaluation has long been done and we have a working image build system that can also publish (and soon will perform test deployments).

Jan 19 2023, 5:10 PM · Release-Engineering-Team (GitLab IV: Mise En Place 🍱), SecTeam-Processed, Security-Team, GitLab (CI & Job Runners), User-brennen, Release Pipeline
dduvall removed a subtask for T287211: Figure out the future of (or replacements for) PipelineLib in a GitLab world: T327332: Make a tool to convert .pipeline/config.yaml to .gitlab-ci.yaml.
Jan 19 2023, 5:08 PM · Release-Engineering-Team (GitLab IV: Mise En Place 🍱), SecTeam-Processed, Security-Team, GitLab (CI & Job Runners), User-brennen, Release Pipeline
dduvall removed a parent task for T327332: Make a tool to convert .pipeline/config.yaml to .gitlab-ci.yaml: T287211: Figure out the future of (or replacements for) PipelineLib in a GitLab world.
Jan 19 2023, 5:08 PM · Release-Engineering-Team (They Live 🕶️🧟), GitLab (Project Migration)

Jan 18 2023

dduvall added a comment to T316519: Flink application and flink-kubernetes-operator production docker images.

Should we change this? Should we set the runs.as to something different when building images based of of the production-images flink image with blubber?

Jan 18 2023, 4:18 PM · Event-Platform Value Stream (Sprint 07), Patch-For-Review, Data-Engineering-Planning

Jan 6 2023

dduvall closed T322344: Move cloud runner CI jobs to trusted runners as Resolved.

You would have needed that with the previous configuration as well, the updates config only makes this easier on the SRE end. I have setup a systemd-nspawn container which has the terraform source config. So when we need an update I can simply run "apt-get download terraform" and import the deb on apt1001.

Jan 6 2023, 7:42 PM · Patch-For-Review, serviceops-collab, Release-Engineering-Team (Priority Backlog 📥), GitLab (CI & Job Runners)
dduvall closed T322344: Move cloud runner CI jobs to trusted runners, a subtask of T297426: Provision untrusted instance-wide GitLab job runners to handle user-level projects and merge requests from forks, as Resolved.
Jan 6 2023, 7:41 PM · Release-Engineering-Team (Priority Backlog 📥), serviceops-collab, User-brennen, GitLab (CI & Job Runners)
dduvall added a comment to T325385: Trusted gitlab runner containers need access to staging k8s cluster.

(Orthogonal but worth a discussion in our next meeting.)

Jan 6 2023, 6:59 PM · Kubernetes, serviceops, serviceops-collab, GitLab
dduvall closed T325580: 1.40.0-wmf.17 deployment blockers as Resolved.
Jan 6 2023, 5:44 PM · Patch-For-Review, Release-Engineering-Team (Priority Backlog 📥), Release, Train Deployments

Dec 16 2022

dduvall added a comment to T322344: Move cloud runner CI jobs to trusted runners.

I'm thinking the best course for us at this point might be to rely on binaries verified by checksum for now and not the upstream deb package.

Nah, let's not do that. thirdparty/terraform was missing because the older config caused issues with reprepro and had been reverted. I have now merged a patch to readd thirdparty/terraform and import terraform 1.3.6, please give your image build another shot.

Dec 16 2022, 6:00 PM · Patch-For-Review, serviceops-collab, Release-Engineering-Team (Priority Backlog 📥), GitLab (CI & Job Runners)

Dec 15 2022

dduvall added a comment to T322344: Move cloud runner CI jobs to trusted runners.

I'm thinking the best course for us at this point might be to rely on binaries verified by checksum for now and not the upstream deb package.

Dec 15 2022, 7:08 PM · Patch-For-Review, serviceops-collab, Release-Engineering-Team (Priority Backlog 📥), GitLab (CI & Job Runners)
dduvall added a comment to T322344: Move cloud runner CI jobs to trusted runners.

We're now seeing errors during our image build. It seems the component and package may no longer be in our repo.

Dec 15 2022, 7:06 PM · Patch-For-Review, serviceops-collab, Release-Engineering-Team (Priority Backlog 📥), GitLab (CI & Job Runners)
dduvall added a comment to T325069: Align the GitLab runner tags.

One more thought: Since the DigitalOcean runners will be instance wide starting in the new year and will be configured to run untagged jobs, I'm not sure their tags will matter as much for the most general use cases.

Dec 15 2022, 6:56 PM · GitLab, Release-Engineering-Team, serviceops-collab
dduvall added a comment to T325069: Align the GitLab runner tags.

I know that I am a leading contributor to this problem, but "cloud" is pretty closely tied to Wikimedia Cloud Services within the technical side of the movement. Could we find a different keyword to use to describe the untrusted runners?

I opened https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/92. See also my comment in the MR. I have some concerns of using a provider specific tag because that means end-users have to change CI configurations when we switch to a different provider. After thinking a bit more about tagging for public cloud runners I would like to keep use cloud as the primary tag. I think cloud is the best current description for public cloud providers without being too technical to end users. Furthermore the term/tag cloud was already announced in the latest monthly Tech Department Updates.

I understand that people with more history and time in the foundation may have a different opinion about that, especially because there was no other "cloud" at that time.

We can add a second, provider specific tag like digitalocean for easier maintenance.

I would vote for

Trusted runners: trusted
Shared runners in public cloud: cloud, kubernetes, optional digitalocean
Shared runners in WMCS: wmcs

Dec 15 2022, 6:53 PM · GitLab, Release-Engineering-Team, serviceops-collab

Dec 8 2022

dduvall added a comment to T323394: Add Gitlab JWT support to Reggie.

Which way are we going with this? FWIW I think fronting with nginx would allow offloading of both auth (via jwt-authorizer) and GET/HEAD request caching. Honestly, if we could get the existing nginx ingress working for internal requests, it could handle all of that.

Dec 8 2022, 6:00 PM · Patch-For-Review, Release-Engineering-Team (GitLab V: Event Horizon 🌄)
dduvall added a comment to P42318 (An Untitled Masterwork).

It's possible for a blubber.yaml to be in a subdirectory of .pipeline as well. Could you run the command again with a less restrictive pattern, maybe just grep -iq blubber.yaml?

Dec 8 2022, 4:36 PM
dduvall closed T324361: Provision Ingress with TLS for reggie to allow image reuse in pipelines as Resolved.

Paired with @dancy and @jeena on this today.

Dec 8 2022, 12:29 AM · Release-Engineering-Team (GitLab III: GitLab in LA 🪃)

Dec 2 2022

dduvall claimed T324361: Provision Ingress with TLS for reggie to allow image reuse in pipelines.
Dec 2 2022, 7:38 PM · Release-Engineering-Team (GitLab III: GitLab in LA 🪃)
dduvall created T324361: Provision Ingress with TLS for reggie to allow image reuse in pipelines.
Dec 2 2022, 7:38 PM · Release-Engineering-Team (GitLab III: GitLab in LA 🪃)

Dec 1 2022

dduvall closed T324037: Upgrade jwt-authorizer on all registry hosts, a subtask of T322453: Buildkit erroring with "cannot reuse body, request must be retried" upon multi-platform push, as Resolved.
Dec 1 2022, 5:16 PM · Patch-For-Review, Release Pipeline (Blubber), Release-Engineering-Team (Priority Backlog 📥), serviceops
dduvall closed T324037: Upgrade jwt-authorizer on all registry hosts as Resolved.

Done. See T322691#8433642

Dec 1 2022, 5:16 PM · serviceops-collab, GitLab (CI & Job Runners), Release-Engineering-Team (Priority Backlog 📥)

Nov 30 2022

dduvall reopened T322691: Build and import new release of jwt-authorizer (1.1.0) as "Open".

@Jelto or @Dzahn we'll need this built for buster as well since the registry hosts are all buster based.

Nov 30 2022, 4:23 PM · GitLab (CI & Job Runners), serviceops-collab, Release-Engineering-Team (Priority Backlog 📥)
dduvall reopened T322691: Build and import new release of jwt-authorizer (1.1.0), a subtask of T322453: Buildkit erroring with "cannot reuse body, request must be retried" upon multi-platform push, as Open.
Nov 30 2022, 4:23 PM · Patch-For-Review, Release Pipeline (Blubber), Release-Engineering-Team (Priority Backlog 📥), serviceops

Nov 29 2022

dduvall created T324037: Upgrade jwt-authorizer on all registry hosts.
Nov 29 2022, 4:38 PM · serviceops-collab, GitLab (CI & Job Runners), Release-Engineering-Team (Priority Backlog 📥)

Nov 17 2022

dduvall moved T323147: Try DigitalOcean object storage for buildkit caching from Ready to Done on the Release-Engineering-Team (GitLab III: GitLab in LA 🪃) board.
Nov 17 2022, 9:09 PM · GitLab (CI & Job Runners), Release-Engineering-Team (GitLab III: GitLab in LA 🪃)
dduvall closed T323147: Try DigitalOcean object storage for buildkit caching as Declined.

The buildkitd deployment running in cloud-runner now has access to the DO Spaces credentials. However, there's a pretty large snafu here that I ran into while testing with a blubber MR.

Nov 17 2022, 9:08 PM · GitLab (CI & Job Runners), Release-Engineering-Team (GitLab III: GitLab in LA 🪃)
dduvall closed T323147: Try DigitalOcean object storage for buildkit caching, a subtask of T323140: Ensure efficient Gitlab CI operations for scap, as Declined.
Nov 17 2022, 9:08 PM · Release-Engineering-Team (GitLab III: GitLab in LA 🪃), Scap

Nov 15 2022

dduvall created T323164: Provision Horizontal Pod Autoscaler (HPA) for GitLab cloud runners.
Nov 15 2022, 9:23 PM · GitLab (CI & Job Runners), Release-Engineering-Team (GitLab III: GitLab in LA 🪃)
dduvall added a comment to T308615: Add DigitalOcean resource monitoring for cloud runner nodes.

I recently enabled the prometheus metrics exporter and prometheus operator in the gitlab-runner chart.

Nov 15 2022, 9:10 PM · Release-Engineering-Team (GitLab V: Event Horizon 🌄), User-brennen, GitLab (CI & Job Runners)
dduvall edited Description on Release-Engineering-Team (GitLab III: GitLab in LA 🪃).
Nov 15 2022, 8:59 PM

Nov 10 2022

dduvall added a comment to T322344: Move cloud runner CI jobs to trusted runners.

@dduvall It's available for bullseye now:

[apt1001:/tmp] $ sudo -E reprepro ls terraform
terraform | 1.3.4 | bullseye-wikimedia | amd64

\o/ thank you!!!

Nov 10 2022, 1:24 AM · Patch-For-Review, serviceops-collab, Release-Engineering-Team (Priority Backlog 📥), GitLab (CI & Job Runners)
dduvall added a comment to T322344: Move cloud runner CI jobs to trusted runners.

@dduvall The way you describe above works but it seems to me that is simply because it skips the "signed by" line in APT sources and does not verify. It would also work if you skip the installation of the gpg key entirely.

Nov 10 2022, 1:24 AM · Patch-For-Review, serviceops-collab, Release-Engineering-Team (Priority Backlog 📥), GitLab (CI & Job Runners)
dduvall added a comment to T322344: Move cloud runner CI jobs to trusted runners.

Would it be ok to add GetInRelease: no to the reprepro updates file? According to the manpage:

Nov 10 2022, 12:38 AM · Patch-For-Review, serviceops-collab, Release-Engineering-Team (Priority Backlog 📥), GitLab (CI & Job Runners)
dduvall added a comment to T322344: Move cloud runner CI jobs to trusted runners.

I opened a ticket with upstream. (#89323)

Nov 10 2022, 12:22 AM · Patch-For-Review, serviceops-collab, Release-Engineering-Team (Priority Backlog 📥), GitLab (CI & Job Runners)

Nov 9 2022

dduvall added a comment to T322344: Move cloud runner CI jobs to trusted runners.

Hmmh, this is quite strange. There's no obvious error in the config (but the claimed signed on 1970 may be the culprit) and the key has been imported correctly on apt1001. This will need some further debugging, I'll try to get to it in the next days, currrently quite busy.

@Dzahn In the mean time to unblock the work, could you fetch the deb via secure apt (by setting it up following the upstream docs to setup the apt source and then running "apt-get download terraform" and then importing the package using "reprepro includedeb"?

Nov 9 2022, 7:48 PM · Patch-For-Review, serviceops-collab, Release-Engineering-Team (Priority Backlog 📥), GitLab (CI & Job Runners)

Nov 8 2022

dduvall created T322691: Build and import new release of jwt-authorizer (1.1.0).
Nov 8 2022, 8:38 PM · GitLab (CI & Job Runners), serviceops-collab, Release-Engineering-Team (Priority Backlog 📥)
dduvall added a comment to T322453: Buildkit erroring with "cannot reuse body, request must be retried" upon multi-platform push.

I enabled debug logging for buildkitd on the gitlab-runner hosts and re-ran the failed job. It appears that auth failure may be the root cause—the client is definitely misbehaving but the there should not be auth failure here. It's unclear why.

Nov 8 2022, 8:15 PM · Patch-For-Review, Release Pipeline (Blubber), Release-Engineering-Team (Priority Backlog 📥), serviceops
dduvall added a comment to T322453: Buildkit erroring with "cannot reuse body, request must be retried" upon multi-platform push.

@JMeybohm can you provide the nginx access log entries from that time period as well? I'm trying to rule out auth failure as a factor and docker-registry log entries do not include the subrequests between nginx and jwt-authorizer.

Nov 8 2022, 7:42 PM · Patch-For-Review, Release Pipeline (Blubber), Release-Engineering-Team (Priority Backlog 📥), serviceops
dduvall added a comment to T322579: give releng access to logs to debug buildkit-to-wmf-registry publishing.

[...]
Also: both the registry and nginx keep access logs, so I guess it's enough to export one of the two.

I'd say export those from docker-registry. It writes access log plus additional stuff that might be useful

Nov 8 2022, 7:40 PM · serviceops-radar, Release-Engineering-Team (Radar), serviceops-collab
dduvall renamed T322453: Buildkit erroring with "cannot reuse body, request must be retried" upon multi-platform push from WMF container registry does not accept a manifest list (aka OCI manifest index, or "fat" manifest) to Buildkit erroring with "cannot reuse body, request must be retried" upon multi-platform push.
Nov 8 2022, 5:28 PM · Patch-For-Review, Release Pipeline (Blubber), Release-Engineering-Team (Priority Backlog 📥), serviceops
dduvall updated subscribers of T322453: Buildkit erroring with "cannot reuse body, request must be retried" upon multi-platform push.

Thanks for debugging this further, @JMeybohm and @hashar. In your recent re-run of the job it seems the manifest list is accepted by the registry, and it's the subsequent manifest push that fails, so you're right this doesn't seem to be related to a lack of manifest list support for perhaps some errant behavior on buildkit's side that manifests under this multi-platform push condition. I'll re-word and triage accordingly.

Nov 8 2022, 5:21 PM · Patch-For-Review, Release Pipeline (Blubber), Release-Engineering-Team (Priority Backlog 📥), serviceops
dduvall added a comment to T322344: Move cloud runner CI jobs to trusted runners.
...
ERROR: Condition 'DA418C88A3219F7B' not fulfilled for '/srv/wikimedia/lists/thirdparty%2Fterraform-bullseye_bullseye_InRelease'.
Signatures in '/srv/wikimedia/lists/thirdparty%2Fterraform-bullseye_bullseye_InRelease':
'DA418C88A3219F7B' (signed 1970-01-01): bad signature
Error: Not enough signatures found for remote repository thirdparty/terraform-bullseye (https://apt.releases.hashicorp.com bullseye)!
There have been errors!
Nov 8 2022, 4:55 PM · Patch-For-Review, serviceops-collab, Release-Engineering-Team (Priority Backlog 📥), GitLab (CI & Job Runners)

Nov 7 2022

dduvall updated subscribers of T322344: Move cloud runner CI jobs to trusted runners.

Thanks for the review/merge, @MoritzMuehlenhoff and @Dzahn! I don't see the packages yet but I'm assuming the actual import is a manual step?

Nov 7 2022, 9:54 PM · Patch-For-Review, serviceops-collab, Release-Engineering-Team (Priority Backlog 📥), GitLab (CI & Job Runners)
dduvall added a comment to T318382: Upgrade docker on integration hosts for fixes to BuildKit builder.

Thanks, Moritz. Can we do the same for buster, pulling in the newer 20.10.18~3~debian-buster packages as well? I think using the same exact version on contint and integration agents is probably best.

Nov 7 2022, 9:52 PM · Patch-For-Review, Release-Engineering-Team (Priority Backlog 📥), Epic, Release Pipeline (Blubber)
dduvall added a project to T322579: give releng access to logs to debug buildkit-to-wmf-registry publishing: Release-Engineering-Team (Radar).
Nov 7 2022, 9:40 PM · serviceops-radar, Release-Engineering-Team (Radar), serviceops-collab
dduvall updated subscribers of T322579: give releng access to logs to debug buildkit-to-wmf-registry publishing.

Thanks for filing this!

Nov 7 2022, 9:39 PM · serviceops-radar, Release-Engineering-Team (Radar), serviceops-collab
dduvall awarded T322579: give releng access to logs to debug buildkit-to-wmf-registry publishing a Love token.
Nov 7 2022, 9:28 PM · serviceops-radar, Release-Engineering-Team (Radar), serviceops-collab
dduvall added a comment to T322453: Buildkit erroring with "cannot reuse body, request must be retried" upon multi-platform push.

I think this might require an update of our docker-registry (which we have not planned for currently). Is this blocking something apart from testing with multi arch images (T272500)?

Nov 7 2022, 5:22 PM · Patch-For-Review, Release Pipeline (Blubber), Release-Engineering-Team (Priority Backlog 📥), serviceops

Nov 5 2022

dduvall added a comment to T321316: Self-build and publish buildkit helper images.

In the last IC sync meeting we discussed that it makes more sense to add the self-hosted dockerfile/copy image to production-images first instead of hosting that separately in GitLab.

Nov 5 2022, 6:15 PM · Release-Engineering-Team, GitLab (CI & Job Runners), serviceops-collab

Nov 4 2022

dduvall added a comment to T321316: Self-build and publish buildkit helper images.

This may be blocked by T322453: Buildkit erroring with "cannot reuse body, request must be retried" upon multi-platform push which is preventing the publishing of multi-platform images. The dockerfile-copy image needs to be multi-platform to achieve parity with the upstream version.

Nov 4 2022, 11:06 PM · Release-Engineering-Team, GitLab (CI & Job Runners), serviceops-collab
dduvall created T322453: Buildkit erroring with "cannot reuse body, request must be retried" upon multi-platform push.
Nov 4 2022, 11:03 PM · Patch-For-Review, Release Pipeline (Blubber), Release-Engineering-Team (Priority Backlog 📥), serviceops

Nov 3 2022

dduvall created T322344: Move cloud runner CI jobs to trusted runners.
Nov 3 2022, 4:39 PM · Patch-For-Review, serviceops-collab, Release-Engineering-Team (Priority Backlog 📥), GitLab (CI & Job Runners)

Nov 1 2022

dduvall added a comment to T318866: "qemu: uncaught target signal 11" building local dev container on M1 Mac with Docker Desktop.

See https://gitlab.wikimedia.org/repos/releng/blubber/-/merge_requests/15 for multi-platform support in Blubber, currently under review.

Nov 1 2022, 10:59 PM · Patch-For-Review, ARM support, Wikimedia-Developer-Portal
dduvall added a comment to T318866: "qemu: uncaught target signal 11" building local dev container on M1 Mac with Docker Desktop.

Reading T321316: Self-build and publish buildkit helper images clued me into the copy action where the crash is happening being very likely done via the docker/dockerfile-copy image which is used as a helper by buildkit.

Nov 1 2022, 3:47 PM · Patch-For-Review, ARM support, Wikimedia-Developer-Portal
dduvall added a comment to T321316: Self-build and publish buildkit helper images.

Thanks @Joe for the analysis! That pushed my in the right direction.

I also found the code/project for the copy binary here: https://github.com/tonistiigi/copy . There is also a Dockerfile in there.

I'll try to migrate that to a GitLab project. Best case with a job to build the copy binary, otherwise we can import it. First I have to verify if that's the correct code repository and what the copy tool is doing.

Nov 1 2022, 3:42 PM · Release-Engineering-Team, GitLab (CI & Job Runners), serviceops-collab

Oct 25 2022

thcipriani awarded T301168: Migrate Blubber project to GitLab a Barnstar token.
Oct 25 2022, 10:28 PM · Release-Engineering-Team (GitLab II: Wrath of Kahn 👾), User-dduvall, GitLab (Project Migration)

Oct 24 2022

dduvall closed T301168: Migrate Blubber project to GitLab as Resolved.
Oct 24 2022, 9:24 PM · Release-Engineering-Team (GitLab II: Wrath of Kahn 👾), User-dduvall, GitLab (Project Migration)