Intro
This is a task to recap and record the outcome of the 2022 SRE Summit session "Adoption of aarch64 (aka arm64) in WMF production?"
For the rest of this task, consider the terms aarch64, arm64 and arm/ARM as interchangeable. For those interested:
- aarch64 is the 64-bit of the ARM CPU architecture introduced with ARMv8-A. It is also the target of various GNU tools and used to be the name of an LLVM backend.
- ARM64/arm64 is the architecture as defined by various OS, including Debian Linux, Mac OS and Windows.
- ARM is the company that designs CPUs. Sometimes written as arm.
Problem statement
Our breakout session was split in 4 parts. First part was a quick overview of the problem.
Tasks tagged ARM support, snippets of conversations between WMF devs as well as industry announcements were showcased to help everyone understand the issue at hand, namely that running amd64 workloads (mostly OCI images right now) under emulation (transparent to the user or not) on arm64 is extremely slow and leads at failures due to extremely long execution times.
Most identified current use cases suffering
- CI
- Deployment pipeline
- General mediawiki developers
We reached the understanding that up to now, developers have been meeting this problem seldomly but we are at the critical point where they will start to meet it way more. They are coming up with (and sharing nonetheless) ad-hoc/custom solutions to solve it, all of them reproducing locally what we could be having in our infrastructure.
With an eye to the future, it was noted that their might be workloads in production that could benefit from either ARM performance being overall better there, ARM performance per watt being better, or just ARM's better power consumption for those workloads. Grafana power usage graphs per cluster were showcased as a possible way to identify the lowest hanging fruit.
Vote on adopting arm64 in WMF production
Unanimous agreement that, on principle, we should start adopting aarch64 (aka arm64) in WMF production
Challenges
Then we went on brainstorming the challenges we will face to make the above happen. No specific order for now
- We will eventually need to procure hardware -> Challenge: Vendor evaluation
- Utilizing some cloud that already provides arm64 might help speed some things up -> Challenge: Cloud evaluation
- We will need to host our own arm64 Debian Packages -> Challenge: Reprepro arm64 support
- We might be able to bridge some gaps while obtaining hardware using cross building temporarily -> Challenge: Get some cross building infrastructure, even if temporarily
- We will need to augment our current amd64 package/image tracking infrastructure -> Challenge: Identify what needs to be done
- We will need to adapt our CPU architecture specific puppet manifests to aarch64 -> Challenge: track down and amend
- arm64 OCI images should be treated with the same reproducibility and testability concerns as amd64 -> Challenge: figure out what this means
- Down the line and assuming even partially adoption of arm64 in our production, we will need to come up with a migration plan for the applicable workloads -> Challenge: Figure those workloads out and craft migration paths
- Virtualization issues. Clusters like WMCS or Ganeti will gradually need arm64 support, KVM issues might arise
- Currently, data center operations are certified by vendors. We will need the same thing for our arm64 vendors if they differ
- We will need arm64 support in our image registry
- Cloud VPS support
Solutions
Not all of the above challenges where met with easy solutions (that wouldn't be possible) but here's a few
- Reprepro => Apparently it's easy to enable multi-architecture support on it
- We will need to implement a tool to keep track of our build needs
- Once we obtain hardware, we 'll create a Pilot installation.