Page MenuHomePhabricator

Adoption of aarch64 (aka arm64) in WMF production? (SRE Summit 2022 Session)
Open, Needs TriagePublic

Description

Intro

This is a task to recap and record the outcome of the 2022 SRE Summit session "Adoption of aarch64 (aka arm64) in WMF production?"

For the rest of this task, consider the terms aarch64, arm64 and arm/ARM as interchangeable. For those interested:

  • aarch64 is the 64-bit of the ARM CPU architecture introduced with ARMv8-A. It is also the target of various GNU tools and used to be the name of an LLVM backend.
  • ARM64/arm64 is the architecture as defined by various OS, including Debian Linux, Mac OS and Windows.
  • ARM is the company that designs CPUs. Sometimes written as arm.

Problem statement

Our breakout session was split in 4 parts. First part was a quick overview of the problem.

Tasks tagged ARM support, snippets of conversations between WMF devs as well as industry announcements were showcased to help everyone understand the issue at hand, namely that running amd64 workloads (mostly OCI images right now) under emulation (transparent to the user or not) on arm64 is extremely slow and leads at failures due to extremely long execution times.

Most identified current use cases suffering

  • CI
  • Deployment pipeline
  • General mediawiki developers

We reached the understanding that up to now, developers have been meeting this problem seldomly but we are at the critical point where they will start to meet it way more. They are coming up with (and sharing nonetheless) ad-hoc/custom solutions to solve it, all of them reproducing locally what we could be having in our infrastructure.

With an eye to the future, it was noted that their might be workloads in production that could benefit from either ARM performance being overall better there, ARM performance per watt being better, or just ARM's better power consumption for those workloads. Grafana power usage graphs per cluster were showcased as a possible way to identify the lowest hanging fruit.

Vote on adopting arm64 in WMF production

Unanimous agreement that, on principle, we should start adopting aarch64 (aka arm64) in WMF production

Challenges

Then we went on brainstorming the challenges we will face to make the above happen. No specific order for now

  • We will eventually need to procure hardware -> Challenge: Vendor evaluation
  • Utilizing some cloud that already provides arm64 might help speed some things up -> Challenge: Cloud evaluation
  • We will need to host our own arm64 Debian Packages -> Challenge: Reprepro arm64 support
  • We might be able to bridge some gaps while obtaining hardware using cross building temporarily -> Challenge: Get some cross building infrastructure, even if temporarily
  • We will need to augment our current amd64 package/image tracking infrastructure -> Challenge: Identify what needs to be done
  • We will need to adapt our CPU architecture specific puppet manifests to aarch64 -> Challenge: track down and amend
  • arm64 OCI images should be treated with the same reproducibility and testability concerns as amd64 -> Challenge: figure out what this means
  • Down the line and assuming even partially adoption of arm64 in our production, we will need to come up with a migration plan for the applicable workloads -> Challenge: Figure those workloads out and craft migration paths
  • Virtualization issues. Clusters like WMCS or Ganeti will gradually need arm64 support, KVM issues might arise
  • Currently, data center operations are certified by vendors. We will need the same thing for our arm64 vendors if they differ
  • We will need arm64 support in our image registry
  • Cloud VPS support

Solutions

Not all of the above challenges where met with easy solutions (that wouldn't be possible) but here's a few

  • Reprepro => Apparently it's easy to enable multi-architecture support on it
  • We will need to implement a tool to keep track of our build needs
  • Once we obtain hardware, we 'll create a Pilot installation.

Event Timeline

I don't really understand the problem statement, what are the arm64 things we'd run in production? Is it just image building?

I don't really understand the problem statement, what are the arm64 things we'd run in production? Is it just image building?

Good point. I did amend the write up with a part I had forgotten, namely the fact that for some workloads (e.g. batch jobs) ARM performance per watt and/or overall power consumption could be better.

That being said, the problem statement is on purpose a bit vague, as aside from the most obvious need (image building as you point out), other useful cases might/will arise in the next few years. The various challenges listed, aim to identify blockers to adoption of arm64 for these future use cases. Figuring out precisely which use cases those might be is something that we did not do as it is a pretty involved process.

Let me also add, that as far as MediaWiki + surrounding services, that is, things in the hot path of end user requests go, I have my doubts ARM64 would prove to be substantially beneficial. Let's see if I 'll regret this phrase in the next few years.

jijiki renamed this task from SRE Summit 2022 Outcome of Session "Adoption of aarch64 (aka arm64) in WMF production?" to Adoption of aarch64 (aka arm64) in WMF production? (SRE Summit 2022 Session).Nov 7 2022, 4:01 PM