We are going to experiment with setting up a proof-of-concept toolforge implementation outside of cloud-vps. This is the top-level tracking task for this experiment.
Description
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Open | None | T407296 Toolforge on bare metal POC | |||
| Open | None | T407299 What are the parts of toolforge? | |||
| Open | None | T407140 Plan networking for Toolforge-on-Metal experiment | |||
| Open | None | T409309 Create new VRF and networks for Toolforge-on-Metal | |||
| Resolved | taavi | T411081 Improve how virt networks are configured in cloudgw | |||
| Open | Andrew | T406630 Do something with cloudcontrol100[8-10]-dev | |||
| Open | None | T407502 Identify candidate tools to test migrating to the Toolforge on bare metal POC |
Event Timeline
I have some questions about this project, which I'm hoping some Tiger Team folks will have answers to.
Typically an experiment seeks to answer a question. What question are we answering here? I started with something like
"Can we rebuild toolforge outside of cloud-vps?"
...but the answer to that is definitely 'yes' since software is software. So that gets me to something more specific, like
"Will it be more complicated to build toolforge outside of cloud-vps?"
...but the answer to that is also almost certainly 'yes,' so complication is clearly not the primary metric. Another question could be
"How hard will it be to build toolforge wiithout cloud-vps?"
answering that would probably start by making a list of what would be involved in building that. And, having that list would be pretty useful regardless! So I'm creating a task for that (T407299) with the expectation that someone has already done this and that task will just link to a doc someplace. Answering that question doesn't require us to build anything, though. So if we're going to build something, we're probably answering something like:
"After building a pilot bare-metal toolforge replacement, how hard was it, and does it seem like something that would be easy to maintain, or hard to maintain?"
If that's the question we're going to answer (and I think it is), then next we should define a scope for what we mean by 'pilot toolforge replacement' so we know when to stop building and when to start evaluating. I propose we define that scope by selecting an existing small set of tools to run on the new toolforge, and when they work then we're done.
If others agree with that, then we need to decide if we're targeting simple tools or complicated ones. It's my opinion that we should intentionally choose the most pathologically-complex tools we can find for the pilot, because if we choose trivial ones then we'll wind up deciding that the whole bare-metal migration is easy when all we've actually proved is that doing easy things is easy.
In case it's useful, there was a document started on thinking what's needed from toolforge side: https://docs.google.com/document/d/145cfnZYR2QSYlVtxBR2HCgHx46I7681_H3I74c3j_vI/edit?tab=t.0
I agree with @Andrew, I think that the experiment is not yet well defined. And I agree also with the questions he raises. Maybe we can phrase it as hypotheses like with the anual plan, something like:
- Hypothesis 1: deploying toolforge in bare metal is easier than deploying on VMs
- 1.1: more reproducible
- 1.2: less effort needed
- ...
- Hypothesis 2: maintaining toolforge in barae metal is easier than maintaining it on VMs
- 2.1: capacity planning is easier
- 2.2: debugging is easier
- 2.3: incident management is easier
- 2.4: upgrades are easier
- ...
- Hypothesis 3: all current workflows can be implemented in bare metal too
- 3.1: <really complicated tool1> is able to function the same
- ...
- Hypothesis 4: toolforge in bare metal is more stable for users
- Hypothesis 5: toolforge in bare metal is more reliable for users
- ...
Then we can define the POC that will hit all those questions so we can answer them, some could even be tackled in parallel if they are well defined. I think that @CCiufo-WMF might be the one with the insight to fill in all the product-side gaps here that should set the technical direction to follow (key in that direction product definiton -> then technical definition).
I agree that this is question we're trying to answer and that we should start by identifying candidate tools to migrate as part of the POC. I've spun out T407502 to capture that and started documenting the overall effort at https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Toolforge_on_bare_metal.
In T407296#11274680, @Andrew wrote:
"After building a pilot bare-metal toolforge replacement, how hard was it, and does it seem like something that would be easy to maintain, or hard to maintain?"
I 'd reword this from "it" to "we". With "we" being here decidedly a larger group of people (than the previously extant WMCS team) with a different set of shared best practices.
So something like
"After building a pilot bare-metal toolforge replacement, can we maintain it more sustainably, thus providing more value to Toolforge's user base?"
I also agree that writing APP-style hypotheses (whether they end up in the APP or not) might be useful to clarify our goals.
Does "we" include every SRE in the SRE department? Or is that only the "tools-infra" team? Or is it "tools-platform" + "tools-infra" teams?
**"After building a pilot bare-metal toolforge replacement, can we maintain it more sustainability
Can we find some proxy metrics for "sustainability" and/or "maintainability"? Some proposals:
- SLOs (uptime, MTTR, ...)
- lottery factor (for Toolforge and for supporting systems like Ceph, Openstack, etc.)
- number of contributors (for Toolforge and for supporting systems)
- number of on-call SREs (for Toolforge and for supporting systems)
- delay in upgrades (e.g. how far we are from the latest version of k8s)
- review time for patches
- number of bugs reported
- general team health (happiness, tech debt, ...)
These are just some very quick proposals, I'm not sure if they are good metrics and if they are relevant to what we are trying to achieve.
Does "we" include every SRE in the SRE department? Or is that only the "tools-infra" team? Or is it "tools-platform" + "tools-infra" teams?
If the PoC pans out, my hope is that down the line that "we" would include every SRE in the SRE department. Such discussions are at the very very very very early stages and as such there can be no guarantee. And again, they are dependent on the PoC succeeding.
Can we find some proxy metrics for "sustainability" and/or "maintainability"? Some proposals:
I 'd add "% of time spent on keeping the lights on vs strategic work". These metrics are already being surfaced by managers in the Essential Works/APP work reports.
they are dependent on the PoC succeeding.
What is the definition of "succeeding"? :) I'm not sure if everybody has the same definition in their mind. In other words: what is the "Definition of Done" for this task?
I didn't see until now this wiki page created by @CCiufo-WMF that contains some key questions we want to answer with the PoC work. I propose that answering those questions could be the "Definition of Done" for this task.
This one question however will be very hard to answer even when we get to the point of having a running "Toolforge on Metal" deployment:
Will hosting on bare metal allow us to provide a higher level of availability and reliability?
How do we measure "availability" and "reliability"? And how can we compare them between the existing Toolforge deployment and the "metal" deployment?