Page MenuHomePhabricator

wmcs.spicerack: Setup a host to run cookbooks from prod network
Open, MediumPublic

Description

We will want to be able to run the WMCS related cookbooks also from a production host, that's specifically useful when
getting to the point of automated remediations and such, but also if your workplace network is not stable.

This task is to investigate and implement a solution to it.

A solution could be to generate an ssh key for the root user in the production cumin host, and configure it (to
investigate) to use a bastion host, and that key when connecting to cloud instances.

Event Timeline

dcaro triaged this task as Medium priority.Mar 4 2021, 11:39 AM
dcaro created this task.

Could you clarify the scope between:

  1. production hosts that currently have WMCS as the service team (cloudvirt, cloudcephosd, etc.)
  2. Cloud VPSes that the WMCS team currently semi-manages (i.e. that have other roots, possibly custom puppetmasters etc.)
  3. Cloud VPSes that the WMCS team is currently managing fully (operates config mgmt such as the puppetmaster), not necessarily exclusively (e.g. I think Toolforge has additional admins)

These three have different constraints, risk tolerances, set of administrators, authentication constraints (or keys even), etc. and will likely need separate solutions

It would be also helpful to be a bit more specific about the workflows you're interested in specifically - is that for existing cookbooks or future ones? What is that you are interested in doing, at a higher level? Thanks!

Could you clarify the scope between:

I'll try :)

  1. production hosts that currently have WMCS as the service team (cloudvirt, cloudcephosd, etc.)

For these we don't really need anything new, current cumin setup already can connect/manage these.

  1. Cloud VPSes that the WMCS team currently semi-manages (i.e. that have other roots, possibly custom puppetmasters etc.)

For this, we will need one ssh key, these include projects like 'cloudinfra' and 'admin' (TODO: get a full list)

  1. Cloud VPSes that the WMCS team is currently managing fully (operates config mgmt such as the puppetmaster), not necessarily exclusively (e.g. I think Toolforge has additional admins)

And these will need another ssh key, to avoid leaking dangerous ones, probably per-project? (to discuss)
Though being asymmetric encryption I don't know how much risk it's to share the public keys (maybe you @faidon can help assessing that).

These three have different constraints, risk tolerances, set of administrators, authentication constraints (or keys even), etc. and will likely need separate solutions

It would be also helpful to be a bit more specific about the workflows you're interested in specifically - is that for existing cookbooks or future ones? What is that you are interested in doing, at a higher level? Thanks!

This is mostly for future cookbooks (including the toolforge ones being reviewed, https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/667183/5). It's not blocking us from using them right now, so it's not critical.
But that will unblock two future flows:

  • Low prio: Being able to run the cookbooks in a more reliable network (sometimes ISPs are not nice).
  • Mid prio, but long term: Being able to run cookbooks unattended, for example, scheduled cleanups or automated remediations.

And as a side bonus, that aligns also with the current philosophy of wanting to run cookbooks only from released packages on puppetized hosts (at least when running on that host).

To clarify, it's left for discussion what host they should run on (if using existing cumin hosts, or a new VM in ganeti, etc.), always open to suggestions.

(I'd suggest to focus on the nitty-gritty like SSH keys later -- I'm not the right person to ask for these either :)

For me the biggest question right now is more of a policy/administrative one; the proposal to run workflows for (2) and (3) (in the "labs" realm, and the per-project sub-realms) from production is crossing some administrative boundaries that has been desired to not cross in the past. As a concrete example, one of the reasons has been that there has been a desire to be able to open up global roots in WMCS or roots in Toolforge, that are not necessarily global roots in production, for volunteers etc. (we have at least a few individuals that I know of). This implementation would effectively go against that goal, and create another special snowflake situation where we either need to increase the level of rigor (NDAs etc.) to give those folks production access to this special in-between box, or allow only the intersection of roots in the two realms to execute orchestration workflows. This is of course a decision we could revisit, but it's not a very lightweight decision. Directionally, I'd like to avoid special unique in-betweens as much as possible, and focus on creating proper layering and clear interfaces betwen both administrative realms and WMF teams. Hope that makes sense :)

In the meantime, I'd be interested to hear what would prevent you from running these workflows in a labs-realm space, such as a VM in the cloudinfra project, rather than a cross-realm VM. It feels like it could address the two requirements you mentioned. Perhaps there are other requirements or constraints that need to be part of the conversation here?

So that host is meant to be accessed only by SREs, it's not meant to be used by other Toolforge/VPS admins, those would have to fallback to run the cookbooks from their workstations/laptops.

In the meantime, I'd be interested to hear what would prevent you from running these workflows in a labs-realm space, such as a VM in the cloudinfra project, rather than a cross-realm VM. It feels like it could address the two requirements you mentioned. Perhaps there are other requirements or constraints that need to be part of the conversation here?

This has been widely discussed already, if you remember we talked about it also the last time we met 1:1 where I explain the need to be able to automate WMCS team operational tasks, that include actions in both realms, and to automate (mid-long term) VPS and Toolforge non-staff members operational tasks too (those will probably require access only to VMs).
In summary, the requirement is being able to automate WMCS team operational tasks, and those include access to both realms.

dcaro removed dcaro as the assignee of this task.Aug 10 2021, 5:04 PM
dcaro raised the priority of this task from Medium to Needs Triage.
nskaggs triaged this task as Medium priority.Aug 10 2021, 10:04 PM