Page MenuHomePhabricator

Decision request - Tool account management and Striker
Closed, ResolvedPublic

Description

Problem

Currently Striker (toolsadmin.wikimedia.org) exposes the only user-accessible method to create and modify tool accounts. It does this by having LDAP credentials with full write access and then exposing a web interface for authenticated interfaces to use.

There are, however, several other use cases that need to store persistent information about tools:

Plus there are numerous other workflows that consume read-only tool information that are not listed here.

All of these use separate code to read and possibly write tool entries in LDAP.

Direct LDAP write access has historically been restricted to the wikiprod realm. This is one of the main reasons why Striker runs on cloudweb* hardware and not on a Cloud VPS VM or in the Toolforge k8s cluster.

Constraints and risks

TBD

The other highly privileged action that Striker does, adding and removing members from the tools Cloud VPS project, is out of scope here. That is an OpenStack operation meaning we could already restrict with custom roles and more specific policies.

Options

Option 1

Do nothing.

Pros:

  • No engineering work required
  • It already works

Cons:

  • All future use cases will need to come up with their ad hoc code and think about deployment considerations for LDAP write access
  • Some options (e.g. user-installable CLIs) will not be possible to implement

Option B

Keep the LDAP logic in Striker, and expose an API that all write operations will use.

Pros:

  • Minimal engineering work (Striker is already deployed and has LDAP credentials that work)
  • Will unlock some features that would otherwise not be possible

Cons:

  • Heavy coupling between frontend/UI features and backend logic
  • Will need to come up with a way to authenticate those API calls
  • Does not unblock moving Striker off of wikiprod hardware

Option Γ (Gamma)

As option B, but also migrate all read operations to the new API where possible.

Pros:

  • All LDAP logic gets consolidated in one place, everything else is "just" standard HTTP calls
  • Allows better caching and such

Cons:

  • More engineering work than in option B
  • All other cons of option B apply

Option Beryllium

Write a new backend service which will consolidate all LDAP writing logic and expose a standard HTTP API for those operations.

Pros:

  • Uncouples UI and backend logic
  • Unblocks various new use cases

Cons:

  • Requires most engineering work to implement
  • Increases system complexity by introducing yet another service
  • Will need to come up with a way to authenticate those API calls
  • Will not enable API editing for resources currently managed in Striker (including toolinfo records, GitLab repos, Phab projects, membership requests)

Option Purple

As option Beryllium, but also consolidate all read operations to the new service when possible.

Pros:

  • All LDAP logic gets consolidated in one place, everything else is "just" standard HTTP calls
  • Allows better caching and such

Cons:

  • Even more upfront engineering logic than in option Beryllium
  • All other cons of option Beryllium apply

Option Games

Introduce APIs for everything Striker can do (like in option Γ), then move the frontend code somewhere else

Pros:

  • Uncouples UI and backend logic
  • Unblocks various new use cases

Cons:

  • Engineering work required to implement
  • Django might not be the best tool to implement a headless backend
  • Hard to migrate UI components one at a time

Event Timeline

taavi updated the task description. (Show Details)

I can definitely see the benefit of the HTTP API, way easier to use than the LDAP interface.

But also, I'm trying to think, how is exposing the REST API different than exposing the LDAP URI for RW directly? It may be the same level of security/auth/authn complications, etc. Or maybe better, because there is usually better tooling, middleware and such, for HTTP.

I can definitely see the benefit of the HTTP API, way easier to use than the LDAP interface.

But also, I'm trying to think, how is exposing the REST API different than exposing the LDAP URI for RW directly? It may be the same level of security/auth/authn complications, etc. Or maybe better, because there is usually better tooling, middleware and such, for HTTP.

The main advantage is we can expose a specific, narrow interface for these operations instead of exposing the entire LDAP write interface. For example, while you probably could write an OpenLDAP ACL to permit a member of a tool to disable and re-enable that tool (unless the tool has been forcibly disabled by an admin but not deleted yet), it's much easier to implement that (in an easily understandably and testable way) with some custom code in a small dedicated service.

Thanks @taavi for starting this discussion and describing all the options!

I think two valuable goals are:

  • Uncoupling UI and backend logic
  • Being able to deploy most things as Toolforge components, inside the Tools k8s cluster

In option Purple, would it be possible to deploy the new "LDAP API" service to wikiprod hardware, and move the rest of Striker to run within the Tools cluster? Are there other features apart from LDAP that require Striker to be deployed in wikiprod hardware?

Long shot, but maybe we could try to hand off the new "LDAP API" service in option Purple to a production team? I think that service might potentially find other use cases outside of Toolforge/WMCS.

In option Purple, would it be possible to deploy the new "LDAP API" service to wikiprod hardware

Yes.

, and move the rest of Striker to run within the Tools cluster? Are there other features apart from LDAP that require Striker to be deployed in wikiprod hardware?

This is a large part of that work and part of why I'm exploring the API idea in the first place, but this won't entirely unblock that. There are two other things in Striker that have historically deemed as "scary":

  • Adding new members to the tools Cloud VPS project. As briefly mentioned in the task description I believe that could be solved by ensuring that workflow uses a restricted-enough service account.
  • Creating new developer accounts. In an optimal world that would always happen in Bitu.

In addition T359554: Use IDP for authentication in Striker needs to happen, but I want that to happen regardless of the outcome here and it is already in my near-term roadmap for Striker.

Long shot, but maybe we could try to hand off the new "LDAP API" service in option Purple to a production team? I think that service might potentially find other use cases outside of Toolforge/WMCS.

I'm fine with asking if others have potential use cases, but if not, I suspect that it's better to make a Toolforge-specific service than to make something that's in theory generic but in practice only used by us. For example:

  • what if we want to move the tool disable process information to this service instead of having them in a separate database
  • what if, in a distant future where we've removed the shared NFS cluster and everyone is using a really good web UI instead of logging onto a bastion host, we want to move the primary data store from LDAP to something more suitable for the requirements of that time?
taavi renamed this task from [DRAFT] Decision request - Tool account management and Striker to Decision request - Tool account management and Striker.May 13 2025, 3:52 PM

it's better to make a Toolforge-specific service than to make something that's in theory generic but in practice only used by us

Agreed! But we should consider carefully the requirements for this service and its interface. I think I would like this service to be as "thin" as possible, does it need to contain Toolforge-specific logic? Could it be a generic LDAP adapter, with some minimal logic to restrict the damage you can do through its API?

For example in the scenario you mentioned:

we want to move the primary data store from LDAP to something more suitable

if the LDAP API service is very "thin", instead of patching that service to use a different data store, we would just stop calling that service completely, and call a different service instead.

I was holding back on deciding this until it was blocking some work, I guess that it has started blocking :)
Some random comments.

Agreed! But we should consider carefully the requirements for this service and its interface. I think I would like this service to be as "thin" as possible, does it need to contain Toolforge-specific logic? Could it be a generic LDAP adapter, with some minimal logic to restrict the damage you can do through its API?

This would increase the security issues with that service, as would expand the possible actions (just saying, that might make it harder to harden).

we want to move the primary data store from LDAP to something more suitable

I have been thinking about this a bit too, the main blocker to do something like this is the need for unix auth (nfs + bastion/homes currently), once we move to persistent volumes of sorts and stop relying on unix auth, that would be simpler to do.

For authentication on that service

I can imagine several options:

  • Have a 'frontend' service that proxies/does the calls to that "striker api", this could then just use a password that gives it access (as only the frontend would use that service), this might be the easiest to implement (and mock).
  • Have "striker api" authenticate using sso/oauth of sorts, so the users can access it directly, this would limit a bit more the potential issues if someone "hijacks" a session (as they would only have access to whatever that user has access).

For how to evolve towards the desired goal

I think it would not be too hard to couple the move with the "striker 2.0" redesign process, specially if we end up going with the "redesign the new features from scratch, then redesign the existing UI in chunks" (option 2 in T393010: [DRAFT] Decision Request - Initial product approach to integrating Toolforge UI functionality with Toolsadmin).
It would perfectly match the deployment in k8s move too.

Django might not be the best tool to implement a headless backend

True, though if we get to the point where we stop serving any web UI from it, we could kinda easily change the framework for something else. We might want to avoid using "rest" frameworks and such for that (not really exposing CRUD operations for resources), as that's usually pretty framework-specific (and thus, hard to migrate later if we want).

Only do LDAP?

I think we might benefit too from leaving there other flows that might need risky creds, like creation of gitlab projects, phabricator and such (would have to review all the creds we have there).

Could it be a generic LDAP adapter, with some minimal logic to restrict the damage you can do through its API?

This would increase the security issues with that service, as would expand the possible actions (just saying, that might make it harder to harden).

Yes, that's fair. What are the write actions that the API needs to perform against LDAP? From a quick search in the Striker codebase, I can only see adding and deleting SSH keys. I imagine it also needs to modify LDAP group membership, but I can't find that in the Striker codebase.

Have "striker api" authenticate using sso/oauth of sorts, so the users can access it directly

How would you handle authorization in this case, i.e. how would the API know which actions are authorized for a given user?

I think we might benefit too from leaving there other flows that might need risky creds, like creation of gitlab projects, phabricator and such (would have to review all the creds we have there).

"Risky" is subjective, I think two clear boundaries could be:

  • connections to services that can only be reached from the wikiprod network (only LDAP at the moment? not sure)
  • connections to services where toolforge is only one of many users (LDAP, gitlab, phabricator, etc.)
taavi triaged this task as Medium priority.May 14 2025, 1:29 PM
taavi added a project: cloud-services-team.

Could it be a generic LDAP adapter, with some minimal logic to restrict the damage you can do through its API?

This would increase the security issues with that service, as would expand the possible actions (just saying, that might make it harder to harden).

Yes, that's fair. What are the write actions that the API needs to perform against LDAP? From a quick search in the Striker codebase, I can only see adding and deleting SSH keys. I imagine it also needs to modify LDAP group membership, but I can't find that in the Striker codebase.

SSH key management, like the registration workflow, is also one of those things that would ideally go to Bitu, but I'm not sure on which timeline that is happening.

The main write actions here are creating tools, modifying tool membership, and disabling/re-enabling tools.

Have "striker api" authenticate using sso/oauth of sorts, so the users can access it directly

How would you handle authorization in this case, i.e. how would the API know which actions are authorized for a given user?

I very explicitely left out any mentions of authentication and authorization out of this task :-) A general Toolforge API authn/authz system is something where we/I need to do more research and design work, and will likely have a separate decision request at some point.

For options Beryllium and Purple, I would expect that for now the communication between Striker and the new LDAP-writing service would use a simple shared secret or something similar as a stopgap until the general API authentication thing happens.

I think we might benefit too from leaving there other flows that might need risky creds, like creation of gitlab projects, phabricator and such (would have to review all the creds we have there).

"Risky" is subjective, I think two clear boundaries could be:

  • connections to services that can only be reached from the wikiprod network (only LDAP at the moment? not sure)
  • connections to services where toolforge is only one of many users (LDAP, gitlab, phabricator, etc.)

Yeah, the GitLab and Phabricator credentials are both not very privileged and I don't think they should be a concern when talking about where Striker lives.

it's better to make a Toolforge-specific service than to make something that's in theory generic but in practice only used by us

Agreed! But we should consider carefully the requirements for this service and its interface. I think I would like this service to be as "thin" as possible, does it need to contain Toolforge-specific logic? Could it be a generic LDAP adapter, with some minimal logic to restrict the damage you can do through its API?

In theory yes. In practice I t hink even the "simple" workflows do require some logic that, while in theory generic, would end up being Toolforge specific. For example, the workflow to authorize re-enabling a disabled tool looks something like this:

  • If the tool was force disabled by an admin (represented by a specific pwdAccountLockedTime attribute in LDAP), ony admins can re-enable it.
  • If the tool was disabled more than 40 days ago, deny because it's already in the process of being deleted.
  • Otherwise, member of a tool can re-enable it.
    • Except that you can add an another tool as a member of a tool, which is represented in a really weird way in LDAP but still should give members of that another tool access to that tool.

Agreed! But we should consider carefully the requirements for this service and its interface. I think I would like this service to be as "thin" as possible, does it need to contain Toolforge-specific logic? Could it be a generic LDAP adapter, with some minimal logic to restrict the damage you can do through its API?

In theory yes. In practice I t hink even the "simple" workflows do require some logic that, while in theory generic, would end up being Toolforge specific.

I would keep it as small and specific as possible, as in, have 1 endpoint for each flow toolforge needs, so it's as restrictive as we can.

Option Purple is my favourite so far, but I'm still a bit confused about how the new service would look like. Apologies if I'm fixating on this, we can also split this topic into a separate Decision Request or Design Document.

Regardless of whether the new service is generic or Toolforge-specific (I don't feel too strongly about this), I don't think the new service should contain too much "business logic".

  • If the tool was force disabled by an admin (represented by a specific pwdAccountLockedTime attribute in LDAP), ony admins can re-enable it.
  • If the tool was disabled more than 40 days ago, deny because it's already in the process of being deleted.

Could these checks be performed in Striker instead? Striker could read the LDAP attributes using the "read" endpoint of the new service, perform any required checks, then call the "write" endpoint of the new service. If Striker is the only client of the new service, there would be no way to circumvent the checks anyway, unless you get access to the secret that Striker is using to connect to the new service.

What is the advantage/need of performing the checks above inside the new service?

The main write actions here are creating tools, modifying tool membership, and disabling/re-enabling tools.

What if the new service was a simple REST API that can perform these actions, without doing any checks apart from checking that it is writing to LDAP objects with a "tools" or "toolsbeta" prefix?

Option Purple is my favourite so far, but I'm still a bit confused about how the new service would look like. Apologies if I'm fixating on this, we can also split this topic into a separate Decision Request or Design Document.

I think this is a rather core element of what this decision request is here, so I'd prefer if we could come to an agreement here.

Regardless of whether the new service is generic or Toolforge-specific (I don't feel too strongly about this), I don't think the new service should contain too much "business logic".

  • If the tool was force disabled by an admin (represented by a specific pwdAccountLockedTime attribute in LDAP), ony admins can re-enable it.
  • If the tool was disabled more than 40 days ago, deny because it's already in the process of being deleted.

Could these checks be performed in Striker instead? Striker could read the LDAP attributes using the "read" endpoint of the new service, perform any required checks, then call the "write" endpoint of the new service. If Striker is the only client of the new service, there would be no way to circumvent the checks anyway, unless you get access to the secret that Striker is using to connect to the new service.

What is the advantage/need of performing the checks above inside the new service?

The main argument for doing the checks in the new service is that if the authorization logic is kept inside Striker, it would mean that we can never enable these write actions in anything except than Striker. For example this would block offering any sort of non-read-only tool management commands in the CLI.

In addition some part of that logic are things that T332478 solutions would have to re-implement if this logic is kept in Striker. Particularly figuring out what I've dubbed "effective members" of a tool (i.e. members of a tool, plus members of tools that have been added as members of the original tool (which can be done recursively)) is something that is both realistically a Toolforge-specific operation and would have to be replicated elsewhere.

Plus at some point keeping enough authorization logic in Striker you start to negate the "Striker is no longer doing anything too privileged, we can host it on a non-wikiprod box" effect we want.

The main write actions here are creating tools, modifying tool membership, and disabling/re-enabling tools.

What if the new service was a simple REST API that can perform these actions, without doing any checks apart from checking that it is writing to LDAP objects with a "tools" or "toolsbeta" prefix?

Does the above answer this question?

Could it be a generic LDAP adapter, with some minimal logic to restrict the damage you can do through its API?

Have "striker api" authenticate using sso/oauth of sorts, so the users can access it directly

How would you handle authorization in this case, i.e. how would the API know which actions are authorized for a given user?

Same way that it does now :), if you are part of the tool's group (that info comes from the idp too), then you have access to change the tool config/membership/etc.

It might depend if we have the user's browser accessing directly the striker api (ex. having javascript code doing the requests), or we have another 'frontend' service that does it on their behalf (then we'd need oauth I think).

I think we might benefit too from leaving there other flows that might need risky creds, like creation of gitlab projects, phabricator and such (would have to review all the creds we have there).

"Risky" is subjective, I think two clear boundaries could be:

  • connections to services that can only be reached from the wikiprod network (only LDAP at the moment? not sure)
  • connections to services where toolforge is only one of many users (LDAP, gitlab, phabricator, etc.)

agree 100% subjective xd

I would separate it on "scope" of the credentials, as in, credentials that would give access to do things further than a regular user, or would give access just to do what a regular user does, like:

  • LDAP -> you'd have access to more than what a user does (you can change anything in LDAP)
  • gitlab -> you'd have extra access too, as you would be able to create repositories under toolforge-repos and manage tool groups for all tools
  • phabricator -> same, as you'd have extra access to create project tags, repositories, etc.
  • (future) openstack? -> this one might be both, depending on how it's implemented
  • (future) toolforge-ui -> this would give you access to only what the user can do in the tool

Keeping those behind the API would shrink even more the things an attacker could do if the toolforge deployment in k8s got broken into (as they would have access only to that thin API, instead of directly gitlab/phabricator/ldap).

Another interesting dimension to keep in mind would be if we can have an extra deployment of it, like for toolsbeta or lima-kilo, that for example, would be hard to do for both gitlab and phabricator (but having a mock deployment of striker-api that just returns 'ok' to the api request would be pretty easy).
This is, splitting from striker the parts that are hard to replicate and don't change often, from the parts that are easy to replicate and change more often (specially if we want to put effort on adding extra features and redesign parts of it).

I guess that makes the dimensions:

  • reachable from prod only: yes/no
  • gives wide access/gives only user access
  • deploying a working test instance: easy/hard
taavi claimed this task.

This was discussed in today's Toolforge monthly meeting. The summary of the agreement there is:

  • We want to migrate the privileged LDAP write operations from Striker to this new service
  • That new service is responsible for only allowing writes that affect Toolforge (or Toolsbeta, as configured) related objects, as the LDAP directory is also used by other parts of the Wikimedia infrastructure.
  • The service will expose an API that allows high-level operations ("add maintainer to tool", "disable tool") instead of providing a generic filtered LDAP write interface.
  • Further authorization (is this specific user allowed to act on this specific tool) is for now left in Striker. Designing an user-facing API with those checks included is left for the future.