Page MenuHomePhabricator

Figure out process for deleting an unused tool
Open, HighPublic

Description

See Toolforge (Tools to be deleted) for information on how to nominate a tool for deletion.

We have a growing list of tools that have been volunteered for deleting by their maintainers. We probably also have quite a large number of tools that are being "name squatted" and could be reclaimed with gentle prodding of their maintainers. But, we don't have a well defined process for what steps are actually necessary to delete a tool. We should make that checklist and then figure out if automating it is worthwhile or not.

Deletion checklist

  • Remove all maintainers from tool group
  • Remove tool from maintainers list for all other tools
  • Archive tool's crontab
  • Stop all running jobs owned by tool on job grid
  • Delete all deployments owned by the tool on Kubernetes
  • Revoke Kubernetes credentials for tool
  • Remove Kubernetes namespace for tool
  • Archive any ToolsDB databases owned by the tool
  • Revoke database credentials for tool
  • Revoke elasticsearch credentials for tool
  • Archive tool homedir
  • Archive any Diffusion repositories owned by the tool
  • Delete tool account and group from LDAP

Event Timeline

bd808 added subscribers: madhuvishy, Andrew, yuvipanda and 3 others.

@valhallasw, @scfc, @yuvipanda, @madhuvishy, @Andrew, @chasemp: what am I forgetting? Update the list in the description please and thank you.

For everyone's understanding: This is a complement to T102066 (abandoned tools that nobody wants to take over and/or should die because of technical reasons)?

For everyone's understanding: This is a complement to T102066 (abandoned tools that nobody wants to take over and/or should die because of technical reasons)?

I think the tools that have been self-nominated so far are mostly failed experiments or things that have graduated to other hosting/folded into other tools. But, yes the archival step would be intended to preserve a record in case someone petitioned to revive the tool as a fork.

For everyone's understanding: This is a complement to T102066 (abandoned tools that nobody wants to take over and/or should die because of technical reasons)?

I think the tools that have been self-nominated so far are mostly failed experiments or things that have graduated to other hosting/folded into other tools. But, yes the archival step would be intended to preserve a record in case someone petitioned to revive the tool as a fork.

Yeah, I think some of them were created only for testing, learning Toolforge, or trying to create some tool and not be successful at all (like me). Also because of this there should be some more easier process to remove user's own tool within some time from creation (7 days?)..

With a backlog of 45 (a.t.m.) requests for tool deletions, is the checklist above complete so we can start honoring (or at least study) those requests? Thanks.

With a backlog of 45 (a.t.m.) requests for tool deletions, is the checklist above complete so we can start honoring (or at least study) those requests? Thanks.

Having a list of abandoned tools is not a cause for alarm if the length of the list is 1 or 1000. Implementation of a repeatable cleanup process is a hard problem. I would like to devote some time in the coming months to attempting to find a resolution, but I am also in no hurry to do so. We now have a process documented in this task and on wikitech that can be used by tool maintainers to detach themselves from a tool they no longer wish to maintain. This was the most important aspect to solve. The remaining work is administrative and not something that should be a concern for the Toolforge maintainer community.

Bstorm triaged this task as High priority.Feb 11 2020, 5:54 PM

This looks like enough steps that we should probably be automating things! A few questions:

  • Is it important that this process be fully reversable? (If 'yes' then we have to worry about restoring access creds in the future, and preventing namespace collisions)
  • Do we want an api for this (so that e.g. striker can archive/restore a tool) or just a CLI?
  • Is there any important tool state stored in Striker's database, or is all the important state in NFS and ldap?

This looks like enough steps that we should probably be automating things! A few questions:

  • Is it important that this process be fully reversable? (If 'yes' then we have to worry about restoring access creds in the future, and preventing namespace collisions)

I'm open to hearing arguments about why "undo" of a delete is important, but I my current opinion is that it is not.

  • Do we want an api for this (so that e.g. striker can archive/restore a tool) or just a CLI?

Fully automated would be awesome, but a checklist with N steps requiring use of M cli tools would be a perfectly reasonable way to start. There are several different backends to deal with (Kubernetes, Grid Engine, ToolsDB, Wiki Replicas, Phabricator, LDAP, ...) which I think means that getting to full one-click automation will be non-trivial. It would probably end up being a command-and-control endpoint that called out to several other endpoints to do the work in any case so that we could isolate credentials and failure zones.

  • Is there any important tool state stored in Striker's database, or is all the important state in NFS and ldap?

The only thing that is in Striker's DB really would be toolinfo records which are read by admin.toolforge.org, Hay's Directory, and (coming soon) Toolhub. These can currently be deleted by a tool maintainer or an admin via the Striker UI. Exposing that as an API could be done, but would require us to figure out API auth for Striker.

Proposed: striker has disable/enable option, when a tool is disabled (via ldap login block) it also touches the password-last-updated date.

Grid and k8s should check if the

Whoah, that's what happens when I try to type during a meeting. Here's a proposed workflow:

  1. An admin or a tool member can mark a tool as 'disabled'. They can also mark it as 'enabled'. When a tool is marked as disabled a datestamp is set and *waves hands* running procs, pods and crons are stopped and prevented from restarting.
  2. A periodic cronjob or an admin-run script (maybe running on an NFS server) periodically checks for tools that have been disabled for more than 30 days. Those tools have their files and config archived into a tarball to the best of our ability; credentials are revoked and ldap entries are removed
  3. If, after a tool has been archived, a user requests that a tool be revived (possibly via the abandoned tool workflow) we will (by hand) copy the archived tarball from step 2 into the tool dir of their choice. After that they're on their own.

We will use the ldap pwdAccountLockedTime field to indicate that a tool is disabled (and when it was disabled).

Setting or clearing this will be left to Striker. I'm not sure if the option should be available in user space or not; I'd like users to be able to decom their own tools but the ability to re-enable will interfere with admin powers to kill off malicious tools.

Here's how to find disabled tools with the cli:

ldapsearch -W -H ldap://ldap-labs.eqiad.wikimedia.org:389 -D uid=novaadmin,ou=people,dc=wikimedia,dc=org -b ou=people,ou=servicegroups,dc=wikimedia,dc=org "(pwdAccountLockedTime=*)" "+"

We're going to need several new agents and additional changes to support this. Here's an updated version of the checklist from the original description.

First, for the disabling phase:

  • archive crontab [6]
  • Stop all running jobs owned by tool on job grid [2]
  • Delete all deployments owned by the tool on Kubernetes [3]
  • Prevent the launching of new grid jobs [4]
  • Prevent the launching of new k8s deployments [5]

Then for the archiving phase:

  • Remove all maintainers from tool group [1]
  • Remove tool from maintainers list for all other tools [1]
  • Archive any ToolsDB databases owned by the tool [1]
  • Revoke database credentials for tool [7]
  • Revoke elasticsearch credentials for tool [1]
  • Archive tool homedir [1]
  • Delete tool account and group from LDAP [1]

And I'm not sure we should do this one automatically at all:

  • Archive any Diffusion repositories owned by the tool

[1] Cron job that detects long-disabled tools, edits ldap and archives files. Possibly runs on an NFS server in order to archive things off of NFS (as per the current closed-cps-project archive process)

[2] Cron job that runs on a grid master, detects disabled tools and stops their associated jobs

[3] Cron job that runs on a k8s host, detects disabled tools and stops associated k8s deployments; detects archive candidates and revokes k8s permissions

[4] Disable logins as disabled tool.

[5] Disable logins as disabled tool.

[6] Runs on cron submit host, archives user crontab

[7] This can probably be handled by maintain-dbusers

Change 699577 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Toolforge bastions: add a broken shell for disabled tools

https://gerrit.wikimedia.org/r/699577

We will use the ldap pwdAccountLockedTime field to indicate that a tool is disabled (and when it was disabled).

Setting or clearing this will be left to Striker. I'm not sure if the option should be available in user space or not; I'd like users to be able to decom their own tools but the ability to re-enable will interfere with admin powers to kill off malicious tools.

Here's how to find disabled tools with the cli:

ldapsearch -W -H ldap://ldap-labs.eqiad.wikimedia.org:389 -D uid=novaadmin,ou=people,dc=wikimedia,dc=org -b ou=people,ou=servicegroups,dc=wikimedia,dc=org "(pwdAccountLockedTime=*)" "+"

pwdPolicySubentry=cn=disabled,ou=ppolicies,dc=wikimedia,dc=org may be a better setting to rely on, at least for positive account locking. See the analysis at T168692#5065458 for some reasoning behind that.

We will use the ldap pwdAccountLockedTime field to indicate that a tool is disabled (and when it was disabled).

Setting or clearing this will be left to Striker. I'm not sure if the option should be available in user space or not; I'd like users to be able to decom their own tools but the ability to re-enable will interfere with admin powers to kill off malicious tools.

Here's how to find disabled tools with the cli:

ldapsearch -W -H ldap://ldap-labs.eqiad.wikimedia.org:389 -D uid=novaadmin,ou=people,dc=wikimedia,dc=org -b ou=people,ou=servicegroups,dc=wikimedia,dc=org "(pwdAccountLockedTime=*)" "+"

pwdPolicySubentry=cn=disabled,ou=ppolicies,dc=wikimedia,dc=org may be a better setting to rely on, at least for positive account locking. See the analysis at T168692#5065458 for some reasoning behind that.

That doesn't include a datestamp does it?

pwdPolicySubentry=cn=disabled,ou=ppolicies,dc=wikimedia,dc=org may be a better setting to rely on, at least for positive account locking. See the analysis at T168692#5065458 for some reasoning behind that.

That doesn't include a datestamp does it?

No, it does not. I think its reasonable to try using pwdAccountLockedTime to track the time of change, but that we should also do the belt-and-suspenders addition of setting pwdPolicySubentry=cn=disabled,ou=ppolicies,dc=wikimedia,dc=org on the tool user account. In reality those accounts do not have passwords anyway, but these are the easy LDAP schema bits to fiddle with.

Change 699577 merged by Andrew Bogott:

[operations/puppet@production] Toolforge bastions: add a broken shell for disabled tools

https://gerrit.wikimedia.org/r/699577

Change 701455 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Add config file for the disable_tool script

https://gerrit.wikimedia.org/r/701455

Change 701455 merged by Andrew Bogott:

[operations/puppet@production] Add config file for the disable_tool script

https://gerrit.wikimedia.org/r/701455

Change 701458 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] profile::wmcs::nfs::primary: Fix exchanged ldap pass and dn

https://gerrit.wikimedia.org/r/701458

Change 701458 merged by Andrew Bogott:

[operations/puppet@production] profile::wmcs::nfs::primary: Fix exchanged ldap pass and dn

https://gerrit.wikimedia.org/r/701458

Change 701928 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] toolforge: add a profile for installing the disable_tool script

https://gerrit.wikimedia.org/r/701928

Change 701928 merged by Andrew Bogott:

[operations/puppet@production] toolforge: add a profile for installing the disable_tool script

https://gerrit.wikimedia.org/r/701928

Change 702745 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] profile::toolforge::disable_tool: fix typos

https://gerrit.wikimedia.org/r/702745

Change 702745 merged by Andrew Bogott:

[operations/puppet@production] profile::toolforge::disable_tool: standardize on the singular 'disable_tool'

https://gerrit.wikimedia.org/r/702745

Change 706035 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] nfs: use disable-tool.py to archive disabled+expired tools

https://gerrit.wikimedia.org/r/706035

Change 706035 merged by Andrew Bogott:

[operations/puppet@production] nfs: use disable-tool.py to archive disabled+expired tools

https://gerrit.wikimedia.org/r/706035

Change 706768 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] disable_tool: run every 5 minutes rather than every 10

https://gerrit.wikimedia.org/r/706768

Change 706768 merged by Andrew Bogott:

[operations/puppet@production] disable_tool: run every 5 minutes rather than every 10

https://gerrit.wikimedia.org/r/706768