Page MenuHomePhabricator

Spicerack: add distributed locking support
Open, MediumPublic

Description

With the ever growing number of cookbooks and cookbook runs, we're now at a point in which we start to need to be able to limit the amount of specific cookbooks that can be run in parallel.
In some cookbooks some steps require that other cookbooks run are not performing the same step at the same time. In other cases we might want to limit the number of parallel runs.
Those locks should naturally be distributed and have a TTL with a cleanup of stale locks in place.

Requirements

  1. As a cookbook developer I want to be able to say:
    • No more than 3 parallel run of this cookbook can be running at any given time
    • because the usual time this cookbooks take to run is N minutes, I want to set the TTL for the lock at say 2xN minutes
    • This specific line of code (or method) can be run only once at any given time, no parallelism
  2. As a cookbook user:
    • I want to be able to bypass the locking mechanism and run the cookbook in "unsafe" mode from the locks point of view for emergency reasons or because the locking backend has an outage
    • I want to know which cookbooks can lock one another. [out of scope]
    • If I encounter a lock, I want to know the user who holds the lock, the cumin host, how much time the lock has left, and whether the cookbook is actively running or waiting at a user prompt [out of scope, has nothing to do with the locking feature].
    • If my cookbook is holding the lock, I want to know the above information as well.
    • If my cookbook is waiting at a prompt, I want to be paged so I can complete the prompt. [out of scope, has nothing to do with the locking feature]
    • I need this information to be available throughout my cookbook run, via CLI or dashboard. [it's available in the console and the logs, and there is no need to know about it, it's just a contention prevention mechanism, and depending on the settings multiple concurrent runs will be available for specific locks.]
  1. As a Spicerack developer:
    • Each cookbook should set a lock for itself with a default reasonable concurrency limit and TTL, easily customizable for each cookbook.
    • I want to check if existing libraries exists that do what's needed and implement it only if necessary
    • All the libraries must be in Debian
    • The backend where the locks are stored is not important as long as it's reliable and distributed

Existing libraries

After some research for Python libraries that implement a distributed locking mechanism with configurable concurrency/parallelism and a TTL, I didn't find any that were even close to the set of features needed here.
So if we need to implement one we should first decide which backend we want to use for it.

Locks backend

Given the current infrastructure at the WMF it seems natural to pick one of the existing backends that are easily usable for locking:

  1. Poolcounter
    • PRO: in house, supports queue size (concurrency) for the locks, has a Python library (actually more than one)
    • CON: requires a TCP connection to be kept open for each lock for the whole duration of the lock and some cookbooks last a week or more. It's also a cross-DC connection in some cases, to ensure the distributed locking.
  2. etcd
    • PRO: already used in Spicerack via conftool, has native support for basic locks (even better with APIv3)
    • CON: the APIv3 library for Python has not been released in 2 years and has 61 commits in master not officially released, some of them needed to fully support our cluster (DNS SRV record support has 2 PRs, none merged and multi-endpoint support is merged in master but not released). We could though use the APIv2 library (preferred) or the HTTP GRPC gateway one (less preferred).
  3. Redis
    • CON: the Redlock Python libraries seems to be mostly unmaintained
    • CON: its usage at the WMF has declined in recent times, might not be the smartest move to use it for this.

Given the above it would seem natural to me to pick etcd as backend, to be decided with which library/API but most likely the APIv2 python library for now, to be replaced with the APIv3 one if/once available.

Implementation proposal

  1. Use etcd as backend
  2. Pick a namespace for the keys and setup the ACLs for it (proposed name: /spicerack_locks
  3. Use the API2 for now
  4. Use the very simple native support for exclusive locks in etcd APIv2 to acquire the lock to write a more complex object that has the additional metadata needed to implement the feature. Decide how many native locks to use:
    1. Just one with a hardcoded name to be used for all write actions performed by Spicerack that manage locks
    2. Three different ones, one for each different key prefix (the automatic lock creation at every cookbook run (basically between START and END), spicerack modules custom locks, cookbooks custom locks)
    3. Many, one for each different lock to create
  5. To acquire a spicerack lock first the native simple lock will be acquired, then a JSON object will be read/written from/to etcd to set the lock. If the lock cannot be created because it has reached its size limit it will retry for few times with some sleeps in between.
  6. An example object written to etcd will be of the form (the comments are added here to better explain the fields):
key: '/spicerack_locks/cookbooks/sre.hosts.decommission'
value: {
    '4e2677c7-541a-4f9e-afbb-cfdb69c440a1': {  # UUID of the single lock acquired, auto-generated
        'concurrency': 5,  # How many parallel runs are allowed
        'ttl': 120,  # After how many seconds the lock should be considered stale and garbage collected
        'created': '2023-07-17 10:50:35.974419',  # The UTC datetime of when the lock was acquired
        'owner': 'user@host',  # The username and hostname where the cookbook that took the lock is running
    }
}
    • Alternatively it's possible to use the etcd's native TTL support making each UUID based lock a key in etcd with TTL inside the lock name's directory and each time traverse the directory to get all the existing locks. This should be done while holding a write lock to avoid race conditions at least until we can use the API v3 and use transactions.
  1. From the API point of view there will be a context manager method acquired(key: str, *, concurrency: int, ttl: int) -> None to be used with the with ... statement plus an acquire(key: str, *, concurrency: int, ttl: int) -> str and release(lock_id: str) -> None methods for standalone use
  2. At each cookbook run Spicerack will automatically acquire the lock for the key /spicerack_locks/cookbooks/$COOKBOOK_NAME
  3. Each cookbook can also create additional locks for specific steps getting a lock object via the usual accessor mechanism: spicerack.lock(). The key for the locks created this way will have a prefix /spicerack_locks/custom/$LOCK_NAME.
  4. Spicerack modules that will implement locking for some of their steps will use locks with a prefix /spicerack_locks/$MODULE_NAME/ (e.g. /spicerack_locks/spicerack.dhcp/).

I'll send shortly some CR to gerrit with a draft of the above proposal.

Details

SubjectRepoBranchLines +/-
operations/cookbooksmaster+15 -1
operations/cookbooksmaster+15 -1
operations/cookbooksmaster+9 -7
operations/cookbooksmaster+8 -5
operations/cookbooksmaster+37 -10
operations/software/spicerackmaster+6 -0
operations/cookbooksmaster+16 -5
operations/cookbooksmaster+3 -0
operations/cookbooksmaster+2 -0
operations/cookbooksmaster+6 -1
operations/puppetproduction+17 -5
operations/software/spicerackmaster+6 -1
operations/software/spicerackmaster+35 -6
operations/cookbooksmaster+6 -1
operations/cookbooksmaster+6 -1
operations/cookbooksmaster+6 -1
operations/cookbooksmaster+6 -1
operations/software/spicerackmaster+2 -5
operations/cookbooksmaster+7 -30
operations/software/spicerackmaster+61 -4
operations/software/spicerackmaster+87 -54
operations/software/spicerackmaster+34 -1
operations/software/spicerackmaster+146 -14
operations/software/spicerackmaster+22 -12
operations/software/spicerackmaster+21 -0
operations/software/spicerackmaster+1 K -1
Show related patches Customize query in gerrit

Event Timeline

Volans triaged this task as Medium priority.Jul 17 2023, 10:55 AM
Volans created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 938822 had a related patch set uploaded (by Volans; author: Volans):

[operations/software/spicerack@master] locking: add new module for distributed locking

https://gerrit.wikimedia.org/r/938822

Change 938823 had a related patch set uploaded (by Volans; author: Volans):

[operations/software/spicerack@master] cookbook: add --no-locks CLI argument

https://gerrit.wikimedia.org/r/938823

Change 938824 had a related patch set uploaded (by Volans; author: Volans):

[operations/software/spicerack@master] cookbooks: acquire lock for each cookbook run

https://gerrit.wikimedia.org/r/938824

Per email discussion with @jbond , adding another locking backend for consideration:

Swift

  • PRO: Well-known/supported at WMF
  • PRO: Native support for TTLs and custom metadata
  • PRO: Could use requests, python-openstack or s3-compatible libraries for implementation.
  • CON: Eventual consistency means cookbooks launched within a few seconds of each other may not see one anothers' locks. That's extremely unlikely but it is possible. At that point, you'd have to pick a 'winner', presumably based on timestamp, and inform both cookbook users that the lock exists and the 'loser' is waiting for the winner's cookbook to finish.
bking updated the task description. (Show Details)
bking updated the task description. (Show Details)

@bking please use the task comments to discuss changes instead of modifying the task description

Swift

  • CON: Eventual consistency means cookbooks launched within a few seconds of each other may not see one anothers' locks. That's extremely unlikely but it is possible. At that point, you'd have to pick a 'winner', presumably based on timestamp, and inform both cookbook users that the lock exists and the 'loser' is waiting for the winner's cookbook to finish.

I think this is a very big con and in my mind excludes it from consideration. You say its extremely unlikely that " cookbooks [would be] launched within a few seconds of each other" but i disagree and thing this is both likley and desirable. As a recent example i set up two new vms one in codfw and on in eqiad and would have liked to start both cookbooks simultaneity (one from cumin1001 and one from cumin2002) and have them both do the right thing without tripping each other up. however i had to stagger them due to the lack of locking on the dns repo.

To ensure that the generated read/write traffic on the etcd cluster will be ok and not cause any issue I've made some tests using the CRs sent so far (in particular https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/938822/ ).

In particular, running this code:

import logging
from time import sleep
from os import getpid

from spicerack.locking import get_lock_instance

logging.basicConfig(level=logging.DEBUG)
logging.getLogger("urllib3").setLevel(logging.WARNING)

lock = get_lock_instance(config_file='etcd_configrc', prefix='cookbooks', owner=getpid(), dry_run=False)
with lock.acquired("mykey", concurrency=2, ttl=1800):
    print('with spicerack lock acquired.')
    sleep(1)

I get this output, from which it's clear the number of read and writes to etcd made by the library:

DEBUG:etcd.client:New etcd client created for http://127.0.0.1:2379
DEBUG:etcd.client:Retrieved list of machines: ['http://localhost:2379']
DEBUG:etcd.client:Machines cache initialised to ['http://localhost:2379']
DEBUG:spicerack.locking:Acquiring lock for key mykey: {'concurrency': 2, 'created': '2023-10-02 13:50:49.946912', 'owner': 91607, 'ttl': 1800}
DEBUG:etcd.lock:Initiating lock for /spicerack/locks/etcd with uuid 5347b11cd63d4a74951984316abf80aa
DEBUG:etcd.client:Issuing read for key /spicerack/locks/etcd with args {'recursive': True}
DEBUG:etcd.lock:Lock not found, writing it to /spicerack/locks/etcd
DEBUG:etcd.client:Writing 5347b11cd63d4a74951984316abf80aa to key /spicerack/locks/etcd ttl=15 dir=False append=True
DEBUG:etcd.lock:Lock key /spicerack/locks/etcd/00000000000000000733 written, sequence is 00000000000000000733
DEBUG:etcd.client:Issuing read for key /spicerack/locks/etcd with args {'recursive': True}
DEBUG:etcd.lock:Lock keys found: ['/spicerack/locks/etcd/00000000000000000733']
DEBUG:etcd.lock:No key before our one, we are the locker
DEBUG:etcd.lock:Lock acquired!
DEBUG:etcd.client:Issuing read for key /spicerack/locks/etcd/00000000000000000733 with args {}
DEBUG:etcd.client:Issuing read for key /spicerack/locks/cookbooks/mykey with args {'timeout': 60}
DEBUG:etcd.client:Writing {"cba7acc5-76af-4e4a-8b48-4ae9a467536b": {"concurrency": 2, "created": "2023-10-02 13:50:49.992084", "owner": 91607, "ttl": 1800}} to key /spicerack/locks/cookbooks/mykey ttl=None dir=False append=False
INFO:spicerack.locking:Acquired lock for key /spicerack/locks/cookbooks/mykey: {'concurrency': 2, 'created': '2023-10-02 13:50:49.992084', 'owner': 91607, 'ttl': 1800}
DEBUG:etcd.lock:Releasing existing lock /spicerack/locks/etcd/00000000000000000733
DEBUG:etcd.client:Deleting /spicerack/locks/etcd/00000000000000000733 recursive=None dir=None extra args={}
DEBUG:etcd.client:Calculated params = {}
with spicerack lock acquired.
DEBUG:spicerack.locking:Releasing lock for key mykey with ID cba7acc5-76af-4e4a-8b48-4ae9a467536b
DEBUG:etcd.lock:Initiating lock for /spicerack/locks/etcd with uuid 7cf711f4da144124bfa99e2d1d5640a5
DEBUG:etcd.client:Issuing read for key /spicerack/locks/etcd with args {'recursive': True}
DEBUG:etcd.lock:Lock not found, writing it to /spicerack/locks/etcd
DEBUG:etcd.client:Writing 7cf711f4da144124bfa99e2d1d5640a5 to key /spicerack/locks/etcd ttl=15 dir=False append=True
DEBUG:etcd.lock:Lock key /spicerack/locks/etcd/00000000000000000736 written, sequence is 00000000000000000736
DEBUG:etcd.client:Issuing read for key /spicerack/locks/etcd with args {'recursive': True}
DEBUG:etcd.lock:Lock keys found: ['/spicerack/locks/etcd/00000000000000000736']
DEBUG:etcd.lock:No key before our one, we are the locker
DEBUG:etcd.lock:Lock acquired!
DEBUG:etcd.client:Issuing read for key /spicerack/locks/etcd/00000000000000000736 with args {}
DEBUG:etcd.client:Issuing read for key /spicerack/locks/cookbooks/mykey with args {'timeout': 60}
DEBUG:etcd.client:Writing {} to key /spicerack/locks/cookbooks/mykey ttl=None dir=False append=False
INFO:spicerack.locking:Released lock for key /spicerack/locks/cookbooks/mykey: {'concurrency': 2, 'created': '2023-10-02 13:50:49.992084', 'owner': 91607, 'ttl': 1800}
DEBUG:etcd.lock:Releasing existing lock /spicerack/locks/etcd/00000000000000000736
DEBUG:etcd.client:Deleting /spicerack/locks/etcd/00000000000000000736 recursive=None dir=None extra args={}
DEBUG:etcd.client:Calculated params = {}

In particular there are:

  • 4 Writes
  • 2 Deletes
  • 8 Reads

Those will be the minimum operations that will be made by every cookbook run, more if the cookbook acquires some specific lock for specific operations.
Keep in mind that most of the operations are issued by the etcd library to get the exlusive lock to write the spicerack object.

@Joe do you think this will be acceptable for the etcd cluster?

For the record as Giuseppe is out, I had a chat with @CDanis going over the plan and numbers and we didn't find anything worrisome or blockers. I'll proceed with the current implementation, anyway it will be off by default and switched on only with a puppet change to update the config file, that will also allow to easily stop using the locks in case there is any issue.

Change 938822 merged by jenkins-bot:

[operations/software/spicerack@master] locking: add new module for distributed locking

https://gerrit.wikimedia.org/r/938822

Change 938823 merged by jenkins-bot:

[operations/software/spicerack@master] cookbook: add --no-locks CLI argument

https://gerrit.wikimedia.org/r/938823

Change 964852 had a related patch set uploaded (by Volans; author: Volans):

[operations/software/spicerack@master] locking: load also ~/.etcdrc for the running user

https://gerrit.wikimedia.org/r/964852

Change 964852 merged by jenkins-bot:

[operations/software/spicerack@master] locking: load also ~/.etcdrc for the running user

https://gerrit.wikimedia.org/r/964852

Change 966172 had a related patch set uploaded (by Volans; author: Volans):

[operations/software/spicerack@master] spicerack: add _spicerack_lock private accessor

https://gerrit.wikimedia.org/r/966172

Change 966173 had a related patch set uploaded (by Volans; author: Volans):

[operations/software/spicerack@master] dhcp: acquire exclusive per-DC lock on write

https://gerrit.wikimedia.org/r/966173

Change 966187 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] sre.hosts.dhcp: make the lock per-host

https://gerrit.wikimedia.org/r/966187

Change 966188 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] sre.hosts.provision: make the lock per-host

https://gerrit.wikimedia.org/r/966188

Change 966189 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] sre.hosts.reimage: make the lock per-host

https://gerrit.wikimedia.org/r/966189

Change 966190 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] sre.hosts.reboot-single: make the lock per-host

https://gerrit.wikimedia.org/r/966190

Change 938824 merged by jenkins-bot:

[operations/software/spicerack@master] cookbooks: acquire lock for each cookbook run

https://gerrit.wikimedia.org/r/938824

Change 966172 merged by jenkins-bot:

[operations/software/spicerack@master] spicerack: add _spicerack_lock private accessor

https://gerrit.wikimedia.org/r/966172

Change 966490 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] dhcp: adapt to new Spicerack's dhcp() API

https://gerrit.wikimedia.org/r/966490

Change 966173 merged by jenkins-bot:

[operations/software/spicerack@master] dhcp: acquire exclusive per-DC lock on write

https://gerrit.wikimedia.org/r/966173

Change 966547 had a related patch set uploaded (by Volans; author: Volans):

[operations/software/spicerack@master] documentation: add section for distributed locking

https://gerrit.wikimedia.org/r/966547

Change 966547 merged by jenkins-bot:

[operations/software/spicerack@master] documentation: add section for distributed locking

https://gerrit.wikimedia.org/r/966547

Change 966490 merged by jenkins-bot:

[operations/cookbooks@master] dhcp: adapt to new Spicerack's dhcp() API

https://gerrit.wikimedia.org/r/966490

Change 966812 had a related patch set uploaded (by Volans; author: Volans):

[operations/software/spicerack@master] locking: fix path for Spicerack modules locks

https://gerrit.wikimedia.org/r/966812

Change 966812 merged by jenkins-bot:

[operations/software/spicerack@master] locking: fix path for Spicerack modules locks

https://gerrit.wikimedia.org/r/966812

Change 966187 merged by jenkins-bot:

[operations/cookbooks@master] sre.hosts.dhcp: make the lock per-host

https://gerrit.wikimedia.org/r/966187

Change 966188 merged by jenkins-bot:

[operations/cookbooks@master] sre.hosts.provision: make the lock per-host

https://gerrit.wikimedia.org/r/966188

Change 966189 merged by jenkins-bot:

[operations/cookbooks@master] sre.hosts.reimage: make the lock per-host

https://gerrit.wikimedia.org/r/966189

Change 966190 merged by jenkins-bot:

[operations/cookbooks@master] sre.hosts.reboot-single: make the lock per-host

https://gerrit.wikimedia.org/r/966190

Change 966854 had a related patch set uploaded (by Volans; author: Volans):

[operations/software/spicerack@master] locking: delete the key on etcd if no locks remain

https://gerrit.wikimedia.org/r/966854

Change 966854 merged by jenkins-bot:

[operations/software/spicerack@master] locking: delete the key on etcd if no locks remain

https://gerrit.wikimedia.org/r/966854

Change 966873 had a related patch set uploaded (by Volans; author: Volans):

[operations/puppet@production] spicerack: enable distributed locking in production

https://gerrit.wikimedia.org/r/966873

Change 966886 had a related patch set uploaded (by Volans; author: Volans):

[operations/software/spicerack@master] documentation: expand distributed locking docs

https://gerrit.wikimedia.org/r/966886

Change 966886 merged by jenkins-bot:

[operations/software/spicerack@master] documentation: expand distributed locking docs

https://gerrit.wikimedia.org/r/966886

Change 966873 merged by Volans:

[operations/puppet@production] spicerack: enable distributed locking in production

https://gerrit.wikimedia.org/r/966873

Mentioned in SAL (#wikimedia-operations) [2023-10-19T06:34:45Z] <volans> enabled distributed locking support in spicerack/cookbooks T341973

Disributed locking is now live in Spicerack and used by the Cookbooks.
For a general overview see https://doc.wikimedia.org/spicerack/master/introduction.html#distributed-locking

In case of emergency or etcd malfunctions cookbooks can be run with the --no-locks CLI flag (see cookbook -h) to bypass the locking mechanism or the whole locking support can be easily disabled globally setting profile::spicerack::etcd_config: ~ in Puppet's hieradata/common/profile/spicerack.yaml.

I've also sent an email to the SREs to let anyone know about it.

Change 967164 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] sre.deploy.python-code: fine tune lock

https://gerrit.wikimedia.org/r/967164

Change 967165 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] sre.discovery.datacenter: customize lock arguments

https://gerrit.wikimedia.org/r/967165

Change 967166 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] sre.discovery.service-route: customize lock args

https://gerrit.wikimedia.org/r/967166

Change 967167 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] sre.dns.netbox: make the lock exclusive

https://gerrit.wikimedia.org/r/967167

Change 967164 merged by jenkins-bot:

[operations/cookbooks@master] sre.deploy.python-code: fine tune lock

https://gerrit.wikimedia.org/r/967164

Change 967167 merged by Volans:

[operations/cookbooks@master] sre.dns.netbox: make the lock exclusive

https://gerrit.wikimedia.org/r/967167

Change 967236 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] sre.puppet.sync-netbox-hiera: fine tune lock

https://gerrit.wikimedia.org/r/967236

Change 967236 merged by jenkins-bot:

[operations/cookbooks@master] sre.puppet.sync-netbox-hiera: fine tune lock

https://gerrit.wikimedia.org/r/967236

Change 967628 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] sre.ganeti.*: customize lock arguments

https://gerrit.wikimedia.org/r/967628

Change 967629 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] sre.gitlab.*: customize lock arguments

https://gerrit.wikimedia.org/r/967629

Change 967629 merged by Jbond:

[operations/cookbooks@master] sre.gitlab.*: customize lock arguments

https://gerrit.wikimedia.org/r/967629

Change 973194 had a related patch set uploaded (by Volans; author: Volans):

[operations/software/spicerack@master] documentation: add example of locking

https://gerrit.wikimedia.org/r/973194

Change 973194 merged by jenkins-bot:

[operations/software/spicerack@master] documentation: add example of locking

https://gerrit.wikimedia.org/r/973194

Change 974972 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] sre.hardware.upgrade-firmware: add custom locking

https://gerrit.wikimedia.org/r/974972

Change 974973 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] sre.hosts.decommission: acquire lock for each host

https://gerrit.wikimedia.org/r/974973

Change 974972 merged by jenkins-bot:

[operations/cookbooks@master] sre.hardware.upgrade-firmware: add custom locking

https://gerrit.wikimedia.org/r/974972

Change 974973 merged by jenkins-bot:

[operations/cookbooks@master] sre.hosts.decommission: acquire lock for each host

https://gerrit.wikimedia.org/r/974973

Change #967165 merged by jenkins-bot:

[operations/cookbooks@master] sre.discovery.datacenter: customize lock arguments

https://gerrit.wikimedia.org/r/967165

Change #967166 merged by jenkins-bot:

[operations/cookbooks@master] sre.discovery.service-route: customize lock args

https://gerrit.wikimedia.org/r/967166