Page MenuHomePhabricator

dbctl: expose diff via API in a more structured way
Open, MediumPublic

Description

From the API point of view of a cookbook using dbctl it would be really useful to be able to access the diff in a more structured way in addition to the list of lines of diff already formatted by the diff library.

A practical example is the depool/pool cookbook that wants to check that no spurious dbctl changes gets committed with the depool/pool action.
It currently has some heuristic to do so here but it doesn't catch all the cases.
In particular the ones where an instance is the only one part of a group (e.g. vslow) and hence the whole block is removed/created upon depool/repool. Also being the last item of a JSON list creates issues with the diff for the comma on the previous line. Some cases are catched by the current heuristic but is far from optimal and also doesn't cover them all.

Ideally dbctl should expose from the dbctl.config.diff() method also some more structured data about the diff so that a client could more easily check the diff.

Spicerack locks here can't help as dbctl is expected to also be run via CLI directly.

Event Timeline

Volans triaged this task as Medium priority.

@Scott_French is this something you could help us with as you are touching dbctl lately? (And I am so thankful you are doing it!).

This is the error I got at the time:

[06:26:42] marostegui@cumin2002:~$ sudo cookbook sre.mysql.pool -r "Fixed corruption" -t T381142 --fast db1223
Acquired lock for key /spicerack/locks/cookbooks/sre.mysql.pool:db1223: {'concurrency': 1, 'created': '2024-11-29 06:26:55.673620', 'owner': 'marostegui@cumin2002 [339837]', 'ttl': 3600}
START - Cookbook sre.mysql.pool db1223 quickly with 2 steps - Fixed corruption
Updated Phabricator task T381142
pool instance db1223 at 25%
The current diff has 2 spurious changes, aborting:
['--- eqiad/sectionLoads/DEFAULT live\n',
 '+++ eqiad/sectionLoads/DEFAULT generated\n',
 '@@ -7,6 +7,7 @@\n',
 '         "db1166": 300,\n',
 '         "db1175": 500,\n',
 '         "db1198": 500,\n',
 '-        "db1212": 500\n',
 '+        "db1212": 500,\n',
 '+        "db1223": 125\n',
 '     }\n',
 ' ]\n']
Exception raised while executing cookbook sre.mysql.pool:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 251, in _run
    raw_ret = runner.run()
  File "/srv/deployment/spicerack/cookbooks/sre/mysql/pool.py", line 134, in run
    self.gradual_pooling()
  File "/srv/deployment/spicerack/cookbooks/sre/mysql/pool.py", line 168, in gradual_pooling
    self.commit_change(message)
  File "/srv/deployment/spicerack/cookbooks/sre/mysql/pool.py", line 179, in commit_change
    self.check_diff(diff)
  File "/srv/deployment/spicerack/cookbooks/sre/mysql/pool.py", line 223, in check_diff
    raise RuntimeError("Unable to proceed due to spurious changes in the diff")
RuntimeError: Unable to proceed due to spurious changes in the diff
Released lock for key /spicerack/locks/cookbooks/sre.mysql.pool:db1223: {'concurrency': 1, 'created': '2024-11-29 06:26:55.673620', 'owner': 'marostegui@cumin2002 [339837]', 'ttl': 3600}
END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) db1223 quickly with 2 steps - Fixed corruption

Ah, that's an interesting problem!

Abstractly, "fully automated" supervision of the diffs would require (1) structured / semantic differences between live and generated configs and (2) a definition, for a given operation, of what edits are expected.

In a sense, DbConfig.diff_configs gets part of the way there, in terms of what it does internally (semi-structured?) - i.e., returning a merged view of textual diffs, where the latter are computed for each leaf object that differs between the two configs.

Now, even if it were to instead return the underlying per-differing-leaf-object results, that still wouldn't quite be sufficient, since cases like T383760#10461481 would be hard to get right.

One way I've seen a "similarly shaped" problem solved before is to refine the notion of leaf object - e.g., in the above, it would be eqiad/sectionLoads/DEFAULT/replicas/db1223 - and introduce a set of type-aware enums that represent how leaves differ (e.g., existence, value).

That would then report who differs (db1223) in what ways (existence) and in what contexts (eqiad/sectionLoads/DEFAULT/replicas), and allow one to express an invariant like: I'm pooling db1223, which has no special groups, so I expect to see only that object differ (existence or value, depending on the pooling step) in one context.

The obvious downside of course is that this is really "in the weeds" of the semantics of how an operation maps to dbconfig diffs (tightly coupled to schema).

All of that is to say, I probably don't have time to work on this in the near term, but I'll give some thought to whether there's a simpler solution :)

A possible simpler approache could be to use something like dictdiffer and be able to return the diff of python objects for programmatical consumption as opposed to JSON-encoded text diff for human consumption.

Ah, great find, @Volans! It looks like dictdiffer computes exactly the kind of diff I had in mind :)

A tricky part may still be in defining invariants to assert on the diff results, reflecting what we expect for a given operation (i.e., awareness of what's allowed to change and how).

However, getting something "simple" for basic operations like depool and (stepped) pool seems totally tractable.

Revisiting this afternoon:

I think it would be fairly straightforward to provide a "dictdiffer-inspired" [0] structured diff in DbConfig without taking a dependency on the package itself (a python3-dictdiffer Debian package does exist, but only in testing and unstable as of now).

I'm optimistic here because (1) we require a very limited subset of the functionality in dictdiffer and (2) the dbconfig schema (i.e., structure and admissible types) constrains the space of inputs a lot.

With that, it would also be fairly straightforward to define assertions like those described in T383760#10468322 for pool / depool that are resistant to all the problems described in this task.

[0] The key difference would be the fact that, for our purposes, it's likely sufficient (and preferable) to describe diffs in terms of individual leaf deltas rather than sub-tree deltas (e.g., the API is more useful if it tells us db1206 with value 100 disappeared from eqiad/groupLoadsBySection/s1/vslow, rather than "vslow": {"db1206": 100} did from eqiad/groupLoadsBySection/s1 as dictdiffer would).

Revisiting this afternoon:

I think it would be fairly straightforward to provide a "dictdiffer-inspired" [0] structured diff in DbConfig without taking a dependency on the package itself (a python3-dictdiffer Debian package does exist, but only in testing and unstable as of now).

FYI we have it packaged in our APT already (used by other projects):

$ sudo -E reprepro ls python3-dictdiffer
python3-dictdiffer | 0.9.0-3~wmf11u1 | bullseye-wikimedia | amd64, i386
python3-dictdiffer | 0.9.0-3~wmf12u1 | bookworm-wikimedia | amd64, i386

[0] The key difference would be the fact that, for our purposes, it's likely sufficient (and preferable) to describe diffs in terms of individual leaf deltas rather than sub-tree deltas (e.g., the API is more useful if it tells us db1206 with value 100 disappeared from eqiad/groupLoadsBySection/s1/vslow, rather than "vslow": {"db1206": 100} did from eqiad/groupLoadsBySection/s1 as dictdiffer would).

I think we might need both because from one side we want to ensure we're just modifying the data of the DB in question, but we might also want to alert/warning/prevent a change in case some group is removed because it would mean that those queries would go to all DBs in that section IIRC.

I just ran into this bug again while using the cookbook, adding the output in case it helps - but it is the same as original posted:

The current diff has 2 spurious changes, aborting:
['--- eqiad/sectionLoads/s5 live\n',
 '+++ eqiad/sectionLoads/s5 generated\n',
 '@@ -7,6 +7,7 @@\n',
 '         "db1161": 200,\n',
 '         "db1185": 500,\n',
 '         "db1200": 200,\n',
 '-        "db1210": 500\n',
 '+        "db1210": 500,\n',
 '+        "db1230": 30\n',
 '     }\n',
 ' ]\n']
Exception raised while executing cookbook sre.mysql.pool:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 265, in _run
    raw_ret = runner.run()
  File "/srv/deployment/spicerack/cookbooks/sre/mysql/pool.py", line 134, in run
    self.gradual_pooling()
  File "/srv/deployment/spicerack/cookbooks/sre/mysql/pool.py", line 168, in gradual_pooling
    self.commit_change(message)
  File "/srv/deployment/spicerack/cookbooks/sre/mysql/pool.py", line 179, in commit_change
    self.check_diff(diff)
  File "/srv/deployment/spicerack/cookbooks/sre/mysql/pool.py", line 223, in check_diff
    raise RuntimeError("Unable to proceed due to spurious changes in the diff")
RuntimeError: Unable to proceed due to spurious changes in the diff
Released lock for key /spicerack/locks/cookbooks/sre.mysql.pool:db1230: {'concurrency': 1, 'created': '2025-01-29 07:40:44.889061', 'owner': 'root@cumin1002 [3016488]', 'ttl': 7200}
END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) db1230 gradually with 4 steps - Repooling after rebuild index T384994

Icinga downtime and Alertmanager silence (ID=c6f72334-4f9d-4ebc-91e8-ac9d789522f9) set by fceratto@cumin1002 for 4:00:00 on 1 host(s) and their services with reason: Repooling after clone - T383760

db1169.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2025-02-04T08:57:21Z] <fceratto@cumin1002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1169.eqiad.wmnet with reason: Repooling after clone - T383760

@Marostegui FYI as far as I know @Scott_French has already a draft working patch.

Ah, I wasn't ware of that @Volans as I don't see it attached to the task. @FCeratto-WMF and @Scott_French can you both coordinate together just to avoid stepping on each other's toes? Thanks!

Apologies for complicating this by not updating the task!

I put together a sketch of what the two different APIs might look like (i.e., a dictdiffer-like "true" structural diff vs. a diff that only enumerates affected leaf entities) last week, and chatted briefly with @Volans out of band.

I can post those patches later today so we can weigh which one will be more useful for constructing meaningful safety checks.

@Volans @Marostegui - Did you also want to track implementation of the safety checks here? If so, that might be a natural way to break up the work between @FCeratto-WMF and me.

Alright, as promised, I've pushed two topic branches that contain demos of the two structured-diff APIs we've talked about here.

The (tip) commits of interest here are:

The commit messages contain a description of what these two APIs do and how they differ from each other (i.e., what information they can represent, and what implications it has for users implementing safety checks).

Both commits add exactly same unit test cases, in an effort to demonstrate how they contrast with each other.

For the subtree-focused version, I opted not to pull in a dependency on dictdiffer. This is mainly to present a more usable (e.g., structured types) and consistent (e.g., tree paths are always lists) API without having to bolt one on. Also, it's less than 20 lines of code to achieve this in a manner suitable for our purposes.

Anyway, if folks are interested in taking a quick look - particularly at the unit tests to demonstrate what the diff representations look like - that would be greatly appreciated.

Thanks a lot for the effort @Scott_French and for the symmetries of tests to make it easier to compare them!
I understand why you did it this way, in my original idea I would have loved to have some more awareness in the deltas to have something like a host, group, etc.. properties to make it easier to parse them.

Between the two, the current leaves proposal seems the simpler to parse as the host is always the last in the path and it will be very quick to make a check that ensures it's the only host affected in a diff.

I agree that it's not the diff job to ensure correctness of the diff and if we need to ensure that a section has always at least 1 host in a group it should be defined somewhere else like we currently enforce the minimum number of replicas.

The subtrees proposal would be more useful to humans, but for that we do have already the line diffs that works fine.
Hence I would vote for the leaves one. Implementation details could be discussed on the patch once there is an agreement on the direction.

Many thanks for weighing in on the two API options, @Volans.

Agreed that the diff_leaves approach is probably the more straightforward to work with for the particular types of safety checks we're interested in here - i.e., as a replacement for the current logic to check the text-based diffs.

Concretely, picking an arbitrary host from the live dbconfig in eqiad, suppose we were to depool instance db1232 from section s1 in eqiad. This should produce the following dbconfig leaf deltas:

[
    Delta(
        kind=DeltaKind.REMOVE,
        path=["eqiad", "sectionLoads", "s1", 1, "db1232"],
        old=200,
        new=None,
    ),
    Delta(
        kind=DeltaKind.REMOVE,
        path=["eqiad", "groupLoadsBySection", "s1", "api", "db1232"],
        old=100,
        new=None,
    ),
]

The rough equivalent of the existing PoolDepoolRunner.check_diff logic - which asserts that the only affected entity is self.args.instance - would be to simply loop over Delta instances and confirm that path[-1] == self.args.instance.

This could of course become nuanced in various ways if desired. For example, for a depool operation, you could always assert that kind == DeltaKind.REMOVE and the path touches only the expected properties (e.g., sectionLoads or groupLoadsBySection for a core section, or externalLoads for other section types). You could do similarly for a slow-pool operation, where kind would always be one of ADD or CHANGE.

If we were to go with the subtree diff, it would produce the exact same results for db1232. Where things differ would be cases like db1206, where you would end up with something like:

[
    Delta(
        kind=DeltaKind.REMOVE,
        path=["eqiad", "sectionLoads", "s1", 1, "db1206"],
        old=1,
        new=None,
    ),
    Delta(
        kind=DeltaKind.REMOVE,
        path=["eqiad", "groupLoadsBySection", "s1", "dump"],
        old={"db1206": 100},
        new=None,
    ),
    Delta(
        kind=DeltaKind.REMOVE,
        path=["eqiad", "groupLoadsBySection", "s1", "vslow"],
        old={"db1206": 100},
        new=None,
    ),
]

i.e., one needs to recurse into the old value to confirm it's what's expected. That's fairly simple in this case, but would get more involved for db1166, which would remove the DEFAULT subtree from groupLoadsBySection:

[
    Delta(
        kind=DeltaKind.REMOVE,
        path=["eqiad", "sectionLoads", "DEFAULT", 1, "db1166"],
        old=300,
        new=None,
    ),
    Delta(
        kind=DeltaKind.REMOVE,
        path=["eqiad", "groupLoadsBySection", "DEFAULT"],
        old={"dump": {"db1166": 100}, "vslow": {"db1166": 100}},
        new=None,
    ),
]

In contrast, the leaf-focused approach would again simply "unroll" this to:

[
    Delta(
        kind=DeltaKind.REMOVE,
        path=["eqiad", "sectionLoads", "DEFAULT", 1, "db1166"],
        old=300,
        new=None,
    ),
    Delta(
        kind=DeltaKind.REMOVE,
        path=["eqiad", "groupLoadsBySection", "DEFAULT", "dump", "db1166"],
        old=100,
        new=None,
    ),
    Delta(
        kind=DeltaKind.REMOVE,
        path=["eqiad", "groupLoadsBySection", "DEFAULT", "vslow", "db1166"],
        old=100,
        new=None,
    ),
]

... which is far simpler to process for this particular kind of safety check.

Anyway, I'm happy to post a MR for the diff_leaves option if this seems like the desirable one. @FCeratto-WMF - it would be great to get your thoughts on this before I do, since you'll be implementing the safety check.

Also, if you have any preference on whether to retain the ActionResult return value pattern for surfacing errors (even though this is not a CLI action), that would be great as well.

Change #1122099 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/cookbooks@master] pool.py: Add basic typing to allow mypy checks

https://gerrit.wikimedia.org/r/1122099

Change #1122099 merged by Federico Ceratto:

[operations/cookbooks@master] pool.py: Add basic typing to allow mypy checks

https://gerrit.wikimedia.org/r/1122099

@FCeratto-WMF, @Ladsgroup, @Marostegui, @KOfori: has Data Persistence reach a decision between the two offered options?
I've expressed my preference earlier but either solution works fine.

Given that @Scott_French has already provided the implementation for both solutions and also how the dbctl diff check in the (de)pool cookbook should be modified to take advantage of this new dbctl API, I think that making progress on this would be an easy and quick win.
Once done this will allow to use the (de)pool cookbook for all (de)pooling operations both manually and programmatically calling it from the other existing scripts that currently (de)pool databases.
This will cover the large majority of dbctl writes, and, in addition to the existing checks to avoid conflicts, can easily be additionally protected with an custom exclusive lock shared between the pooling and depooling operations.

We discussed the two options in various meetings within the team. At this time we don't have a decision but we will get back to this task when ready.

Hi, I wrote in T387209#10674140 but I repeat myself here again: The diffing is not needed at all. IMHO we should just decline this task and remove check_diff and other diffing altogether. Simply wait for dbctl diff to be empty and then immediately make the change and commit it. Why:

  • We have been using this method in many of our automation for many years and nothing has been broken. The chance of overlap is practically zero.
  • This adds a lot of complexity (as seen by the discussion here) while the benefit is so rare that might happen once every year.
  • Even if it happens, it's not great but it's not end of the world, something that probably was made by another automation gets committed at the same time. It's well within our error budget.

TLDR: YAGNI.

@Ladsgroup @FCeratto-WMF - I understand your point and you have way more experience than me in the area so I cannot provide any "authoritative" technical reply, but from an external point of view I see two main things:

  1. We are trying to use the argument that "it never happened so it is fine" / "the chances are small" / etc.. that is definitely not future-proof, because it may hold now but not in the future, especially if we want to move to a more safe/automated set of procedures (and more code is added).
  2. The added complexity is not a lot IMHO, the structured diff is basically already implemented and the path forward is clean, it is just a matter of saying "we prefer X or Y" to Riccardo and Scott to make the work happen. There will be a small/safer win for various cookbooks, and it will not add any extra complexity in the DBAs life, at least afaics.

I don't think that we should refuse automation improvements in this way, but in the end this is your team's decision so we'll respect that. Please take a moment to reconsider before rejecting the task, it feels to me that we are dropping good improvements that are already available and that could be quick wins because of communication misunderstandings.

@Ladsgroup @FCeratto-WMF - I understand your point and you have way more experience than me in the area so I cannot provide any "authoritative" technical reply, but from an external point of view I see two main things:

  1. We are trying to use the argument that "it never happened so it is fine" / "the chances are small" / etc.. that is definitely not future-proof, because it may hold now but not in the future, especially if we want to move to a more safe/automated set of procedures (and more code is added).

My point is that we have a pretty decent protection already. It checks if the config is empty before moving forward, and that will take care of everything except changes that might happen in a fraction of second. The chances for that is abysmally low. Also it doesn't really address the problem since now you have the window between diffing and committing which is roughly the same time span. So you need locking, etc.

On top of that, if we reach a point to have so many repool and depool at the same time (which is also not realistic given the number of hosts we have), there are way bigger bottlenecks such as SAL being overwhelmed, phabricator phaste db, etc. etc.

  1. The added complexity is not a lot IMHO, the structured diff is basically already implemented and the path forward is clean, it is just a matter of saying "we prefer X or Y" to Riccardo and Scott to make the work happen. There will be a small/safer win for various cookbooks, and it will not add any extra complexity in the DBAs life, at least afaics.

Let me give you an example: A couple weeks ago, the diffing started to have a bug that it couldn't understand the output and because of that many cookbooks stopped working. What happened? We had to basically stop using many cookbooks that does depooling, a couple of maint work got postponed and we had to revert patches on slow repool to do it in a bash script to be able to continue. So it is a lot of complexity and failure point and it has already bitten us and caused non-negligible DBA time wasted.

I don't think that we should refuse automation improvements in this way, but in the end this is your team's decision so we'll respect that.

My point is that I don't think it's an improvement in automation and in the long term it will harm us due to extra complexity.

@Ladsgroup @FCeratto-WMF - I understand your point and you have way more experience than me in the area so I cannot provide any "authoritative" technical reply, but from an external point of view I see two main things:

  1. We are trying to use the argument that "it never happened so it is fine" / "the chances are small" / etc.. that is definitely not future-proof, because it may hold now but not in the future, especially if we want to move to a more safe/automated set of procedures (and more code is added).

My point is that we have a pretty decent protection already. It checks if the config is empty before moving forward, and that will take care of everything except changes that might happen in a fraction of second. The chances for that is abysmally low. Also it doesn't really address the problem since now you have the window between diffing and committing which is roughly the same time span. So you need locking, etc.

Totally understand, but IIUC from Riccardo and Scott the point here is to start a process that will eventually lead to an ecosystem of cookbooks properly using locking etc.. The idea is to be able to build more complex workflows in the future, starting from a proper/shared locking mechanism.

On top of that, if we reach a point to have so many repool and depool at the same time (which is also not realistic given the number of hosts we have), there are way bigger bottlenecks such as SAL being overwhelmed, phabricator phaste db, etc. etc.

I don't think that this is the goal, but we shouldn't target our automation goals based on what we think other unrelated tools may support or not..

  1. The added complexity is not a lot IMHO, the structured diff is basically already implemented and the path forward is clean, it is just a matter of saying "we prefer X or Y" to Riccardo and Scott to make the work happen. There will be a small/safer win for various cookbooks, and it will not add any extra complexity in the DBAs life, at least afaics.

Let me give you an example: A couple weeks ago, the diffing started to have a bug that it couldn't understand the output and because of that many cookbooks stopped working. What happened? We had to basically stop using many cookbooks that does depooling, a couple of maint work got postponed and we had to revert patches on slow repool to do it in a bash script to be able to continue. So it is a lot of complexity and failure point and it has already bitten us and caused non-negligible DBA time wasted.

I cannot suport this point, really sorry. While I understand that bugs in automation may cause some time "wasted" improving the codebase, and that it is definitely frustrating to change plans etc.., it is a process that happens with any software (even bash scripts that are battle tested), that eventually leads to less tech debt and TOIL.
This task was discussed in Atlanta by several people, and Riccardo spent months in Data Persistence to draw a picture of how dba-related automation should possibly improve in the future to get the best of what our tools can offer at the moment (adopting more standard/shared procedures etc.. at the same time). Dismissing all this work because of these reasons feels like Data Persistence prefers to keep diverging from SRE automation, to keep building your own automation stack. If this is the case let's just make sure that everybody is onboard, and then avoid misunderstandings and wasted time. We know each other and I am pretty sure this is not the case, but I am telling you how it feels like from the other side :)

I don't think that we should refuse automation improvements in this way, but in the end this is your team's decision so we'll respect that.

My point is that I don't think it's an improvement in automation and in the long term it will harm us due to extra complexity.

I explained above why I don't think we are adding any complexity (the opposite), but if DP as a whole thinks the same way let's make sure to be on the same page with the rest of SRE :)

To be honest, I am completely lost on where and when we drifted from the two offered options from Scott at T383760#10523861. At some point the idea was to fix this bug and at some point we ended up creating so many different tasks including a whole new locking mechanism T387209 T389947
At this point, I am not sure what we are even discussing and what is the main blocker of this.

To me, this is the main comment: https://phabricator.wikimedia.org/T383760#10650432 - what is preventing us to adopt any of these two solutions that Scott kindly wrote for us and unblock the rest of things @FCeratto-WMF?

We discussed next steps on IRC in #wikimedia-data-persistence minutes ago and agreed that for the time being we are ok with removing check_diff to unblock depooling in the cookbook.

Thanks for the follow-up, all!

To emphasize, the motivation for the structured-diff feature is to enable making the pre-commit check in PoolDepoolRunner simpler and more predictable.

My understanding is that the issue today involving es2040 is "typical" of that seen with the current check_diff - i.e., entirely the result of parsing text diffs of json-formatted content, a failure mode that should disappear with a suitable check of the structured diff. For example, if using the diff-of-leaves version of the feature, the result for that particular delta would have been:

[
    Delta(
        kind=DeltaKind.ADD,
        path=["codfw", "externalLoads", "es7", 1, "es2040"],
        old=None,
        new=6,
    ),
]

... where it's straightforward to validate that es2040 is the only affected entity.

In any case, I entirely understand if making this happen is not the highest priority among various ongoing automation improvements, but I'm around and happy to help if that changes :)

Also, especially when higher-level locking is introduced that guards the entire [mutate sections / instances -> commit] cycle, entirely agreed that the pre-flight diff check (i.e., check for any diff at all) is more "load-bearing" than the pre-commit diff check. The latter is more a defense-in-depth for the ways in which locks will be fallible, such as if tradeoffs are made for usability (e.g., TTLs) or more generally in the absence of fencing tokens.

Change #1153671 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/cookbooks@master] sre.mysql.pool: Remove diff check functionality

https://gerrit.wikimedia.org/r/1153671

We discussed next steps on IRC in #wikimedia-data-persistence minutes ago and agreed that for the time being we are ok with removing check_diff to unblock depooling in the cookbook.

I think we can look into adding back the diff check in Q2 next FY given that Q1 will be quite hectic and I'm actively offloading some work during that time. If someone else from the team gets to it sooner, of course I'd appreciate it.

Change #1153989 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/cookbooks@master] pool.py: bugfix: remove diff check

https://gerrit.wikimedia.org/r/1153989

Change #1153989 abandoned by Federico Ceratto:

[operations/cookbooks@master] pool.py: bugfix: remove diff check

Reason:

Implemented in a different CR

https://gerrit.wikimedia.org/r/1153989

Change #1153671 merged by jenkins-bot:

[operations/cookbooks@master] sre.mysql.pool: Remove diff check functionality

https://gerrit.wikimedia.org/r/1153671

Also, especially when higher-level locking is introduced that guards the entire [mutate sections / instances -> commit] cycle, entirely agreed that the pre-flight diff check (i.e., check for any diff at all) is more "load-bearing" than the pre-commit diff check. The latter is more a defense-in-depth for the ways in which locks will be fallible, such as if tradeoffs are made for usability (e.g., TTLs) or more generally in the absence of fencing tokens.

If I understand your comment correctly you are describing a cycle that goes:

  1. Acquire high level lock with a TTL
  2. Pre-flight diff check (check for any diff at all) (Wait for a little bit if necessary)
  3. Make changes
  4. Perform pre-commit structured/semantic diff (See Scott's example with es2040 https://phabricator.wikimedia.org/T383760#10882369 )
  5. Release the lock

...and, I'd imagine, if any of the steps fails or times out an error is logged and the lock is released?

If people are happy with this solution we can think of a roadmap to integrate the various changes.