Page MenuHomePhabricator

Make dbctl check for depooled future masters
Closed, ResolvedPublic

Description

When we operate sX sections, if a host that is being promoted to master via sudo dbctl --scope XXX section sX set-master is depooled, dbctl will error with:

Execution FAILED
Reported errors:
Section sX has no master

This is good as otherwise you'll end up with a section with no master. The future new master always needs to be pooled (even with weight 0).

When testing dbctl + parsercache this wasn't the case, and I was allowed to promote a depooled replica to master, which means the section ended up with no master. How to reproduce (this is safe to run as of today as right now the main config for parsercache is still MW):

root@cumin1002:~# sudo dbctl --scope codfw section pc4 set-master pc2015
WARNING: 'pc2015' is not a candidate master for section 'pc4'

I ignore the warning, as it is just a warning and commit:

root@cumin1002:~# dbctl config commit -m "Test pc4 master switch"
codfw/externalLoads/pc4 live                                       codfw/externalLoads/pc4 generated
[                                                                  {}
    {
        "pc2016": 1
    },
    {}
]

Enter y or yes to confirm: y
Previous configuration saved. To restore it run: dbctl config restore /var/cache/conftool/dbconfig/20240516-103039-marostegui.json
WARNING:conftool.announce:dbctl commit (dc=all): 'Test pc4 master switch', diff saved to https://phabricator.wikimedia.org/P62494 and previous config saved to /var/cache/conftool/dbconfig/20240516-103039-marostegui.json

The section ends empty.

pc2015 is a replica on pc4 but it is depooled:

root@cumin1002:~# dbctl instance pc2015 get
{
    "pc2015": {
        "host_ip": "10.192.32.132",
        "note": "",
        "port": 3306,
        "sections": {
            "pc4": {
                "percentage": 100,
                "pooled": false,
                "weight": 1
            }
        }
    },
    "tags": "datacenter=codfw"
}

Event Timeline

We do have a consistency check at https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/conftool/+/refs/heads/master/conftool/extensions/dbconfig/config.py#569 but is used only for sectionLoads, I think it's generic enough to be extended to externalLoads too.

From a quick glance the two places where we need to add it are;

In general we could revisit how section validity is managed and try to factor out the things that are in common.

Interesting! Agreed, yeah - having _check_section applied to external-flavored sections as well seems desirable.

More broadly, I can't see anything in check_config [0] (i.e., "master matches intent" and "replica count is allowed") that wouldn't also sensibly apply to external-flavored sections (setting aside the DEFAULT special case). I might be missing something, though.

Happy to give this a go if hands are needed :)

[0] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/conftool/+/8e5efcc919869483751627f46b92787db6873816/conftool/extensions/dbconfig/config.py#183

Change #1032849 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/software/conftool@master] dbctl: extend dbconfig checks to external sections

https://gerrit.wikimedia.org/r/1032849

CDanis triaged this task as Medium priority.May 20 2024, 2:32 PM

Change #1034163 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/software/conftool@master] dbctl: break up test_check_config test case

https://gerrit.wikimedia.org/r/1034163

Scott_French changed the task status from Open to In Progress.May 21 2024, 5:06 PM

Thanks, all, for the reviews. Before merging, I wanted to check one last time that all external-flavored sections have a self-consistent config that will pass validation. It looks like x2 in eqiad has:

{
  "master": "db1152",
  "min_replicas": 1,
  "readonly": false,
  "ro_reason": "test",
  "flavor": "external",
  "omit_replicas_in_mwconfig": true
}

i.e., both min_replicas: 1 and omit_replicas_in_mwconfig: true which will produce a dbconfig that (soon) will not validate.

@Marostegui - Could you take a look and (unless there's some reason to reconsider this specific check) set min_replicas to 0 like in codfw?

Thanks, all, for the reviews. Before merging, I wanted to check one last time that all external-flavored sections have a self-consistent config that will pass validation. It looks like x2 in eqiad has:

{
  "master": "db1152",
  "min_replicas": 1,
  "readonly": false,
  "ro_reason": "test",
  "flavor": "external",
  "omit_replicas_in_mwconfig": true
}

i.e., both min_replicas: 1 and omit_replicas_in_mwconfig: true which will produce a dbconfig that (soon) will not validate.

@Marostegui - Could you take a look and (unless there's some reason to reconsider this specific check) set min_replicas to 0 like in codfw?

Changed - I guess we set that up just because when x2 was created. There were many "ifs" with x2...

Great, thank you for taking a look. I'll merge my changes and follow up here once they're deployed (a process I need to figure out).

Change #1032849 merged by jenkins-bot:

[operations/software/conftool@master] dbctl: extend dbconfig checks to external sections

https://gerrit.wikimedia.org/r/1032849

Change #1034163 merged by jenkins-bot:

[operations/software/conftool@master] dbctl: break up test_check_config test case

https://gerrit.wikimedia.org/r/1034163

Change #1035596 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/software/conftool@master] Release 3.0.0

https://gerrit.wikimedia.org/r/1035596

Change #1035596 merged by jenkins-bot:

[operations/software/conftool@master] Release 3.0.0

https://gerrit.wikimedia.org/r/1035596

Packages for buster, bullseye, and bookworm are ready, and have been copied over to apt1002.

I've not imported them into the repository yet, and plan to do so shortly before deploying. The latter won't happen until early next week.

Alas, this is unlikely to happen this week within scheduling constraints. Now aiming for Tuesday the 18th (next week) starting at 14:00 UTC. I'll advertise this in IRC on Monday and again Tuesday.

While the only significant functional changes in this release are for dbctl and requestctl (installed to a small number of hosts), I of course still need to update all hosts that have conftool installed.

Interestingly, debmonitor reports 605 hosts with python3-conftool installed, while a cumin query P:conftool::client (a superset of :dbctl_client and :requestctl_client) returns only 579.

The differences are:

  • All aqs hosts (role::aqs) except aqs1013 (23 hosts). This appears to be the result of losing P:lvs::realserver [0] and then aqs1013 being reimaged a few weeks later.
  • Both cloudcumin hosts (role::cluster::cloud_management), which have P:spicerack (2 hosts). That in turn installs the spicerack package, which of course pulls in python3-conftool.
  • wdqs2023, which appears to have been temporarily moved to a different role without a reimage [1].

Anyway, I'll want to make sure to catch these hosts as well (note to self).

[0] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1014100
[1] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1027001

Mentioned in SAL (#wikimedia-operations) [2024-06-18T14:09:33Z] <swfrench-wmf> included conftool 3.0.0 into buster-wikimedia on apt.w.o for T365123

Mentioned in SAL (#wikimedia-operations) [2024-06-18T15:47:38Z] <swfrench-wmf> included conftool 3.0.0 into buster/bullseye/bookworm-wikimedia on apt.w.o for T365123

Mentioned in SAL (#wikimedia-operations) [2024-06-18T16:23:15Z] <swfrench-wmf> depooled / pooled mw2441.codfw.wmnet to smoke-test python3-conftool for T365123

Mentioned in SAL (#wikimedia-operations) [2024-06-18T16:29:07Z] <swfrench-wmf> conftool on cumin2002 updated to 3.0.0 for T365123

Mentioned in SAL (#wikimedia-operations) [2024-06-18T16:39:29Z] <swfrench-wmf> validated dbctl 3.0.0 on cumin2002 (noop edit to note: on parsercache spare pc2014) for T365123

Mentioned in SAL (#wikimedia-operations) [2024-06-18T16:42:34Z] <swfrench-wmf> conftool on puppetmaster2001 updated to 3.0.0 for T365123

Mentioned in SAL (#wikimedia-operations) [2024-06-18T17:12:09Z] <swfrench-wmf> updated conftool to 3.0.0 on remaining buster hosts in codfw for T365123

Mentioned in SAL (#wikimedia-operations) [2024-06-18T17:14:43Z] <swfrench-wmf> updated conftool to 3.0.0 on remaining bookworm hosts in codfw for T365123

Mentioned in SAL (#wikimedia-operations) [2024-06-18T17:16:23Z] <swfrench-wmf> updated conftool to 3.0.0 on remaining bullseye hosts in codfw for T365123

Mentioned in SAL (#wikimedia-operations) [2024-06-18T17:34:06Z] <swfrench-wmf> updated conftool to 3.0.0 on buster hosts in eqiad for T365123

Mentioned in SAL (#wikimedia-operations) [2024-06-18T17:35:10Z] <swfrench-wmf> updated conftool to 3.0.0 on bookworm hosts in eqiad for T365123

Mentioned in SAL (#wikimedia-operations) [2024-06-18T17:37:28Z] <swfrench-wmf> updated conftool to 3.0.0 on bullseye hosts in eqiad for T365123

Mentioned in SAL (#wikimedia-operations) [2024-06-18T18:17:05Z] <swfrench-wmf> updated conftool to 3.0.0 on hosts (cp,ncredir) in ulsfo for T365123

Mentioned in SAL (#wikimedia-operations) [2024-06-18T18:27:03Z] <swfrench-wmf> updated conftool to 3.0.0 on hosts (cp,ncredir) in magru for T365123

Mentioned in SAL (#wikimedia-operations) [2024-06-18T18:33:12Z] <swfrench-wmf> updated conftool to 3.0.0 on hosts (cp,ncredir) in drmrs for T365123

Mentioned in SAL (#wikimedia-operations) [2024-06-18T18:38:59Z] <swfrench-wmf> updated conftool to 3.0.0 on hosts (cp,ncredir) in eqsin for T365123

Mentioned in SAL (#wikimedia-operations) [2024-06-18T18:44:42Z] <swfrench-wmf> updated conftool to 3.0.0 on hosts (cp,ncredir) in esams for T365123

Well, that was a blast!

Many thanks to @Volans for riding along to talk out some of the issues.

Some notes for follow-up:

Conflicting conftool_3.0.0.orig.tar.gz content:

The procedure I used to build the packages (modeled on [0]) resulted in the orig source archive picking up divergent views of debian/changelog across the three distribution-specific builds (i.e., reflecting my rebuild edits).

My expectation was that by setting --git-upstream-tag=3.0.0, this would be stable across all three builds, but this didn't happen. After including the first of three builds (buster) via reprepro, the other two failed due to the inconsistency in conftool_3.0.0.orig.tar.gz.

I was able to rebuild the packages in a way that achieves the expected output by splitting off my changelog updates for rebuilds to a separate branch (and using --git-debian-branch to be able to invoke the build from there). Indeed, now that I look at [1] as see:

[DEFAULT]
upstream-tree=BRANCH
upstream-branch=master

So yeah, I think I see why my approach - which assumes upstream-tree is TAG(default) and commits on master - would not work, and why using a separate branch for rebuild commits would.

In any case, we then removed the errant 3.0.0 packages from wikimedia-buster and included all again.

Long story short, I think we should document and / or automate (e.g., with a script) the build procedure.

Also, in chatting with Moritz, he had the neat idea to potentially introduce a conftool-next component to pilot changes in a way where only a subset of hosts are exposed to them (i.e., making the include -> oops -> remove -> re-add cycle here a bit less harrowing).

Task: T367921

New ERROR logging:

In https://gerrit.wikimedia.org/r/992104, the default log level for YAML loading errors was increased from debug to error to address T355256.

However, given the way conftool enumerates possibly-non-existent config files during initialization, and this is now rather noisy. We should either (1) not attempt to config load files that do not exist or (2) limit the scope of ERROR logging to only the invalid-YAML case (original intent).

Task: T367919

[0] https://wikitech.wikimedia.org/wiki/Httpbb#Deploying_httpbb

[1] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/conftool/+/refs/heads/master/debian/gbp.conf

Forgot to mention: there's still one straggler on 2.3.3 per debmonitor: elastic2099 which down (T367598).

Scott_French added a subscriber: Joe.

Alright, this should all be complete. I'll follow up in T367921 about the build / release instructions, and keep an eye on T367598 for the return of elastic2099. Many thanks to @Joe for already having taken care of T367919.