Page MenuHomePhabricator

dbconfig bug - "2 instances found for query ..."
Open, HighPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):
From cumin:

$ sudo dbctl instance db2103 set-weight 0
Traceback (most recent call last):
  File "/usr/bin/dbctl", line 33, in <module>
    sys.exit(load_entry_point('conftool==2.3.2', 'console_scripts', 'dbctl')())
  File "/usr/lib/python3/dist-packages/conftool/extensions/dbconfig/__init__.py", line 208, in main
    return cli.run_action()
  File "/usr/lib/python3/dist-packages/conftool/extensions/dbconfig/cli.py", line 31, in run_action
    result = getattr(self, "_run_on_{}".format(self.args.object_name))()
  File "/usr/lib/python3/dist-packages/conftool/extensions/dbconfig/cli.py", line 76, in _run_on_instance
    *self.instance.weight(name, self.args.weight, self.args.section, self.args.group)
  File "/usr/lib/python3/dist-packages/conftool/extensions/dbconfig/entities.py", line 288, in weight
    return self.write_callback(set_weight, (instance,), section=section, group=group)
  File "/usr/lib/python3/dist-packages/conftool/extensions/dbconfig/entities.py", line 162, in write_callback
    obj = self.get(*id)
  File "/usr/lib/python3/dist-packages/conftool/extensions/dbconfig/entities.py", line 90, in get
    raise ValueError(
ValueError: 2 instances found for query 'db2103' and scope 'None', expected 1.

What happens?:
ValueError: 2 instances found for query 'db2103' and scope 'None', expected 1.

What should have happened instead?:
db2103 weight edit to set it to 0

Software version (skip for WMF-hosted wikis like Wikipedia):

$ dbctl --version
dbctl 2.3.2

Other information (browser name/version, screenshots, etc.):

MariaDB [zarcillo]> select * from servers where hostname like 'db2103%';
+--------------------+----------+-------+------+------+------+-----------+
| fqdn               | hostname | dc    | rack | ipv4 | ipv6 | last_boot |
+--------------------+----------+-------+------+------+------+-----------+
| db2103.codfw.wmnet | db2103   | codfw | A3   | NULL | NULL | NULL      |
+--------------------+----------+-------+------+------+------+-----------+

conftool with get/edit works as intended:

ladsgroup@cumin1001:~$ sudo dbctl --scope codfw instance db2103 get
{
    "db2103": {
        "host_ip": "10.192.0.118",
        "note": "",
        "port": 3306,
        "sections": {
            "s1": {
                "candidate_master": true,
                "percentage": 100,
                "pooled": true,
                "weight": 300
            }
        }
    },
    "tags": "datacenter=codfw"
}
~ $ ssh db2103.codfw.wmnet  uptime
 06:43:21 up 76 days, 23:41,  0 users,  load average: 3.85, 4.23, 4.39

~ $ ssh db2112.codfw.wmnet  uptime
 06:43:28 up 77 days, 18 min,  0 users,  load average: 1.70, 1.87, 2.05

Raised in: T350142

Event Timeline

I was having a look, I checked etcd and I didn't find two records that could match the name db2103. I also can't repro it, both on cumin1001 and cumin2002:

 cumin1001  0 08:12:23 ~ $ sudo dbctl instance db2103 set-weight 295
 cumin1001  0 08:12:34 ~ $ sudo dbctl instance db2103 get
{
    "db2103": {
        "host_ip": "10.192.0.118",
        "note": "",
        "port": 3306,
        "sections": {
            "s1": {
                "candidate_master": true,
                "percentage": 100,
                "pooled": true,
                "weight": 295
            }
        }
    },
    "tags": "datacenter=codfw"
}

 cumin2002  0 08:13:09 ~ $ sudo dbctl instance db2103 set-weight 300
 cumin2002  0 08:13:31 ~ $ sudo dbctl instance db2103 get
{
    "db2103": {
        "host_ip": "10.192.0.118",
        "note": "",
        "port": 3306,
        "sections": {
            "s1": {
                "candidate_master": true,
                "percentage": 100,
                "pooled": true,
                "weight": 300
            }
        }
    },
    "tags": "datacenter=codfw"
}

Change 971437 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/software/conftool@master] debug: printing results when return object count > 1

https://gerrit.wikimedia.org/r/971437

I've also run while read line; do sudo dbctl instance "${line}" get; sleep 1; done < dblist getting the dblist from etcd and I couldn't repro the error.

do we have some history of etcd content somewhere? It could be possibly some bad config being set and then get expired?

There is no automatic expiration on any key written/edited by dbctl AFAIK.
As for the history content it's a bit complicated but from a quick look at the mirror logs I see that there was a db2103 key in eqiad at some point this morning:

$ sudo grep db2103 /var/log/etcdmirror-conftool-eqiad-wmnet/syslog.log
Oct 12 21:06:43 conf2005 etcdmirror-conftool-eqiad-wmnet[270715]: [etcd-mirror] INFO: Replicating key /conftool/v1/dbconfig-instance/codfw/db2103 at index 2545105
Oct 13 05:58:06 conf2005 etcdmirror-conftool-eqiad-wmnet[270715]: [etcd-mirror] INFO: Replicating key /conftool/v1/dbconfig-instance/codfw/db2103 at index 2545743
Oct 13 06:13:12 conf2005 etcdmirror-conftool-eqiad-wmnet[270715]: [etcd-mirror] INFO: Replicating key /conftool/v1/dbconfig-instance/codfw/db2103 at index 2545746
Oct 13 06:28:19 conf2005 etcdmirror-conftool-eqiad-wmnet[270715]: [etcd-mirror] INFO: Replicating key /conftool/v1/dbconfig-instance/codfw/db2103 at index 2545749
Oct 13 06:43:25 conf2005 etcdmirror-conftool-eqiad-wmnet[270715]: [etcd-mirror] INFO: Replicating key /conftool/v1/dbconfig-instance/codfw/db2103 at index 2545752
Nov  7 06:21:45 conf2005 etcdmirror-conftool-eqiad-wmnet[270715]: [etcd-mirror] INFO: Replicating key /conftool/v1/dbconfig-instance/eqiad/db2103 at index 2625063
Nov  7 06:22:44 conf2005 etcdmirror-conftool-eqiad-wmnet[270715]: [etcd-mirror] INFO: Replicating key /conftool/v1/dbconfig-instance/codfw/db2103 at index 2625070
Nov  7 06:26:49 conf2005 etcdmirror-conftool-eqiad-wmnet[270715]: [etcd-mirror] INFO: Replicating key /conftool/v1/dbconfig-instance/codfw/db2103 at index 2625071
Nov  7 06:26:57 conf2005 etcdmirror-conftool-eqiad-wmnet[270715]: [etcd-mirror] INFO: Replicating key /conftool/v1/dbconfig-instance/eqiad/db2103 at index 2625072
Nov  7 06:31:29 conf2005 etcdmirror-conftool-eqiad-wmnet[270715]: [etcd-mirror] INFO: Replicating key /conftool/v1/dbconfig-instance/eqiad/db2103 at index 2625074
Nov  7 06:32:02 conf2005 etcdmirror-conftool-eqiad-wmnet[270715]: [etcd-mirror] INFO: Replicating key /conftool/v1/dbconfig-instance/codfw/db2103 at index 2625075
Nov  7 07:28:35 conf2005 etcdmirror-conftool-eqiad-wmnet[270715]: [etcd-mirror] INFO: Replicating key /conftool/v1/dbconfig-instance/eqiad/db2103 at index 2625077
Nov  7 08:12:34 conf2005 etcdmirror-conftool-eqiad-wmnet[270715]: [etcd-mirror] INFO: Replicating key /conftool/v1/dbconfig-instance/codfw/db2103 at index 2625078
Nov  7 08:13:31 conf2005 etcdmirror-conftool-eqiad-wmnet[270715]: [etcd-mirror] INFO: Replicating key /conftool/v1/dbconfig-instance/codfw/db2103 at index 2625079
Nov  7 12:39:33 conf2005 etcdmirror-conftool-eqiad-wmnet[270715]: [etcd-mirror] INFO: Replicating key /conftool/v1/dbconfig-instance/codfw/db2103 at index 2625948

@ABran-WMF I took the liberty to check your bash history on cumin1001 and I see a sudo dbctl --scope eqiad instance db2103 edit at 6:21:40 that seems to match the first mention of eqiad in the logs above.
Is it possible you run the edit with eqiad by mistake and that's what created the second object?

I think that in any case running the commands with -s/--scope codfw should have worked, but if this is the issue that we saw then it's easy to repro creating a duplicate of another less-critical DB object and test our theories.

Ok, I think I found the problem, the write_callback(self, callback, id, **args) doesn't pass any datacenter selection when calling obj = self.get(*id), while its signature allows for it: get(self, name, dc=None). (that in turns calls get_all() but the dc is always propagated there).

So I would expect the following methods to not work also when setting the scope because basically it's not propagated: depool, pool, weight, candidate_master, note, while get and edit are not calling the callback and hence are not affected.

[source code available at https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/conftool/+/refs/heads/master/conftool/extensions/dbconfig/entities.py ]

I think that the above in conjunction with the above theory if that's confirmed should explain everything.

@ABran-WMF I took the liberty to check your bash history on cumin1001 and I see a sudo dbctl --scope eqiad instance db2103 edit at 6:21:40 that seems to match the first mention of eqiad in the logs above.
Is it possible you run the edit with eqiad by mistake and that's what created the second object?

don't worry it's more than OK. I ran the command a bit late in the process to check there was any "scope" mistake (i.e. if the host had anything referenced on eqiad)

The debugging sequence was:

705  07/11/23 06:23:58 sudo dbctl instance db2103 set-weight 0 && sudo dbctl config commit -m "Set db2103 with weight 0 T350142" && sudo db-switchover --timeout=25 --only-slave-move db2112 db2103
706  07/11/23 06:24:15 sudo dbctl instance db2103 set-weight 0
707  07/11/23 06:25:24 sudo dbctl --scope codfw instance db2103 set-weight 0
708  07/11/23 06:25:40 sudo dbctl --scope codfw instance db2103 set-weight 0
709  07/11/23 06:25:49 vim $(which dbctl)
710  07/11/23 06:26:44 sudo dbctl --scope codfw instance db2103 edit
711  07/11/23 06:26:52 sudo dbctl --scope eqiad instance db2103 edit
712  07/11/23 06:27:17 sudo dbctl --scope eqiad instance db99103 edit
713  07/11/23 06:31:26 sudo dbctl --scope eqiad instance db2103 edit
714  07/11/23 06:32:24 sudo dbctl instance db2103 set-weight 0 --scope codfw
715  07/11/23 06:32:30 sudo dbctl instance db2103 set-weight 0 -s codfw
716  07/11/23 06:32:36 sudo dbctl -s codfw instance db2103 set-weight 0

and I got the error right off the bat:

705  07/11/23 06:23:58 sudo dbctl instance db2103 set-weight 0 && sudo dbctl config commit -m "Set db2103 with weight 0 T350142" && sudo db-switchover --timeout=25 --only-slave-move db2112 db2103

But your history has a little bit later in the file:

#1699338100
sudo dbctl --scope eqiad instance db2103 edit

and that's at 2023-11-07 06:21:40

weird! history did not register it this way, so it could be a way to try and reproduce this error 🤔 should we try? (maybe with a less impactful server 😄)

Indeed, do you have a dbctl host that is already out of production or not used for any reason, or we could pick a replica with low weight in the secondary dc

I'll try on db1136 from T344036 as it's depooled and supposed to be removed in T350458 @Marostegui are you OK with this?

managed to "properly reproduce" the bug with an old server:

$ sudo dbctl --scope codfw instance db1141 edit && sudo dbctl instance db1141 set-weight 0
Traceback (most recent call last):
  File "/usr/bin/dbctl", line 33, in <module>
    sys.exit(load_entry_point('conftool==2.3.2', 'console_scripts', 'dbctl')())
  File "/usr/lib/python3/dist-packages/conftool/extensions/dbconfig/__init__.py", line 208, in main
    return cli.run_action()
  File "/usr/lib/python3/dist-packages/conftool/extensions/dbconfig/cli.py", line 31, in run_action
    result = getattr(self, "_run_on_{}".format(self.args.object_name))()
  File "/usr/lib/python3/dist-packages/conftool/extensions/dbconfig/cli.py", line 76, in _run_on_instance
    *self.instance.weight(name, self.args.weight, self.args.section, self.args.group)
  File "/usr/lib/python3/dist-packages/conftool/extensions/dbconfig/entities.py", line 288, in weight
    return self.write_callback(set_weight, (instance,), section=section, group=group)
  File "/usr/lib/python3/dist-packages/conftool/extensions/dbconfig/entities.py", line 162, in write_callback
    obj = self.get(*id)
  File "/usr/lib/python3/dist-packages/conftool/extensions/dbconfig/entities.py", line 90, in get
    raise ValueError(
ValueError: 2 instances found for query 'db1141' and scope 'None', expected 1.

Creating an object in etcd a bit later could fix this issue afaiu

Great, thanks. Then I think T350656#9312531 should explain everything :)