Page MenuHomePhabricator

Cleanup dbctl possible leftovers from older config
Closed, ResolvedPublic

Description

We had recently an issue where dbctl ended up confused because a wrong/outadated section (s10). The root cause was discovered to be a non-mw host pooled accidentally under a section that shouldn't exist- instead of returning an error.

It would be great to review dbctl data and make sure there are no older hosts, sections or references to other data that are outdated and could confuse monitoring, pooling/depooling tools, and humans. In particular:

  • Checking if there are leftover sections, such as s10 that shouldn't be on mw config/dbctl
  • Checking if there are references to databases (hostsByName) that are not intended for mediawiki, such as db2135. It is likely there are still decomissioned hosts or hosts that used to be on mw config but now are misc or other roles.
  • Checking there are no depooled hosts that are idle and forgotten

In the future, it would be also nice to have automatic monitoring, such as T256845 (out of scope of this ticket)

Event Timeline

  • Checking if there are leftover sections, such as s10 that shouldn't be on mw config/dbctl

There is s11, I don't know if it's needed or not

  • Checking if there are references to databases (hostsByName) that are not intended for mediawiki, such as db2135. It is likely there are still decomissioned hosts or hosts that used to be on mw config but now are misc or other roles.

Ran a script that went through the list of instances. These showed up:

db1111
db1127
db1132
db1143
db1144:3314
db2135

Will check them soon.

  • Checking there are no depooled hosts that are idle and forgotten
ladsgroup@cumin1001:~$ sudo dbctl instance all get  | jq 'select(..|.pooled? == false)'
{
  "db1111": {
    "host_ip": "10.64.0.128",
    "port": 3306,
    "sections": {
      "s8": {
        "percentage": 100,
        "pooled": false,
        "weight": 325
      }
    },
    "note": ""
  },
  "tags": "datacenter=eqiad"
}
{
  "db1127": {
    "host_ip": "10.64.0.97",
    "port": 3306,
    "sections": {
      "s7": {
        "percentage": 100,
        "pooled": false,
        "weight": 400
      }
    },
    "note": ""
  },
  "tags": "datacenter=eqiad"
}
{
  "db1132": {
    "host_ip": "10.64.16.35",
    "port": 3306,
    "sections": {
      "s1": {
        "groups": {
          "api": {
            "pooled": true,
            "weight": 100
          }
        },
        "percentage": 100,
        "pooled": false,
        "weight": 200
      }
    },
    "note": ""
  },
  "tags": "datacenter=eqiad"
}
{
  "db1143": {
    "host_ip": "10.64.16.174",
    "port": 3306,
    "sections": {
      "s4": {
        "weight": 400,
        "percentage": 100,
        "pooled": false
      }
    },
    "note": ""
  },
  "tags": "datacenter=eqiad"
}
{
  "db1144:3314": {
    "host_ip": "10.64.16.175",
    "port": 3314,
    "sections": {
      "s4": {
        "percentage": 100,
        "pooled": false,
        "weight": 150
      }
    },
    "note": ""
  },
  "tags": "datacenter=eqiad"
}

They are mostly 10.6 ones and depooled (T311106#8114679)

Just to confirm- was s10- which we know is no longer needed- removed? It is unclear on your comment (maybe it was done beforehand, or never needed e.g. db2135 it was assigned to a non-existent section, and that was enough to "depool it").

If the second, I wonder if it would be too much work to go over each defined host (including valid ones) and making sure they are assigned to a valid section-

If s11 is needed or not, cloud team should be able to say. IIRC, it used to be a codfw-only write "testing" section for labswiki, labtestwiki or something like that.

I'm not seeing any host with invalid section but there was a host without a section nor name:

{'host_ip': '10.192.32.187', 'port': 3306, 'sections': {}, 'note': ''}
187.32.192.10.in-addr.arpa. 3600 IN	PTR	db2135.codfw.wmnet.

Change 821689 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] conftool-data: Remove db2135 from the list

https://gerrit.wikimedia.org/r/821689

I checked and all of depooled ones are intentional. The only thing not fixed is db2135 ^

Change 821689 merged by Ladsgroup:

[operations/puppet@production] conftool-data: Remove db2135 from the list

https://gerrit.wikimedia.org/r/821689

Ladsgroup removed a project: Patch-For-Review.
Ladsgroup moved this task from In progress to Done on the DBA board.

I'm not seeing any host with invalid section but there was a host without a section nor name:

{'host_ip': '10.192.32.187', 'port': 3306, 'sections': {}, 'note': ''}

I think db2135 was there for when wikitech was living in m5
Thanks for fixing it!

And btw, to check all the depooled hosts I have:

root@cumin1001:~#   /home/marostegui/check_depooled.sh --all