Page MenuHomePhabricator

Create or modify an existing tool that quickly shows the db replication status in case of master failure
Closed, ResolvedPublic

Description

When we have to do an unscheduled failover, we cannot use switchover, as that assumes the master is up and running.

Creating a failover.py will be a quite hard task, as we have to assume all possible ways in which servers can fail- it will likely be handled by orchestrator.

For now, it would be nice to have a simple way to check the replication status (up to what point a host has replicated to, using GTID or binlog position).

This was run last time: P15586, but:

  • I don't want to think about how to use undeployed scripts and syntax problems under pressure
  • I don't want to check which are direct and which are undirect replicas
  • I don't care about non-core hosts (backups, dbstores, labsdbs, ...)

Maybe it could be integrated into db-replication-tree, but it won't work just by adding an additional field there, as it has to work even if the master cannot be connected to, and other host could be down. Maybe it could use zarcillo or other cached list (orchestrator?).

Ideally, it would show something like this, clearly:

db1111 master => (unavailable) or "1123456778 / GTID: 12345678"
db1112 candidate => "1123456778" (good, with color?
db1113 direct replica => "12345677" (bad, not up to date)

Event Timeline

jcrespo added a subscriber: jbond.

Jbond: we already collect those metrics, what we don't have is a way to show them easily.

Jbond: we already collect those metrics, what we don't have is a way to show them easily.

thanks :)

It could have a format similar to the existing db-replication-tree, but it cannot use replication for discovery as we assume replication will be broken/master unavailable:

Screenshot_20210427_145618.png (1×2 px, 409 KB)

LSobanski triaged this task as Medium priority.May 4 2021, 9:37 AM
LSobanski moved this task from Triage to Refine on the DBA board.

With orchestrator we can sort of do that (note clouddb1020 failure)

root@dborch1001:~# /usr/bin/orchestrator-client -c which-cluster-master -a s5  < /usr/bin/orchestrator -c topology-tabulated
db1130.eqiad.wmnet:3306           |     0s|ok     |10.4.19-MariaDB-log|rw|STATEMENT|>>,semi:master
+ db1096.eqiad.wmnet:3315         |     0s|ok     |10.4.21-MariaDB-log|ro|ROW      |>>,GTID
+ db1100.eqiad.wmnet:3306         |     0s|ok     |10.4.19-MariaDB-log|ro|STATEMENT|>>,GTID,semi:replica
+ db1110.eqiad.wmnet:3306         |     0s|ok     |10.4.19-MariaDB-log|ro|ROW      |>>,GTID,semi:replica
+ db1113.eqiad.wmnet:3315         |     0s|ok     |10.4.19-MariaDB-log|ro|ROW      |>>,GTID
+ db1144.eqiad.wmnet:3315         |     0s|ok     |10.4.19-MariaDB-log|ro|ROW      |>>,GTID
+ db1150.eqiad.wmnet:3315         |     0s|ok     |10.4.19-MariaDB    |ro|nobinlog |GTID
+ db1161.eqiad.wmnet:3306         |     0s|ok     |10.4.19-MariaDB-log|ro|ROW      |>>,GTID,semi:replica
  + db1154.eqiad.wmnet:3315       |     0s|ok     |10.4.21-MariaDB-log|ro|ROW      |>>,GTID
    + clouddb1016.eqiad.wmnet:3315|     0s|ok     |10.4.19-MariaDB    |ro|nobinlog |GTID
    - clouddb1020.eqiad.wmnet:3315|unknown|invalid|10.4.19-MariaDB    |ro|nobinlog |GTID
    + clouddb1021.eqiad.wmnet:3315|     0s|ok     |10.4.19-MariaDB    |ro|nobinlog |GTID
+ db2123.codfw.wmnet:3306         |     0s|ok     |10.4.19-MariaDB-log|ro|STATEMENT|>>,GTID,semi:master
  + db2075.codfw.wmnet:3306       |     0s|ok     |10.4.19-MariaDB-log|ro|ROW      |>>,GTID,semi:replica
  + db2089.codfw.wmnet:3315       |     0s|ok     |10.4.19-MariaDB-log|ro|ROW      |>>,GTID
  + db2101.codfw.wmnet:3315       |     0s|ok     |10.4.21-MariaDB    |ro|nobinlog |GTID
  + db2111.codfw.wmnet:3306       |     0s|ok     |10.4.19-MariaDB-log|ro|ROW      |>>,GTID,semi:replica
  + db2113.codfw.wmnet:3306       |     0s|ok     |10.4.19-MariaDB-log|ro|STATEMENT|>>,GTID,semi:replica
  + db2128.codfw.wmnet:3306       |     0s|ok     |10.4.19-MariaDB-log|ro|ROW      |>>,GTID,semi:replica
    + db2094.codfw.wmnet:3315     |     0s|ok     |10.4.21-MariaDB-log|ro|ROW      |>>,GTID
  + db2137.codfw.wmnet:3315       |     0s|ok     |10.4.19-MariaDB-log|ro|ROW      |>>,GTID
+ dbstore1003.eqiad.wmnet:3315    |     0s|ok     |10.4.19-MariaDB    |ro|nobinlog |GTID

There are a few things:

  • It doesn't have colours
  • I haven't tested for the failure of an intermediate master, but I would assume it is still listed there (just as an invalid/unknown). I would also guess that the host would still be there until the expiration time happens (when orchestrator deletes it, which is a week at the moment if I recall correctly, but it is configurable) - can be tested with test-s4 testing section if needed.
  • I haven't found yet how to display the last executed statement from the CLI but it is on the GUI (actually all this info is in the GUI of course)

For backups (but I think DBAs may have an equivalent need) I realized that the main thing I need is differentiating by "groups"- mw "core" servers, backup sources, analytics, test and cloud (and to some extent, its subgroups ("special" replicas).

Example needs I have "give me the list of s1 backup sources", "give me all eqiad backup sources". I also realized I didn't know which were proper core mw servers to replace a failing primary candidate quicky during an incident.

I wonder if orchestrator API could be used to add some custom data on top or something?

Orchestrator has tags, which can be useful - I filed this to investigate the tags sometime ago: T266869: Investigate using orchestrator tags for different type of hosts.

root@dborch1001:~# orchestrator-client -c tag -i db1124 --tag name=master
db1124.eqiad.wmnet:3306
root@dborch1001:~# orchestrator-client -c tag -i db1125 --tag name=candidate
db1125.eqiad.wmnet:3306
root@dborch1001:~# /usr/bin/orchestrator-client -c which-cluster-master -a test-s4 < /usr/bin/orchestrator -c topology-tabulated -c topology-tags
db1124.eqiad.wmnet:3306   [0s,ok,10.4.20-MariaDB-log,rw,STATEMENT,>>,semi:master] [name=master]
+ db1125.eqiad.wmnet:3306 [0s,ok,10.4.21-MariaDB-log,ro,ROW,>>,GTID,semi:replica] [name=candidate]
Ladsgroup moved this task from Backlog to In Progress on the SRE-tools board.
Ladsgroup subscribed.

I looked at this a bit. Probably gonna work more on it.

There are three ways I can think of, not mutually exclusive.

  • Running a script on cumin analyzing output of orchestrator-client.
    • This requires installing orchestrator-client on cumin. If we go this path, I suggest just simply splitting off cumin to a dedicated host like dbmaint1001 and move all of db work there instead of just adding stuff to it.
    • The other reason that I don't like this is that you need to build a tool to parse the free-form output of orch-client.
  • Running a js code calling the API and analyzing it.
    • I already wrote this. Simply open orchestrator, open inspect element, run P22613 and then write something like report('s4'); to get the report. It sorts them based on the position in binlog, and provides information such as saying if it's not accessible, etc.
    • It's a very basic POC, we can improve that and for example take tags into account, or make it sort based on another metric, etc. etc. An example of data it gets is https://orchestrator.wikimedia.org/api/cluster/alias/s3
  • Running a python script in cumin calling orchestrator API and analyzing it.
    • I like this, the only complexity is that you need to authenticate to be able to use the API which is going to be either ewww or slow or both. @jbond was very kind to provide a POC of a change that would let cumin hosts bypass CAS-SSO: https://gerrit.wikimedia.org/r/c/operations/puppet/+/770981 and I think that'd be nice. I will work on it to make it mergable.
    • It can be bash instead of python but why?

What do you think?

To not get distracted, I would like to sumarize the core request, and differentiate it against the several proposed solutions. The core needs on the original request were:

  • Knowing the binlog/GTID position of all servers, *even if replication is broken* to identify if there is one higher than the others
  • Knowing easily and fast what is best candidate in order of preference (it is up, it was designated as such, or by its "good state" (no missed transactions) to set as the new primary, it is in a "good" group (e.g. no sanitarium), it wasn't under maintenance, etc.)

Note most of this is possible already, but when a primary is down, and we are under a lot of stress/alerts, clear tooling would help a lot (e.g. knowing "what is going on" is more difficult than "acting"), and reliable tools (which will work even if several nodes or services have failed) are very nice.

@jcrespo Did you test the POC I mentioned? P22613. It is exactly what you're saying (ordering by position, state, problems, ignoring dbstore dbs, etc.) and works when the master is down.

Thanks for working on this @Ladsgroup.
I would discard option #1 if we can - I wouldn't want to have another host to maintain and we should try to stick to cumin* hosts as much as possible.

Option #2 looks good although I am not able to see the binlog file in particular, but I do see multi-instance and the log pos. Maybe I am missing something but not a big deal at the moment. We should definitely tag candidate masters as candidate, the same way we do with dbctl now and include the tagging and untagging as part of the switchovers. Not just for this change, but for future usages.

Option #3 looks the cleanest to me as it can be integrated with many other things if needed but, we should probably wait for you to see how doable it is.

Change 771257 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/software@master] switchover-tmpl.sh: Add orchestrator tag notes

https://gerrit.wikimedia.org/r/771257

Change 771257 merged by Marostegui:

[operations/software@master] switchover-tmpl.sh: Add orchestrator tag notes

https://gerrit.wikimedia.org/r/771257

@jcrespo Did you test the POC I mentioned? P22613. It is exactly what you're saying (ordering by position, state, problems, ignoring dbstore dbs, etc.) and works when the master is down.

Sure -as I said, orchestrator would be a nice option for sourcing the data- but (and I know it is POC):

  • Can we have it as a deployed command line tool or webpage?
  • Can we order it by the best option to failover too- the first suggestion "in order" I got was "db2110", which was not a useful one- although I guess it was nice because it told us that a remote host was further in binlog for failover (I am guessing this was because orch runs on codfw, so it was a latency + replication is still ongoing thingy, not realistic on a primary error).
  • Can we filter out "bad" (not necesarilly hide, but mark clearly) options e.g. dbstore role <> dbstore* instances (e.g. backup sources are named as db*- although they can be pooled in in an emergency, I guess, with much lower priority. We first only "core" hosts (e.g. hosts that are already pooled in mw), no indirect replicas, no hosts from other dc, hosts with statement (or that can be switched to statement) on the suggestions (compare to dbctl, maybe?)- My guess is there should be a set of rules for that, starting with the candidate, if it is in a good state. For example, if a "bad" instance is further along, we will want to reply those extra transactions manually on a "good instance"? I guess we will also need to translate between old and new primary coords (that will require later work and more discussion)?

I tried other methods to get the s4 candidate masters. I cannot see them at: https://noc.wikimedia.org/dbconfig/eqiad.json or https://noc.wikimedia.org/db.php (where it would have been very fast to spot by me).

I was unable to get it on my own using dbctl command line help. I arrived to https://wikitech.wikimedia.org/wiki/Dbctl#Find_candidate_masters_for_a_given_section which was actually very useful. When run, it gave me:

{
  "db1141": {
    "host_ip": "10.64.0.219",
    "port": 3306,
    "sections": {
      "s4": {
        "candidate_master": true,
        "percentage": 100,
        "pooled": true,
        "weight": 400
      }
    },
    "note": ""
  },
  "tags": "datacenter=eqiad"
}
{
  "db1149": {
    "host_ip": "10.64.48.232",
    "port": 3306,
    "sections": {
      "s4": {
        "weight": 300,
        "candidate_master": true,
        "percentage": 100,
        "pooled": true,
        "groups": {
          "api": {
            "weight": 100,
            "pooled": true
          }
        }
      }
    },
    "note": ""
  },
  "tags": "datacenter=eqiad"
}
{
  "db1160": {
    "host_ip": "10.64.0.49",
    "port": 3306,
    "sections": {
      "s4": {
        "candidate_master": true,
        "percentage": 100,
        "pooled": true,
        "weight": 400
      }
    },
    "note": ""
  },
  "tags": "datacenter=eqiad"
}

3 candidates, is that intended? puppet seems out of sync with dbctl: "$ git grep 'candidate master for s4'" (only showing 1 and a future one)

Obviously, all in favor of adding tags to orchestrator, if that makes life easier :-)

Please note I don't expect things to be all nice and easy as of now . I am already very happy this is getting attention- nor I am saying how it can/should be done. I am commenting as an "outsider" (only very occasionally attending outages), and trying to communicate the difficulties I have on getting the information I need out, specially without knowing all the internal details (same thing I think happens in the reverse way when handling database backups). 0:-)

Thank you for looking at this.

@jcrespo Did you test the POC I mentioned? P22613. It is exactly what you're saying (ordering by position, state, problems, ignoring dbstore dbs, etc.) and works when the master is down.

Sure -as I said, orchestrator would be a nice option for sourcing the data- but (and I know it is POC):

  • Can we have it as a deployed command line tool or webpage?

That's why I think option  #3 is better.

  • Can we order it by the best option to failover too- the first suggestion "in order" I got was "db2110", which was not a useful one- although I guess it was nice because it told us that a remote host was further in binlog for failover (I am guessing this was because orch runs on codfw, so it was a latency + replication is still ongoing thingy, not realistic on a primary error).
  • Can we filter out "bad" (not necesarilly hide, but mark clearly) options e.g. dbstore role <> dbstore* instances (e.g. backup sources are named as db*- although they can be pooled in in an emergency, I guess, with much lower priority. We first only "core" hosts (e.g. hosts that are already pooled in mw), no indirect replicas, no hosts from other dc, hosts with statement (or that can be switched to statement) on the suggestions (compare to dbctl, maybe?)- My guess is there should be a set of rules for that, starting with the candidate, if it is in a good state. For example, if a "bad" instance is further along, we will want to reply those extra transactions manually on a "good instance"? I guess we will also need to translate between old and new primary coords (that will require later work and more discussion)?

We should probably mark as not suitable anything that is multi-instance (dbstore, sanitarium hosts, most of the backup sources...) and anything that has slaves under them (that would scratch also codfw and sanitarium masters). That would give us a real scenario of possible good hosts. Even further, we could list _only_ hosts with STATEMENT based replication, as those are only candidate masters. So the list would be pretty clean with all those exceptions.

I tried other methods to get the s4 candidate masters. I cannot see them at: https://noc.wikimedia.org/dbconfig/eqiad.json or https://noc.wikimedia.org/db.php (where it would have been very fast to spot by me).

Yeah, they aren't listed there. When we migrated to dbctl, we lost the ability to put comments on the config like we had with using db-éqiad, db-codfw.

I was unable to get it on my own using dbctl command line help. I arrived to https://wikitech.wikimedia.org/wiki/Dbctl#Find_candidate_masters_for_a_given_section which was actually very useful. When run, it gave me:

{
  "db1141": {
    "host_ip": "10.64.0.219",
    "port": 3306,
    "sections": {
      "s4": {
        "candidate_master": true,
        "percentage": 100,
        "pooled": true,
        "weight": 400
      }
    },
    "note": ""
  },
  "tags": "datacenter=eqiad"
}
{
  "db1149": {
    "host_ip": "10.64.48.232",
    "port": 3306,
    "sections": {
      "s4": {
        "weight": 300,
        "candidate_master": true,
        "percentage": 100,
        "pooled": true,
        "groups": {
          "api": {
            "weight": 100,
            "pooled": true
          }
        }
      }
    },
    "note": ""
  },
  "tags": "datacenter=eqiad"
}
{
  "db1160": {
    "host_ip": "10.64.0.49",
    "port": 3306,
    "sections": {
      "s4": {
        "candidate_master": true,
        "percentage": 100,
        "pooled": true,
        "weight": 400
      }
    },
    "note": ""
  },
  "tags": "datacenter=eqiad"
}

3 candidates, is that intended? puppet seems out of sync with dbctl: "$ git grep 'candidate master for s4'" (only showing 1 and a future one)

Yeah, it is a bit messy due to the planning with SDC and hosts with bigger disks. I will clean that up as soon as I finish this comment to leave it like the rest of sections. Well spotted.
I think for now we should use puppet as a source of truth for candidate masters (I will clean s4 puppet too).

As I mentioned at T266869#7780563 hosts are now tagged in orchestrator, and with https://gerrit.wikimedia.org/r/c/operations/software/+/771257 we should be able to keep a clean list of valid candidates.
Ideally I would use either orchestrator or zarcillo to maintain the canonical list once we've defined our final source of truth.

When we migrated to dbctl, we lost the ability to put comments on the config like we had with using db-eqiad, db-codfw.

Ah, I understand now: https://github.com/wikimedia/operations-mediawiki-config/blob/master/docroot/noc/db.php

Because db.php is technically a "wmf mediawiki page" the only info it gets is pure mw config (e.g. generated https://noc.wikimedia.org/dbconfig/eqiad.json config) and not the original dbctl data that has additional sre info. :-(

I will follow your advice and use puppet for now.

Fixed dbctl notes for s4. Checked also the rest of sections.

I updated that script to completely avoid any replica that has replicas which automatically removes codfw master and so on. I'm going to work on getting orch web access through cumin and write the script for it.

I updated that script to completely avoid any replica that has replicas which automatically removes codfw master and so on. I'm going to work on getting orch web access through cumin and write the script for it.

[question] Could be more beneficial to add support for orchestrator API directly into spicerack?

I updated that script to completely avoid any replica that has replicas which automatically removes codfw master and so on. I'm going to work on getting orch web access through cumin and write the script for it.

I think it is cool to show them, but maybe mark them as not suitable for failover or something, but definitely showing them in the topology would be nice I reckon

I updated that script to completely avoid any replica that has replicas which automatically removes codfw master and so on. I'm going to work on getting orch web access through cumin and write the script for it.

[question] Could be more beneficial to add support for orchestrator API directly into spicerack?

Generally, sure, once have more integration of db tooling such as dbctl and co in spicerack, then we can move there and take advantage of that in cookbooks. i.e. I think there are more important pieces that we need to move there first.

In this particular case? I don't think so, this is not a cookbook and having a master switchover cookbook is years in the future.

I updated that script to completely avoid any replica that has replicas which automatically removes codfw master and so on. I'm going to work on getting orch web access through cumin and write the script for it.

I think it is cool to show them, but maybe mark them as not suitable for failover or something, but definitely showing them in the topology would be nice I reckon

I can play with it much easier in python but for that js code, I will change it later.

Change 770981 had a related patch set uploaded (by Ladsgroup; author: jbond):

[operations/puppet@production] idp: Open up orchestrator to cumin host

https://gerrit.wikimedia.org/r/770981

Change 770981 merged by Ladsgroup:

[operations/puppet@production] idp: Open up orchestrator to cumin host

https://gerrit.wikimedia.org/r/770981

Change 771570 had a related patch set uploaded (by Ladsgroup; author: jbond):

[operations/puppet@production] idp: Open up orchestrator to cumin host, take II

https://gerrit.wikimedia.org/r/771570

Change 771570 merged by Ladsgroup:

[operations/puppet@production] idp: Open up orchestrator to cumin host, take II

https://gerrit.wikimedia.org/r/771570

So that patch the supposed to open it up, broke it in two different ways:

  • When a request was made to API endpoint, it required the IP to be one of the cumin hosts (or mysql roots, whatever you want to name it). Meaning if the user was logged in and was accessing it from outside, it would not let you. Basically breaking it for users.
  • When you were making the request from cumin, you'd get resource denied:
ladsgroup@cumin1001:~$ curl https://orchestrator.wikimedia.org/api/cluster/alias/s4
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>403 Forbidden</title>
</head><body>
<h1>Forbidden</h1>
<p>You don't have permission to access this resource.</p>
</body></html>
ladsgroup@cumin1001:~$ curl https://orchestrator.wikimedia.org
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>302 Found</title>
</head><body>
<h1>Found</h1>
<p>The document has moved <a href="https://idp.wikimedia.org/login?service=https%3a%2f%2forchestrator.wikimedia.org%2f">here</a>.</p>
</body></html>

Which I assume because of another layer of defense in the apache not letting access if there no user logged in? I will check in depth later. Preferably in somewhere not production.

Change 771642 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] idp: Open up orchestrator to cumin host, take III

https://gerrit.wikimedia.org/r/771642

So that patch the supposed to open it up, broke it in two different ways:

  • When a request was made to API endpoint, it required the IP to be one of the cumin hosts (or mysql roots, whatever you want to name it). Meaning if the user was logged in and was accessing it from outside, it would not let you. Basically breaking it for users.

I think this is caused because you will need to have require ip .. + the cas block.

  • When you were making the request from cumin, you'd get resource denied:

This is because mysql_root_clients only contains ipv4 addresses however cumin tries to connect to orchestrator.wikimedia.org via ipv6.

i have created https://gerrit.wikimedia.org/r/c/operations/puppet/+/771642 to try and addresses theses but it may need some massaging. can you ping me when you plan to deploy so i can test a few other bits if the current patch doesn't work as is

Awesome, thanks. I will deploy this tomorrow.

Change 771642 merged by Ladsgroup:

[operations/puppet@production] idp: Open up orchestrator to cumin host, take III

https://gerrit.wikimedia.org/r/771642

The patch broke orch in two ways.

  • The first being mismatching element name, that's easily fixable.
  • The second: Invalid command '<Macro', perhaps misspelled or defined by a module not included in the server configuration. I just copied instead of using macro to bring it back. A proper fix can be done later

Change 771861 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] orchestrator: Fix Apache settings

https://gerrit.wikimedia.org/r/771861

Change 771861 merged by Ladsgroup:

[operations/puppet@production] orchestrator: Fix Apache settings

https://gerrit.wikimedia.org/r/771861

orchestrator is now online with correct access (including API access from cumin)

Change 771866 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] idp: Open up orchestrator to cumin host, take IV

https://gerrit.wikimedia.org/r/771866

So I ported the js code to python and it's P22832. You can run it like this in cumin:

ladsgroup@cumin1001:~$ python3 master_finder.py s4
{"has replicas": true, "name": "db2110.codfw.wmnet:3306", "position": 919165983}
{"name": "db1147.eqiad.wmnet:3306", "position": 919150399}
{"name": "db1148.eqiad.wmnet:3306", "position": 919150399}
{"name": "db1149.eqiad.wmnet:3306", "position": 919150399}
{"name": "db1150.eqiad.wmnet:3314", "multiinstace": true, "position": 919150399}
{"name": "db1141.eqiad.wmnet:3306", "position": 919150101}
{"name": "db1142.eqiad.wmnet:3306", "position": 919150101}
{"name": "db1143.eqiad.wmnet:3306", "position": 919150101}
{"name": "db1146.eqiad.wmnet:3314", "multiinstace": true, "position": 919150101}
{"has replicas": true, "name": "db1121.eqiad.wmnet:3306", "position": 919149801}
{"name": "db1144.eqiad.wmnet:3314", "multiinstace": true, "position": 919149801}
{"name": "db1145.eqiad.wmnet:3314", "multiinstace": true, "position": 919149801}
{"name": "db1160.eqiad.wmnet:3306", "problems": ["not_replicating"], "position": 706410601}

Ordered based on position. What do you want to see changed before I send it to operation/software?

Not sure what's available but it would probably be useful to see the binlog format and whether it has slaves hanging or not.

Made some changes.

ladsgroup@cumin1001:~$ python3 master_finder.py s4
{"has replicas": true, "name": "db2110.codfw.wmnet:3306", "binlog format": "STATEMENT", "position": "db1138-bin.005536:40736867"}
{"name": "db1150.eqiad.wmnet:3314", "multiinstace": true, "binlog format": "MIXED", "position": "db1138-bin.005536:40442713"}
{"name": "db1145.eqiad.wmnet:3314", "multiinstace": true, "binlog format": "MIXED", "position": "db1138-bin.005536:40442086"}
{"name": "db1146.eqiad.wmnet:3314", "multiinstace": true, "binlog format": "ROW", "position": "db1138-bin.005536:40442086"}
{"name": "db1147.eqiad.wmnet:3306", "binlog format": "ROW", "position": "db1138-bin.005536:40442086"}
{"name": "db1148.eqiad.wmnet:3306", "binlog format": "ROW", "position": "db1138-bin.005536:40442086"}
{"name": "db1149.eqiad.wmnet:3306", "binlog format": "STATEMENT", "position": "db1138-bin.005536:40442086"}
{"name": "db1144.eqiad.wmnet:3314", "multiinstace": true, "binlog format": "ROW", "position": "db1138-bin.005536:40438642"}
{"name": "db1141.eqiad.wmnet:3306", "binlog format": "ROW", "position": "db1138-bin.005536:40438214"}
{"name": "db1142.eqiad.wmnet:3306", "binlog format": "ROW", "position": "db1138-bin.005536:40438214"}
{"has replicas": true, "name": "db1121.eqiad.wmnet:3306", "binlog format": "ROW", "position": "db1138-bin.005536:40434872"}
{"name": "db1143.eqiad.wmnet:3306", "binlog format": "ROW", "position": "db1138-bin.005536:40434872"}
{"name": "db1160.eqiad.wmnet:3306", "problems": ["not_replicating"], "binlog format": "STATEMENT", "position": "db1138-bin.005532:706410601"}

"slaves hanging or not." is "has replicas" option. I only show it when there is one. I want to show a note if there is a disqualifying factor.

Yeah, it doesn't need to show them, but if there are replicas. So that's good enough!

The code to do db switchover is https://github.com/wikimedia/operations-software-wmfmariadbpy/blob/master/wmfmariadbpy/cli_admin/switchover.py and from what I'm seeing it wouldn't work without the old master unless there will be a massive refactor.

That's the main thing and what T196366: Implement (or refactor) a script to move slaves when the master is not available also needs. The difficult part is to gather the list of slaves of that old master which would be unreachable. Having orchestrator probably solves that issue (either that or zarcillo, whatever we consider the source of truth).
The script for this task should only list the replicas (and the other things we already mentioned: position, binlog etc etc).
For T196366: Implement (or refactor) a script to move slaves when the master is not available (which can be combined on this script maybe) we should probably make it run with something like
./emergency_switchover NEWMASTER LIST_OF_REPLICAS and it should configure just replication on the list of replicas and let it start replicate from the $NEWMASTER. Then dbctl can be handled aside (like we do now, where switchover.py doesn't handle that)

Change 771866 merged by Ladsgroup:

[operations/puppet@production] orchestrator: Use macros in apache config

https://gerrit.wikimedia.org/r/771866

That's the main thing and what T196366: Implement (or refactor) a script to move slaves when the master is not available also needs. The difficult part is to gather the list of slaves of that old master which would be unreachable. Having orchestrator probably solves that issue (either that or zarcillo, whatever we consider the source of truth).
The script for this task should only list the replicas (and the other things we already mentioned: position, binlog etc etc).
For T196366: Implement (or refactor) a script to move slaves when the master is not available (which can be combined on this script maybe) we should probably make it run with something like
./emergency_switchover NEWMASTER LIST_OF_REPLICAS and it should configure just replication on the list of replicas and let it start replicate from the $NEWMASTER. Then dbctl can be handled aside (like we do now, where switchover.py doesn't handle that)

Thanks. I will pick those up next but before doing so. Do you have any feature request with the current code before I submit it to operations/software? e.g. highlighting the db that doesn't have any issues or whatever you feel would be useful.

Thanks. I will pick those up next but before doing so. Do you have any feature request with the current code before I submit it to operations/software? e.g. highlighting the db that doesn't have any issues or whatever you feel would be useful.

Based on the current output of the script you pasted at P22832, I think if you highlight (or put those who have NO replicas, NO multi-instance, have STATEMENT based replication AND are on the active DC, that's probably the fastest way to identify the candidate master for that section)

This is the ouput for s4:

{"has replicas": true, "name": "db2110.codfw.wmnet:3306", "binlog format": "STATEMENT", "position": "db1138-bin.005603:990422071"}
{"has replicas": true, "name": "db1121.eqiad.wmnet:3306", "binlog format": "ROW", "position": "db1138-bin.005603:990204250"}
{"name": "db1141.eqiad.wmnet:3306", "binlog format": "ROW", "position": "db1138-bin.005603:990204250"}
{"name": "db1142.eqiad.wmnet:3306", "binlog format": "ROW", "position": "db1138-bin.005603:990204250"}
{"name": "db1143.eqiad.wmnet:3306", "binlog format": "ROW", "position": "db1138-bin.005603:990204250"}
{"name": "db1144.eqiad.wmnet:3314", "multiinstace": true, "binlog format": "ROW", "position": "db1138-bin.005603:990204250"}
{"name": "db1145.eqiad.wmnet:3314", "multiinstace": true, "binlog format": "MIXED", "position": "db1138-bin.005603:990204250"}
{"name": "db1146.eqiad.wmnet:3314", "multiinstace": true, "binlog format": "ROW", "position": "db1138-bin.005603:990204250"}
{"name": "db1147.eqiad.wmnet:3306", "binlog format": "ROW", "position": "db1138-bin.005603:990204250"}
{"name": "db1148.eqiad.wmnet:3306", "binlog format": "ROW", "position": "db1138-bin.005603:990204250"}
{"name": "db1149.eqiad.wmnet:3306", "binlog format": "ROW", "position": "db1138-bin.005603:990204250"}
{"name": "db1150.eqiad.wmnet:3314", "multiinstace": true, "binlog format": "MIXED", "position": "db1138-bin.005603:990204250"}
{"name": "db1160.eqiad.wmnet:3306", "binlog format": "STATEMENT", "position": "db1138-bin.005603:990204250"}

If we get the condition I mentioned above:
STATEMENT based -> db2110 and db1160
NO replicas -> db1160
Active DC -> db1160

And db1160 is indeed s4 candidate master, and should probably be highlighted as the best candidate master available.

Change 774585 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/software@master] dbtools: Add master_finder.py

https://gerrit.wikimedia.org/r/774585

@Ladsgroup at a second iteration we should add an USAGE to it and also specify that the section needs to match the sections we have in orchestrator.

I'm not following what you mean by USAGE 😅 Can you elaborate?

A script usage to show when the script is executed without any parameters :)

I see. Done now. Can you take a look?

Btw confirmed it works fine when the master is dead:

Ordered based on log position, the freshest first:
{"name": "db1125.eqiad.wmnet:3306", "problems": ["not_replicating"], "binlog format": "ROW", "position": "db1124-bin.000088:91070853"}

Change 774585 merged by jenkins-bot:

[operations/software@master] dbtools: Add master_finder.py

https://gerrit.wikimedia.org/r/774585

Ladsgroup moved this task from In progress to Done on the DBA board.

Let's call this done. I'll pick up T196366: Implement (or refactor) a script to move slaves when the master is not available next (probably next week)