Page MenuHomePhabricator

#dbctl: manage 'externalLoads' data
Closed, ResolvedPublic

Description

externalLoads is used for configuration of which DB servers store so-called external storage data. dbctl should manage this as well.

There's a few different-from-usual things going on here, from dbctl's perspective:

  • Obviously, the top-level key is different -- externalLoads instead of sectionLoads.
  • There's no groups for the externalLoads sections.
  • Many of the sections in externalLoads refer to $wmgOldExtTemplate, the same set of three servers. Any changes to that grouping need to apply to all the sections involved.
  • Weights seem to be either 0 (for the master) or 1 (when pooled as a replica), but we'd possibly want to change that?

Event Timeline

Those weights are treated the same way as the other sX sections, but given that we have few hosts we normally use them like that (0/1).
Sometimes when we are repooling a cold host we might go for 3 on the existing one and 1 for the one that's being slowly reepooled. Then 3/2 and once we are fully ready we just set 1 to both of them.

Sometimes we do set the master to 1 if one of the hosts is depooled.
But essentially this is like any other section s1-s8

In db-eqiad/codfw.php we currently provide IP addresses as the values of the keys in externalLoads instead of using hostsByName to translate hostnames to IP, e.g.:

'externalLoads' => [
	# es2
	'cluster24' => [
		'10.64.32.184' => 0, # es1015, C2 11TB 128GB, master
		'10.64.0.6'    => 1, # es1011, A2 11TB 128GB
		'10.64.16.186' => 1, # es1013, B1 11TB 128GB
	],
]

However looking at the code path, I don't think there's any reason why this has to be so. I see in LBFactoryMulti.php that newExternalLB calls newLoadBalancer which calls makeServerArray which implements the translation: $serverInfo['host'] = $this->hostsByName[$serverName] ?? $serverName;

So I'm planning on emitting hostnames when I implement externalLoads output in dbctl, but wanted to verify with @Krinkle and @aaron that I understood the situation correctly. @Krinkle also suggested that if we are going to rely on this behavior, then it should be explicitly tested in mediawiki-core.

Here's a mini-design proposal for the dbctl feature itself (@Volans and @Joe please review):

  • Add a flavor enum to the section schema. The default will be regular, which will cause the section to be output in the config in sectionLoads, as usual.
  • The flavor externalstore will cause the section to be output in externalLoads.
    • Any groups set in the instances of a section with flavor=externalstore will be ignored.
  • The flavor template will cause the section to not be output anywhere. Rather this is only used to be referenced by other sections.
    • The name of a flavor=template section must begin with template_.
  • Also add to the section schema a base_template field. The value of this field must be the name of a flavor=template section.
    • It is not allowed to set both base_template and flavor=template -- strictly one level of inheritance is allowed.
    • When base_template=T is set on a section X, dbctl will emit section X's config in the following way:
      • The master of section T is emitted as the master of section X.
      • All of section T's instances will be emitted as part of section X.
      • The value of the master field for section X is ignored. dbctl will produce a warning at section edit / section set_master time to edit the master of a section that has base_template set.
      • Any instances somehow associated with section X are ignored. dbctl will produce a warning at instance edit / instance pool time to associate an instance with a section that has base_template set.

Alternatives considered and rejected:

  • Allowing instances to be associated with sections that have base_template set. This raises all sorts of questions:
    • "What to do when an instance is associated with X that doesn't occur in T" could be answered by just taking the union of the replica instances, but...
    • what do you do when the same instance is associated with X and with T?
      • Having the weight and pooled status from X take precedence over T sounds 'natural', but it seems inevitable this will produce some sort of surprising accidental misconfiguration in the future.
      • Disallowing an overlap in instances -- making that condition an error -- solves the above concern, while also allowing some template-using sections to pool 'extra' instances specific to them -- but it also means any edit on an instance possibly requires reading from etcd the definitions of all instances and many sections.
  • Not supporting templates. This would make maintenance on the existing $wmgOldExtTemplate server group, used by 15 sections in externalLoads, very arduous.
    • One possible workaround: allow batch pool/depool on sections in dbctl. However this is also a bunch of new code, would result in very time-consuming edits with the number of etcd operations involved, there would be the potential for individual operations to fail and possibly cause inconsistencies, etc. The template system actually seems simpler to use, maintain, and implement.
  • Don't support templates in dbctl, but instead add some kind of plumbing to reference the existing $wmgOldExtTemplate somehow, without redefinining it within dbctl. So $wmgOldExtTemplate would still live on within db-eqiad/codfw.php.
    • It isn't clear how to do this, but it would mean that the output JSON is no longer a sufficient snapshot of the configuration, and it would add complexity to the PHP that reads dbctl JSON. This seems to incresae maintenance burden rather than reduce it.

Here's a mini-design proposal for the dbctl feature itself (@Volans and @Joe please review):

  • Add a flavor enum to the section schema. The default will be regular, which will cause the section to be output in the config in sectionLoads, as usual.
  • The flavor externalstore will cause the section to be output in externalLoads.
    • Any groups set in the instances of a section with flavor=externalstore will be ignored.

I would rather forbid/fail here if possible. If the flavor is externalstore it cannot have groups and we could enforce it at schema level if we don't want to do anything special in the code.

  • The flavor template will cause the section to not be output anywhere. Rather this is only used to be referenced by other sections.

From now on you lost me 😜 I think we could get away with a much simpler solution, because we need to remember that this is the DBA side of things, not the MW one. And we don't need to keep track of all the old oddities MW needs in the PHP files.
I think we just need to be able to output either 4 variables or an associative array with 4 elements: one for the current $wmgOldExtTemplate for ext1 and one each for the cluster24, cluster25 and extension1 groups. I think it would be wrong to track all the keys in externalLoads in dbconfig.
Thoughts?

A couple comments:

  • I concur with @Volans I'd keep the first iteration (at least) *very* simple. I find inheritance in configurations to be hard to reason about and mostly a complication. But maybe I'm not seeing the problem you're solving.
  • I know adding tags to a schema is a pain (in fact, it will need a data migration) but the flavour thing you were proposing seems like the kind of thing that should be a tag, so scope=eqiad,flavour=main could be a set of tags for an instance object

That would allow for easier querying (confctl doesn't natively query on object values) on all sides.

My thought with wanting to build templates into dbctl was mostly that it would mean dbctl's output would make it obvious what was happening, without needing to also look at the associated Mediawiki configs. Also I didn't want to add more PHP glue on the mediawiki-config side, because the status quo is already a bit gross.

Anyway I think I've been talked out of adding template inheritance in dbctl. What is the best key to use in dbctl's externalLoads output to signify $wmgOldExtTemplate? I am thinking cluster1 will do fine, and then the obvious cluster24, cluster25, extension1.

I suppose it does also make sense to add a flavor tag to section objects, despite the migration needed.

@Marostegui please give me some input on the UI -- what would you like dbctl to call the section that will represent "everything that refers to $wmgOldExtTemplate"? For now I'm thinking cluster1, does that sound good to you? (So you would write something like dbctl instance es1018 depool --section cluster1)

@Krinkle @aaron -- could one of you please find some time soon to make sure I'm understanding things correctly as I state them below?

In db-eqiad/codfw.php we currently provide IP addresses as the values of the keys in externalLoads instead of using hostsByName to translate hostnames to IP, e.g.:

'externalLoads' => [
	# es2
	'cluster24' => [
		'10.64.32.184' => 0, # es1015, C2 11TB 128GB, master
		'10.64.0.6'    => 1, # es1011, A2 11TB 128GB
		'10.64.16.186' => 1, # es1013, B1 11TB 128GB
	],
]

However looking at the code path, I don't think there's any reason why this has to be so. I see in LBFactoryMulti.php that newExternalLB calls newLoadBalancer which calls makeServerArray which implements the translation: $serverInfo['host'] = $this->hostsByName[$serverName] ?? $serverName;

So I'm planning on emitting hostnames when I implement externalLoads output in dbctl, but wanted to verify with @Krinkle and @aaron that I understood the situation correctly. @Krinkle also suggested that if we are going to rely on this behavior, then it should be explicitly tested in mediawiki-core.

@Marostegui please give me some input on the UI -- what would you like dbctl to call the section that will represent "everything that refers to $wmgOldExtTemplate"? For now I'm thinking cluster1, does that sound good to you? (So you would write something like dbctl instance es1018 depool --section cluster1)

We use them as:
es2
es3

("soon" we'll have es4 and es5 too).
We also call them like that on tendril, so maybe better to call them like that on dbctl as well.

@CDanis my 2 cents are with Manuel to use es[123] on the dbctl side and have the mapping es->cluster in mediawiki-config code as it is right now (with comments in the db-$dc.php files).
One thing that I don't remember if it was mentioned is that the es1 cluster is read-only and doesn't have any replication setup between them. Some additional code/knob is probably needed to allow this setup IIRC.

@CDanis my 2 cents are with Manuel to use es[123] on the dbctl side and have the mapping es->cluster in mediawiki-config code as it is right now (with comments in the db-$dc.php files).
One thing that I don't remember if it was mentioned is that the es1 cluster is read-only and doesn't have any replication setup between them. Some additional code/knob is probably needed to allow this setup IIRC.

Correct, and we'll have es2, es3 also on RO "soon" as they are filling up, so we'll have es4 and es5.

es1: currently read only clusters, all <24
es2 es3: currently rw clusters containing cluster24 and 25 respectively (but distribution may change in the future), they will become read only after the following are setup:
es4 es5: new rw clusters, previsibly cluster 26 and cluster 27 respectively, yet to be purchased and setup

The cluster -> section is not dynamic information, the section -> hosts is.

We could involve mw stakeholders to make that a reality for mw cofiguration, as they are sometimes also confused by the logical vs. hw setup.

Krinkle edited projects, added Performance-Team; removed Performance-Team (Radar).

I'm not familiar with what's being asked exactly, deferring to @aaron for now. Happy to help later if needed.

In db-eqiad/codfw.php we currently provide IP addresses as the values of the keys in externalLoads instead of using hostsByName to translate hostnames to IP, e.g.:

'externalLoads' => [
	# es2
	'cluster24' => [
		'10.64.32.184' => 0, # es1015, C2 11TB 128GB, master
		'10.64.0.6'    => 1, # es1011, A2 11TB 128GB
		'10.64.16.186' => 1, # es1013, B1 11TB 128GB
	],
]

However looking at the code path, I don't think there's any reason why this has to be so. I see in LBFactoryMulti.php that newExternalLB calls newLoadBalancer which calls makeServerArray which implements the translation: $serverInfo['host'] = $this->hostsByName[$serverName] ?? $serverName;

So I'm planning on emitting hostnames when I implement externalLoads output in dbctl, but wanted to verify with @Krinkle and @aaron that I understood the situation correctly. @Krinkle also suggested that if we are going to rely on this behavior, then it should be explicitly tested in mediawiki-core.

I see no 'templateOverridesByServer' usage, so as long as the hostnames used in 'externalLoads' are defined in 'hostsByName', that sounds OK.

Any rough ETA on when externalLoads will be able to be handled by dbctl?

Any rough ETA on when externalLoads will be able to be handled by dbctl?

Before EoQ.

Change 556224 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/software/conftool@master] dbctl: generate externalLoads

https://gerrit.wikimedia.org/r/556224

Change 556224 merged by jenkins-bot:
[operations/software/conftool@master] dbctl: generate externalLoads

https://gerrit.wikimedia.org/r/556224

Change 556236 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/mediawiki-config@master] dbctl: also read externalLoads from dbctl

https://gerrit.wikimedia.org/r/556236

Change 556712 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] dbctl: update schemata for 1.3.0-1

https://gerrit.wikimedia.org/r/556712

Mentioned in SAL (#wikimedia-operations) [2019-12-12T15:59:53Z] <cdanis> T229686 ✔️ cdanis@install1002.wikimedia.org ~ 🕚☕ sudo -E reprepro -C main include stretch-wikimedia conftool_1.3.0-1_amd64.changes

Mentioned in SAL (#wikimedia-operations) [2019-12-12T16:02:52Z] <cdanis> T229686 upgrade python3-conftool and python3-conftool-dbctl on cumin hosts

Change 556712 merged by CDanis:
[operations/puppet@production] dbctl: update schemata for 1.3.0-1

https://gerrit.wikimedia.org/r/556712

Mentioned in SAL (#wikimedia-operations) [2019-12-12T20:17:09Z] <cdanis> T229686 adding instances backing es1/es2/es3/x1 to dbctl's instance data

Mentioned in SAL (#wikimedia-operations) [2019-12-12T20:18:54Z] <cdanis> T229686 adding sections es1/es2/es3/x1 to dbctl's section data

Mentioned in SAL (#wikimedia-operations) [2019-12-12T20:20:24Z] <cdanis@cumin2001> dbctl commit (dc=all): 'T229686 add sections es1/es2/es3/x1 and their instances', diff saved to https://phabricator.wikimedia.org/P9866 and previous config saved to /var/cache/conftool/dbconfig/20191212-202023-cdanis.json

Change 556800 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] conftool-data: add es[123]|x1 sections and instances

https://gerrit.wikimedia.org/r/556800

Change 556800 merged by CDanis:
[operations/puppet@production] conftool-data: add es[123]|x1 sections and instances

https://gerrit.wikimedia.org/r/556800

Mentioned in SAL (#wikimedia-operations) [2019-12-13T21:58:00Z] <cdanis> testing I0e0de86d by hand on mwdebug1001 T229686

I tested this on mwdebug1001 by manually installing my patch there.

I can verify it seems to work on both a variety of enwiki reads
https://logstash.wikimedia.org/app/kibana#/context/logstash-*/mediawiki/AW8BSA_mKWrIH1QROCC1?_a=h@53d830d&_g=h@44136fa

and also on a metawiki write:
https://logstash.wikimedia.org/app/kibana#/context/logstash-*/mediawiki/AW8BW5l6KWrIH1QRPMRr?_a=h@fbd2ac0&_g=h@44136fa

Earlier logs before the patch was installed referred to the externalLoads servers by IP address, as that is how they're written in the live PHP configs:
https://logstash.wikimedia.org/app/kibana#/context/logstash-*/mediawiki/AW8BQ27uKWrIH1QRNtmM?_a=h@a9e6f28&_g=h@44136fa

Mentioned in SAL (#wikimedia-operations) [2019-12-13T22:35:41Z] <cdanis> T229686 ✔️ cdanis@mwdebug1001.eqiad.wmnet /srv/mediawiki 🕠🍺 scap pull

Change 556236 merged by jenkins-bot:
[operations/mediawiki-config@master] dbctl: also read externalLoads from dbctl

https://gerrit.wikimedia.org/r/556236

Mentioned in SAL (#wikimedia-operations) [2019-12-16T14:03:28Z] <cdanis@deploy1001> Synchronized wmf-config/etcd.php: enable dbctl for externalLoads 6dfb30c76 T229686 (duration: 00m 53s)

Change 558054 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/mediawiki-config@master] db-codfw: remove dbctl-obsoleted externalLoads section

https://gerrit.wikimedia.org/r/558054

Change 558054 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw: remove dbctl-obsoleted externalLoads section

https://gerrit.wikimedia.org/r/558054

Change 558062 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/mediawiki-config@master] db-eqiad: remove dbctl-obsoleted externalLoads section

https://gerrit.wikimedia.org/r/558062

Mentioned in SAL (#wikimedia-operations) [2019-12-16T14:39:44Z] <cdanis@deploy1001> Synchronized wmf-config/etcd.php: db-codfw: remove dbctl-obsoleted externalLoads section 519e37461 T229686 (duration: 00m 53s)

Change 558062 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad: remove dbctl-obsoleted externalLoads section

https://gerrit.wikimedia.org/r/558062

Mentioned in SAL (#wikimedia-operations) [2019-12-16T14:45:08Z] <cdanis@deploy1001> Synchronized wmf-config/db-codfw.php: db-codfw: remove dbctl-obsoleted externalLoads section 519e37461 T229686 (duration: 00m 54s)

Mentioned in SAL (#wikimedia-operations) [2019-12-16T14:46:31Z] <cdanis@deploy1001> Synchronized wmf-config/db-eqiad.php: db-eqiad: remove dbctl-obsoleted externalLoads section 5413a6d73 T229686 (duration: 00m 54s)

Seems to be working well.