Move wikireplicas dbproxy haproxy config to etcd
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	• razzi
	Mar 23 2022, 7:10 AM

Description

Currently haproxy config is written using a puppet resource which reads from hiera. This means that when clouddb hosts are to be depooled (for example updating views like https://phabricator.wikimedia.org/T302233), we have to create a puppet patch like this one and apply it. This is manual and makes automating depooling in a cookbook difficult (the first pass at a cookbook tried to use haproxy commands directly which did not work with our haproxy permissions and would make haproxy out of sync with the filesystem).

A better approach would be to put the configuration in etcd and use confd to write the latest etcd values to disk (/etc/haproxy/conf.d/multi-db-replicas.cfg) and reload haproxy when etcd changes. This way etcd can be manually updated which is easy to do programmatically.

Rough outline of steps:

copy the hiera data about which servers are depooled in etcd
make confd write etcd to disk (we can use a different name, like /etc/haproxy/conf.d/wikireplicas.cfg, so that the current system is undisturbed)
- this should be configured to watch the key(s) we chose to rewrite the config files
when we're happy with the new config file, switch haproxy to read from the new config, and remove the old
make haproxy reload when the config file changes
test depooling a clouddb server by using etcdctl; once this works update the cookbook to depool using this command

Details

Subject	Repo	Branch	Lines +/-
dbproxy: add clouddb sections to conftool	operations/puppet	production	+9 -2
profile::mariadb::proxy::multiinstance_replicas: haproxy config from etcd	operations/puppet	production	+55 -0
wikireplicas: depool clouddb1017-1020 and repool 15 and 16	operations/puppet	production	+38 -38
wikireplicas: depool clouddb1017-1020 and repool 15 and 16	operations/puppet	production	+38 -38

Customize query in gerrit

Related Objects

Mentioned In: T313445: hw troubleshooting: Move dbproxy1019 from C5 to B6
Mentioned Here: T302233: Toolforge db: View 'fiwiki_p.flaggedrevs' references invalid table/column/rights to use them

Event Timeline

• razzi created this task.Mar 23 2022, 7:10 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 23 2022, 7:10 AM

@Joe I'm interested in your input, since you mentioned etcd is the way to go here - does the above plan make sense?

If so, for etcd data modeling, would it make sense to have a single to have a single keyfor this dictionary-like structure for this profile::mariadb::proxy::multiinstance_replicas::section_overrides or break it into multiple keys and depend on a prefix?

I'll admit I've never used etcd prior to this (though data engineering uses zookeeper which is similar); please excuse my novice-level understanding.

Change 773386 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] [WIP] profile::mariadb::proxy::multiinstance_replicas: haproxy config from etcd

https://gerrit.wikimedia.org/r/773386

gerritbot added a project: Patch-For-Review.Mar 24 2022, 6:03 AM

• EChetty changed the task status from Open to In Progress.Mar 24 2022, 2:54 PM

• EChetty assigned this task to • razzi.

• EChetty added a project: Data-Engineering-Kanban.

• EChetty moved this task from Next Up to In Progress on the Data-Engineering-Kanban board.

• EChetty moved this task from Incoming (new tickets) to Datasets on the Data-Engineering board.

I have opened a patch for this that writes to a config file other than the actual one, so it can be inspected for correctness: https://gerrit.wikimedia.org/r/c/operations/puppet/+/773386

A couple questions remain:

How to get the data into confd? This will require adding keys like

conftool/v1/dbproxy-mariadb-hosts-pooled/s7/dbstore1018.eqiad.wmnet

with values like

{"ip": "10.64.0.29", "state": "active"}

A full set of keys and values can be derived from the current puppet-based depooling status.

for every clouddb server / section. The current values are in puppet.

Is the key structure that I used workable? Key is <prefix>/dbproxy-mariadb-hosts-pooled/<section>/<fqdn>

Hey sorry I didn't take the time to reply earlier, I was a bit busy.

We already have a structure in etcd to describe pools of backends and what is pooled/not pooled there.

It's the node object hierarchy.

So basically you would have to add something like the following to conftool-data:

# file name: node/dbproxy.yaml
dbproxy-mariadb:
  dbstore1018.eqiad.wmnet: [s7, ...] # any other section you might be interested in.

I think this would be the simplest way to go, but this would only provide you the pooled status for confd use.

I get that's somewhat clunky, as you'd need to get the IP address from another source - like say hiera. It's doable and it's the path of least resistance, but we can also think of adding a new type of objects.

That is done by adding a new class of objects to the conftool schema at ./modules/profile/files/conftool/schema.yaml.

So basically, I'd say this new type of object would need the following:

a scope (or another name) tag to indicate which functional group of dbproxies we're aiming at
the fields should be as you described, an ip address and a boolean pooled/not pooled

Then, we can have confd on the dbproxies just change the file on disk and notify haproxy to reload it.

Some pointers given we're not working with a big TZ overlap:

The conftool docs at https://gerrit.wikimedia.org/g/operations/software/conftool
The way we check and reload vcl when using confd at ./modules/varnish/files/confd-reload-vcl.sh

By declaring the host -> ip mappings using https://github.com/kelseyhightower/confd/blob/master/docs/templates.md#map earlier in the etcd template, we should be able to keep the data stored in etcd in line with the current node statuses; as simple as "pooled/not pooled". I'd prefer this, since repeating the ip address in etcd is a potential point of confusion.

When I compile the puppet catalog and inspect the template it generates, it gives some strange output; all the databases are clouddb1099 which doesn't exist (https://puppet-compiler.wmflabs.org/pcc-worker1003/34632/dbproxy1018.eqiad.wmnet/change.dbproxy1018.eqiad.wmnet.pson, search for /etc/haproxy/conf.d/multi-db-replicas.cfg), but the current file with which puts the puppet values straight into the haproxy config uses the same values, so I guess these are dummy values for the puppet catalog compiler.

Now I need to update the replicas.tpl.erb to use the standard config schema, but I'm a bit lost on how the keys and values would look in this case. There is a category, host name, and service name, but how to they combine into a key? I'm guessing it would look like /dbproxy-mariadb/clouddb1013.eqiad.wmnet/s1 or /dbproxy-mariadb/s1/clouddb1013.eqiad.wmnet.

Looking at the docs here (https://wikitech.wikimedia.org/wiki/Conftool#The_tools), it seems the values look like:

{"clouddb1013.eqiad.wmnet": {"weight": 1, "pooled": "yes"}, "tags": "dc=eqiad"}

Perhaps first I should get the conftool data onto etcd with conftool-merge by just merging the changes in this file: https://gerrit.wikimedia.org/r/c/operations/puppet/+/773386/10/conftool-data/node/eqiad.yaml and then set and query the etcd keys and values once that's done?

bd808 moved this task from Backlog to Wiki replicas on the Data-Services board.Apr 12 2022, 11:24 PM

Change 779926 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] dbproxy: add clouddb sections to conftool

https://gerrit.wikimedia.org/r/779926

Change 779941 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] wikireplicas: depool clouddb1017-1020 and repool 15 and 16

https://gerrit.wikimedia.org/r/779941

Change 779941 abandoned by Razzi:

[operations/puppet@production] wikireplicas: depool clouddb1017-1020 and repool 15 and 16

Reason:

Accedentally added a patch on top of an unrelated one

https://gerrit.wikimedia.org/r/779941

Change 779945 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] wikireplicas: depool clouddb1017-1020 and repool 15 and 16

https://gerrit.wikimedia.org/r/779945

Change 779945 merged by Razzi:

[operations/puppet@production] wikireplicas: depool clouddb1017-1020 and repool 15 and 16

https://gerrit.wikimedia.org/r/779945

Ladsgroup subscribed.May 4 2022, 8:25 PM

JAllemandou moved this task from In Progress to Paused on the Data-Engineering-Kanban board.May 24 2022, 6:41 PM

BTullis claimed this task.May 25 2022, 4:07 PM

JArguello-WMF removed BTullis as the assignee of this task.Jun 16 2022, 5:29 PM

JArguello-WMF removed a project: Data-Engineering-Kanban.

JArguello-WMF added a subscriber: BTullis.

Change 779926 abandoned by Btullis:

[operations/puppet@production] dbproxy: add clouddb sections to conftool

Reason:

Not proceeding with this approach for now.

https://gerrit.wikimedia.org/r/779926

Thanks @Joe for your input on this ticket. I think that we've decided it's too much work for us to take on at the moment, so I'll decline this ticket.
Maybe we will come back to the question of dynamic configuation of the clouddb proxy layer at a later date. I really like the idea of being able to adjust the proxies at runtime, but it's just not high enough up the priority list at the moment.

• nskaggs mentioned this in T313445: hw troubleshooting: Move dbproxy1019 from C5 to B6.Jul 29 2022, 3:50 PM

Move wikireplicas dbproxy haproxy config to etcdClosed, DeclinedPublicActions

Description

Details

Related Objects

Event Timeline

Move wikireplicas dbproxy haproxy config to etcd
Closed, DeclinedPublic
Actions