Page MenuHomePhabricator

Redesign and rebuild the wikireplicas service using a multi-instance architecture
Open, HighPublic

Description

In the interest of trying to keep or make the wikireplicas a performant and sustainable service, new backend architectures are needed. This task is to get to work designing and working on frontend and WMCS-supported scripting architectures to allow more backend flexibility with minimal loss of ease-of-use for end users of cloud services.

Since this task encompasses general access and orchestration, special access needs such as PAWS and Quarry, will need their own tasks, and there is probably room for a lot of other subtasks as well.

Related Objects

StatusSubtypeAssignedTask
ResolvedMarostegui
ResolvedRobH
OpenNone
OpenJhernandez
OpenBstorm
ResolvedBstorm
ResolvedMarostegui
ResolvedMarostegui
ResolvedMarostegui
ResolvedBstorm
ResolvedBstorm
ResolvedMoritzMuehlenhoff
ResolvedMarostegui
ResolvedMarostegui
ResolvedCmjohnson
Resolveddcaro
OpenMarostegui
DeclinedNone
ResolvedKormat
ResolvedArielGlenn
OpenBstorm
DeclinedBstorm
OpenBstorm
ResolvedBstorm
OpenJhernandez
Openrazzi
ResolvedMarostegui
ResolvedMilimetric
ResolvedBstorm
ResolvedBstorm
OpenNone
ResolvedBstorm
ResolvedAndrew
ResolvedBstorm
OpenNone
OpenJhernandez
ResolvedMarostegui
ResolvedRagesoss
ResolvedBstorm
OpenNone

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

It seems like we would need a real layer 7 proxy for the MySQL protocol to do "name based vhosts". In theory ProxySQL could do that (https://proxysql.com/documentation/how-to-setup-proxysql-sharding/), but I'm not sure if anyone would be excited about introducing ProxySQL for this feature. I have never actually used it myself, but I keep ending up looking at docs for it because of the mysql-proxy questions we have for PAWS.

Yeah, it's possible to act on some pretty funky bits of matching in a tcp payload, but that seems somewhat daft with regard to mariadb. Some of proxysql might be just the thing for PAWS, on the other hand, to handle both this complication and T253134: Find an alternative solution for the mysql-proxy in PAWS. Even so, that'd be a different scale and design than this kind of thing, which also makes me favor the IP-based routing notion above using listen blocks.

This seems possible to me. Today there would be 16 service names to route (8 slices x 2 access patterns) if done this way. We already actually create distinct DNS records for s[1-8].{analytics,web}.db.svc.eqiad.wmflabs. Today all the "analytics" records point to one load balancer IP and the "web" records point to another. These records are managed using the wmcs-wikireplica-dns script. These are all internal IPs too (10.x.x.x), so there shouldn't be any major concerns about IP exhaustion.

I just realized this might also give us a simple way to run both systems in parallel for a testing period. We could publish DNS records as s[1-8].{analytics,web}.db.svc.eqiad1.wikimedia.cloud pointing to the 16 new service ips as soon as the backing cluster existed. The current eqiad.wmflabs records could then stay pointed at the legacy service ips and continue to route to the legacy cluster. Then on the date of the hard switch to the new cluster we could update the eqiad.wmflabs DNS records to use the new service ips as well.

On that cutoff date, I think we should also consider retiring the legacy *.labsdb DNS records that we have been keeping around since the current (soon to be legacy) wiki replicas cluster was deployed.

I just realized this might also give us a simple way to run both systems in parallel for a testing period.

Oooh! I like that. That might be a cool idea. Adding @Marostegui for comment on that kind of thing as well.

Yeah, we should have both systems running in parallel for sometime.
My idea was to have that and slowly turn off the old ones, as there are many pieces here, I am sure we'll keep finding stuff only once we've shutdown the old system, so it would be good to have it handy just in case we need to start it on again until we are fully confident that the new one is working fine.

Regarding the proxying, it might require indeed quite lots of work. I think we can also experiment with just RR DNS, first (although of course that doesn't give us persistent connections) it can be a good indication of whether the new and old systems are fully transparent, as in: the user doesn't notice to which set of hosts they are connecting to. Once that happens, that is a good indication that the old system can started to be powered off.

Change 621067 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] wikireplicas: create cumin aliases for wikireplica servers

https://gerrit.wikimedia.org/r/621067

Change 621088 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/cookbooks@master] wikireplicas: add wikireplica cookbook to add a wiki

https://gerrit.wikimedia.org/r/621088

Change 621067 merged by Bstorm:
[operations/puppet@production] wikireplicas: create cumin aliases for wikireplica servers

https://gerrit.wikimedia.org/r/621067

Change 621343 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cumin: for new wmcs. prefix for cookbooks, grant access to wmcs-admins

https://gerrit.wikimedia.org/r/621343

Change 621088 merged by jenkins-bot:
[operations/cookbooks@master] wikireplicas: add wikireplica cookbook to add a wiki

https://gerrit.wikimedia.org/r/621088

Change 621574 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/cookbooks@master] wikireplicas: fix typo in the dns script for wikireplicas

https://gerrit.wikimedia.org/r/621574

Change 621574 merged by jenkins-bot:
[operations/cookbooks@master] wikireplicas: fix typo in the dns script for wikireplicas

https://gerrit.wikimedia.org/r/621574

I spent a bit of time testing out the idea of switching PAWS to using proxysql instead of mysql-proxy. I can see a big value in putting proxysql in front of the wikireplicas overall, but it honestly doesn't do the exact things we need for paws except query routing. I'm going to need to hack on mysql-proxy, I think, to add the appropriate routing.

Ok, this doesn't look that hard except that it may require maintaining a git checkout of the mediawiki config that is up to date on the mysql-proxy pod. The main thing that needs to change is that the database needs to be matched to the server in DNS, which should be a CNAME for the shard DNS, which I hope will go to an IP that has a listen block that routes to the right hosts and ports on haproxy.

Just have to figure out where to squeeze that in there...

Ok, this doesn't look that hard except that it may require maintaining a git checkout of the mediawiki config that is up to date on the mysql-proxy pod.

Depending on what you need from the mw-config clone, you may be able to read files from https://noc.wikimedia.org/ instead. The wmcs-wikireplica-dns script loads dblists from there instead of a local clone of the config. One advantage of this is that noc is updated by scap, so there should never be configuration drift where the upstream HEAD has changed but the local clone has not pulled yet. (Ok, "never" is absolutist. How about "extremely rare" instead?) My versions tool uses this technique as well. You may of course need to introduce some caching of the config which would bring drift potential back, but at least you would not need to manage a git clone with a sidecar container.

The main thing that needs to change is that the database needs to be matched to the server in DNS, which should be a CNAME for the shard DNS, which I hope will go to an IP that has a listen block that routes to the right hosts and ports on haproxy.

Just have to figure out where to squeeze that in there...

I think @elukey and his team did something based on DNS for the dbstore1XXX hosts, which are running multi-instance as well. I am not fully sure of the details on how that was finally implemented, but maybe it can give you some ideas

The main thing that needs to change is that the database needs to be matched to the server in DNS, which should be a CNAME for the shard DNS, which I hope will go to an IP that has a listen block that routes to the right hosts and ports on haproxy.

Just have to figure out where to squeeze that in there...

I think @elukey and his team did something based on DNS for the dbstore1XXX hosts, which are running multi-instance as well. I am not fully sure of the details on how that was finally implemented, but maybe it can give you some ideas

The system used in the Analytics cluster is documented at https://wikitech.wikimedia.org/wiki/Analytics/Systems/MariaDB#Database_setup. It works pretty well in my experience but it would be a huge burden for Toolforge users to have to add port number lookups to all code connecting to the Wiki Replicas. Even if we make nice helper functions available in multiple programming languages I would expect a very large number of tools to break because they are not actively maintained.

The service name ideas above are an attempt to find a setup that moves the port mapping burden onto us in the infrastructure layer rather than pushing that work down to all of the users.

So looking into Quarry a bit, I found that the setup is substantially the same as it is for PAWS. Connections to the replicas are simply done with a variable:

REPLICA_HOST: 'enwiki.analytics.db.svc.eqiad.wmflabs'

The nice thing here is that there's no deprecated proxy that needs some odd lua inserted in the client connection for ALL clients. It's just python. That said, the UI assumes you are already connected and then behaves sort of like a mysql shell (only not).

It'll need a "database to connect to" UI element that keys the app to connect to the right replica instance. The beautiful thing here is that we may not need to worry about nonsense like I did here https://phabricator.wikimedia.org/rPAWS26e5ac6bfd1df925314822cfdf0ce9362449321c because we can just prepend the datbase name to the URL instead.

I can confirm that the sql command in the misctools package should work if we are proxying based on section names.

Change 627379 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] wikireplicas: Proposal for a proxy setup on multi-instance replicas

https://gerrit.wikimedia.org/r/627379

Change 621343 merged by Jbond:
[operations/puppet@production] cumin: for new wmcs. prefix for cookbooks, grant access to wmcs-admins

https://gerrit.wikimedia.org/r/621343

Now it actually works and might even be safe enough to do a PR with. It should work now since the DNS aliases already exist.
https://github.com/toolforge/paws/commit/b3a348a337af0f507bbcb3f87882671faf3c75d4

Doing the paws work on this ticket feels like it is burying things a bit. I'm going to open a separate ticket for the notes on my quarry work.

Bstorm renamed this task from Experiment with and design options for multi-instance or multi-section wikireplicas frontend architecture to Redesign and rebuild the wikireplicas service using a multi-instance architecture.Oct 29 2020, 7:14 PM

Ok, apparently mysqlproxy is smarter than I thought, and it can tell that I'm not pointing it at different IP addresses when I give it a list of different names that point at the same IP. As of now, the otherwise working code to connect in the new way is commented out. I may need to add another switch that generates the set of proxies when there are multiple proxy addresses to connect to. At very least, the code is all in place to do this now for PAWS.

Change 642503 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] wikireplicas: modify views scripts to work on any replica style

https://gerrit.wikimedia.org/r/642503

Change 642570 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] wikireplicas: extend maintain_dbusers to multiinstance replicas

https://gerrit.wikimedia.org/r/642570

Change 642503 merged by Bstorm:
[operations/puppet@production] wikireplicas: modify views scripts to work on any replica style

https://gerrit.wikimedia.org/r/642503

Change 642570 merged by Bstorm:
[operations/puppet@production] wikireplicas: extend maintain_dbusers to multiinstance replicas

https://gerrit.wikimedia.org/r/642570

Change 644949 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] wikireplicas: extend maintain_dbusers to multiinstance replicas

https://gerrit.wikimedia.org/r/644949

Change 644949 merged by Bstorm:
[operations/puppet@production] wikireplicas: extend maintain_dbusers to multiinstance replicas

https://gerrit.wikimedia.org/r/644949

Change 644950 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] wikireplicas: extend maintain_dbusers to multiinstance--test 2

https://gerrit.wikimedia.org/r/644950

Change 644950 merged by Bstorm:
[operations/puppet@production] wikireplicas: extend maintain_dbusers to multiinstance--test 2

https://gerrit.wikimedia.org/r/644950

Change 644952 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] wikireplicas: extend maintain_dbusers to multiinstance

https://gerrit.wikimedia.org/r/644952

Change 644952 merged by Bstorm:
[operations/puppet@production] wikireplicas: extend maintain_dbusers to multiinstance

https://gerrit.wikimedia.org/r/644952

Change 645173 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] wikireplicas: fix the harvest-replicas functionality

https://gerrit.wikimedia.org/r/645173

Change 645173 merged by Bstorm:
[operations/puppet@production] wikireplicas: fix the harvest-replicas functionality

https://gerrit.wikimedia.org/r/645173

Change 651857 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[labs/private@master] multiinstance proxies: workaround puppetdb not being in cloud

https://gerrit.wikimedia.org/r/651857

Change 651857 merged by Bstorm:
[labs/private@master] multiinstance proxies: workaround puppetdb not being in cloud

https://gerrit.wikimedia.org/r/651857

Change 627379 merged by Bstorm:
[operations/puppet@production] wikireplicas: Work toward a proxy setup on multi-instance replicas

https://gerrit.wikimedia.org/r/627379

Change 664860 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] wikireplicas: fix the centralauth management bit of the view scripts

https://gerrit.wikimedia.org/r/664860

Change 664860 merged by Bstorm:
[operations/puppet@production] wikireplicas: fix the centralauth management bit of the view scripts

https://gerrit.wikimedia.org/r/664860

Change 670907 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[labs/toollabs@master] wikireplicas: switch sql command to multiinstance replicas

https://gerrit.wikimedia.org/r/670907

Change 670907 merged by jenkins-bot:
[labs/toollabs@master] wikireplicas: switch sql command to multiinstance replicas

https://gerrit.wikimedia.org/r/670907