Page MenuHomePhabricator

Set up IP addresses for the new wiki replicas setup
Closed, DeclinedPublic

Assigned To
None
Authored By
Bstorm
Nov 5 2020, 10:16 PM
Referenced Files
F33946999: wikireplicas-floating ips.png
Dec 16 2020, 10:22 AM
F33918352: wikireplicas-desired(1).png
Nov 18 2020, 3:33 PM
F33918233: wikireplicas-current.png
Nov 18 2020, 11:38 AM
F33918238: wikireplicas-transition.png
Nov 18 2020, 11:38 AM
F33918236: wikireplicas-desired.png
Nov 18 2020, 11:38 AM
Tokens
"Stroopwafel" token, awarded by Ottomata.

Description

For the multi-instance wiki replicas, we will need to have one IP address per section assigned to the proxies (dbproxy101[89].eqiad.wmnet). That's 8 vips for each proxy type (2 proxies) that will only need to move when new proxy servers are put in play.

At this time, everything will connect via IPv4, but we would like these Cloud VMs to be IPv6 capable in the future.

The naming associated with these inside Cloud VPS will be:
s1.analytics.db.svc.wikimedia.cloud
s2.analytics.db.svc.wikimedia.cloud
s3.analytics.db.svc.wikimedia.cloud
s4.analytics.db.svc.wikimedia.cloud
s5.analytics.db.svc.wikimedia.cloud
s6.analytics.db.svc.wikimedia.cloud
s7.analytics.db.svc.wikimedia.cloud
s8.analytics.db.svc.wikimedia.cloud
s1.web.db.svc.wikimedia.cloud
s2.web.db.svc.wikimedia.cloud
s3.web.db.svc.wikimedia.cloud
s4.web.db.svc.wikimedia.cloud
s5.web.db.svc.wikimedia.cloud
s6.web.db.svc.wikimedia.cloud
s7.web.db.svc.wikimedia.cloud
s8.web.db.svc.wikimedia.cloud

The equivalent names are currently managed by Designate (using the eqiad.wmflabs domain). So we need to allocate 16 IPs (and possibly IPv6 IPs?) and have the name records somewhere sane.

Currently, we do not have multi-instance replicas, so everything connects to a single IP for each proxy. The existing proxies are:

  • dbproxy1018.eqiad.wmnet (10.64.37.27)
  • dbproxy1019.eqiad.wmnet (10.64.37.28)

These are likely to remain the proxy machines after all this. Cloud instances connect to MariaDB via these hosts over port 3306. They will still do that, but the proxies will be routing to other ports and hosts than they do now on the backend.

Event Timeline

Bstorm added subscribers: Marostegui, Kormat.

Added members of the Data Persistence team to correct me if I am wrong on my understanding of anything in the description :)

Added members of the Data Persistence team to correct me if I am wrong on my understanding of anything in the description :)

Everything looks good.
Expanding on what you already said: dbproxy1018 and dbproxy1019 also manage the different weights per host:

$ cat hieradata/hosts/dbproxy1018.yaml
# These hosts support the wikireplica-analytics service
profile::mariadb::proxy::replicas::servers:
  labsdb1010:
    address: '10.64.37.23:3306'
    weight: 1
  labsdb1011:
    address: '10.64.37.24:3306'
    weight: 1
profile::mariadb::proxy::firewall: 'public'

$ cat hieradata/hosts/dbproxy1019.yaml
# These hosts support the wikireplica-web service
profile::mariadb::proxy::replicas::servers:
  labsdb1009:
    address: '10.64.4.14:3306'
    weight: 2
  labsdb1010:
    address: '10.64.37.23:3306'
    weight: 1
profile::mariadb::proxy::firewall: 'public'
``

My first impression when reading the ticket is that using the wikimedia.cloud domain to point to production IPv4 addresses feels weird.
Also, the svc subdomain that belongs to no actual CloudVPS tenant is confusing. I know this is explictly mentioned in https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/DNS#*.{deployment}.wikimedia.cloud but this would be the first actual case.

In the following paste you can see all the .svc. subdomains belong to a particular project and there are no 'direct' svc FQDNs under the main eqiad1.wikimedia.cloud subdomain:

aborrero@cloudcontrol1005:~ $ sudo wmcs-openstack recordset list --all-projects eqiad1.wikimedia.cloud. | grep svc.eqiad1.wikimedia.cloud
aborrero@cloudcontrol1005:~ $ sudo wmcs-openstack zone list --all-projects | grep svc | grep wikimedia.cloud
| cfacd98f-19cf-4f92-98e0-f2f4dde08b75 | newprojectdomaintest1          | svc.newprojectdomaintest1.eqiad1.wikimedia.cloud.    | PRIMARY | 1583180029 | ACTIVE | NONE   |
| 808263bd-522d-42c3-88a1-40af24b2929e | tools                          | svc.tools.eqiad1.wikimedia.cloud.                    | PRIMARY | 1584559778 | ACTIVE | NONE   |
| 3706c91c-94bd-42d7-aaff-19dbe5752cc8 | entity-detection               | svc.entity-detection.eqiad1.wikimedia.cloud.         | PRIMARY | 1584538808 | ACTIVE | NONE   |
| aa355dad-c6c5-44de-994e-9b01803d9b57 | meet                           | svc.meet.eqiad1.wikimedia.cloud.                     | PRIMARY | 1586424653 | ACTIVE | NONE   |
| a7d20e51-7459-4ecd-8f90-81ac4960cf2e | cloudvirt-canary               | svc.cloudvirt-canary.eqiad1.wikimedia.cloud.         | PRIMARY | 1586545364 | ACTIVE | NONE   |
| 8bc300ef-3882-48c4-adec-a8cb8c2fb1d6 | metrics-infra                  | svc.metrics-infra.eqiad1.wikimedia.cloud.            | PRIMARY | 1586887107 | ACTIVE | NONE   |
| 59fc232a-41c0-47ad-8d24-8660e916db86 | metricsinfra                   | svc.metricsinfra.eqiad1.wikimedia.cloud.             | PRIMARY | 1586887373 | ACTIVE | NONE   |
| 3daac632-dd1b-4f0d-8db0-245965b7fe78 | sre-sandbox                    | svc.sre-sandbox.eqiad1.wikimedia.cloud.              | PRIMARY | 1588009649 | ACTIVE | NONE   |
| 345c1fbc-4f15-4ab5-b87e-38bd284b55b9 | paws                           | svc.paws.eqiad1.wikimedia.cloud.                     | PRIMARY | 1592322469 | ACTIVE | NONE   |
| 3a26c084-2912-48fc-8737-91f844ef5341 | chat                           | svc.chat.eqiad1.wikimedia.cloud.                     | PRIMARY | 1593683715 | ACTIVE | NONE   |
| 3bbd31fb-70b6-468e-a547-a9405482a35b | wmde-templates-alpha           | svc.wmde-templates-alpha.eqiad1.wikimedia.cloud.     | PRIMARY | 1594322678 | ACTIVE | NONE   |
| 079ca9f3-240c-4311-bbed-e13f55c1c61b | mailman                        | svc.mailman.eqiad1.wikimedia.cloud.                  | PRIMARY | 1594328462 | ACTIVE | NONE   |
| 41118b4c-e907-45cc-b2ed-7819d1c48b12 | newp                           | svc.newp.eqiad1.wikimedia.cloud.                     | PRIMARY | 1595452119 | ACTIVE | NONE   |
| 3a322b08-d448-4511-b643-7db0e5a08241 | mariadb104-test                | svc.mariadb104-test.eqiad1.wikimedia.cloud.          | PRIMARY | 1595452489 | ACTIVE | NONE   |
| 4b53a937-17c1-4689-9e80-b53a3371249c | redwarn                        | svc.redwarn.eqiad1.wikimedia.cloud.                  | PRIMARY | 1597336277 | ACTIVE | NONE   |
| 8001d0b5-4020-49c4-b838-22f3f42caefc | sccache                        | svc.sccache.eqiad1.wikimedia.cloud.                  | PRIMARY | 1597860126 | ACTIVE | NONE   |
| a02ba033-856c-4dab-9095-f45925b83fdf | pipelinelib-experimental       | svc.pipelinelib-experimental.eqiad1.wikimedia.cloud. | PRIMARY | 1604508599 | ACTIVE | NONE   |
| 7347b52a-0599-4d4d-a843-d07f3d8677bd | wikicommunityhealth            | svc.wikicommunityhealth.eqiad1.wikimedia.cloud.      | PRIMARY | 1604510318 | ACTIVE | NONE   |

proposal #1: update our DNS guidelines to warn against project-less svc records in the wikimedia.cloud domain.
proposal #2: consider allocating the record sets under the svc.clouddb-services.eqiad1.wikimedia.cloud subdomain, i.e, s1.analytics.svc.clouddb-services.eqiad1.wikimedia.cloud IN A 10.x.x.x

Moreover, so far the wikimedia.cloud domain only hosts addresses in the 172.16.x.x range, i.e, CloudVPS addresses. If I understand the ticket description correctly, we would be hosting production IPv4 addresses (either 10.x.x.x or 208.x.x.x.). Perhaps the wikimediacloud.org is better suitable for it? The next proposal is not a strong one, just me making sure we are evaluating the option:

proposal #3: consider using the wikimediacloud.org domain, i.e, s1.analytics.eqiad1.wikimediacloud.org IN A 10.x.x.x.

But, unfortunately, If we take into account the potential complexity of T266331: Cloud: define relationship between wikimediacloud.org domain, CIDR prefixes and netbox automation What if we don't use a cloud domain at all for the DB proxies? This also makes sense when we try soon to introduce PTR records. In proposal #2 for example we would need to create a PTR record in production DNS servers for a cloud domain. Proposal #4 below feels more correct when you think on the PTR records:

proposal #4: consider using a production domain for production IP addresses, i.e, s1.analytics.svc.eqiad.wmnet IN A 10.x.x.x

Also, mind cloud --> production network isolation. I strongly advice against using private addressing for the proxies. That will make things more difficult in the near future. Not even sure if the .svc. keyword would be needed if using wikimedia.org.

proposal #5: consider using production public IPv4 addresses, ie: s1.analytics.svc.wikimedia.org IN A 208.x.x.x

Given the wiki replicas is mostly a cloud-specific service, I have the desire to have the 'cloud' keyword somewhere in the FQDN. Perhaps the right thing to do is follow the DNS pattern in proposal #2 and make sure we are using production public IPv4 addresses in the proxy side, i.e, a new proposal combining #2 and #5:

proposal #6: consider using production public IPv4 addresses in a project-specific service subdomain, i.e, s1.analytics.svc.clouddb-services.eqiad1.wikimedia.cloud IN A 208.x.x.x

This proposal 6 would end up creating a cloud domain PTR record in a production DNS zone, which also feels weird.

NOTE: @aborrero sees drawbacks in basically all proposals, but perhaps proposal #5 is the one that makes more sense.

Proposal #4 and #5 would directly clash with stuff we do in production, and/or create confusion as to what services are for cloud and which aren't, and I strongly oppose them.

I have no preference regarding the other proposals.

I'm leaning towards this solution (which wasn't explicitly stated in my previous comment):

  • use production public IPv4 address
  • use the wikimediacloud.org domain or even a replica-specific subdomain

Example:

s1.analytics.db.eqiad1.wikimediacloud.org IN A 208.x.x.x.x
s1.web.db.eqiad1.wikimediacloud.org IN A 208.x.x.x

This may have netbox implications (PTR records etc, mind T266331: Cloud: define relationship between wikimediacloud.org domain, CIDR prefixes and netbox automation) but at this point seems to be the lesser evil.

For the netbox implications:

  • if the IP addresses used are in a prefix already "automated" the PTR records will be automatically generated and included in the DNS
  • if the domain names are within a specific $ORIGIN we can manage them as autogenerated or not independently from the rest of the $ORIGINS in the same zonefile
  • I would suggest that we go into a direction where either both direct and reverse records are autogenerated or they are both manual, avoiding a mixed configuration.

I've been trying to understand the whole setup and failed.
Despite through https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Wiki_Replicas (which seems outdated? no mentions of dbproxy for example) and T260389.
Maybe because I don't have much knowledge about Wiki Replicas?

It would be useful to have a WikiTech page explaining it all. With diagrams.

About IPs specifically:
Where will they live? On clouddb hosts? on dbproxy hosts?
What needs to reach them? CloudVPS VMs?
Why are multiple IPs needed?

a quick looks shows:

enwiki.web.db.svc.eqiad.wmflabs -> s1.web.db.svc.eqiad.wmflabs
s1.web.db.svc.eqiad.wmflabs -> 10.64.37.28
10.64.37.28 is also dbproxy1019.eqiad.wmnet

So my guess is dbproxy?

We don't allocate non host IPs from the hosts ranges (the few ones are legacy).
VIPs (extra IPs) are usually for load-balancing (or active/passive setup) for example through a LVS (with the .svc. prefix).
Speaking of load balancer T260389#6384054 mentions a load balancer IP is it possible to have more information about it?

About DNS names, 10.64.x will always have a .eqiad.wmnet and 208. .wikimedia.org FQDN. Then what's managed by Neutron (and in WMCS) is less relevant for SRE.

About using public IPs the same question apply, on which hosts will those IPs live? What are the requirements?

I should be able to help you better once I understand the whole picture a bit more.

Despite through https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Wiki_Replicas (which seems outdated? no mentions of dbproxy for example) and T260389.

I'll review the page. We probably didn't go into the proxy layer in it except in the operational docs. That document is about the replicas as they are, not as we are building, as well. The dbproxy setup is quite simple as it is, but it must become more complex.

The basic idea is that right now, there are two dbproxy servers that cloud users can contact. Those are used to load balance and apply weights and pooling/depooling functions to the actual database servers that are the wikireplicas. Cloud users do not contact the wikireplicas servers directly (wikireplicas servers are named labsdb10* and now clouddb10*).

About IPs specifically:
Where will they live? On clouddb hosts? on dbproxy hosts?

On dbproxy hosts. The idea here is so that cloud users can use the default mariadb port to connect. The proxies then route to the multi-instance, multi-port setup on backend servers.

What needs to reach them? CloudVPS VMs?

Yes. CloudVPS VMs.

Why are multiple IPs needed?

Currently, we have an unsustainable setup of using single mariadb instances that have multisource replication. That means ALL of the wiki databases are accessed on the same IP/port, and the dbproxy servers just serve to provide depooling, weights and load balancing. The naming was setup from the beginning to allow, one day, different IP addresses to correspond to different database sections (s1, s2, s3, s4...). They have all pointed at the same two IPs for years. Since we have reached the breaking point and must stop using the same sort of mass multi-source replication, we are trying to break up the backend replicas into multi-instance database servers using different DB ports than 3306. The proxies will have to do much more complex outing, knowing which server and port to send clients to. Since this is TCP proxying on the mysql protocol, you cannot route based on paths and names easily (never seen it successfully done for that protocol). We need IP addresses that correspond to each section. Then the proxy frontend for each IP will know it's set of backends to route to.

We don't allocate non host IPs from the hosts ranges (the few ones are legacy).
VIPs (extra IPs) are usually for load-balancing (or active/passive setup) for example through a LVS (with the .svc. prefix).
Speaking of load balancer T260389#6384054 mentions a load balancer IP is it possible to have more information about it?

dbproxies are database load balancers running haproxy. They aren't part of the LVS setup, but they currently do use the svc prefix (with cloud domain naming). I am hoping y'all can help me know which IPs to correctly use here because I have no idea.

@Bstorm maybe a stupid question, couldn't we use different ports on the proxies, one port per section, using the same IP address like we do in the multi-instance DBs in prod (see also profile::mariadb::section_ports) ?

@Bstorm maybe a stupid question, couldn't we use different ports on the proxies, one port per section, using the same IP address like we do in the multi-instance DBs in prod (see also profile::mariadb::section_ports) ?

Not a stupid question at all! The very short answer is that we don't want to dramatically increase complexity for volunteer developers. It's more complex than that, though, including the design of the entire service and other services supported by WMCS, like PAWS.

If you look at current wikireplicas DNS (we use scripts to control it in puppet and the config is here modules/openstack/files/util/wikireplica_dns.yaml), the intent has been to simplify connecting to the replicas all along while allowing us to make complex backend changes where necessary. We have a system of CNAMES for each wiki database name where if you connect to enwiktionary.analytics.db.svc.eqiad.wmflabs, you will be routed to the correct "s#" for that wiki already and always have. We simply had never got around to actually splitting the replica servers up into multiinstance. This is just the implementation of how the wiki replicas were intended to work. We need IP addresses because host-based routing isn't possible on the mysql protocol. So my first answer is that this is really just continuing and completing the existing design of many cloud services that depend on the wikireplicas, as it has been prepped years ago, now that we have the hardware to do it.

The second part is why it was all built like this in the first place: Using different ports requires volunteer developers to learn the sections layout (https://noc.wikimedia.org/db.php), the section-to-port mappings and then update all their code that currently can just connect to a server named "enwiki" or any "s#" they like so that it connects to the right database. That's fine for someone who works here or regularly contributes to Mediawiki prod, but it is a pretty steep learning curve for volunteer developers who are going to have enough trouble adapting to needing different hostnames for different databases. We have the DNS already prepared for that style of coding and have for years (even before I was hired), but people are aware that they didn't always need it. They will need it soon. That's easier to communicate to and retrain our user community than trying to teach them the ins and out of profile::mariadb::section_ports. Even my own code in cloud will break if I had to change to that without lots of work for PAWS, Quarry, Toolforge tooling, etc. We keep the complexity and implementation details on the other side of the proxies, if that makes sense. Wikireplicas are a cloud community service, not just a backend for our use, so they need to be understandable to a new developer with good ideas.

That's not touching on existing published notebooks, examples and Quarry queries that are published for community consumption already that we'd rather do our best not to break. This will break cross-database joins in many cases as is, but that cannot be helped.

This might help discussion around this as well: https://gerrit.wikimedia.org/r/c/operations/puppet/+/627379

The IP addresses in the patch are just to illustrate the concept and are intentionally incorrect/impossible. I would need the correct IP addresses to move forward with that whole thing.

Also the host is clouddb1020 because that's going to be used for dbproxy puppet testing. Please understand that these IPs are not to be used on clouddb servers in actual deployment.

On the naming:

I'm leaning towards this solution (which wasn't explicitly stated in my previous comment):

  • use production public IPv4 address
  • use the wikimediacloud.org domain or even a replica-specific subdomain

Example:

s1.analytics.db.eqiad1.wikimediacloud.org IN A 208.x.x.x.x
s1.web.db.eqiad1.wikimediacloud.org IN A 208.x.x.x

This may have netbox implications (PTR records etc, mind T266331: Cloud: define relationship between wikimediacloud.org domain, CIDR prefixes and netbox automation) but at this point seems to be the lesser evil.

This would probably make good sense, since these would be analogous to the service IPs for the Openstack APIs. Our current setup using eqiad.wmflabs in Designate is kind of confusing, and we could probably still use something like the wmcs-wikireplicas-dns script setup to create all the CNAMES in designate with the service names either using the clouddb-services project or a project just for the wikireplicas. I think we'd also want the "svc" subhost in the DNS, though, wouldn't we? Like: s1.analytics.db.svc.eqiad1.wikimediacloud.org? eqiad1 is an openstack region name for us, but I don't think that is very problematic to reuse for this. Then Designate would just be making cnames like enwiki.analytics.db.svc.clouddb-services.eqiad1.wikimedia.cloud pointing at s1.analytics.db.svc.eqiad1.wikimediacloud.org.

I think we should avoid using the wikimedia.cloud domain at all..

Also, why do we need the svc subdomain? I believe the whole wikimediacloud.org domain is service-oriented. Take for example openstack.eqiad1.wikimediacloud.org. This FQDN is a service address, and it doesn't contain any svc for this reason.

This is the naming scheme I'm talking about:

enwiki.analytics.db.eqiad1.wikimediacloud.org IN CNAME s1.analytics.db.eqiad1.wikimediacloud.org
s1.analytics.db.eqiad1.wikimediacloud.org     IN A     208.x.x.x

This DNS data being managed by production DNS services (and operations/dns.git). No openstack designate involved.

This DNS data being managed by production DNS services (and operations/dns.git). No openstack designate involved.

So we would no longer have automation on adding new wikis to the replicas for DNS?
Managing that piece by script has some advantages. Also that would mean adding something like 900 new cnames to operations/dns.git by hand that are only for use on the cloud VMs. That's doable, but right now the script keeps things in sync nicely. The script reads from noc.wikimedia.org and makes changes accordingly. That way it is always as correct as the upstream info. If we abandon that, we now have a situation where we have to keep ops/dns up to date by hand for a fair-sized data set.

We could do the same against operations/dns, but that's an entirely different framework, so that would require a rewrite of wiki-replicas-dns and moving the non-replicas material out of that. I'm not saying it's the wrong answer, but that's a fairly large scope creep there that will require additional coordination (because this option would mean a script making unreviewed changes to ops/dns and then syncing/reloading things).

Actually, I'm willing to say that for purposes of the wikireplicas rebuild, moving the db cnames to gdns really should be considered out of scope, @aborrero, unless I'm missing something in terms of how it could be accomplished very simply and effectively.

Thanks for the explanations.

Another criteria to consider is to reduce the number of firewall holes from CloudVPS VMs to production private (see T267789).

Maybe this would benefit a doc similar to https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/notes/NAT_loophole/NFS.

There are 2 interdependent topics: 1/ IPs to use and 2/ DNS domain to use
Whatever IPs we go with, they will always have A/AAAA/PTR DNS records that follows production standards. Then it's up to WMCS to add the relevant extra A/CNAME you need (as long as they don't go against prod standards and maintainability).

Using hosts vlan (subnet) IPs (eg. 10.64.37.0/24 or 208.80.154.0/26) for something else than hosts' interfaces IPs is not something we support, the few existing cases are there for historical reasons and a pain in our automation.

All Virtual IPs should use LVS, (for example the 10.2.2.0/24 and 208.80.154.224/27 ranges), even if there is only 1 backend.

Because of the need to reduce/eliminate the number of flows from CloudVPS to private, public IPs should be preferred.

Unfortunately due to IPv4 exhaustion we can't accommodate 16 (or more) public v4 IPs. This could be a great use of v6 space, but it's not supported yet in CloudVPS (see T37947).
However 1 or 2 are fine.

About using ports vs. IPs: with the above, T267376#6614181 becomes a question of moving the complexity either on SRE (by using non-standard setup) or on CloudVPS users.

Some extra thoughts/considerations:

  • Using public IPs, services would still be ACL protected to only be reachable from relevant clients
  • Could those multiple (user facing) IPs live within CloudVPS, and the 1 public IP used to sync data from prod to CloudVPS?
  • It looks like MySQL supports DNS SRV records, maybe that could help users with using ports?

I had a meeting today with @Marostegui trying to make a bit more sense of all this. I focused on creating diagrams that hopefully will help us all be on the same page.

First, the current status. This is how wiki-replicas work today:

wikireplicas-current.png (772×2 px, 178 KB)

About the current status:

  • there are the fundamental problems both @Bstorm and @Marostegui are trying to solve for scalability and service reliability etc. This is the main thing why wiki-replicas are being reworked in this project.
  • there are network connections from CloudVPS private addresses to production private ranges, using the infamous cloud-support vlan. This is undesirable for several reasons.
  • the labsdb servers (the wiki-replica servers themselves) are also sitting in the infamous cloud-support vlan.

Then, we have the desired final status, which is something like in this diagram:

wikireplicas-desired(1).png (772×2 px, 198 KB)

I tried to incorporate feedback by several of you, some key points:

  • @ayounsi mentioned we would need to use LVS with a public IPv4 address in front of the proxy layer. I'm not sure if that fits the plans by @Bstorm to have 16 IP addresses in the proxy layers. I guess not.
  • @Marostegui confirmed that there is no need for neither the proxy layer or the wiki-replica servers to live in the cloud-support vlan, and we could potentially use any production private IPv4 range to host them.
  • connections from CloudVPS client VMs will hit LVS public IPv4 address, using the private VM address (172.16.x.x)

@Marostegui also mentioned that there would probably be a period of 'co-existence' of the two setups, a transition period. I'm including a diagram for completeness, but it is really difficult to capture a dynamic thing (a transition) in a static thing (the diagram), so take this with a grain of salt:

wikireplicas-transition.png (1×2 px, 325 KB)

From my perspective, this whole physical layout/should should be discussed in pair with the DNS domain thing.
Other thing to consider is if we need CloudVPS VM private addresses leaking into prod at all (be it LVS or the dbproxy layer). So here is my question for @Bstorm and @Marostegui: Could the DBproxy handle / it is accepted / desirable to see all incoming mysql connections as coming from a single public IPv4 address (nat.openstack.eqiad1.wikimediacloud.org, 185.15.56.1)? Do we need to know which client is using the proxy? (for ratelimiting, access control, etc?)

Other thing to consider is if we need CloudVPS VM private addresses leaking into prod at all (be it LVS or the dbproxy layer). So here is my question for @Bstorm and @Marostegui: Could the DBproxy handle / it is accepted / desirable to see all incoming mysql connections as coming from a single public IPv4 address (nat.openstack.eqiad1.wikimediacloud.org, 185.15.56.1)? Do we need to know which client is using the proxy? (for ratelimiting, access control, etc?)

Since mysql sessions are distinct on the same server, I don't *think* it should matter. The proxy is just haproxy in this case. It would just see a lot of NATing. Connection and access control has all been done at the mariadb level to date. We depool connections at haproxy and apply weights, but that affects backends, not frontends. I cannot think of a reason that would be a problem.

Ok,

so this is our plan:

client 
->
 enwiki.web.db.eqiad1.wikimediacloud.org IN CNAME s1.web.whatever.wikimedia.org
 ->
   s1.web.whatever.wikimedia.org LVS (208.x.x.x)
   ->
     dbproxy1019 s1 address (10.x.x.x) haproxy
     ->
       clouddb1013 mysql


client 
->
 enwiki.analytics.db.eqiad1.wikimediacloud.org IN CNAME s1.analytics.whatever.wikimedia.org
 ->
   s1.analytics.whatever.wikimedia.org LVS (208.x.x.x)
   ->
     dbproxy1018 s1 address (10.x.x.x) haproxy
     ->
       clouddb1013 mysql
  • we delegate the subdomain db.eqiad1.wikimediacloud.org to designate @ eqiad1
  • we create arbitrary CNAME records in that subdomain for the wiki.{analytics|web} combos pointing to the LVS FQDNs (in the form of s1.(analytics|web).whatever.wikimedia.org)
  • LVS hosts 1 public IPv4 addresses per section per (analytics|web) combo. That's 8x2=16 public IPv4 addresses
  • LVS has exactly 1 host as backend, either dbproxy1019 (web) or dbproxy1018 (analytics). The dbproxy hosts use private 10.x addresses.
  • HAproxy running in dbproxy servers pool/depool the db sections in each final backend mysql server (using 10.x addresses)

The general network flow is what was depicted in the following diagram:

wikireplicas-desired(1).png (772×2 px, 198 KB)

please @Marostegui and @ayounsi validate this makes sense so we can move forward.

Thanks Arturo for working this out. I don't have much to say here, as this is mostly owned by WMCS, we simply expose our clouddb ports behind the HAProxy. If this makes sense from a network point of view, that's something from @ayounsi.
I would assume though, that the logic behinds all this allows us to simply pool/depool hosts directly on haproxy config (like we do now) and we don't have to touch anything DNS based?

The question that I have now is how to depool an entire HAProxy.
Let me elaborate on that.
Right now our two proxies (dbproxy1018 and dbproxy1019) have just one cname:

wikireplica-analytics    5M  IN CNAME    dbproxy1018.eqiad.wmnet.
wikireplica-web          5M  IN CNAME    dbproxy1019.eqiad.wmnet.

If we need to take one down for maintenance, we simply change the cname to point to the other one.
With this new model, it is not clear to me how this would be achieved.

I would assume though, that the logic behinds all this allows us to simply pool/depool hosts directly on haproxy config (like we do now) and we don't have to touch anything DNS based?

Yup! That's the idea.

The question that I have now is how to depool an entire HAProxy.

That's a good question. If the DNS lives in Designate, we'd probably have to provide a straightforward way to depool the entire host (redirect traffic) in spicerack/cookbooks.

Bstorm updated the task description. (Show Details)

Removed the reference to clouddb1020. Testing with that server would make the networking much more problematic. We will just have to test with one of the proxies by taking it out of the pool once this is sorted out.

Change 645114 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] wikireplicas: let clouddb1020 join the party

https://gerrit.wikimedia.org/r/645114

Change 645114 merged by Marostegui:
[operations/puppet@production] wikireplicas: let clouddb1020 join the party

https://gerrit.wikimedia.org/r/645114

  • LVS hosts 1 public IPv4 addresses per section per (analytics|web) combo. That's 8x2=16 public IPv4 addresses

That many v4 IPs is going to be hard to do (see T267376#6620374)
After reading T267376#6614181 I'm still not uderstanding why using distinct ports instead of distinct IPs is not doable?
The first part looks like tooling needing to be updated.

Would it not be possible to connect to (for example):
wikireplicas.whatever.wikimedia.org:3301 instead of s1.web.whatever.wikimedia.org:3306
Or wikireplicas.whatever.wikimedia.org:3302 instead of enwiktionary.analytics.db.svc.eqiad.wmflabs.
And keep a service/FQDN/port mapping instead of doing such mapping using CNAMES?
LVS would "forward" those ports to the relevant ports on the dbproxy hosts and dbproxies to the relevant DB.

Using different ports requires volunteer developers to learn the sections layout

I'm not sure that's true. they would need to "learn" (or lookup) that frwiki is wikireplicas.whatever.wikimedia.org:XXX but they would not need to lean the actual sections. Associating a database to a fixed FQDN + a specific port doesn't seem to be that much of a difficulty for a user, especially as it's quite common to do so on the Internet. Ports are cheap, v4 IPs not so much :)

It's not only about learning entirely. It is a lot of internal infrastructure about the Foundation to learn for a student programmer, and I do no not understand for a moment how that would be controversial (after spending a couple years supporting them). The idea behind our frontend setup is that nobody has to look up anything other than the wiki database they are connecting to (which has already made our customers unhappy). Most of the people conversant in the sections work for WMDE or the Foundation, so we actually have kept it mostly out of the communication plan.

As is, the popularity of the service is partly due to its ease of use. You can connect to a single name and access all the public wiki database information outside of content. The change to multiinstance in general has already generated significant concerns among the user base. Some may simply take down the tools that the wikis use for things like vandalism protection and just not do them any more as is, and the quarterly goals related to this (including the communication plan already in flight) have hinged on the idea that the infrastructure built four years ago using DNS would now become the method that is mandatory to connect instead.

If we pivot now to a less convenient and more support-heavy system (support in the sense of WMCS techs spending more time in education and assisting with code refactors), that will not be a good look, for sure. I also think it is a very bad idea for the longevity of the system as a core piece of many wiki tools before we can provide an alternative (which we are working on with other teams).

Let me be 100% clear about something here, too: I am not opinionated at all about where these IPs need to end up, VLAN-wise. They just need to be connected to cloud VMs on one end, and wikireplicas servers on the prod network on the other. The proxies *could* even become VMs (making IPAM entirely my problem but requiring perhaps a new firewall hole?...or maybe the same firewall hole changed a bit) with appropriate documentation so the Data Persistence team can continue to support operations on them. We are mostly just asking if there is a way to do this on hardware (which is the current design). If there genuinely is not a way to do it, then perhaps we need to find a way to do it inside cloud. I prefer to do it this way so that *no* VM can access the database servers directly, but if you think it would be all the same (and @Marostegui agrees as well) we could make it so that some special VMs that end-users cannot log into (other than cloud-only roots) can connect to them. Another option is to put them in a cloud-services VLAN inside the cloudsw setup :)

I do still believe we need this, but I don't have strong feelings about exactly how it happens as long as it works in the end.

After reading T267376#6614181 I'm still not uderstanding why using distinct ports instead of distinct IPs is not doable?
The first part looks like tooling needing to be updated.

Yes, this is "only" a matter of changing every client that connects to the Wiki Replica databases to add port number variation. There is even a reasonably well documented port based selection method for a similar multi-instance database cluster in the Analytics network.

The part that is probably not obvious to folks outside of the Technical Engagement team is that this collection of clients is very large--hundreds to thousands of bots, webservices, utility scripts, and services--and that there is no central codebase for these clients. There is no reasonable way to search across all of the potentially affected code. The clients include code written in PHP, Python2, Python3, Ruby, Java, NodeJS, Perl, and likely even more exotic programming languages. These tools are maintained by hundreds to thousands of separate volunteers, and although we have mailing lists and wikitech site notices to use to notify people of breaking changes we know from experience that even a long and high effort information campaign will not reach a significant percentage of these tool maintainers.

Requiring port based section selection rather than IP based just will not work for this service. Not because it is technically impossible, but because conversion of the existing clients in a reasonably short amount of time and with reasonably limited disruption to the end user community for these tools is improbable.

Would it be feasible to use SNI on haproxy to select the backend? It looks like it has SNI support.

It feels like there are multiple issues being discussed here, so perhaps it's worth breaking this down and talking about some of these issues separately? The last few comments seem to be about the IP numbering and assignment issue, so I'll focus on that below.

From what I understand from the discussion above, this is a service that's going to run in production, and offered to users in at least the cloud services guest network. Assuming that's right, we would consider this a service exposed from production to the "public" (even if limited to the guest network that lives adjacent to production). Per past conversations and conventions, such services need to be exposed by the (production) public network, i.e. standard service IPv4/IPv6 addresses, with wikimedia.org hostnames, ideally using our standard ingress (layer 4 loadbalancers/LVS), accessible by private and public networks, including the WMCS tenant network. While possible to revisit this decision, and create a new private range that we would treat in a special way ("public-for-WMCS") it feels to me so far like the reasons that we decided our current strategy still stand, despite this new additional challenge.

It seems there is now a request for 16 IPv4 addreses, one per database section, and based on the above, 16 public IPs, for which there is a bit of a pushback above. To put the pushback to this idea a bit into context: IPv4 address space has been exhausted globally, and we need to be careful with how we assign IP space, both because of a limited runway, and because there are RIR policies we have to adhere to, for the conservation of IP space. Our strategy for this has been to invest in IPv6 (rather than buy/hoard IPv4 space). While we still do have enough addresses in eqiad, and we are still assigning new IPv4 for new projects, we still need to treat this as a constrained resource and be mindful of new assignments. Furthermore, while the service can (and would be) IPv6-enabled, this does not seem like it would resolve anything here, as the timeline for IPv6 in WMCS extends further than this project's, plus we'd also run in the same risks of legacy code support that may not be IPv6-ready. Does that part make sense, and are my assumptions right?

So, @ayounsi's proposal as I understand it is to run this from a single IP, with multiple ports, one per service/instance/db section. This is, on its face, a reasonable idea -- ports exist to expose multiple services off one IP after all :) However, it seems that in this case, as @Bstorm and @bd808 are pointing out, there is a risk of code (written in a diverse set of languages, maintained by a hundreds of individuals, most of whom volunteers with limited time, if currently present and available at all) that would not support setting a non-default port, which would complicate and prolong this project.

Given we have a trade-off to make here, would it be possible to quantify this risk and perhaps the effort required to assist with that migration? We know that for a lot of code, setting the database port is fairly standard and widely used (e.g. by public hosters), but it sounds like there may be a tail of tools that will not support that. Would it be worth figuring out how long of a tail this may be? I am guessing that auditing all of these consumers is hard or perhaps even an impossible task, so could it perhaps make sense to perhaps e.g. check either a random sample of them, and/or the top-N bots/frameworks? I also realize that even _that_ smaller scale analysis will absolutely take some time and effort by someone. However, I think it's important to emphasize here that there are no options here that do not require time and effort by someone, either in the present or in the future (i.e. debt to be paid at a later point). IP space costs time, and increasing amounts of effort and money; nonstandard exceptional setups create work in the present and risk, problems and work in the future, etc.

Also, are there any other options or stogaps to explore here, that could either mitigate this, or give tool authors a reasonable amount of time to adjust? Any ideas?

Hope all this made sense - happy to hear everyone's thoughts.

From what I understand from the discussion above, this is a service that's going to run in production, and offered to users in at least the cloud services guest network.

That's mostly right, but it is based on the assumption that things will remain as they currently are. Right now, that is entirely true. All of the DNS entries now point to two physical proxy hosts in the production network at their primary, internal IP addresses. So it is more accurate to say this is a service that does run in production as currently laid out. The consumers are definitely in the cloud system. The ask is to enable the next generation of that service, hopefully with better compliance with the Foundation's security and network design expectations.

The version here with 16 new IPs on the existing hosts is the short path in terms of the full infrastructure of the wiki replicas (which includes DNS orchestration via scripts in cloud, PAWS, Quarry, etc.) because so much is already done in this direction. This was the envisioned design when all the DNS aliases were created as a way to find the right IP for the right database. They've just until now been pointed to the same IP as long as all sections were hosted on single hosts (multi-source replication...which we now know crashes and causes all kinds of other issues like blocking upgrade to buster). However, that's just the shortest operational and design path that matches the already-executed communication plan with the public.

A note on IPv6, since the main consumer of the wikireplicas is Toolforge:

  1. Grid engine doesn't support ipv6 and never will, unfortunately (dead software)
  2. Kubernetes will and sorta-kinda does, but yes, client libraries may be a problem.

That's made IPv6 a non-starter so far without even touching other libraries, unfortunately.

On the issue of the scarcity of IPv4 addresses, the ongoing usability of this service is critical to the maintenance of all of the wiki anti-vandalism tools in the cloud that I am aware of, making it something I certainly am willing to sacrifice resources for (and would trade some servers for the IPs if I had to), but I do understand that we only have so many, especially in the public realm. I have no particular desire to put these proxies in public, but that is my understanding fo the rules. If they were placed in another VLAN or in VMs (requiring the particular proxy VMs to access all 12 backend database servers in production), I'd be just as happy if that were allowed.

Also, as @Kormat pointed out, mysql/mariadb cannot route by hostname, which was something I really tried early on in this. The only way for haproxy, in particular, to route to, load balance and allow depooling of the correct backend instances (and provide a layer in front of the databases that is contacted by cloud instances rather than the database servers themselves), it must expose ports or IPs. You cannot address ports in DNS, so we have ~900 DNS CNAMES that our users use that correspond to wiki databases and route to the correct section IPs. That system was a lot fo work put in place 4 years ago to get us to this point, and it is what we've educated our users to use over time.

Given we have a trade-off to make here, would it be possible to quantify this risk and perhaps the effort required to assist with that migration?

Hmm...
The existing plan is taking over a perhaps the largest portion of my time to implement and a significant portion of the time of @Jhernandez (the dev advocate on this) because the actual splitting of the instances is causing a lot of discussion, generating tickets, has taken down a number of Quarry queries and so forth in order to support the notion that cross-wiki joins will not work. The response of several community members has been to simply preemptively take down their work because not having things on the same server makes it harder so we are trying to support them through it. This aspect has also taken time from my management in terms of planning and problem solving. It is requiring a fair bit of education on our part and has required some data mining as well to find cross-wiki joins and try to ease the fears and concerns of our user base. In some cases, there are use cases that simply will not work anymore at all just because we are splitting into multi-instance replicas. One example of work on that end is T267992. This is all without requiring users to look up which section each wiki belongs to. The idea of placing the burden of the port/section structure on the end users requires us to pivot our communication in the middle of this contentious process and explain that they not only now have to look up the section of the wiki in question, but they have to connect to non-standard ports is not one I have a lot of hope for at all. It is possible to instrument looking up the section via connection to noc.wikimedia.org or doing a full checkout of the mediawiki-config repo, but that's a fair bit of dev effort for small tools that are sometimes literally just an SQL script that runs on a schedule to feed into a webapp or bot. Others are highly relied-upon, but largely unmaintained web applications that will effectively need a new library created that will do that mapping for people in python, php and rust at the very least. I could start over on the Quarry and PAWS refactor to make them work with it, though it won't be as performant as the existing DNS-based solution, I could do that since those don't expose a low level to end users.

The IP-based plan requires users to simply do things as they have always been told to do them instead of exploiting the implementation detail that the wikis all live on the same server, and that is a pretty big ask with significant resource consumption already.

I should say that I am not terribly worried about most actually-used programming libraries not being capable of using a port argument (though I am sure some of our users could find some). I am worried that people will simply shut down their tools or let them break instead of climbing this new hill unless we can invest developer time to provide libraries to do the lookups in python 3.5+, python2 (unfortunately), php (several versions of compatibility) and oddly rust (several magnustools) just to cover the most popular tools because of how they have already responded so far. This is informed by the responses some people have given to the changes to multi-instance even though they haven't even happened yet where several editors were removing Quarry queries rather than having to update them to connect to specific databases and how some tool maintainers reacted to other changes in the environment (upgrading Grid Engine, etc). We have to make changes, but the more changes we can abstract away from user experience the better. Cloud users are pretty forgiving about many things, but they tend to be less-so with hard breaks that appear to have been optionally chosen by WMCS.

We also would want to have simple shell tools to look things up for SQL users on Toolforge (this is not hard in that particular case, but it requires time). I would also suggest that we'd need to commit significantly more time and resources into developer advocacy to work with and re-educate our users, provide code samples and so forth well beyond what they already are doing. Ideally, this would include rewriting segments of code for the users if a WMF-maintained library wasn't available. Please note, there are no WMCS-maintained programming libraries outside of those required to launch on Toolforge at this time because we are not resourced for it.

Even with a lot of this support, I believe (informed conjecture here) we'd eat some losses in terms of tools and edits to the wikis. This is a cost we have experienced for many necessary changes that is hard to quantify because sometimes tools keep working even though they've gone unmaintained (until they break, and this is a breaking change). It would be helpful perhaps to start maintaining some kind of metric around this, but we don't do it so far.

To take a stab at the scale of the user base we are talking about supporting on this kind of transition:
Every tool has access to the replicas, and there are a couple thousand of them. There are 938 namespaces with at least one running pod on the Kubernetes cluster at this time with most of that being actual tools (we have less than 10 non-tool namespaces) https://grafana-labs.wikimedia.org/d/000000004/tools-activity?viewPanel=4&orgId=1. There are many tools that run scheduled jobs or run webservices on the grid, which makes it a bit trickier to quantify which tools actually see use. Right now, there are 569 Grid jobs running https://sge-status.toolforge.org/. Most (if not all) tools that do anti-vandalism hit the replicas, and the majority of candidates for "Coolest Tool Awards" do when I was helping with that. It should be emphasized that there is not a one-to-one ratio between most useful tools and most well-maintained tools, as well. Some things are very tricky to get updates on, which is why large changes in cloud services require communication and support resourcing. The petscan tool certainly does because they have special increased limits. Any code that hits any replica would need changes to keep running in a port-based solution (from the numbers above that's up to ~1400 bits of scattered code). The grid jobs are a firmer number because it is pretty uncommon for cron jobs and difficult-to-launch-because-its-mostly-undocumented grid web services are rarely "hello world". I know many K8s webservices are simple "hello world" attempts. It's hard to say how much of the 30% of edits that come from Cloud Services involve a replicas query from all this. This doesn't speak of the Quarry users (who are generally checking ideas or doing research rather than making edits) or PAWS users. I did the numbers on PAWS, and 303 notebooks connect to the wikireplicas at this time. Those would all break without changes if ports were used, and they would not break if DNS is used unless they do a cross-wiki join or connect to a different wiki than they say.

Another measure of scale of use (at least) I have is the general query logging of the wikireplicas themselves. We were not able to turn it on for any real length of time because of the rate of querying that comes in, but when @Marostegui turned it on for 120 seconds it produced a 17MB log file (with 4702 instances of the word SELECT in the log). If we extrapolate from that, an hour would fill half a TB, which is why it's not turned on. That's only one wikireplica server. There are three available to the users at this time. The one that was tested in one of a load-balanced pair. That's cool, but I have no way to translate that into lines of code.

Also, are there any other options or stogaps to explore here, that could either mitigate this, or give tool authors a reasonable amount of time to adjust? Any ideas?

My initial reply is a bit scattershot and long, but I had time because my laptop broke and puppet is hard to check out sometimes on a slow network. 😆

As for other options:

  • We have a cloud vlan that we'd been abandoning to use public IPs that we could un-abandon?
  • This layer with 16 VIPs could also live inside cloud in more ways than one. It's absolutely hideous, but it could be that the 16 VIPs are on VM proxies, and those VM proxies connect to the right ports on the physical proxies, which then connect to the database servers on their respective IPs and ports. I have no idea if mariadb would behave well through all that, but it definitely might!
  • Implement a new kind of routing proxy that allows users to use things exactly as they are and route things into the right location. This sounds really hard, since existing routing SQL proxies are targeted at sharding rather than quite this kind of thing where there are no shard keys. (I did experiment a bit with existing versions of this kind of software and didn't get far).

The timetable isn't entirely up to me here. The replicas are blocking DBA work, in part. We proposed a timeline here https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign#Timeline to try to keep up with that work.

Don't let me derail the conversation, this is mostly for myself to remember to discuss this in the following team meetings/etc. I'll move to a new task if anyone thinks it's worth it xd

I see a couple of things that, though this time it's too late to start working on, will avoid lots of issues in the mid-long term it we get to implement and imo should get a strong focus after the current move is done:

  • A way to measure usage by users, this will allow us both to be able to take informed decisions about where to put effort, and guide the design of the system to adapt to the user needs, and to be able to estimate, foresee and plan, so we would be able to take measures before there's no time.
  • A way to abstract the host-port-db url selection for users, that will allow us to dynamically move dbs around without having to change any user's code directy.

A very "simple" first step toward both could be an http api that given a db name/project name returns the url to connect to (so the users just have to do requests.get/http_get/https.get/reqwests::Client::new()::get(..), whatever lang, simple enough to not need a new library, still requires changing lots of bits of code, and risking some tools getting abandoned). That would also get a first measure of how many tools connect to which databases (though tells us nothing on for how long/what they do with them, but it's a start).

I think that the next step on that goal would be to try to figure out those two, but that will most probably require implementing counters on the DB side (as Brooke mentioned, gathering raw data is not plausible, so we'd have to summarize on the fly instead). But once there's the indirection of the db connector, we can play around with POCs more easily to find out what works before having to move everything to that system.

No matter the solution chosen, I wanted to advocate for ensuring we abstract implementation details away from the end users. I'd like to keep it very simple for end users (including us!) to utilize the service, while ensuring flexibility to make architectural changes in the future without user impact or concern. Ideally architecture changes would not have as much impact as the current scenario. While I wouldn't hide technical details about port, host, or section a wiki is in, I would discourage users from depending on those details.

We had an internal conversation and we think we can move forward with the following architecture:

wikireplicas-floating ips.png (806×3 px, 232 KB)

The main diff compared to the previous architecture diagram is that we introduce another proxy layer inside the CloudVPS virtual network. This proxy layer 1 is able to abstract away database sections for customers.
Then proxy layer 2 receives connections from the first layer, using the general egress CloudVPS NAT public IPv4 address, and is responsible to proxy again to the actual database, using different TCP port selectors which is only known between proxy layer 1 and proxy layer 2.

The flow is something like this:

client 
->
 enwiki.web.svc.clouddbservices.eqiad1.wikimedia.cloud IN CNAME s1.web.svc.clouddbservices.eqiad1.wikimedia.cloud
 ->
   s1.web.svc,clouddbservices.eqiad1.wikimedia.cloud (proxy layer 1 -- 172.16.x.x -- TCP/3306)
   ->
     dbproxy1019 (proxy layer 2 -- web db address (208.80.x.x) -- s1 port TCP/XXXX)
     ->
       clouddb1013 or any other db mysql

client 
->
 enwiki.analytics.svc.clouddbservices.eqiad1.wikimedia.cloud IN CNAME s1.analytics.svc.clouddbservices.eqiad1.wikimedia.cloud
 ->
   s1.analytics.svc,clouddbservices.eqiad1.wikimedia.cloud (proxy layer 1 -- 172.16.y.y -- TCP/3306)
   ->
     dbproxy1018 (proxy layer 2 -- analytics db address (208.80.y.y) -- s1 port TCP/YYYY)
     ->
       clouddb1013 or any other db mysql

We would only need 2 public IPv4 addresses in the production side, to map to each of the 2 proxy servers in proxy layer 2, one for the web db and other for the analytics db. These 2 public IPv4 addresses being managed by LVS, or directly attached to proxy servers, is up to you folks in the production side.

Actually, using both LVS and proxy layer 2 is a bit redundant from the architecture point of view, but we understand that the DBA side might want to retain their current workflows (which involves haproxy) and the SRE side might want to introduce consistency and use LVS. Given this doesn't impact how customers interact with the replicas, we really don't care that much if performance etc is good enough.

The architecture proposed in this very comment received some early testing by @Bstorm already, and we believe checks all the items in the several checklists we all have. So, again, please @Marostegui and @ayounsi validate if this work for you.

In the future we might move replicas database servers inside the cloud realm, and only have replication traffic to cross the edge network, which should help with performance and to simplify the architecture. But that would be a big change for this iteration and we don't want to introduce artificial/unneeded delays in this project.

We had an internal conversation and we think we can move forward with the following architecture:

wikireplicas-floating ips.png (806×3 px, 232 KB)

The main diff compared to the previous architecture diagram is that we introduce another proxy layer inside the CloudVPS virtual network. This proxy layer 1 is able to abstract away database sections for customers.
Then proxy layer 2 receives connections from the first layer, using the general egress CloudVPS NAT public IPv4 address, and is responsible to proxy again to the actual database, using different TCP port selectors which is only known between proxy layer 1 and proxy layer 2.

The flow is something like this:

client 
->
 enwiki.web.svc.clouddbservices.eqiad1.wikimedia.cloud IN CNAME s1.web.svc.clouddbservices.eqiad1.wikimedia.cloud
 ->
   s1.web.svc,clouddbservices.eqiad1.wikimedia.cloud (proxy layer 1 -- 172.16.x.x -- TCP/3306)
   ->
     dbproxy1019 (proxy layer 2 -- web db address (208.80.x.x) -- s1 port TCP/XXXX)
     ->
       clouddb1013 or any other db mysql

client 
->
 enwiki.analytics.svc.clouddbservices.eqiad1.wikimedia.cloud IN CNAME s1.analytics.svc.clouddbservices.eqiad1.wikimedia.cloud
 ->
   s1.analytics.svc,clouddbservices.eqiad1.wikimedia.cloud (proxy layer 1 -- 172.16.y.y -- TCP/3306)
   ->
     dbproxy1018 (proxy layer 2 -- analytics db address (208.80.y.y) -- s1 port TCP/YYYY)
     ->
       clouddb1013 or any other db mysql

We would only need 2 public IPv4 addresses in the production side, to map to each of the 2 proxy servers in proxy layer 2, one for the web db and other for the analytics db. These 2 public IPv4 addresses being managed by LVS, or directly attached to proxy servers, is up to you folks in the production side.

Actually, using both LVS and proxy layer 2 is a bit redundant from the architecture point of view, but we understand that the DBA side might want to retain their current workflows (which involves haproxy) and the SRE side might want to introduce consistency and use LVS. Given this doesn't impact how customers interact with the replicas, we really don't care that much if performance etc is good enough.

The architecture proposed in this very comment received some early testing by @Bstorm already, and we believe checks all the items in the several checklists we all have. So, again, please @Marostegui and @ayounsi validate if this work for you.

As I mentioned earlier, we don't have strong opinions on how clients should be reaching the wikireplica databases, but we do need:

  1. Being able to depool hosts from the LB gracefully (ie: clouddb1016:3315 needs a mariadb restart),
  2. Being able to fully switch traffic from one LB to another (ie: a dbproxy goes down, or needs to go offline for maintenance).

And to add my personal note, I fully agree with @nskaggs and @Bstorm - this service is very popular cause it is super easy to use. We are already adding an overhead with the multi-instance conversion (but we really have no other option).
Volunteers/tools work with database names (wikis) most of the time and from my point of view they should keep working with that approach as much as possible, so abstracting as much as possible ports/ips from them is very needed.

Because it isn't that hard to clean up later if I am wrong, I'm going to allocate the 16 vips inside CloudVPS, using the clouddb-services project which is already restricted to WMCS and Data Persistence. That will allow me to set up the PoC of this last design, and I think it checks everyone's boxes.

Change 651301 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] wikireplicas: set up VM haproxy layer

https://gerrit.wikimedia.org/r/651301

Change 651778 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] wikireplicas: set up VM haproxy layer

https://gerrit.wikimedia.org/r/651778

Change 651778 merged by Bstorm:
[operations/puppet@production] wikireplicas: set up VM haproxy layer

https://gerrit.wikimedia.org/r/651778

Change 651845 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] wikireplicas proxy: Make port map available to cloud

https://gerrit.wikimedia.org/r/651845

Change 651845 merged by Bstorm:
[operations/puppet@production] wikireplicas proxy: Make port map available to cloud

https://gerrit.wikimedia.org/r/651845

Thanks. On an SRE standardization point of view this latest architecture seems to check all the boxes.

Change 657155 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] wikireplicas: Add new DNS names for multiinstance replicas

https://gerrit.wikimedia.org/r/657155

Declining this task in favor of the new subtask.

Change 657155 merged by Bstorm:
[operations/puppet@production] wikireplicas: Add new DNS names for multiinstance replicas

https://gerrit.wikimedia.org/r/657155