Page MenuHomePhabricator

Define naming scheme for connecting to new wiki replica cluster
Closed, ResolvedPublic

Description

After some conversation with @jcrespo and @Marostegui, promoting just role based names (i.e. wikireplica-web & wikireplica-analytics) for the new Wiki Replica cluster limits our longer term ability to separate and route traffic.

We need a naming convention that allows describing both the target wiki and the role (real-time or analytics).

Once a scheme is chosen we need to add all of the permutations to DNS.

Event Timeline

bd808 created this task.Sep 2 2017, 4:14 PM
Restricted Application added a project: User-bd808. · View Herald TranscriptSep 2 2017, 4:14 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
bd808 moved this task from Backlog to Wiki replicas on the Data-Services board.Sep 5 2017, 4:53 PM
bd808 added a comment.Sep 5 2017, 5:32 PM

Some semi-random thoughts that will probably eventually spawn other tasks:

  • <wiki>.labsdb will be with us for a long, long time and should eventually be made CNAMEs to some other hostnames when we cut people over forcibly to the new servers.
  • <wiki>.labsdb should probably point to the "web" flavor in the new routing. Existing usage is going to be split across both use-cases, but it will be easier to get people to update their code for a carrot (can run longer queries without being stopped by the long query killer) than for a stick (usage now slower because of SLA change).
  • The "web" flavor should probably start life with a query killer in place. This may help us head off some of the eventual "but this query hasn't changed in N time units and now doesn't work" frustration from users.
  • Quarry should be pointed at the "analytics" flavor.
  • /usr/bin/sql should be pointed to the "analytics" flavor.
chasemp added a subscriber: chasemp.Sep 5 2017, 7:48 PM

I really dislike foo.labsdb. It is trading all sanity for conciseness I think. I think wikireplica-web.eqiad.wmnet and wikireplica-analytics.eqiad.wmnet are service urls and we should follow some service url standard here with a respective FQDN so that our sanity is preserved. We have a few other use cases (services that should be using service urls) that should fall in line here but this is probably the most painful to change longterm.

Strawdog(s) -- (enwiki.labsdb can CNAME to one of these):

  1. en.web.db.svc.eqiad.wmflabs | en.analytics.db.svc.eqiad.wmflabs
  1. en-db.web.svc.eqiad.wmflabs | en-analytics.web.svc.eqiad.wmflabs
  1. en-web.svc.eqiad.wmflabs | en-analytics.svc.eqiad.wmflabs

(svc identifier follows production nomenclature)

bd808 added a comment.Sep 5 2017, 8:00 PM

The <wiki>.(web|analytics).db.svc.eqiad.wmflabs convention makes sense to me. Per my ramblings in T174860#3581008, we would CNAME <wiki>.labsdb to <wiki>.web.db.svc.eqiad.wmflabs as soon as the full "go live" is given for the new cluster. Those CNAMEs will be with us in DNS for a long time, but should not cause any long term burden.

bd808 added a comment.Sep 7 2017, 1:05 AM

@jcrespo & @Marostegui: do either of you have a reasoned objection to us using the <wiki>.(web|analytics).db.svc.eqiad.wmflabs naming scheme for Cloud Services user access to the new servers?

@jcrespo & @Marostegui: do either of you have a reasoned objection to us using the <wiki>.(web|analytics).db.svc.eqiad.wmflabs naming scheme for Cloud Services user access to the new servers?

It sounds good to me

jcrespo added a comment.EditedSep 7 2017, 8:12 AM

Looks ok to me. I was worried if underscores would be allowed on dns entries (which some wikis sadly have, which are also wildcards for mysql), but it seems to be accepted (it is only frowned upon on hostnames).

Are you (cloud) going to take care of changing the dns every time a wiki is added?

Looks ok to me. I was worried if underscores would be allowed on dns entries (which some wikis sadly have, which are also wildcards for mysql), but it seems to be accepted (it is only frowned upon on hostnames).

Are you (cloud) going to take care of changing the dns every time a wiki is added?

That seems reasonable to me

bd808 added a comment.Sep 7 2017, 3:29 PM

Are you (cloud) going to take care of changing the dns every time a wiki is added?

Do we already have some process for this with the *.labsdb DNS names? I assumed so, but I actually don't know where that code/configuration lives.

$ dig enwiki.labsdb

; <<>> DiG 9.9.5-3ubuntu0.15-Ubuntu <<>> enwiki.labsdb
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 29834
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;enwiki.labsdb.                 IN      A

;; ANSWER SECTION:
enwiki.labsdb.          1640    IN      A       10.64.4.11

;; Query time: 0 msec
;; SERVER: 208.80.155.118#53(208.80.155.118)
;; WHEN: Thu Sep 07 15:29:24 UTC 2017
;; MSG SIZE  rcvd: 47
jcrespo added a comment.EditedSep 7 2017, 3:36 PM

There is: puppet:modules/toollabs/files/sql

And

puppet:modules/role/manifests/labs/dnsrecursor.pp

I do not dare to say I know what is actually going on there.

bd808 added a comment.Sep 7 2017, 4:02 PM

puppet:modules/role/manifests/labs/dnsrecursor.pp

This plus modules/role/templates/labs/dns/db_aliases.erb seems to be the magic. We will need to do some rethinking of how this actually works to give more control of mapping things. The current setup is that there are 3 buckets of service names corresponding to the the prior 3 labsdb100[123] hosts hard coded in the ::role::labs::dnsrecursor module. These are iterated to generate a DNS zone file in /var/zones/labsdb that looks something like:

c1      3600    IN      A       10.64.4.11
s1      3600    IN      A       10.64.4.11
s2      3600    IN      A       10.64.4.11
s4      3600    IN      A       10.64.4.11
enwiki  3600    IN      A       10.64.4.11
...
c2      3600    IN      A       10.64.37.4
c3      3600    IN      A       10.64.37.5
s3      3600    IN      A       10.64.37.5
s5      3600    IN      A       10.64.37.5
s6      3600    IN      A       10.64.37.5
s7      3600    IN      A       10.64.37.5
dewiki  3600    IN      A       10.64.37.5
...
tools   3600    IN      A       10.64.37.9

The proposed web.db.svc.eqiad.wmflabs and analytics.db.svc.eqiad.wmflabs would be generated similarly. The open question is designing a data structure to hold the mapping of <wiki> to backing host that is easily edited, usable in the code, and provides the flexibility needed to actually manage the use of the backing server cluster. For the first iteration, we could CNAME things to either dbproxy1010.eqiad.wmnet or dbproxy1011.eqiad.wmnet as appropriate. Its unclear to me if that provides the desired long term flexibility or not.

As an advice you have been hearing from me many times- all refactoring is cool and I am more than ok with it, but please do not let it block the urgent part- decommissioning old servers- (which I think bd808 is on the same mind, based on his comment of "in a first iteration")- you can always to do proper refactoring later.

bd808 added a comment.Sep 7 2017, 4:15 PM

I'll put up a patch to generate the full mappings as CNAMEs to the two dbproxy servers. I agree that we can iterate on this over time as opposed to stopping for days/weeks to determine the "perfect" solution that will almost surely turn out to be insufficient 6-9 months from now due to something we did not foresee in the initial design.

bd808 added a comment.Sep 8 2017, 12:55 AM

Using *.db.svc.eqiad.wmflabs is a tiny bit more complicated than the current *.labsdb service names at our DNS layer. The *.labsdb service names are maintained as zone overlay files in our PDNS recursors. As noted above, this is done with the modules/role/templates/labs/dns/db_aliases.erb template and a lot of hardcoded wikidb name configuration in ::role::labs::dnsrecursor. The use of an overlay file works because the primary DNS servers, which are managed via OpenStack designate, don't know anything about the labsdb zone. Its just some made up stuff that only the recursors understand how to answer.

eqiad.wmflabs however is a proper zone:

$ dig -t SOA +noall +answer eqiad.wmflabs.
eqiad.wmflabs.          59      IN      SOA     labs-ns0.wikimedia.org. root.wmflabs.org. 1504830032 3600 600 86400 3600

This implies that *.db.svc.eqiad.wmflabs will need to be manged by the authoritative DNS servers. This in turn means that we will need to use OpenStack Designate to manage the zones. This is not horrible and should not slow down implementation much, but it does mean we have to setup a reasonable way to do this for the large number of service names that are needed.

zone contents

Historically in the Cloud Services replicas, <wikidb>.labsdb includes all of the exposed wikis as well as a few other names:

  • c[123] - direct connections to the prior cluster hosts
  • s[1-7] - mappings based on the production shard which are expected to allow access to the same collection of wikidb's that would be reachable on a corresponding production shard host

In the new setup, c1, c2, and c3 can be omitted. Historically these service names were used by Tools that maintained state (Tool owned databases) co-located with the wikidb tables. The new cluster will not support this use-case, so this layer of mappings is no longer necessary.

Use of s[1-7] however will still be useful for some use-cases. If we maintain the contract of mapping the same wikidbs to the same shard as production then we can use CNAME records for each <wikidb> pointing to the appropriate s[1-7] service name. Each of the s[1-7] service names would in turn be an A record pointing to the same IPv4 as one of the haproxy instances.

managing the zones

To manage the list of CNAME records, I think we should use a utility script that pulls the s[1-7] dblist files from https://noc.wikimedia.org/conf/ and talks directly to the Designate API to add/update records. This script would be run manually similar to the way we run maintain-views. When run it would ensure that the wikidbs in each dblist file exist in both DNS zones and are pointing to the correct s[1-7] service name in the same zone.

A second utility script could be created to make managing the A records for s[1-7].(web|analytics).db.svc.eqiad.wmlabs easy. This would make failing a shard's traffic over from on host to another easier.

The nicest part of this scheme in my opinion is that it removes one of the "oh yeah also add the wikidb to file X" steps from adding a new wiki. It also separates managing the DNS entries from Puppet runs as a bonus.

final cutover

When we are ready to force all traffic to the new cluster we can replace the current ::role::labs::dnsrecursor template and config with a single static file that has the point in time list of service names as CNAME records to their appropriate s[1-7].web.db.svc.eqiad.wmflabs A record.

I'm not fully aware of the technical considerations in this task, but is it possible to do <whatever>.eqiad.wmcloud instead of "wmflabs"? And, would we even want to start doing that?

Krenair added a subscriber: Krenair.Sep 9 2017, 1:31 PM
bd808 added a comment.Sep 9 2017, 11:09 PM

I'm not fully aware of the technical considerations in this task, but is it possible to do <whatever>.eqiad.wmcloud instead of "wmflabs"? And, would we even want to start doing that?

We had an epic discussion about this on irc and decided that we are not ready to start using wmcloud for this yet. The reasons were long and boring but can be summarized as we want to 'save' that switch for when we bring up a second availability zone for Cloud VPS or at least have a very firm idea of how we will do that and what we want the domain structure to look like.

Andrew added a subscriber: Andrew.Sep 11 2017, 1:15 PM

I've created the db.svc.eqiad.wmflabs. domain.

Managing this domain will be a bit of a pain, since it's in noauth-project. Manipulation of noauth-project doesn't seem to be supported in the designate client until Newton; I've hacked up a fix to support it in the client running on labvirt1001 (it's a very simple change) but I can't decide if getting that hacked version rolled out on the cluster is worth the headache... the other option is to use direct API calls. An example of how to do that (for querying but not modifying domains) is in puppet/modules/openstack2/files/liberty/admin_scripts/novastats

Krenair added a comment.EditedSep 17 2017, 9:53 PM

I've created the db.svc.eqiad.wmflabs. domain.

Managing this domain will be a bit of a pain, since it's in noauth-project. [...]

So let's transfer the domain to some project like 'admin'? We have a similar project for wmflabs.org and 128-25.155.80.208.in-addr.arpa called wmflabsdotorg

I am curious as to what your designateclient hack is though

bd808 added a comment.Sep 18 2017, 4:31 AM

I have a script that manages the domains that I'll upload as an operations/puppet.git patch soon. I used it to fully populate the web.db.svc.eqiad.wmflabs and analytics.db.svc.eqiad.wmflabs zones with s[1-7] A records and CNAME records for all of the wikidb names pointing to the s[1-7] 'shard' that they match. The script is configured with a YAML file that looks like:

wikireplica_dns.yaml
---
zones:
  web.db.svc.eqiad.wmflabs.:
    s1:
      - 10.64.37.15
    s2:
      - 10.64.37.15
    s3:
      - 10.64.37.15
    s4:
      - 10.64.37.15
    s5:
      - 10.64.37.15
    s6:
      - 10.64.37.15
    s7:
      - 10.64.37.15
  analytics.db.svc.eqiad.wmflabs.:
    s1:
      - 10.64.37.14
    s2:
      - 10.64.37.14
    s3:
      - 10.64.37.14
    s4:
      - 10.64.37.14
    s5:
      - 10.64.37.14
    s6:
      - 10.64.37.14
    s7:
      - 10.64.37.14
$ python wikireplica_dns.py --help
usage: wikireplica_dns.py [-h] [-v] [--config CONFIG] [--aliases]

Wiki Replica DNS Manager

optional arguments:
  -h, --help       show this help message and exit
  -v, --verbose    Increase logging verbosity
  --config CONFIG  Path to YAML config file
  --aliases        Update per-wiki CNAME records

That's seems really great

Change 378739 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] wmcs: Add wikireplica_dns management script

https://gerrit.wikimedia.org/r/378739

Change 378739 merged by Rush:
[operations/puppet@production] wmcs: Add wikireplica_dns management script

https://gerrit.wikimedia.org/r/378739

bd808 moved this task from To Do to Done on the User-bd808 board.Jul 15 2020, 9:17 PM