Paste P6465

statsv multi DC discussion - 2017-12-12

Authored by Ottomata on Dec 14 2017, 3:25 PM.
ottomata| hey yall, is there a puppet way of knowing which is the current active primary DC?
bblack| ottomata| there's no a single, global conception of that AFAIK in puppet. Only different switches for different things. Also, the general long-term trend is away from "X is the primary DC" in puppet, towards  "X is the primary DC" in etcd or even better (b) active-active.
ottomata| ya, bblack would prefer to do active active. i'm working on multiDC statsv (varnishkafka -> statsd), but because this is statsd
ottomata| and all DCs produce to a single active statsd host
bblack| s/statsd/prometheus/ ? 
ottomata| i'm just moving clients off of analytics kafka clsuter, not building new ones! 
ottomata| hmm,i suppose i could regex extract the master statsd from the hiera statsd hostname...
ottomata| hmmmm
bblack| in other areas, we're actively rewriting/replacing statsd stuff with prometheus stuff
ottomata| yeah, but this is statsv, which is statsd metrics from clients/browsers. perf team uses mostly i think
bblack| but yeah, I guess in the meantime, you could roll your own switch
ottomata| so we'd have to rewrite how the clients emit the metrics i thinkg...or i suppose translate from statsd format to prometheus in our consumer
bblack| (and maybe control the hiera statsd hostname effect from that switch, instead of having it separate)
ottomata| hmm
ottomata| godog| ^ thoughts on that?
bblack| if it's PII, I don't know that we want it in ops prometheus
bblack| I was thinking of ops statsd
ottomata| shouldn't be PII
ottomata| afaik
bblack| ok
godog| yeah I don't think so either re: PII
godog| ottomata| reading
godog| ottomata| so yeah extracting the statsd dc from the hostname in hiera isn't pretty
ottomata| nope
ottomata| godog| what about:
ottomata| statsd_active_datacenter: eqiad
# Main statsd instance
statsd: statsd.%{statsd_active_datacenter}.wmnet:8125
godog| ottomata| I see, so e.g. producers in a datacenter that's not active would know they are not supposed to send data to statsd ?
ottomata| hmm, no godog, thinking of doing it like:
ottomata| varnishkafka producers know which statsd is active, and they only produce to kafka in that datacenter
ottomata| the statsv python consumer instances (which consume from kafka and produce to statsd), will run in both DCs, consuming from their local kafka clusters
ottomata| normally, there will be no data in statsv topic in codfw main kafka
ottomata| so the statsv consumer will be doing nothing there
ottomata| but, if you change active statsd instance
ottomata| puppet will reconfigure and bump varnishkafka, causing all of them to produce to the other DC statsv topic
ottomata| then the statsv python consumer will see messages there and start producing to statsd
ottomata| one q about that setup: should statsv in codfw always produce to active statsd (I think yes?) or to its local statsd? It seems like it won't matter since graphite is mirrored both ways, but since everybody else produces to active statsd independent of datacenter, probably statsv should do the same.
godog| ottomata| got it, yeah it should always produce to the active statsd for simplicity
godog| ottomata| on the topic (hah) of knobs in puppet needed during a switchover we're trying to reduce them, so perhaps the discovery dns records might be an alternative
ottomata| godog| i need to get them in puppet though, in order to render a template
ottomata| can I do that?
ottomata| oh wait, etcd/confd lets you do that right? hmmm
ottomata| wikitech looking...
volans| yes please, no puppet commit for switchovers
ottomata| godog| how do you do this now? you change the dns IP for statsd.eqiad.wmnet?
ottomata| to something in codfw?
godog| ottomata| that's the current sad state of affairs yes
ottomata| haha coudl use a litltle help...
ottomata| wow, so i coudln't even rely on that. crazy ok.
ottomata| coudln't rely on the hiera value to infer
volans| ottomata| what do you need to do?
ottomata| volans| i need to know which statsd instance is active, in order to change a varnishkafka conf file to make it produce to a DC specific kafka cluster
ottomata| the kafka_config puppet function helps here
ottomata| i need to call
ottomata| $config = kafka_config('main', $statsd_active_datacenter)
ottomata| where $statsd_active_datacenter is either 'eqiad' or 'codfw'
volans| I'm not sure I follow, why loading a different config based on statsd?
ottomata| trying to make this multi DC in the main kafka clsuter (so I can get it off of the analyics cluster)
ottomata| at any given time, there is only a single active statsd instance
volans| yes I know
ottomata| but there is a kafka cluster in eqiad and in codfw
ottomata| so the varnishkafka producers that send this data to kafka, need to know which kafka cluster the data should be sent to, since the statsv python consumer (that produces to statsd) will receive messages in that datacenter
godog| ottomata| I have to go as well, I've subscribed to tho
ottomata| k!
ottomata| is there a confd template example in puppet i can find somewhere...i'm looking...
ottomata| found some! searched for {{
volans| so I think there are two approaches here, either use a discovery DNS and ensure that your tool does a proper refresh with TTL
volans| or confd to dynamically change a config file if it's something in etcd
volans| directly
ottomata| hmm, can't realy use discovery dns easily for this, what I need to discovery is the active DC name in order to lookup proper list of kafka brokers.
ottomata| i think godog said he changes a statsd discovery dns?
ottomata| which I could use to render the varnishkafka config file...but i'd still have to extract the dc name somehow...
ottomata| grr
ottomata| or no hm
ottomata| no
ottomata| i thin this won't work. the kafka broker names are stored in hiera. i can't look them up when confd renders the config file.
ottomata| oh but confd can render from zookeeper???hmmmmm....
ottomata| the broker names are in zookeeper.
ottomata| ergh this is a little crazy hmm.
volans| ottomata| sorry, trying to fix another 2 things.. cannot dig into this now
ottomata| haha k
ottomata| HMMM OR
ottomata| another idea (i think i'm just typing to myself here)
ottomata| i can configure the vk producers to produce to a DC based on the cache::route_table
ottomata| that'll make stastv active/active, even if it will always produce to the single active statsd.
ottomata| bblack| is it ever possible for an entry in cache::route_table to not point at a primary DC (eqiad,codfw)?
ottomata| oh docs say yes
ottomata| "The edge sites (ulsfo and esams) should normally point at one of the primary sites (although it is possible to point an edge at another edge as well and route through it, but this would probably be a rare operational scenario).
bblack| ottomata| routing is actually handled on a per-service basis now too, and the cache::route_table is used in conjunction with the per-service routing info...
ottomata| ?
bblack| ottomata| so for an active/active service, there is no distinction between the two core sites, and traffic is split with no singular final destination. e.g. ulsfo+codfw frontend requests both exit codfw backend caches hitting codfw services, without touching eqiad. esams+eqiad requests to do similarly, staying on their side of the fench and don't touch codfw.
bblack| but for a services which is active/passive and currently only enabled on the eqiad side, requests from ulsfo+codfw end up routing through the eqiad caches to reach the service in eqiad.
bblack| and a different service might be the opposite: active in codfw and passive/offline in eqiad, in which case requests from esams+eqiad end up routing through the codfw caches to reach it.
bblack| it's all based on the service, and one cache cluster can have several services configured differently, such that some requests go eqiad->codfw and some others go codfw->eqiad.
ottomata| ah, ok makes sense. i think in this case the active/active stuff is fine, as i'm kinda basing this on the text caches.
ottomata| since statsv varnishkafka runs on the text cacches
bblack| huh?
bblack| I was just answering the question about the cache routing, I don't know what you're talking about with linking it to how you're handling statsv
ottomata| statsv varnishkafka runs on text caches, and logs requests that match /beacon/statsv to kafka
ottomata| i'm trying to make statsv varnishkafka use the main kafka cluster in either primary DC
ottomata| rather than the analytics/jumbo cluster in only eqiad
bblack| cache::route_table sort of info can't help you there, it's not for that.
ottomata| so, if text caches are configured to route requests based on cache::route_table, it should be safe to do so. for varnishkafka too. it doesn't actually matter which kafka cluster the varnishkafka produces to, i just need to pick one.
ottomata| if varnishakfka only ran in eqiad and codfw, it'd be simple
ottomata| i'd just pick the local one
ottomata| but, it also runs in non-primary DCs
bblack| yes, so they need some switches for which to use
bblack| which are not cache::route_table, which is for a different purpose
ottomata| oh?
ottomata| not for routing text cache requests to online (primary) DCs?
bblack| not for deciding where statsv data should be sent 
ottomata| well, statsv data comes from varnish
bblack| cache::route_table and the per-service routing can be modified in custom ways in a lot of different operational scenarios, some of which might have nothing to do with the statsv destination's concerns
ottomata| so if varnish is routed away from a primary DC, it seems safe to assume that varnishkafka shoudl be routed away from a DC
bblack| in any case, I think I missed part of the conversation somewhere above, I thought the statsd destination wasn't even A/A, looking backwards now...
ottomata| it isn't but statsv is 
bblack| "safe to assume" is the problem
ottomata| or, i'm trying to make it so
ottomata| well, safe enough, since it won't really matter because statsD is not active active...the statsv python consumer will be running in each primary DC...and producing to only the active statsD discovery address
ottomata| so, its more about multiDC active/active kafka & daemon producing to statsd.
ottomata| if you read (too much) scroll back, you'd see i was tryign to choose the kafka cluster based on the value of the statsd master...buuuut that was getting insane since it is done via confd discovery dns
ottomata| and the kafka configs i need are in hiera
volans| chasemp: it reached the patched code and seems that it passed it, at least for now
bblack| discovery DNS sounds like the right solution here
ottomata| yeah, but kafka isn't in discovery DNS, and that's the value i need
bblack| you mean statsv?
ottomata| no i need to configure varnishkafka
ottomata| to produce to kafka
bblack| yes
ottomata| so i need the list of kafka brokers
bblack| and you put the discovery dns hostname in varnishkafka's config, identically at all sites, and discovery DNS does the rest
ottomata| which is in hiera
ottomata| so, maybe, especially since...i can use confd with zookeeper, right?
bblack| you've switched topics from statsv to kafka brokers, or?
ottomata| the broker names are in zookeeper
ottomata| statsv uses kafka
bblack| what is "statsv"?
ottomata| its /beacon/statsv varnishkafka -> kafka topic 'statsv' -> python consumer stastv -> statsd
bblack| no, I meant more-specifically in statements about
bblack| s/about/above/
bblack| I thought we were talking about stasv the python daemon
ottomata| statsv is really just a python kafka consumer that produces messages from kafka to statsd
ottomata| ah
ottomata| yeah, i want the 'statsv service' (which includes the http requests, kafka backend, and statsv consumer process) to be multi DC
ottomata| so, bblack, configuring varnishkafka with confd might be possible, especially if i can look up values in the confd template from zookeeper
ottomata| buut, its just gonna get complicated
volans| on the host, logging into with the install_console
bblack| ottomata| I don't think anyone was suggesting integrating confd directly with vk
ottomata| ok...?
bblack| ottomata| so, statsv is a daemon. it currently runs on hafnium as part of role::webperf. I'm guessing from site.pp there are plans for it and other things to now run on webperf[12]001 for the multi-dc part.
ottomata| yes
bblack| ottomata| some other stuff (kafka brokers?) need to contact statsv, and the question is how to route requests to webperf[12]001 from kafka brokers in various DCs?
ottomata| the producer of the data (varnishkafka) is decoupled from the consumer of the data (statsv daemon)
ottomata| so, not quite.
ottomata| the question is how to route requests from varnishkafka when there are multiple kafka clusters to choose from
bblack| ottomata| that's not a statsv/statsd question though, that's a vk->kafka-clusters question.
ottomata| bblack, we'd have this problem for all webrequest too, if for some reason we wanted to support multi DC analytics stuff, or if someone needed webrequest logs reliably in all DCs
bblack| can we handle one question at a time? we seem to be ping-ponging all over the place
ottomata| bblack, ok.
ottomata| the statsv quetions is: how do we make it so statsv can run multi DC, but with an active/passive statsd?
ottomata| that is
ottomata| statsv needs to be multi DC, but we need to be sure that any given statsv message only is produced once (to the active statsd instance)
ottomata| (since the statsv metrics are mirrored between DCs by graphite)
ottomata| (statsd^*
ottomata| sorry i guess this is hard to type out, its a little complicated...i've been drawing crappy pictures today to keep it straight in my mind...
ottomata| bblack, is that question better?
bblack| so, both ends of this pipeline, vk and statsv, both make their connections *to* a kafka cluster? nobody "connects to" statsv, then?
bblack| and then statsv forwards to statsd by connecting to statsd?
bblack| I'm just trying to figure out who is even connecting to what and where data is going now
ottomata| correct
bblack| are the per-DC kafka clusters independent or do they also share data with each other?
ottomata| this is publish subscribe
ottomata| they are effectively independent. for active active pubsub like changeprop
ottomata| we prefix topics with dc name
ottomata| e.g.
ottomata| eqiad.mediawiki.revision-create, codfw.mediawiki.revision-create
ottomata| then
bblack| so they do share data, you just use keys in some cases to make it distinct
ottomata| Kafka mirror maker mirrors data between the two
ottomata| so data from both DCs is availabel to consumers in the each DC
ottomata| but producers only produce to their local DCs
ottomata| that works because mediawiki, (etc.) only run in primary DCs
bblack| and statsv is basically just a stateless translator that pulls from kafka and pushes to statsd?
ottomata| correct
bblack| ok
bblack| so first, not that it really affects the final solution much: is there a reason to run statsv independently on webperf[12]001, instead of just running (many more) instances of it directly on the kafka brokers?
bblack| just thinking in terms of not introducing more network hops and more resiliency problems
ottomata| so, it woudnl't hurt
ottomata| but, 'statsv' topic is single partition
ottomata| so
ottomata| extra consumer processes would be idel
bblack| ok
ottomata| but would take over if main one dies
bblack| ignore that for now
ottomata| k :)
bblack| the vk statsv *data*: does it have these per-DC key prefixes?
ottomata| no
bblack| ok
ottomata| its bascially a limited webrequset log
ottomata| with statsd format like data in the query params
ottomata| e.g.
ottomata| "?PagePreviewsApiResponse=38ms&PagePreviewsApiResponse=35ms"
bblack| but in any case, it's only going to send the data to <some kafka broker> once. the problem is on the subscriber side.
ottomata| correct
bblack| if we have two statsv daemons in eqiad+codfw, and they each pull from their local <kafka cluster>, and they both forward to a single statsd host, we still have a problem because events will be duplicated, right?
ottomata| (effectively/mostly correct, there is an edge case here but it is irrelevant: kafka is at least once guaruntee)
ottomata| yes, since graphite mirrors the data
ottomata| wait
ottomata| yes, IF statsv data is in both DC kafka clusters
bblack| it is due to the mirroring
ottomata| not necessarily
ottomata| we only mirror datacenter prefixed topics
bblack| ok, so....
ottomata| if the topic was just 'statsv', it woudlnt' be mirrored
bblack| you have two separate multi-dc problems here, at least. maybe 3
ottomata| haha
ottomata| yeah, there a bunch of ways to change this, but the fact that statsd/graphite is active/passive AND mirrored makes the kafka/statsv a little weird
bblack| 1) Telling varnishkafka which kafka cluster to publish to.
bblack| 2) Telling statsv which kafka cluster to publish to.
bblack| 3) Telling statsv which statsd to forward to.
bblack| err typo there on 2
bblack| 2) Telling statsv which kafka cluster to subscribe to.
ottomata| 3) doesnt' matter, its discovery DNS for statsd
ottomata| 2) also doesn't matter, it should always be local
ottomata| statsv daemon in codfw should consume from kafka in codfw
bblack| what happens if you take the kafka cluster in codfw offline? we lose all codfw stats? or we decide that kafka being online in codfw is a precondition for running traffic through codfw, or?
ottomata| good q. thinking
bblack| the "should always be local" is tricky
bblack| because we have N (where N is large) interdependent services, and even in an ideal world where everything's mostly active-active, some of those A/A services might be offline for maintenance in one DC while others are offline in the other.
ottomata| this works for mediawiki stuff, because the answer to ^^ is yes.
ottomata| kafka codfw must work for mediawiki to run in codfw
bblack| so I tend to think every independently-failing service needs its own x-DC switches, independent of others
ottomata| this is kinda why i was tying it to cache text routing.
bblack| when we tie them together things get tricky
ottomata| if cache text is offline in codfw, then no messages will be produced to codfw kafka
ottomata| excetp, in this case, the source data is from varnish webrequest logs.
bblack| what if MW can't run in codfw right now because of <X>, and kafka can't run in eqiad right now because of <Y>, etc...
bblack| youv'e tied them together on any kind of DC-level outage or maintenance.
ottomata| that would be like saying what if Mysql can't run in eqiad, and mediawiki can't run in codfw...
bblack| ideally, even if kafka is dead in eqiad and MW is dead in codfw, MW@eqiad should still be able to log to kafka@codfw
ottomata| they are a little more decoupled that mysql/ mediawiki though.
ottomata| i dunno...
ottomata| that sounds dangerous to me.
bblack| well, because it's not really a/a, it's mirrored with key hacks :)
ottomata| but ya, right now, if mw can't produce to kafka locally, then messages will be dropped
ottomata| bblack, that was a compromise argued with gabriel long ago....
ottomata| :p
bblack| varnish webrequest logs are coming from an a/a source. we do sometimes depool a whole core DC out of varnish-level service, but the norm is A/A sources here.
bblack| we're not going to tie that to VK availability and say "we have to shut down the caches in codfw because kafka is offline in codfw and we're missing logs"
ottomata| hmm, right. i see that.
ottomata| ok, no cache::route_table for this :)
ottomata| but wait
ottomata| no
ottomata| we woudln't do that.
ottomata| these are statsd metrics.
ottomata| hmm
ottomata| hm, yeah, but it'd be better to change a value for statsv rather than caches.
ottomata| is what you are saying?
ottomata| haha, so I shoudl make my own statsv route table :p
bblack| well, the problem exists at multiple layers here, either 2 or 3 different layers depending how you look at it.
bblack| this same basic problem exists for the primary webrequest kafka traffic too
ottomata| ya
ottomata| indeed
bblack| right now our only answer is it's eqiad-only, and if eqiad kafka goes offline we keep serving users and lose the logging
ottomata| its just that i'm trying to make this single limited use of them multi DC
ottomata| if we were trying to do all webrequest traffic multi DC
ottomata| THEN we'd be having some FUN
bblack| well, we should be having that kind of fun, we've been talking about multidc-for-everything for a long time now, and this is part of it :)
ottomata| usually the way this is done with kakfa
ottomata| is every DC has a local Kafka cluster
ottomata| and producers in that DC produce local only
ottomata| then in some main DC (or DCs) there is an 'aggregate' big kfka cluster
ottomata| that mirrors all the DC specific ones
bblack| ok
ottomata| but that doesn't solve your problem of having to take down a whole kafka cluster.
ottomata| fwiw, we've never had to do that
bblack| well the other option is you commit to not doing that
ottomata| kafka is built pretty well to be peerless and online all the time
ottomata| yeah
bblack| I think that's an acceptable answer so long as it's real. what we've seen with some other services is a tendency to use the "other" DC for maintenance and treat multi-DC as something where one side can go offline whenever to <do things>
ottomata| so far that's not how kafka main (our only mluti DC kafka cluster) works
bblack| (which is probably ultimately "wrong" in the long term for just about any service, but it is what it is)
ottomata| we do do things first in codfw, just because the risk is less
ottomata| but, we don't take it offline
bblack| ok
ottomata| change-prop runs active-active in both DCs
ottomata| each consuming from and producing two the local kafka clusters
ottomata| to*
bblack| so let's assume a world where the kafka cluster in a DC is considered always-online. And I guess we chose to believe that strongly enough to decouple outages (in other words, we're not taking down services because kafka's down in that DC. We're losing logs until kafka is restored).
bblack| then we don't have to worry about multi-DC switching at the kafka publishing level. Local services publish to local kafka.
ottomata| that feels right to me, and what we've done so far
bblack| in which case, for sources that are at the edge-only sites, we statically map them to the closest core, not using cache::route_table (which might remap things differently temporarily, for unrelated reasons)
bblack| some hieradata says that vk-statsv in ulsfo publishes to kafka@codfw, no switching.
ottomata| hm, true, since we are saying they are always online.
bblack| we could hypothetically switch it in hieradata, but it's not intended to ever happen as part of some operational/maintenance/outage thing
bblack| but then when we do the next dc-failover trial....
bblack| well I guess it still doesn't matter then. We still wouldn't edit the hieradata. We'd just turn off the publishing services on that side as appropriate (or at least, turn off their inbound requests)
ottomata| aue
ottomata| aye
bblack| and then statsv-daemon's subscription should also be statically configured in hieradata to pull from the local DC kafka cluster as well
ottomata| ya, that's already easy
ottomata| and done in lots of places
bblack| and then statsv->statsd uses statsd dns discovery (which if it's a/p, may go through short periods of no-response, which should be ok)
ottomata| yup
ottomata| kafka_route_table:
ottomata| main:
ottomata| eqiad: main-eqiad
ottomata| codfw: main-codfw
ottomata| esams: main-eqiad
ottomata| ulsfo: main-codfw
ottomata| ?
ottomata| the values are names of kafka clusters in the kafka_clusters hash
bblack| that's for vk->kafka publish?
bblack| you may as well go ahead and add eqsin as well, to codfw
ottomata| well, the only vk instance that we have that will use 'main' clusters is statsv
ottomata| so yes
bblack| which is "main" again?
ottomata| 'main' is the only cluster in both eqiad and codfw
ottomata| the only other is 'jumbo' (which will replace 'analytics')
ottomata| but jumbo is only in eqiad
ottomata| so main is the only one where we do multi DC stuff
bblack| ok
ottomata| i could add jumbo routes in there too, but they'd be unused
ottomata| jumbo:
ottomata| eqiad: jumbo-eqiad
ottomata| codfw: jumbo-eqiad
ottomata| esams: jumbo-eqiad
ottomata| ulsfo: jumbo-eqiad
bblack| it might be nice just to plumb that for future-proofing, but I donno if it's worth it
ottomata| i'll add it in the patch with a comment and we can see what reviewers say :p
ottomata| oof, i coudl augment the kafka_cluster_name & kafk_clusters puppet function to use this data... then if you did
ottomata| kafka_config('main', 'ulsof')
ottomata| you'd get the config for the main-codfw clsuter
bblack| yeah I guess
ottomata| bblack i guess it could also be higher level. instead of just cache::route_table, a static piece of data that listed the closest primary DC
ottomata| that was not kafka specific
bblack| kafka_config($cluster, $publisher_site)
ottomata| yeah
ottomata| $::site (since kafka_config will be called on the varnishkafka host for which config is being rendered)
bblack| for the cache::route_table stuff we chose to have the flexibility, though
ottomata| ya, that'd be left as is
bblack| e.g. if we do want to take varnish@codfw offline, we don't want that to imply taking ulsfo and eqsin offline. we re-route them to use eqiad even though it's further away.
bblack| basically, we're not committing to any one site's online-ness, unlike kafka
ottomata| aye, i'm saying just a separate static mapping that can be used as see fit, but not modified
ottomata| like
ottomata| # Maps datacenter name to the geographically closest
ottomata| # primary data center (either eqiad or codfw).
ottomata| primary_datacenter_map:
ottomata| eqiad: eqiad
ottomata| codfw: codfw
ottomata| esams: eqiad
ottomata| ulsfo: codfw
ottomata| eqsin: codfw
bblack| yeah, maybe. it has the potential for abuse once it's out there, though.
bblack| we'd need to be clear about what use-cases it's for and not-for
bblack| (which is probably only kafka at this time)
ottomata| would you prefer i made it kafka specific?
_joe_| bblack| how is traffic going to be routed from eqsin?
ottomata| i could also just put it in statsv profile for now, to keep grubby abusing hands off of it :)
bblack| _joe_: eqsin->codfw (both physically and logically)
_joe_| ok
bblack| ottomata| right, my fear is someone might use it in places, thinking we'll edit that hieradata on multi-dc switches or outages, when we won't. or use it for a service that's expecting to have site-level maintenance outages or whatever.
bblack| maybe put it in a generic place, but just deal with it in naming/comments
ottomata| i mean, i can add lots of comments saying that's not what its for
ottomata| ok
ottomata| i'll make a patch and we will see...
ottomata| thanks bblack.
bblack| maybe call it "static_core_dc_map"
ottomata| aye
ottomata| do we use 'core' more than 'primary'?
bblack| gets rid of the "primary" wording that sounds failover-y, and calls out its unchanging nature
ottomata| to refer to eqiad and codfw?
_joe_| so your plan to make texas the center of the wiki world is finally coming to a realization. You just need active-active mediawiki :)
ottomata| ayy ok
ottomata| yeah
bblack| I donno if we use "core" in this sense elsewhere. I often do informally, but not sure where formally.
bblack| "git grep core" is not much help lol
bblack| calls them "primary sites"
bblack| which is probably wrong now that we're talking about it
bblack| because people see "primary" in other related places/names/variables and will mentally imply the whole primary/secondary active/passive thing between eqiad+codfw
ottomata| yeah
bblack| we should probably stick to nomenclature where "core" means eqiad+codfw vs non-eqiad+codfw, and the word primary is only used when talking about a/p failover scenarios.
ottomata| haha bblack, 'main'? man, why didn't we call these kafka clusters 'core'?!
ottomata| i hate the name 'main'
ottomata| always did, jsut didn't come up with anything better.
bblack| it would be confusing now anyways :)
bblack| this conversation is making me try to recall when the last time was, that we've taken all of varnish offline at a DC, other than a DC-level network outage of course.
bblack| I don't think we actually do that. We do shut off frontend traffic, and we do re-route for various reasons.
ottomata| Hmmm, you know, i think maybe a kafka specific mapping is better..., there's no guaruntee that a new 'core' datacenter would have kafka main in it...
bblack| but I don't think we've ever just shut off all varnish-be services at a DC for maint or heavy changes
bblack| true
ottomata| right, its generally just rerouted traffic, ya? there is always some trickle of weirdo requests
bblack| in the long run, we could/should expect a 3rd core DC to appear at least briefly
bblack| e.g. if we were to replace eqiad with another site, we'd probably have 3 core sites during the transition period
ottomata| aye
bblack| but I guess traffic-routing is still "different" that other services
bblack| because it's support the edge sites, mostly
bblack| *supporting
ottomata| ya
ottomata| are there other services at the edges?
bblack| if services were running active/active, and codfw suddenly died (the whole DC) or lost too much transport to be useful, we'd still want to route around it and have ulsfo remap to eqiad, for instance
bblack| as opposed to calling varnish@codfw always-up and statically-mapping ulsfo->codfw
bblack| ottomata| not services like we're talking about here, no. just other meta-services for infrastructure, most of which are naturally-redundant in some other unrelated way to this multi-dc stuff.
bblack| e.g. ntp, dns, etcd, etc
bblack| or non-runtime like a tftp server
bblack| mostly the important thing is that they don't carry any important state, so they're "easy"
bblack| it's stateful things that are hard
bblack| although etcd, I donno, I haven't looked at that part of things in a long time, how we're handling it
bblack| static SRV entries to the closer core DCs I guess for now, but we had talked about putting etcd daemons at the edges, too
ottomata| aye
bblack| in theory, kafka brokers at the edges could make sense at some point as well, I donno
bblack| (I mean, analytics ones)
bblack| there's separately the issue that we might want kafka brokers at the edge for kafka-driven PURGEs where the edge caches are subscribing
ottomata| aye
ottomata| i remember talking about that
ottomata| then each cache would have its own consumer group and and commit when it has purged
bblack| so, rewinding a bit to the whole "use local kafka" and "kafka cluster in a core DC is always online" sorts of things...
bblack| that strategy definitely does make sense if you consider that all the services pub/subbing to that kafka cluster are also local to that DC
ottomata| (bblack is in a mood to talk! I love it! :D)
ottomata| bblack, that shoudl generally be how it works, with mirror maker handling cross dc stuff
bblack| if the DC dies or goes offline from the network, the services+kafka go away together.
ottomata| ya
bblack| if we decide to route traffic away from that DC voluntarily for <some reason> it works too, traffic/load just drops off there and it all works ok.
ottomata| yup, or if there is a network problem, stuff will eventaully mirror.
ottomata| purges will eventually make it when network problem is resolved
bblack| but, if we stick to that static edge->core mapping for kafka data from varnishes and this idea that the core ones never go down.
bblack| it implies that even if maintenance-wise you never allow kafka@codfw to go down for kafka-reasons.... if codfw as a whole (or its network links) go down, we're faced with losing all the stats for the live traffic in ulsfo+eqsin (which we've re-routed through eqiad), or deciding to shut off all the edge sites at that half of our world to preserve stats integrity (seems like not a great idea)
bblack| which argues for the idea that we should have a switch that can remap kafka data from ulsfo+eqsin->eqiad when codfw goes down.
ottomata| bblack, whcih sounds like a slight reason in favor of using cache::route_table (or something)
ottomata| ah
ottomata| yeah, so hopefully
ottomata| kafka_datacenter_map:
ottomata| main:
ottomata| # eqiad and esams should use main-eqiad.
ottomata| eqiad: main-eqiad
ottomata| esams: main-eqiad
ottomata| # codfw, ulsfo and eqsin should use main-codfw.
ottomata| codfw: main-codfw
ottomata| ulsfo: main-codfw
ottomata| eqsin: main-codfw
ottomata| coudl do it
ottomata| you could change the value of ulsfo: main-codfw to main-qiad
ottomata| eqiad
bblack| yes, but now we're saying we'd operationally use those switches, not just "if we build more DCs or change our architecture or whatever"
ottomata| ?
bblack| in which case we're now facing the counter-argument that if you're going to operationally use the switches, they shouldn't be in puppet
ottomata| haha
ottomata| yeah, this sounds like discovery to me
ottomata| but, its tricky, because kafka brokers have their own discovery for broker hostnames
ottomata| via zookeeper
ottomata| so if we did that, we should use confd + zookeeper for discovery of kafka broker list
bblack| if codfw dies, or we're doing a planned simulation of codfw death, we're not shutting off ulsfo+eqsin, and we don't want to lose their stats, so we need an operator switch.
bblack| in "because kafka brokers have their own discovery for broker hostnames", you mean "because kafka clients have their own discovery for broker hostnames" right?
ottomata| sorta, yes, in that a broker is also a client
bblack| ok
ottomata| so
ottomata| for kafka
ottomata| brokers
bblack| the case we're talking about mostly here is VK
ottomata| you don't tell them about the other brokers in config
ottomata| they use zookeeper to find each other
ottomata| but kafka clients are decoupled from zookeeper
bblack| does VK use zookeeper to find its cluster?
bblack| ok, no
ottomata| so you give them a list of bootstrap kafka broker names
ottomata| one is enough
ottomata| and it uses the kafka API (which in turn uses zookeeper) to find the brokers
bblack| scary
bblack| ok
ottomata| its more than just hostnames too!
ottomata| its actually full on kafka topic partition leadership info
bblack| (scary because we have an explicit list of webrequest brokers we configure ipsec for, and someone could ignore that list in changing the zookeeper data...)
ottomata| so if you are consuming from a topic
ottomata| the client gets the list of partitions, and which brokers are leaders for those partitions
ottomata| and uses that to start consuming
ottomata| if a somethign changes (broker goes offline, leadership changes, etc.) the client is notified , and the do a new metadata request to kafka to learn what changed
ottomata| and re-subscribe to new leaders, etc.
bblack| so ideally VK gets the names of 1-2 hosts in each DC to bootstrap from, and ZK is consistent across both and will handle failover?
ottomata| (re ipsec: hopefully we'll be done with that next quarter! :D)
ottomata| yeah, its best if you can give as many hsots as you can for bootstrapping, in case the 2 you give it are down
ottomata| but it doesn't use the list of bootstrap hosts to actually communicate
ottomata| it only uses them for initial bootstrap on startup
bblack| but I guess, it's up to us to use custom tooling or confd to do manual failovers in ZK
ottomata| ya, if we did this, it'd be like:
ottomata| datacenter -> kafka cluster name mapping in confd/etcd.
ottomata| but
ottomata| when rendering config templates with confd
ottomata| we'd use that cluster name to find the broker names in zookeeper
ottomata| i'm in now
ottomata| in codfw:
ottomata| [zk: localhost:2181(CONNECTED) 5] ls /kafka/main-codfw/brokers/ids
ottomata| [2002, 2001, 2003]
ottomata| [zk: localhost:2181(CONNECTED) 7] get /kafka/main-codfw/brokers/ids/2001
ottomata| {"jmx_port":9999,"timestamp":"1511878095746","endpoints":["PLAINTEXT://kafka2001.codfw.wmnet:9092"],"host":"kafka2001.codfw.wmnet","version":2,"port":9092}
bblack| should we do the static (non-zk) config with a list of broker hosts from both DCs though, instead of something switching in confd/etcd?
bblack| and then deal with failover exclusively in the zk world?
ottomata| could do, i suppose it woulnd' hurt, except we'd have to maintain it
ottomata| and the data i already in zookeeper
bblack| well yeah but data inside of zookeeper does no good for the hostlist we need to reach the service that has the zookeeper data
ottomata| then again, i'm maintaining that list now in hiera anyway
ottomata| so moving it from hiera -> confd/etcd isn't that big of a deal i guess
ottomata| oh
ottomata| ya
ottomata| we'd need to discover zookeeper
bblack| clearly, we just need to mirror all the things in a giant circle of hieradata->etcd->zookeeper->automatic-hieradata-commits, and use kafka topics for all the ->
ottomata| hah, which is re-coupling kafka clients with zookeeper
ottomata| hahaha yeah
bblack| but seriously, manual changes to a discovery list in puppet/hieradata shouldn't be a big deal, because it doesn't operationally matter much if they fall out of sync a bit anyways.
bblack| it's not a realtime switch-commit. you're just updating the discovery hints.
ottomata| yeah, and it rarely changes
bblack| some of this stuff really deserves some off-site design discussion time. not really so much the kafka questions in particular, but solidfying our standards and future plans about all the related meta-topics this touches on for multi-dc work and etcd and zookeeper and so-on.
ottomata| aye
ottomata| bblack not sure if you've seen this, but I have some other kafka plans in mind too:
ottomata| this is more app level stuff, but relevant probably to cache purges
ottomata| working on making that a next FY program
ottomata| (slowly :) )
Ottomata created this paste.Dec 14 2017, 3:25 PM
Ottomata edited the content of this paste. (Show Details)Dec 14 2017, 3:43 PM
Ottomata changed the title of this paste from statsv multi DC discussion to statsv multi DC discussion - 2017-12-12.