Page MenuHomePhabricator

Migrate webperf from hafnium to webperf1001
Closed, ResolvedPublic

Description

This actionable of T158837 is now unblocked.

Roles on hafnium:

  • webperf

Services in webperf:

  • navtiming
  • statsv

For this task:

  • Verify everything on hafnium is puppetized.
  • Figure out how we want to do the migration.

Thoughts

When thinking about migration, we should also think about a future switch-over. Ideally we'd use the same mechanism.

The status quo is that if things fail, we'd presumably remove the role from hafnium, make sure the process is dead, apply the role to the new server, run puppet right away. This, however, in order to happen quickly, this "status quo" would require 2 code changes, 2 manual commands on a server.

Ideally we'd figure a way that will also ensure the two instances logically cannot be allowed to run at the same time (which would cause duplicate reporting in statsd). The first thing that comes to mind as a possible way to ensure de-duplication is to use a Kafka consumer group. That way Kafka is responsible for only ever sending the same message to one of our webperf servers. The only downside is that if we do that by default, we will also by default consume Kafka across DCs (and send data back to statsd across DCs), which isn't good for latency. That's fine for edge cases during a switchover, but doesn't seem useful as the default state.

Alternatively, we could have both machines enable the role by default, but have some kind of switch in the code that only ensures running/not-running of the service. We'd only keep the primary one running, and then make one code change to switch which one running. During the switch, one of them may start before the other stops, which Kafka would mediate in a sensible way.

Event Timeline

Krinkle triaged this task as Unbreak Now! priority.Feb 8 2018, 7:25 AM
Krinkle created this task.
Krinkle lowered the priority of this task from Unbreak Now! to Needs Triage.Feb 8 2018, 7:25 AM
Krinkle updated the task description. (Show Details)

I like the idea of letting Kafka handle this since it is, after all, built to handle exactly this situation...

Realistically, why does latency matter? I'd expect that the normal case would end up being a consumer in DC-A (say, eqiad) that's processing 10X messages per minute, while the consumer in DC-B (codfw) is processing only X messages.

Realistically, why does latency matter? I'd expect that the normal case would end up being a consumer in DC-A (say, eqiad) that's processing 10X messages per minute, while the consumer in DC-B (codfw) is processing only X messages.

Ah, that assumes each webperf instance could subscribe to navtiming events received by the same DC only. I like that idea, but afaik not sure that's currently possible.

The Varnish front-ends (in all DCs+POPs) receive EventLogging beacons and a server-local varnishkafka instance writes them to Kafka. I'm not sure whether we write to a local Kafka or to the primary one, but regardless, the EventLogging service is not active-active. The primary EventLogging cluster (eg. in eqiad) consumes all raw EventLogging data (eg. eventlogging-raw) and processes the stream (schema validation, normalisation, add extra fields, strip private fields) and then writes the valid events in their normalised form back to a Kafka topic (eg. eventlogging_NavigationTiming).

That is the (single) topic our webperf instance(s) have to consume from. And if we'd do that plainly from both DCs, with the same webperf/navtiming consumer ID, I'd expect Kafka to somewhat evenly balance the load to both consumers. Which means half our webperf data will go across DCs, twice, which increases likelihood of related data ending up in different buckets.

I believe @Ottomata has some ideas about making EventLogging multi-dc, but I'm not familiar with the details. If do do that, though, we still need a strategy for how the eqiad/webperf consumer takes over responsibility of the codfw-navtiming messages if the codfw/webperf consumer is failing. If they're treated as separate consumers, consuming separate dc-specific topics, the automatic balancing won't happen, which also means it won't automatically reroute messages to another consumer when another consumer is unresponsive or lagged.

That is the (single) topic our webperf instance(s) have to consume from. And if we'd do that plainly from both DCs, with the same webperf/navtiming consumer ID, I'd expect Kafka to somewhat evenly balance the load to both consumers. Which means half our webperf data will go across DCs, twice, which increases likelihood of related data ending up in different buckets.

AFAIK, Kafka doesn't do any balancing across consumers. Each consumer requests the next message as soon as it's processed the last, so if one consumer is running more slowly than the other, it will simply process fewer messages.

So, we'd have a consumer running in each DC, both connected to the same Kafka topic, which receives all messages. Presuming that one consumer is able to keep up with the rate of messages coming in, two consumers won't be slower, regardless of the cross-DC latency.

AFAIK, Kafka doesn't do any balancing across consumers

It does across consumers in the same consumer-group. If they are in different consumer-groups, then they are different consumers.

If you ran a process in eqiad and a process in codfw in the same consumer group, (and your topic had more than one partition), each process would in usually operation get an equal split of the messages.

If you run these 2 processes in different consumer groups, each process would get all messages.

Oh right, I forgot about the partition/consumer connection. Been a couple of years, had to clear the cobwebs...

Is there anything (in etcd or otherwise) that indicates which data center is "live" at a given time? We could run in different consumer groups in both locations, but simply discard the messages coming in to the secondary data center.

(What I'm really saying is, I really don't like the idea of having to push code or run puppet in order to have resiliency for this (or any other) service. Let's make a choice up front that allows for the assumption of failure at any time.)

As part of Event Data Platform event intake will become standardized for analytics and production events, and as such will be multi-DC first. For EventBus now (which is in both DCs), consumers can choose to either process only events that originated in their datacenter, or to process all events (by choosing which topics to subscribe to).

Since this service runs on EventLogging (analytics), which is only in eqiad, you might want to just hold off until Event Data Platform settles before figuring out how to make it multi DC.

Or, you could just have the codfw consumer running in the same consumer group, and not worry about the extra bit of latency.

[..] So, we'd have a consumer running in each DC, both connected to the same Kafka topic, which receives all messages. Presuming that one consumer is able to keep up with the rate of messages coming in, two consumers won't be slower, regardless of the cross-DC latency.

The processing won't be slower indeed. With 1 active Kafka cluster and two consumers, which are intentionally identified as being the same logical consumer group so that Kafka, where the central Kafka manages the shared offset pointer and guarantees not to hand out the same message twice to the same group, which would break metrics (this shared consumer principle is how we usually consume from Kafka). My concern was:

  • This approach is the "best" in terms of automatic switchover (there is nothing to switch over) and "best" in terms of fault tolerance. If one lags behind, doesn't ack anymore, is just generally slow, or goes down entirely; nothing bad happens, Kafka will have naturally balanced its portion of the messages to the other consumer(s).
  • But, this approach also means that both DCs are active by default, which I wasn't entirely comfortable with right now, but maybe that's just something to let go off? The main reason I'd like to avoid it is that it means that by default half the traffic goes cross-dc twice: From Kafka to the "other" webperf consumer, and from there back to the main statsd. That seems inefficient and somewhat fragile, especially with StatsD going over UDP.

The main active/active principle isn't an issue, but I'd prefer to get there bottom-up, e.g. have statsd and Kafka active/active first, and then have this work on top of that. Until then, it'd make more sense to have the 99% scenario (e.g. anytime except during automatic switchover) stay within a DC given both dependency and dependant are also there, and reduces various risk factors and complications.

  • But, this approach also means that both DCs are active by default, which I wasn't entirely comfortable with right now, but maybe that's just something to let go off? The main reason I'd like to avoid it is that it means that by default half the traffic goes cross-dc twice: From Kafka to the "other" webperf consumer, and from there back to the main statsd. That seems inefficient and somewhat fragile, especially with StatsD going over UDP.

The main active/active principle isn't an issue, but I'd prefer to get there bottom-up, e.g. have statsd and Kafka active/active first, and then have this work on top of that. Until then, it'd make more sense to have the 99% scenario (e.g. anytime except during automatic switchover) stay within a DC given both dependency and dependant are also there, and reduces various risk factors and complications.

My concern here is that if we don't start doing things active/active, we'll never get comfortable with them. UDP over a private backhaul isn't particularly fragile, and even if it were, the actual implication of missing data is minimal. This is a great use case for getting comfortable!

That said: T149617 says that etcd is exposing a config variable that indicates the master data center, so the "discard if in secondary" option is one we can take. (And that should be easy, too...)

Also, reminds me, we need to make sure that the parameters for kafka_brokers and statsd destination are injected through Puppet in such a way that automatically notifies/restarts the service when they change. As far as I know, Puppet doesn't track Hiera variables over time, so there's no detection there, but as long as they're written to disk somewhere, change detection (notify) should work. Given we invoke the service via systemd unit files, I'd assume this all just works as intended, but would be good to verify at least once after we've got both servers up and running (maybe by making a no-op change to the kafka_brokers list at some point).

That said: T149617 says that etcd is exposing a config variable that indicates the master data center, so the "discard if in secondary" option is one we can take. (And that should be easy, too...)

Hm.. the easiest would be to have the servers work as separate consumers (not related), and have one consume in a no-op fashion through the etcd var. However, that wouldn't result in an atomic switchover. Without a safe guard of some sort, they'd notice the etcd changes at different times and quite likely cause duplication for a short time.

Instead, we could keep them in the same logical consumer group but somehow make one not connect in the first place unless they're active. Either by adding a etcd-polling loop around the navtiming code (iterate every 1-5min or something), or by making Puppet start/stop the service (limited to 30min intervals).

Anyway, I agree we should try active/active. In addition, it also means we can finally deploy changes to webperf/navtiming without producing small gaps in the Graphite metrics. Because our puppet runs wouldn't happen exactly simultaneously and thus during the deploy the other naturally takes its portion for a few seconds. (Note: We intentionally don't continue from last offset on start-up because we can only write to the current minute with statsd.)

from there back to the main statsd

IIUC, you could sent do the local DC statsd, no? (Is there one?). Graphite replicates metrics both ways, correct? Oh, or do you need a single statsd instance to get the metrics so stats can do its aggregation properly?

UDP over a private backhaul isn't particularly fragile

Ha, remember udp2log? We sent hundreds of thousands of UDP messages/sec between DCs...and were only about 4% lossy! :p

from there back to the main statsd

IIUC, you could sent do the local DC statsd, no? (Is there one?). Graphite replicates metrics both ways, correct?

For all intents and purposes, statsd is not multi-dc. All servers have to write to the same load-balanced cluster of statsd instances. I assume the same applies to Graphite.

If we do have replication going in both ways (I don't know), it's presumably intended as a hot standby so that we can immediately start writing to the local one after a switch over. But at any single point in time both DCs should write to the same Graphite, otherwise we'd have random metrics overwritten through replication from the mostly-dormant secondary DC.

http://python-etcd.readthedocs.io/en/latest/ seems like a reasonable python-etcd client, which would in turn allow us to switch based on the master datacenter parameter.

Change 392030 had a related patch set uploaded (by Krinkle; owner: Dzahn):
[operations/puppet@production] webperf1001/2001 start using webperf role

https://gerrit.wikimedia.org/r/392030

I've associated the draft patch from last year with this task:

[operations/puppet@production] webperf1001/2001 start using webperf role
https://gerrit.wikimedia.org/r/392030

Documenting things to make sure of from perf meeting this week:

  • Making sure that under normal conditions, only 1 instance is live, and that it is in the primary DC (eg. eqiad).
  • Making sure that it is safe to have multiple instances running simultaneously. This can happen during a switchover, during the time in between the two DCs each checking etcd.

For the latter, we need to make sure that all instances consume from the same Kafka cluster with the same consumer group, and that they produce to the same statsd address. With that, it should be the case that if two instances run in parallel for any reason, only one will get messages from Kafka (given single partition) or (given two topics: NavTiming+SaveTiming) no two instances process the same messages.

Even if both get messages for the same topic (if it gets partitioned), the data points should merge just fine via Statsd.

One thing to double check is that the webperf/navtiming consumer does not manually commit offsets and (given Statsd) also cannot resume from a stored offset. However, it should be possible to achieve the following:

  • Autocommit offsets.
  • Don't resume from stored offset on start-up.
  • Abide consumer group so that multiple instances don't get the same data.

Change 392030 abandoned by Dzahn:
webperf1001/2001 start using webperf role

Reason:
per IRC. it will be done differently, moving navtiming/coal separately and webperf1001 before webperf2001

https://gerrit.wikimedia.org/r/392030

Change 429242 had a related patch set uploaded (by Imarlier; owner: Imarlier):
[operations/puppet@production] webperfX001: start using the webperf role

https://gerrit.wikimedia.org/r/429242

Change 429252 had a related patch set uploaded (by Imarlier; owner: Imarlier):
[operations/puppet@production] Make webperf role install coal things

https://gerrit.wikimedia.org/r/429252

Change 429242 merged by Dzahn:
[operations/puppet@production] webperfX001: start using the webperf role

https://gerrit.wikimedia.org/r/429242

Change 429825 had a related patch set uploaded (by Imarlier; owner: Imarlier):
[operations/puppet@production] Remove references to hafnium

https://gerrit.wikimedia.org/r/429825

Change 429825 merged by Dzahn:
[operations/puppet@production] Remove references to hafnium

https://gerrit.wikimedia.org/r/429825

@Operations - would love to get a merge on https://gerrit.wikimedia.org/r/#/c/429252/ when someone gets a chance.

Change 429252 merged by Ottomata:
[operations/puppet@production] Make webperf role install coal things

https://gerrit.wikimedia.org/r/429252

Change 430421 had a related patch set uploaded (by Imarlier; owner: Imarlier):
[operations/puppet@production] coal: require python-tz

https://gerrit.wikimedia.org/r/430421

@Operations - ottomata got the prior CR merged fine. https://gerrit.wikimedia.org/r/#/c/430421/ needs to go as well, but can happen any time.

Mentioned in SAL (#wikimedia-operations) [2018-05-03T13:05:23Z] <godog> stop and mask coal service on graphite hosts - T186774