Page MenuHomePhabricator

Simplify navtiming multi-dc logic
Open, Needs TriagePublic

Description

[summary of a chat with @Krinkle]

Currently and for historical navtiming follows mediawiki main/master site via etcd to establish where to consume from (eqiad/codfw), and send to statsd after processing.

The parent task's goal is to establish a Prometheus processor for navtiming, and there's significant progress towards that. Since Prometheus polls both navtiming sites at all times, and there's no danger of double-counting, in a Prometheus-only future we can simplify how navtiming operates, specifically:

  • navtiming consumes from kafka-jumbo which is eqiad-only at all times (this is true today, and will stay true)
  • Use a single consumer group for eqiad and codfw navtiming processes, this way load is effectively spread amongst all navtiming processes
  • When one of webperf hosts needs to go down for maintenance and such there's no action required: the other host will automatically pick up the slack via the single consumer group
  • There's no more etcd following required, eqiad and codfw webperf hosts operate exactly the same (even though both are consuming from eqiad)

Event Timeline

Would it be possible to run navtiming in 'active/active single compute' multi dc setup?

That is:

  • navtiming@eqiad consumes from kafka main-eqiad, but only eqiad prefixed topics, and produces to eqiad prometheus.
  • navtiming@codfw consumes from kafka main-codfw, but only codfw prefixed topics, and produces to codfw prometheus.

Then there is no need to 'follow' mediawiki. Aggregate metrics can be calculated using thanos.

Would it be possible to run navtiming in 'active/active single compute' multi dc setup?

That is:

  • navtiming@eqiad consumes from kafka main-eqiad, but only eqiad prefixed topics, and produces to eqiad prometheus.
  • navtiming@codfw consumes from kafka main-codfw, but only codfw prefixed topics, and produces to codfw prometheus.

navtiming consumes from kafka-jumbo, I lack the context as to why. But other than that, that's indeed the plan I outlined in the description

navtiming consumes from kafka-jumbo, I lack the context as to why. But other than that, that's indeed the plan I outlined in the description

I think its different? I'm suggesting for navtiming in either DC to consume from kafka main in either DC, their prefixed topics. Because they'd be using different Kafka clusters (main-eqiad or main-codfw), they'd effectively be in distinct 'consumer groups'. If navtiming@eqiad is totally turned off, no eqiad messages would be processed while offline. This could be mitigated by doing a rolling restart of navtiming@eqiad with multiple processes in the same consumer group, as you say.

Basically, make navtiming fully and distinctly multi DC by using Kafka main instead of kafka jumbo-eqiad.

navtiming consumes from kafka-jumbo, I lack the context as to why. But other than that, that's indeed the plan I outlined in the description

I think its different? I'm suggesting for navtiming in either DC to consume from kafka main in either DC, their prefixed topics. Because they'd be using different Kafka clusters (main-eqiad or main-codfw), they'd effectively be in distinct 'consumer groups'. If navtiming@eqiad is totally turned off, no eqiad messages would be processed while offline. This could be mitigated by doing a rolling restart of navtiming@eqiad with multiple processes in the same consumer group, as you say.

Note that as of today there's a single navtiming process per site

Basically, make navtiming fully and distinctly multi DC by using Kafka main instead of kafka jumbo-eqiad.

Navtiming consumes from eventlogging_ topics, is that something that is available in kafka-main too? I don't know off hand where/how those topics are generated. At any rate, moving from jumbo to main seems bigger in scope, we can use the single consumer group solution as a stepping stone though!

Navtiming consumes from eventlogging_ topics

OH! right. No that is not available in Kafka main.

Carry on then. :)