Simplify navtiming multi-dc logic
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	fgiunchedi
	May 16 2023, 1:07 PM

Description

[summary of a chat with @Krinkle]

Currently and for historical navtiming follows mediawiki main/master site via etcd to establish where to consume from (eqiad/codfw), and send to statsd after processing.

The parent task's goal is to establish a Prometheus processor for navtiming, and there's significant progress towards that. Since Prometheus polls both navtiming sites at all times, and there's no danger of double-counting, in a Prometheus-only future we can simplify how navtiming operates, specifically:

navtiming consumes from kafka-jumbo which is eqiad-only at all times (this is true today, and will stay true)
Use a single consumer group for eqiad and codfw navtiming processes, this way load is effectively spread amongst all navtiming processes
When one of webperf hosts needs to go down for maintenance and such there's no action required: the other host will automatically pick up the slack via the single consumer group
There's no more etcd following required, eqiad and codfw webperf hosts operate exactly the same (even though both are consuming from eqiad)

Related Objects
Search...

Status	Assigned	Task
Open	None	T228380 Tech debt: sunsetting of Graphite
Open	None	T205870 Fully migrate producers off statsd
Open	None	T319329 Expand navigation timing metrics to include user experience metrics and modernise navigation timing
Open	None	T175087 Create a navtiming processor for Prometheus
Open	None	T336764 Simplify navtiming multi-dc logic

Event Timeline

fgiunchedi created this task.May 16 2023, 1:07 PM

Would it be possible to run navtiming in 'active/active single compute' multi dc setup?

That is:

navtiming@eqiad consumes from kafka main-eqiad, but only eqiad prefixed topics, and produces to eqiad prometheus.
navtiming@codfw consumes from kafka main-codfw, but only codfw prefixed topics, and produces to codfw prometheus.

Then there is no need to 'follow' mediawiki. Aggregate metrics can be calculated using thanos.

In T336764#8855608, @Ottomata wrote:

Would it be possible to run navtiming in 'active/active single compute' multi dc setup?

That is:

navtiming@eqiad consumes from kafka main-eqiad, but only eqiad prefixed topics, and produces to eqiad prometheus.

navtiming@codfw consumes from kafka main-codfw, but only codfw prefixed topics, and produces to codfw prometheus.

navtiming consumes from kafka-jumbo, I lack the context as to why. But other than that, that's indeed the plan I outlined in the description

navtiming consumes from kafka-jumbo, I lack the context as to why. But other than that, that's indeed the plan I outlined in the description

I think its different? I'm suggesting for navtiming in either DC to consume from kafka main in either DC, their prefixed topics. Because they'd be using different Kafka clusters (main-eqiad or main-codfw), they'd effectively be in distinct 'consumer groups'. If navtiming@eqiad is totally turned off, no eqiad messages would be processed while offline. This could be mitigated by doing a rolling restart of navtiming@eqiad with multiple processes in the same consumer group, as you say.

Basically, make navtiming fully and distinctly multi DC by using Kafka main instead of kafka jumbo-eqiad.

In T336764#8856325, @Ottomata wrote:

navtiming consumes from kafka-jumbo, I lack the context as to why. But other than that, that's indeed the plan I outlined in the description

I think its different? I'm suggesting for navtiming in either DC to consume from kafka main in either DC, their prefixed topics. Because they'd be using different Kafka clusters (main-eqiad or main-codfw), they'd effectively be in distinct 'consumer groups'. If navtiming@eqiad is totally turned off, no eqiad messages would be processed while offline. This could be mitigated by doing a rolling restart of navtiming@eqiad with multiple processes in the same consumer group, as you say.

Note that as of today there's a single navtiming process per site

Basically, make navtiming fully and distinctly multi DC by using Kafka main instead of kafka jumbo-eqiad.

Navtiming consumes from eventlogging_ topics, is that something that is available in kafka-main too? I don't know off hand where/how those topics are generated. At any rate, moving from jumbo to main seems bigger in scope, we can use the single consumer group solution as a stepping stone though!

Navtiming consumes from eventlogging_ topics

OH! right. No that is not available in Kafka main.

Carry on then. :)

larissagaulia moved this task from Inbox, needs triage to Backlog: Future Goals, non-prioritized on the Performance-Team board.May 22 2023, 6:39 PM

Krinkle removed a project: Performance-Team.Aug 17 2023, 2:54 PM

Krinkle removed subscribers: • Gilles, • dpifke, Krinkle.

Simplify navtiming multi-dc logicOpen, Needs TriagePublicActions

Description

Related ObjectsSearch...

Event Timeline

Simplify navtiming multi-dc logic
Open, Needs TriagePublic
Actions

Related Objects
Search...