Page MenuHomePhabricator

Split ChangeProp metrics by wiki
Open, NormalPublic

Description

Int he parent task one of the requirements was to be able to watch job queue backlog sizes and other metrics on a wiki-by-wiki basis. Since we have the domain name in every event, we can split up all the metrics by domains, and add a template variable in the grafana dashboard to allow selecting a domain (defaulting to all domains). This will allow to drill really deep into analyzing the queue as well as it will make it possible to create new kinds of graphs giving us insights in which projects create the most pressure on the queue.

However, this means multiplying the number of different metrics by almost 800, and we already have a lot. We have at least 4 metrics per job type for execution and delay monitoring, around 10 metrics related to Kafka brokers per job type, metrics for reds connection, reduplications etc etc, so multiplying the significant number of metrics by 800 might break statsd.

@fgiunchedi do you think our metrics reporting infrastructure will sustain such an increase in metrics variety? Would switching services to prometheus improve the situation?

Event Timeline

Pchelolo created this task.Sep 14 2017, 7:34 PM
Restricted Application added a project: Analytics. · View Herald TranscriptSep 14 2017, 7:34 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

This might be something worth sending into Druid (& Pivot), if the point is more about exploration and finding problems as they happen. We are going to work with performance team next quarter to port over a bunch of Navtiming metrics to Druid.

Prometheus is probably cool too! :)

fdans moved this task from Incoming to Radar on the Analytics board.Sep 21 2017, 4:21 PM

(apologies about the delay, I completely missed this!)

Yeah it is likely statsd isn't going to like a 800x increase. Going with Prometheus and high cardinalities (1k values per single label is a good limit) isn't going to make the situation much better I think.

What @Ottomata suggested makes sense I think, namely the "event data" in Druid/Pivot so once we know what the problematic job is we can drill down per-wiki too.