Page MenuHomePhabricator

Fix aggregation of change-prop exec time metric
Open, MediumPublic

Description

The change-prop exec time metric is composed by the following query:

avg by (rule) (cpjobqueue_normal_rule_processing{quantile="0.5"}) * 1000

the cpjobqueue_normal_rule_processing vector includes the k8s pod name and instance as labels that change during a redeploy or rebalance. The hope was that avg by (rule) will average out all the different pods/instances and give a nice continuous metric. It works OK for rules processed by a single pod - those get aggregated.

However, for partitioned rules, where several pods process the rule simultaneously, the avg aggregation breaks on redeploys - for the latest time period after a redeploy, it correctly averages over all pods, but for time period before a redeploy it returns NaN. Example query:

avg by (rule) (cpjobqueue_normal_rule_processing{quantile="0.5",rule="cirrusSearchElasticaWrite-cpjobqueue-partitioned-mediawiki-job-cirrusSearchElasticaWrite"}) * 1000

removing the avg aggregation you can see that the metric is there for all time periods, so the aggregation is the problem

cpjobqueue_normal_rule_processing{quantile="0.5",rule="cirrusSearchElasticaWrite-cpjobqueue-partitioned-mediawiki-job-cirrusSearchElasticaWrite"} * 1000

Event Timeline

I wanted to ping this as functionality that would still be useful. Reviewing the helmfiles for cpjobqueue almost every change in the last 6 months was a change to job concurrency trying to address a problem. I suspect that having a direct view into concurrency would be valuable for most of the investigations that led to concurrency changes. Most of those don't involve partitioning so perhaps it's not as critical in many cases, but still seems important.

A small modification (add >0 to filter NaN) to the query in the ticket description allows it to return plausible values. I'm not clear anymore on how to go from this to per-job concurrency though (the intial ask in T266762). I also suspect this doesn't have enough information. cirrusSearchElasticaWrite is partitioned to write against three different clusters, and concurrency controls are per-partition. We would want the concurrency graphs to represent each of those separately. Ideally those graphs would include the labels from the partitioner, and not just the partition numbers.

avg by (rule) (cpjobqueue_normal_rule_processing{quantile="0.5",rule="cirrusSearchElasticaWrite-cpjobqueue-partitioned-mediawiki-job-cirrusSearchElasticaWrite"} > 0) * 1000