Int he parent task one of the requirements was to be able to watch job queue backlog sizes and other metrics on a wiki-by-wiki basis. Since we have the domain name in every event, we can split up all the metrics by domains, and add a template variable in the grafana dashboard to allow selecting a domain (defaulting to all domains). This will allow to drill really deep into analyzing the queue as well as it will make it possible to create new kinds of graphs giving us insights in which projects create the most pressure on the queue.
However, this means multiplying the number of different metrics by almost 800, and we already have a lot. We have at least 4 metrics per job type for execution and delay monitoring, around 10 metrics related to Kafka brokers per job type, metrics for reds connection, reduplications etc etc, so multiplying the significant number of metrics by 800 might break statsd.
@fgiunchedi do you think our metrics reporting infrastructure will sustain such an increase in metrics variety? Would switching services to prometheus improve the situation?