From T238658#5973616 (moving here to reduce noise there)
I 'd say yes. We can draft a plan to make sure we don't end up breaking dashboards. Below there's such an effort
Taking into account that
- Services not on kubernetes don't have any kind of standard as far as dashboard creation goes but rather each follow their own devising. Thus they probably don't even use service
- Services on kubernetes use https://grafana.wikimedia.org/d/stpmz_7Wz/template-dashboard?orgId=1&refresh=1m as a basis, adding their own service specific metrics and even then, stick to the basic structure.
An, off the top of my head proposed plan would be:
- Deprecate the service label in service-runner
- Scope this. I 'd propose just kubernetes services for the reasons above. that get deployed via helm. This foregoes all service-runner services that are in other clusters as they are anyway due to migrate to kubernetes eventually. Those should still use statsd-exporter as they currently do.
- Clarify in T242861 whether we want a service label in our charts and what it would stand for.
- Add above label to all charts (and scaffolding). This probably includes non-service-runner services as well. The kubernetes service label is probably more generic anyway
- Deploy new versions of above said charts for all services.
- Proceed with the removal of service label. Services are expected to pick that up as they upgrade dependencies. Non kubernetes services should again have no problem as they don't use the service label.
Do we have enough information to decide whether or not to exchange this metric for logs when taking longer than a configurable time?
I don't feel like I do. In fact I am not sure at all that exchanging one for the other provides us with enough value. Currently looking at those graphs can give insights that is not at all easy to obtain with kibana (that might change if kibana becomes more usable, easy to navigate/search etc). But currently, I feel I would lose functionality if we went down this way. To the point that I would exchange retention for that level of granularity.
Switching to aggregating by hundreds status code is easy enough. Keep in mind that this will apply to all consumers of hyperswitch, but I don't think this will be a problem.
What are we going to aggregate and where though? Some more information would be most welcome.