As a WDQS maintainer I want all metrics that wdqs reports to prometheus to have a consistent name no matter what version of python is being used so that I can have the same dashboards for all WDQS nodes.
In https://github.com/prometheus/client_python/commit/a4dd93bcc6a0422e10cfa585048d1813909c6786 counter metrics were forcibly suffixed with _total.
Since the switch to python3 (buster?) all the counter metrics now have _total appended and notably the blazegraph_lastupdated counter which is used to monitor the update lag. The consequence is that nodes based on stretch reports to blazegraph_lastupdated but the ones based on buster reports to blazegraph_lastupdated_total.
Our proposed solution is to reimage the "old" wdqs instances so they take the latest OS, which will bring the whole wdqs fleet into alignment (pushing the metric to blazegraph_lastupdated_total). Then we just need to change the alert to use the new path.
AC
- update lag is properly monitored on wdqs1011-wdqs1013
- counter metrics work properly for buster