Page MenuHomePhabricator

Some wdqs metrics changed when switching to python3
Closed, ResolvedPublic

Description

As a WDQS maintainer I want all metrics that wdqs reports to prometheus to have a consistent name no matter what version of python is being used so that I can have the same dashboards for all WDQS nodes.

In https://github.com/prometheus/client_python/commit/a4dd93bcc6a0422e10cfa585048d1813909c6786 counter metrics were forcibly suffixed with _total.
Since the switch to python3 (buster?) all the counter metrics now have _total appended and notably the blazegraph_lastupdated counter which is used to monitor the update lag. The consequence is that nodes based on stretch reports to blazegraph_lastupdated but the ones based on buster reports to blazegraph_lastupdated_total.

Our proposed solution is to reimage the "old" wdqs instances so they take the latest OS, which will bring the whole wdqs fleet into alignment (pushing the metric to blazegraph_lastupdated_total). Then we just need to change the alert to use the new path.

AC

  • update lag is properly monitored on wdqs1011-wdqs1013
  • counter metrics work properly for buster

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

We could also upgrade all nodes to buster. Since the new instances are already running on Buster, this should be easy enough (and needs to be done anyway).

RKemper updated the task description. (Show Details)

sudo -i wmf-auto-reimage-host --conftool -p T269204 wdqs2004.codfw.wmnet is an example of how to reimage hosts (run from cumin)

Script wmf-auto-reimage was launched by ryankemper on cumin2001.codfw.wmnet for hosts:

wdqs2004.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012032006_ryankemper_9499_wdqs2004_codfw_wmnet.log.

Mentioned in SAL (#wikimedia-operations) [2020-12-03T20:08:13Z] <ryankemper> T269204 Re-imaging wdqs2004 to upgrade it to buster: sudo -i wmf-auto-reimage-host --conftool -p T269204 wdqs2004.codfw.wmnet

Completed auto-reimage of hosts:

['wdqs2004.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by ryankemper on cumin1001.eqiad.wmnet for hosts:

wdqs1004.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012050009_ryankemper_15678_wdqs1004_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by ryankemper on cumin2001.codfw.wmnet for hosts:

wdqs2001.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012050009_ryankemper_26827_wdqs2001_codfw_wmnet.log.

Mentioned in SAL (#wikimedia-operations) [2020-12-05T00:09:52Z] <ryankemper> T269204 reimaging the following instances to debian buster: wdqs1004, wdqs2001, wdqs1003

Script wmf-auto-reimage was launched by ryankemper on cumin1001.eqiad.wmnet for hosts:

wdqs1003.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012050010_ryankemper_15827_wdqs1003_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['wdqs1004.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['wdqs2001.codfw.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['wdqs1003.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by ryankemper on cumin2001.codfw.wmnet for hosts:

wdqs2005.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012050404_ryankemper_7260_wdqs2005_codfw_wmnet.log.

Script wmf-auto-reimage was launched by ryankemper on cumin1001.eqiad.wmnet for hosts:

wdqs1008.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012050404_ryankemper_23858_wdqs1008_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by ryankemper on cumin1001.eqiad.wmnet for hosts:

wdqs1005.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012050404_ryankemper_23845_wdqs1005_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by ryankemper on cumin2001.codfw.wmnet for hosts:

wdqs2002.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012050404_ryankemper_7306_wdqs2002_codfw_wmnet.log.

Mentioned in SAL (#wikimedia-operations) [2020-12-05T04:05:33Z] <ryankemper> T269204 reimaging the following instances to debian buster (one each from [public, internal] x [eqiad, codfw]): wdqs1005, wdqs2002, wdqs1008, wdqs2005

Completed auto-reimage of hosts:

['wdqs1008.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['wdqs1005.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['wdqs2002.codfw.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['wdqs2005.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by ryankemper on cumin2001.codfw.wmnet for hosts:

wdqs2006.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012071937_ryankemper_1518_wdqs2006_codfw_wmnet.log.

Script wmf-auto-reimage was launched by ryankemper on cumin2001.codfw.wmnet for hosts:

wdqs2003.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012071937_ryankemper_1511_wdqs2003_codfw_wmnet.log.

Script wmf-auto-reimage was launched by ryankemper on cumin1001.eqiad.wmnet for hosts:

wdqs1009.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012071937_ryankemper_31984_wdqs1009_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by ryankemper on cumin1001.eqiad.wmnet for hosts:

wdqs1006.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012071937_ryankemper_31916_wdqs1006_eqiad_wmnet.log.

Mentioned in SAL (#wikimedia-operations) [2020-12-07T19:38:10Z] <ryankemper> T269204 reimaging the following instances to debian buster => eqiad public:wdqs1006, codfw public:wdqs2003, codfw internal:wdqs2006, test:wdqs1009

Completed auto-reimage of hosts:

['wdqs1006.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['wdqs1009.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['wdqs2003.codfw.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['wdqs2006.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by ryankemper on cumin2001.codfw.wmnet for hosts:

wdqs2008.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012080055_ryankemper_29846_wdqs2008_codfw_wmnet.log.

Script wmf-auto-reimage was launched by ryankemper on cumin2001.codfw.wmnet for hosts:

wdqs2007.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012080055_ryankemper_29823_wdqs2007_codfw_wmnet.log.

Script wmf-auto-reimage was launched by ryankemper on cumin1001.eqiad.wmnet for hosts:

wdqs1010.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012080054_ryankemper_29304_wdqs1010_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by ryankemper on cumin1001.eqiad.wmnet for hosts:

wdqs1007.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012080054_ryankemper_29291_wdqs1007_eqiad_wmnet.log.

Mentioned in SAL (#wikimedia-operations) [2020-12-08T00:56:09Z] <ryankemper> T269204 reimaging the following instances to debian buster: eqiad public->wdqs1007, codfw public->wdqs2007, codfw internal->wdqs2008, test->wdqs1010

Completed auto-reimage of hosts:

['wdqs1010.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['wdqs1007.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['wdqs2008.codfw.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['wdqs2007.codfw.wmnet']

and were ALL successful.

Change 646888 had a related patch set uploaded (by Ryan Kemper; owner: Ryan Kemper):
[operations/puppet@production] wdqs: Counters now must end in _total

https://gerrit.wikimedia.org/r/646888

Change 646888 merged by Ryan Kemper:
[operations/puppet@production] wdqs: Switch lag metric to be a gauge

https://gerrit.wikimedia.org/r/646888

Mentioned in SAL (#wikimedia-operations) [2020-12-14T18:25:21Z] <ryankemper> T269204 Restarting wdqs-blazegraph prometheus exporter across all wdqs instances:sudo cumin -b 12 'P{wdqs*}' 'sudo systemctl restart prometheus-blazegraph-exporter-wdqs-blazegraph.service'

Restarting prometheus-blazegraph-exporter-wdqs-blazegraph.service after switching from Counter to Gauge now shows the correct blazegraph_lastupdated metric when running curl localhost:9193 (9193 is the port for wdqs-blazegraph).

Still waiting to see if the downstream consumers of this metric are working properly now, but that *should* have fixed it.