Page MenuHomePhabricator

Upgrade production prometheus-node-exporter to >= 0.16
Closed, ResolvedPublic

Description

This task is to track the upgrade of node-exporter to >= 0.16. Note that we'd want to upgrade to a version that matches Debian Buster, to ease migration (Buster tracking task is T213527).

Prior upgrades: T152580 and T166561. Though in this case there's a bunch of metrics renamed, so we'll have to be backwards compatible with the old names at least for a little while, see also https://github.com/prometheus/node_exporter/blob/v0.16.0/docs/V0_16_UPGRADE_GUIDE.md.

Proposed plan of attack:

  • Make sure Buster/Stretch/Jessie all have the same node-exporter version available (in practice this means backporting 0.17 to jessie, which shouldn't pose particular problems)
  • Test the upgrade in beta:
    • Deploy the compatibility recording rules above
    • Upgrade node-exporter
    • Verify dashboards still report data as expected under old names
  • Extend the upgrade to production, one site at a time (if easy to do)
  • (Optional, can be postponed) audit/change dashboards to use new metric names, and retire compatibility recording rules -- moved to new task

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+2 -0
operations/puppetproduction+1 -0
operations/puppetproduction+1 -0
operations/puppetproduction+32 -88
operations/puppetproduction+37 -83
operations/puppetproduction+1 -0
operations/puppetproduction+12 -8
operations/puppetproduction+26 -44
operations/puppetproduction+2 -0
operations/puppetproduction+1 -0
operations/puppetproduction+15 -0
operations/puppetproduction+1 -1
operations/puppetproduction+2 -0
operations/puppetproduction+1 -1
operations/puppetproduction+1 -0
operations/puppetproduction+7 -1
operations/puppetproduction+52 -34
operations/puppetproduction+1 -1
operations/puppetproduction+2 -2
operations/puppetproduction+1 -0
operations/puppetproduction+1 -1
operations/puppetproduction+86 -16
operations/puppetproduction+385 -2
operations/puppetproduction+2 -2
operations/puppetproduction+595 -0
Show related patches Customize query in gerrit

Event Timeline

fgiunchedi triaged this task as Medium priority.Jan 14 2019, 2:57 PM
fgiunchedi created this task.

Change 484793 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] role: add prometheus2 rules (new format)

https://gerrit.wikimedia.org/r/484793

Changeset [1] contains a current snapshot of the converted rules files. These will be needed for prometheus server to not only maintain current behavior, but it has added the backwards-compatibility rules from [2].

I manually tested the buster packages on stretch and jessie and they appear to run in default configuration. They may work as-is if we load it into the distro-specific repositories. TODO: test our custom configuration.

[1]: https://gerrit.wikimedia.org/r/484793
[2]: https://github.com/prometheus/node_exporter/blob/v0.17.0/docs/example-16-compatibility-rules.yml

I also just realized that the compatibility rules are in Prometheus v2 format, though we'll need them in v1 format as well to decouple the v1 -> v2 migration from node-exporter upgrade

After deploying the rules and node-exporter v0.17 to deployment-prometheus02, it appears the rules are not for backwards compatibility, but for forwards compatibility. Dashboards will need to be updated before node-exporter 0.17 is deployed.

Tested command line flags for prometheus-node-exporter v0.17

ARGS='--collector.diskstats.ignored-devices=^(ram|loop|fd|(h|s|v|xv)d[a-z]|nvmed+nd+p)d+$ --collector.filesystem.ignored-fs-types=^(overlay|autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|nsfs|proc|procfs|pstore|rpc_pipefs|securityfs|sysfs|tracefs)$ --collector.filesystem.ignored-mount-points=^/(sys|proc|dev|var/lib/docker|var/lib/kubelet)($|/) --collector.textfile.directory=/var/lib/prometheus/node.d --collector.buddyinfo --collector.conntrack --collector.diskstats --collector.edac --collector.entropy --collector.filefd --collector.filesystem --collector.hwmon --collector.loadavg --collector.mdadm --collector.meminfo --collector.netdev --collector.netstat --collector.netstat.fields="^(.*)" --collector.sockstat --collector.stat --collector.tcpstat --collector.textfile --collector.time --collector.uname --collector.vmstat --collector.vmstat.fields="^(.*)" --web.listen-address=:9100'

Change 485889 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] role: add forwards-compatibility rules to prometheus

https://gerrit.wikimedia.org/r/485889

Change 484793 merged by Filippo Giunchedi:
[operations/puppet@production] role: add prometheus2 backwards-compatibility rules

https://gerrit.wikimedia.org/r/484793

Change 486192 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] prometheus: upgrade to node-exporter 0.17

https://gerrit.wikimedia.org/r/486192

Change 486493 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] aptrepo: add prometheus-node-exporter component for jessie

https://gerrit.wikimedia.org/r/486493

Change 486493 merged by Cwhite:
[operations/puppet@production] aptrepo: add prometheus-node-exporter components for all dists

https://gerrit.wikimedia.org/r/486493

Change 488593 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: install node exporter 0.17 in beta

https://gerrit.wikimedia.org/r/488593

Change 485889 merged by Cwhite:
[operations/puppet@production] role: add backwards-compatibility rules to prometheus

https://gerrit.wikimedia.org/r/485889

Change 486192 merged by Cwhite:
[operations/puppet@production] prometheus: upgrade to node-exporter 0.17

https://gerrit.wikimedia.org/r/486192

Change 489325 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] prometheus: post-upgrade node-exporter cleanup

https://gerrit.wikimedia.org/r/489325

We currently pin prometheus-node-exporter to 0.17.0+ds-2 on the selected hosts and for buster, but yesterday 0.17.0+ds-3 migrated to testing/buster. I could change the puppet code to pick -3 on buster, but I'd say we upgrade the components for jessie and stretch also to -3 and bump it in general? https://packages.qa.debian.org/p/prometheus-node-exporter/news/20190131T180815Z.html lists a number of fixes and at least the TMPDIR change seems relevant as for those as well.

Uploading -3 internally and changing puppet to install that version sounds good to me!

Change 489753 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: upgrade prometheus-node-exporter to 0.17 in labs

https://gerrit.wikimedia.org/r/489753

Change 489754 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: upgrade prometheus-node-exporter to 0.17 in eqsin

https://gerrit.wikimedia.org/r/489754

Change 489756 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] prometheus: upgrade prometheus-node-exporter to latest patchset

https://gerrit.wikimedia.org/r/489756

Change 489756 merged by Muehlenhoff:
[operations/puppet@production] prometheus: upgrade prometheus-node-exporter to latest patchset

https://gerrit.wikimedia.org/r/489756

Change 489754 merged by Cwhite:
[operations/puppet@production] hiera: upgrade prometheus-node-exporter to 0.17 in eqsin

https://gerrit.wikimedia.org/r/489754

Change 490095 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] prometheus: use non-namespaced hiera key to enable site lookup

https://gerrit.wikimedia.org/r/490095

Change 490095 merged by Cwhite:
[operations/puppet@production] prometheus: use non-namespaced hiera key to enable site lookup

https://gerrit.wikimedia.org/r/490095

Change 490203 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] prometheus: do not change trusty hosts

https://gerrit.wikimedia.org/r/490203

Change 490204 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] prometheus: attempt to force apt update

https://gerrit.wikimedia.org/r/490204

Change 490229 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: upgrade prometheus-node-exporter to 0.17 in esams

https://gerrit.wikimedia.org/r/490229

I've just noticed, based on a diffscan email, that the new version of prometheus-node-exporter ALSO binds to :::9100 on ipv6 and listens to all ipv6 clients, while the old node exporter version would only bind to a specific interface on ipv4 and listen on that interface.

This means that on publicly-exposed hosts, we expose node-exporter to the world over ipv6 if there is no firewall rule. Some servers have no firewall by design, so node-exporter is actually reachable from the open internet.

The solution is probably to just not listen over ipv6.

Change 490304 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: use web_listen_address with node-exporter 0.17

https://gerrit.wikimedia.org/r/490304

Change 490304 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: use web_listen_address with node-exporter 0.17

https://gerrit.wikimedia.org/r/490304

I've just noticed, based on a diffscan email, that the new version of prometheus-node-exporter ALSO binds to :::9100 on ipv6 and listens to all ipv6 clients, while the old node exporter version would only bind to a specific interface on ipv4 and listen on that interface.

This means that on publicly-exposed hosts, we expose node-exporter to the world over ipv6 if there is no firewall rule. Some servers have no firewall by design, so node-exporter is actually reachable from the open internet.

The solution is probably to just not listen over ipv6.

Indeed, this was a regression in the commandline options for node-exporter, fixed now in https://gerrit.wikimedia.org/r/490304

Change 490203 merged by Cwhite:
[operations/puppet@production] prometheus: do not change trusty hosts

https://gerrit.wikimedia.org/r/490203

Change 490204 merged by Cwhite:
[operations/puppet@production] prometheus: attempt to force apt update

https://gerrit.wikimedia.org/r/490204

Change 490229 merged by Cwhite:
[operations/puppet@production] hiera: upgrade prometheus-node-exporter to 0.17 in esams

https://gerrit.wikimedia.org/r/490229

Change 490651 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: upgrade prometheus-node-exporter to 0.17 in ulsfo

https://gerrit.wikimedia.org/r/490651

Change 490651 merged by Cwhite:
[operations/puppet@production] hiera: upgrade prometheus-node-exporter to 0.17 in ulsfo

https://gerrit.wikimedia.org/r/490651

Change 490689 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: upgrade prometheus-node-exporter to 0.17 in codfw

https://gerrit.wikimedia.org/r/490689

Change 490690 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: upgrade prometheus-node-exporter to 0.17 in eqiad

https://gerrit.wikimedia.org/r/490690

Change 489753 merged by Cwhite:
[operations/puppet@production] hiera: upgrade prometheus-node-exporter to 0.17 in labs

https://gerrit.wikimedia.org/r/489753

Change 490693 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: upgrade prometheus-node-exporter to 0.17 in labs

https://gerrit.wikimedia.org/r/490693

Change 490693 merged by Cwhite:
[operations/puppet@production] hiera: upgrade prometheus-node-exporter to 0.17 in labs

https://gerrit.wikimedia.org/r/490693

Noticed this today on bast3002, probably harmless but needs investigation:

Feb 20 10:26:50 bast3002 systemd[1]: Starting Collect ipmitool sensor metrics for prometheus-node-exporter...
Feb 20 10:26:50 bast3002 sh[14417]: awk: not an option: -nf

On further investigation, the log messages appear to be from the shebang of the ipmitool awk script.

Patch submitted upstream and a Debian bug has been filed.

Looks like -n is/was a gawk option?

-n
--non-decimal-data
Enable automatic interpretation of octal and hexadecimal values in input
data (see Nondecimal Data).
CAUTION: This option can severely break old programs. Use with care. Also
note that this option may disappear in a future version of gawk.
https://www.gnu.org/software/gawk/manual/html_node/Options.html

Change 492408 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] prometheus: disable shipped node-exporter ipmitool and smartmon timers

https://gerrit.wikimedia.org/r/492408

Change 492408 merged by Cwhite:
[operations/puppet@production] prometheus: disable shipped node-exporter ipmitool and smartmon timers

https://gerrit.wikimedia.org/r/492408

Change 490689 merged by Cwhite:
[operations/puppet@production] hiera: upgrade prometheus-node-exporter to 0.17 in codfw

https://gerrit.wikimedia.org/r/490689

Change 493047 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] prometheus: require package before disabling services

https://gerrit.wikimedia.org/r/493047

Change 493047 merged by Cwhite:
[operations/puppet@production] prometheus: require package before masking services

https://gerrit.wikimedia.org/r/493047

Change 493269 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] prometheus: clean up node exporter transition code

https://gerrit.wikimedia.org/r/493269

Change 489325 abandoned by Cwhite:
prometheus: post-upgrade node-exporter cleanup

https://gerrit.wikimedia.org/r/489325

Change 497252 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Don't set a fixed package version for prometheus-node-exporter on buster

https://gerrit.wikimedia.org/r/497252

Change 497252 merged by Muehlenhoff:
[operations/puppet@production] Don't set a fixed package version for prometheus-node-exporter on buster

https://gerrit.wikimedia.org/r/497252

Change 490690 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: upgrade prometheus-node-exporter to 0.17 in eqiad

https://gerrit.wikimedia.org/r/490690

Change 490690 merged by Cwhite:
[operations/puppet@production] hiera: upgrade prometheus-node-exporter to 0.17 in eqiad

https://gerrit.wikimedia.org/r/490690

Change 499667 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] prometheus: clean up node exporter transition code

https://gerrit.wikimedia.org/r/499667

Change 493269 abandoned by Cwhite:
prometheus: clean up node exporter transition code

Reason:
superseded

https://gerrit.wikimedia.org/r/493269

Change 488593 abandoned by Cwhite:
hiera: install node exporter 0.17 in beta

Reason:
no longer needed

https://gerrit.wikimedia.org/r/488593

Change 499667 merged by Cwhite:
[operations/puppet@production] prometheus: clean up node exporter transition code

https://gerrit.wikimedia.org/r/499667

colewhite updated the task description. (Show Details)

@CDanis thanks for the heads up. Should look better now.

Change 490223 abandoned by Cwhite:
hiera: upgrade prometheus-node-exporter to 0.17 in esams

https://gerrit.wikimedia.org/r/490223

Change 490224 abandoned by Cwhite:
hiera: upgrade prometheus-node-exporter to 0.17 in esams

https://gerrit.wikimedia.org/r/490224