Page MenuHomePhabricator

Toolforge: collect prometheus node exporter metrics from new k8s worker nodes
Closed, ResolvedPublic



image.png (535×2 px, 219 KB)

It seems prometheus can not collect these metrics.

Event Timeline

aborrero triaged this task as Medium priority.Aug 8 2019, 5:37 PM
aborrero moved this task from Inbox to Soon! on the cloud-services-team (Kanban) board.

This may be a good starting task for @Phamhi, apart from the ones we have already.

In horizon, "Instance Console Log", I can see the following logs for tools-worker-1030

[[32m  OK  [0m] Stopped Prometheus exporter for machine metrics.
       Starting Prometheus exporter for machine metrics...
[[1;31mFAILED[0m] Failed to start Prometheus exporter for machine metrics.
See 'systemctl status prometheus-node-exporter.service' for details.

During the prometheus-node-exporter.service startup, the following error occurs

Aug 06 18:04:45 tools-worker-1030 prometheus-node-exporter[661]: flag provided but not defined: -collector.buddyinfo

This is because the prometheus-node-exporter package is outdated and the the argument isn't supported yet.

For the fix, the prometheus-node-exporter package on tools-worker-10{30..40} needs to be updated from 0.14.0~git20170523-1 to 0.17.0+ds-3

I can confirm that the "working" nodes such as 029 already have prometheus-node-exporter package version 0.17.0+ds-3

That makes sense:

aborrero@tools-worker-1030:~$ apt-cache policy prometheus-node-exporter
  Installed: 0.14.0~git20170523-1
  Candidate: 0.17.0+ds-3
  Version table:
     0.17.0+ds-3 0
       1001 jessie-wikimedia/main amd64 Packages
 *** 0.14.0~git20170523-1 0
       1001 jessie-wikimedia/backports amd64 Packages
        100 /var/lib/dpkg/status

I would suggest we:

  1. create a puppet patch to ensure we are running the latest version in worker nodes, so this error doesn't repeat again when we create new worker VMs in Debian jessie (hopefully not a lot of time, since we are already working on an updated deployment)
  2. manually update all the packages in the affected servers. Should be easy using cumin.

Mentioned in SAL (#wikimedia-cloud) [2019-08-12T16:08:26Z] <phamhi> updated prometheus-node-exporter from 0.14.0~git20170523-1 to 0.17.0+ds-3 in tools-worker-[1030-1040] nodes (T230147)

Mentioned in SAL (#wikimedia-cloud) [2019-08-12T20:39:59Z] <phamhi> toolsbeta-test-puppet-sandbox instance created for T230147

I created a new instance "toolsbeta-test-puppet-sandbox" with jessie image and it looks like it came with prometheus-node-exporter version 0.14.0 not 0.17.0. As per Arturo's suggestion, I am looking into create a Puppet patch for this issue.

Just for information, there's more than one quirk in building new Jessie K8s nodes. It may be worth it to just document the problem because pinning doesn't always prevent chicken/egg issues

That said, this one seems like it might actually be fixable with a pin. :)

Does it make more sense to close this ticket as the original issue has been resolved? We then create a new ticket to prevent this issue from re-occurring?

No, I think this is only resolved if "new kubernetes worker nodes" can export metrics. They'll fail if we spin up another one. I'm perfectly fine with just documenting that the package needs an upgrade (since there's packages that need downgrades as well), but a puppet pin of the package would resolve it as well. The reason I'm ok with just updating the docs is because this is re: Jessie nodes. We are going to deprecate Jessie. Otherwise, we'd surely insist on fixing this in puppet so the build is reproduceable.

We certainly need to spin up more nodes and those ones will have the same problem at this point.

I have updated the docs located at to include the command to update the prometheus-node-exporter package after the build.