Page MenuHomePhabricator

PAWS: get new service and cluster metrics into prometheus
Closed, ResolvedPublic

Description

Figure out prometheus and some grafana dashboard for the new PAWS service. This should include both the jupyterhub metrics built in and already exporting as well as the Kubernetes cluster metrics for capacity planning.

We have 2 prometheus servers available:

  • tools prometheus
  • metricsinfra prometheus

Arturo thinks we should perhaps try using the metricsinfra one to reduce coupling with the tools project.

Event Timeline

@bd808 suggested tools-prometheus because we have that set up more for dashboard use (and it monitors the clouddb metrics, for instance) vs metricsinfra which has so far been more focused on alerting. I'm not strongly opinionated myself, which is why I'm tagging him in to discuss.

I seem to recall there is a retention difference there if nothing else.

Bstorm renamed this task from PAWS: get new service into prometheus to PAWS: get new service and cluster metrics into prometheus.Jun 25 2020, 10:45 PM
Bstorm triaged this task as Medium priority.
Bstorm updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-cloud) [2020-06-26T21:57:34Z] <bstorm> applied the metrics manifests to kubernetes to enable metrics-server, cadvisor, etc. T256361

I guess my current opinion is that the metricsinfra project is only partially done and not resourced to be completed. I don't care where the data lives really, but it should be somewhere that we can trust to generate trend reports.

I will say that we deliberately set the retention to be very short in metricsinfra previously because it was meant to be a shinken replacement so far. On the other hand, we should also consider if the tools-prometheus servers have the storage space to accommodate the metrics from a second k8s cluster. We do currently monitor the clouddb-services project with tools-prometheus for metrics trending and metricsinfra for alerting, though that is a little weird.

Change 610175 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] paws: add project to our prometheus alert-manager system

https://gerrit.wikimedia.org/r/610175

tools-prometheus is at 29% for disk space, so I think we can get away with adding paws stats there for trending. Going to give that a shot.

Change 610189 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] tools-prometheus: set up prometheus to get paws metrics

https://gerrit.wikimedia.org/r/610189

Change 610192 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[labs/private@master] paws-prometheus: add dummy value for the paws-k8s pk

https://gerrit.wikimedia.org/r/610192

Change 610192 merged by Bstorm:
[labs/private@master] paws-prometheus: add dummy value for the paws-k8s pk

https://gerrit.wikimedia.org/r/610192

Change 610175 merged by Bstorm:
[operations/puppet@production] paws: add project to our prometheus alert-manager system

https://gerrit.wikimedia.org/r/610175

Change 610189 merged by Bstorm:
[operations/puppet@production] tools-prometheus: set up prometheus to get paws metrics

https://gerrit.wikimedia.org/r/610189

Change 610383 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] tools-prometheus: set up prometheus to get paws metrics

https://gerrit.wikimedia.org/r/610383

Change 610383 merged by Bstorm:
[operations/puppet@production] tools-prometheus: set up prometheus to get paws metrics

https://gerrit.wikimedia.org/r/610383

Change 610394 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] tools-prometheus: fix a typo in the config that repeated an entry

https://gerrit.wikimedia.org/r/610394

Change 610394 merged by Bstorm:
[operations/puppet@production] tools-prometheus: fix a typo in the config that repeated an entry

https://gerrit.wikimedia.org/r/610394

Ok, so we have a partial victory here https://tools-prometheus.wmflabs.org/tools/targets Most of this works (though it is interesting that some of the Toolforge cadvisor pods are hanging or something). The one thing that doesn't is hitting the jupyterhub metrics because that's an external IP. The labsaliaser doesn't know how to deal with our odd neutron port to get the internal IP for inside wmcs. We need to figure that out, I think.

For that cadvisor thing, it seems to just need a longer timeout for scrapes.

The next step is to ensure that toolforge dashboards don't roll in paws stats and vice versa. I see they are mixed up at the moment because they aren't pegged to job names.

Change 610877 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] paws-prometheus: add node exporter info to tools-prometheus for paws

https://gerrit.wikimedia.org/r/610877

Change 610877 merged by Bstorm:
[operations/puppet@production] paws-prometheus: add node exporter info to tools-prometheus for paws

https://gerrit.wikimedia.org/r/610877

Ok, so besides jupyterhub, which is waiting on T257534, we have an interesting issue with monitoring etcd. The stacked control plane deployment of kubeadm presumes that you are going to run prometheus inside Kubernetes I think. The metrics URLs all listen only on localhost. I can trivially add the URL to listen publicly to the file /etc/kubernetes/manifests/etcd.yaml on each control-plane node. However, I kind of wish I could put that in puppet or kubeadm's config. The option appears to exist in kubeadm config under https://godoc.org/k8s.io/kubernetes/cmd/kubeadm/app/apis/kubeadm/v1beta2#LocalEtcd. I'll test it on a random, unconnected VM to see what the resulting server looks like. If that works, I could retrofit the option into the cluster.

Puppetizing it would likely be messy. I can add it by hand and document it, but we have enough steps. The configuration layout should be in puppet.

Change 610980 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] kubeadm: If using a stacked control plane, expose etcd metrics

https://gerrit.wikimedia.org/r/610980

Change 610980 merged by Bstorm:
[operations/puppet@production] kubeadm: If using a stacked control plane, expose etcd metrics

https://gerrit.wikimedia.org/r/610980

Change 611370 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] tools-prometheus: Add the paws etcd exports

https://gerrit.wikimedia.org/r/611370

Metrics exposed from etcd, now just need to collect 'em.

Change 611370 merged by Bstorm:
[operations/puppet@production] tools-prometheus: Add the paws etcd exports

https://gerrit.wikimedia.org/r/611370

So the haproxy stats weren't working quite right, but then I saw: haproxy_exporter_csv_parse_failures{instance="k8s.svc.paws.eqiad1.wikimedia.cloud:9901",job="paws-haproxy"} 18088

So the exporter cannot read the stats? Checking into that.

Jul 10 00:01:31 paws-k8s-haproxy-1 prometheus-haproxy-exporter[26351]: time="2020-07-10T00:01:31Z" level=error msg="Parser expected at least 33 CSV fields, but got: 1" source="haproxy_exporter.go:386" Yeah, it's upset :)

Also Jul 10 00:13:31 paws-k8s-haproxy-1 prometheus-haproxy-exporter[26351]: time="2020-07-10T00:13:31Z" level=error msg="Can't read CSV: parse error on line 5, column 14: bare \" in non-quoted-field" source="haproxy_exporter.go:348"

The answer here was setting prometheus::haproxy_exporter::endpoint: http://localhost:8404/stats;csv in the prefix puppet.

I'm pretty happy with the haproxy stats now. I think I've improved the use of the load monitoring for both paws and tools in this pass as well https://grafana-labs.wikimedia.org/d/5O3YKfbWz/k8s-haproxy?orgId=1&refresh=5m

I did a little research and confirmed that Prometheus not only does not support setting the Host header, the development team is somewhat hostile to the idea of adding arbitrary headers to scrapes outside of auth headers. So we will not have jupyterhub stats until you can introspect a Kubernetes ingress-ed service from inside the cloud.

I did a little research and confirmed that Prometheus not only does not support setting the Host header, the development team is somewhat hostile to the idea of adding arbitrary headers to scrapes outside of auth headers. So we will not have jupyterhub stats until you can introspect a Kubernetes ingress-ed service from inside the cloud.

I think T257534: CloudVPS: a VM is unable to contact floating IPs of other VMs is mostly ready to go. Please try using directly paws.wmcloud.org and let me know if it works. My tests indicate that it should work just fine now!

Bstorm claimed this task.

Metrics are in as is a dashboard. We almost certainly will want to expand the dash to include more performance metrics in the future.