PAWS: get new service and cluster metrics into prometheus
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	aborrero
	Jun 25 2020, 9:57 AM

Description

Figure out prometheus and some grafana dashboard for the new PAWS service. This should include both the jupyterhub metrics built in and already exporting as well as the Kubernetes cluster metrics for capacity planning.

We have 2 prometheus servers available:

tools prometheus
metricsinfra prometheus

Arturo thinks we should perhaps try using the metricsinfra one to reduce coupling with the tools project.

Details

Subject	Repo	Branch	Lines +/-
tools-prometheus: Add the paws etcd exports	operations/puppet	production	+8 -0
kubeadm: If using a stacked control plane, expose etcd metrics	operations/puppet	production	+5 -0
paws-prometheus: add node exporter info to tools-prometheus for paws	operations/puppet	production	+8 -0
tools-prometheus: fix a typo in the config that repeated an entry	operations/puppet	production	+0 -9
tools-prometheus: set up prometheus to get paws metrics	operations/puppet	production	+265 -6
tools-prometheus: set up prometheus to get paws metrics	operations/puppet	production	+264 -6
paws: add project to our prometheus alert-manager system	operations/puppet	production	+1 -0
paws-prometheus: add dummy value for the paws-k8s pk	labs/private	master	+3 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
		Restricted Task
Resolved	• Bstorm	T246122 Upgrade the Toolforge Kubernetes cluster to v1.16
Resolved	• Bstorm	T211096 PAWS: Rebuild and upgrade Kubernetes
Resolved	• Bstorm	T256361 PAWS: get new service and cluster metrics into prometheus
Resolved	aborrero	T257534 CloudVPS: a VM is unable to contact floating IPs of other VMs
Resolved	aborrero	T287107 CloudVPS: we may need DNS records for neutron port VIP addresses

Event Timeline

aborrero created this task.Jun 25 2020, 9:57 AM

aborrero mentioned this in T211096: PAWS: Rebuild and upgrade Kubernetes.

@bd808 suggested tools-prometheus because we have that set up more for dashboard use (and it monitors the clouddb metrics, for instance) vs metricsinfra which has so far been more focused on alerting. I'm not strongly opinionated myself, which is why I'm tagging him in to discuss.

I seem to recall there is a retention difference there if nothing else.

• Bstorm renamed this task from PAWS: get new service into prometheus to PAWS: get new service and cluster metrics into prometheus.Jun 25 2020, 10:45 PM

• Bstorm triaged this task as Medium priority.

• Bstorm updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-cloud) [2020-06-26T21:57:34Z] <bstorm> applied the metrics manifests to kubernetes to enable metrics-server, cadvisor, etc. T256361

• Bstorm merged a task: T195030: Develop availability metrics for PAWS.Jun 26 2020, 10:00 PM

• Bstorm added subscribers: Harej, • Tbayer, Framawiki and 2 others.

I guess my current opinion is that the metricsinfra project is only partially done and not resourced to be completed. I don't care where the data lives really, but it should be somewhere that we can trust to generate trend reports.

I will say that we deliberately set the retention to be very short in metricsinfra previously because it was meant to be a shinken replacement so far. On the other hand, we should also consider if the tools-prometheus servers have the storage space to accommodate the metrics from a second k8s cluster. We do currently monitor the clouddb-services project with tools-prometheus for metrics trending and metricsinfra for alerting, though that is a little weird.

Change 610175 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] paws: add project to our prometheus alert-manager system

https://gerrit.wikimedia.org/r/610175

gerritbot added a project: Patch-For-Review.Jul 7 2020, 9:33 PM

tools-prometheus is at 29% for disk space, so I think we can get away with adding paws stats there for trending. Going to give that a shot.

Change 610189 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] tools-prometheus: set up prometheus to get paws metrics

https://gerrit.wikimedia.org/r/610189

Change 610192 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[labs/private@master] paws-prometheus: add dummy value for the paws-k8s pk

https://gerrit.wikimedia.org/r/610192

Change 610192 merged by Bstorm:
[labs/private@master] paws-prometheus: add dummy value for the paws-k8s pk

https://gerrit.wikimedia.org/r/610192

• Bstorm mentioned this in rLPRI67dc66a07c64: paws-prometheus: add dummy value for the paws-k8s pk.Jul 8 2020, 12:42 AM

Change 610175 merged by Bstorm:
[operations/puppet@production] paws: add project to our prometheus alert-manager system

https://gerrit.wikimedia.org/r/610175

Change 610189 merged by Bstorm:
[operations/puppet@production] tools-prometheus: set up prometheus to get paws metrics

https://gerrit.wikimedia.org/r/610189

Maintenance_bot removed a project: Patch-For-Review.Jul 8 2020, 7:11 PM

Change 610383 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] tools-prometheus: set up prometheus to get paws metrics

https://gerrit.wikimedia.org/r/610383

gerritbot added a project: Patch-For-Review.Jul 8 2020, 9:53 PM

Change 610383 merged by Bstorm:
[operations/puppet@production] tools-prometheus: set up prometheus to get paws metrics

https://gerrit.wikimedia.org/r/610383

Maintenance_bot removed a project: Patch-For-Review.Jul 8 2020, 10:10 PM

Change 610394 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] tools-prometheus: fix a typo in the config that repeated an entry

https://gerrit.wikimedia.org/r/610394

gerritbot added a project: Patch-For-Review.Jul 8 2020, 10:25 PM

Change 610394 merged by Bstorm:
[operations/puppet@production] tools-prometheus: fix a typo in the config that repeated an entry

https://gerrit.wikimedia.org/r/610394

Harej unsubscribed.Jul 8 2020, 10:28 PM

Maintenance_bot removed a project: Patch-For-Review.Jul 8 2020, 11:10 PM

Ok, so we have a partial victory here https://tools-prometheus.wmflabs.org/tools/targets Most of this works (though it is interesting that some of the Toolforge cadvisor pods are hanging or something). The one thing that doesn't is hitting the jupyterhub metrics because that's an external IP. The labsaliaser doesn't know how to deal with our odd neutron port to get the internal IP for inside wmcs. We need to figure that out, I think.

For that cadvisor thing, it seems to just need a longer timeout for scrapes.

The next step is to ensure that toolforge dashboards don't roll in paws stats and vice versa. I see they are mixed up at the moment because they aren't pegged to job names.

Switched https://grafana-labs.wikimedia.org/d/toolforge-k8s-namespace-resources/kubernetes-namespace-resources?orgId=1&refresh=5m to only check the new-k8s-kube-state-metrics job so it doesn't include the paws cluster.

Change 610877 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] paws-prometheus: add node exporter info to tools-prometheus for paws

https://gerrit.wikimedia.org/r/610877

gerritbot added a project: Patch-For-Review.Jul 9 2020, 5:31 PM

Change 610877 merged by Bstorm:
[operations/puppet@production] paws-prometheus: add node exporter info to tools-prometheus for paws

https://gerrit.wikimedia.org/r/610877

Maintenance_bot removed a project: Patch-For-Review.Jul 9 2020, 11:10 PM

Ok, so besides jupyterhub, which is waiting on T257534, we have an interesting issue with monitoring etcd. The stacked control plane deployment of kubeadm presumes that you are going to run prometheus inside Kubernetes I think. The metrics URLs all listen only on localhost. I can trivially add the URL to listen publicly to the file /etc/kubernetes/manifests/etcd.yaml on each control-plane node. However, I kind of wish I could put that in puppet or kubeadm's config. The option appears to exist in kubeadm config under https://godoc.org/k8s.io/kubernetes/cmd/kubeadm/app/apis/kubeadm/v1beta2#LocalEtcd. I'll test it on a random, unconnected VM to see what the resulting server looks like. If that works, I could retrofit the option into the cluster.

Puppetizing it would likely be messy. I can add it by hand and document it, but we have enough steps. The configuration layout should be in puppet.

Change 610980 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] kubeadm: If using a stacked control plane, expose etcd metrics

https://gerrit.wikimedia.org/r/610980

gerritbot added a project: Patch-For-Review.Jul 10 2020, 12:56 AM

Change 610980 merged by Bstorm:
[operations/puppet@production] kubeadm: If using a stacked control plane, expose etcd metrics

https://gerrit.wikimedia.org/r/610980

Maintenance_bot removed a project: Patch-For-Review.Jul 10 2020, 3:10 PM

Change 611370 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] tools-prometheus: Add the paws etcd exports

https://gerrit.wikimedia.org/r/611370

gerritbot added a project: Patch-For-Review.Jul 10 2020, 4:07 PM

Metrics exposed from etcd, now just need to collect 'em.

Change 611370 merged by Bstorm:
[operations/puppet@production] tools-prometheus: Add the paws etcd exports

https://gerrit.wikimedia.org/r/611370

There, now etcd for paws is monitored like any external cluster https://grafana-labs.wikimedia.org/d/A-NlRCGMk/paws-etcd-node-health?orgId=1&refresh=1m

So the haproxy stats weren't working quite right, but then I saw: haproxy_exporter_csv_parse_failures{instance="k8s.svc.paws.eqiad1.wikimedia.cloud:9901",job="paws-haproxy"} 18088

So the exporter cannot read the stats? Checking into that.

Jul 10 00:01:31 paws-k8s-haproxy-1 prometheus-haproxy-exporter[26351]: time="2020-07-10T00:01:31Z" level=error msg="Parser expected at least 33 CSV fields, but got: 1" source="haproxy_exporter.go:386" Yeah, it's upset :)

Also Jul 10 00:13:31 paws-k8s-haproxy-1 prometheus-haproxy-exporter[26351]: time="2020-07-10T00:13:31Z" level=error msg="Can't read CSV: parse error on line 5, column 14: bare \" in non-quoted-field" source="haproxy_exporter.go:348"

The answer here was setting prometheus::haproxy_exporter::endpoint: http://localhost:8404/stats;csv in the prefix puppet.

I'm pretty happy with the haproxy stats now. I think I've improved the use of the load monitoring for both paws and tools in this pass as well https://grafana-labs.wikimedia.org/d/5O3YKfbWz/k8s-haproxy?orgId=1&refresh=5m

I did a little research and confirmed that Prometheus not only does not support setting the Host header, the development team is somewhat hostile to the idea of adding arbitrary headers to scrapes outside of auth headers. So we will not have jupyterhub stats until you can introspect a Kubernetes ingress-ed service from inside the cloud.

In T256361#6330849, @Bstorm wrote:

I did a little research and confirmed that Prometheus not only does not support setting the Host header, the development team is somewhat hostile to the idea of adding arbitrary headers to scrapes outside of auth headers. So we will not have jupyterhub stats until you can introspect a Kubernetes ingress-ed service from inside the cloud.

I think T257534: CloudVPS: a VM is unable to contact floating IPs of other VMs is mostly ready to go. Please try using directly paws.wmcloud.org and let me know if it works. My tests indicate that it should work just fine now!