Page MenuHomePhabricator

toolforge: new k8s: figure out metrics / observability
Open, NormalPublic

Description

We didn't plan anything yet for prometheus or the like.

It would be interesting to have metrics on the ingress, at very least.

  • the nginx daemon doing ingress
  • how the custom admission controllers are doing
  • how the front haproxy is doing
  • other traffic inside the cluster
  • api-server, calico and other kube-system pods.

Event Timeline

aborrero created this task.Thu, Nov 7, 2:28 PM

Change 550442 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: new k8s: etcd: enable TLS for metrics endpoint

https://gerrit.wikimedia.org/r/550442

Change 550442 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: new k8s: etcd: enable TLS for metrics endpoint

https://gerrit.wikimedia.org/r/550442

aborrero updated the task description. (Show Details)Tue, Nov 12, 4:53 PM

hey @Bstorm. I'm evaluating 2 setups for prometheus in the new k8s cluster:

  1. let prometheus running on tools-prometheus discover and scrape all the metrics in the new cluster by using the new k8s API.
  2. run prometheus inside the new k8s cluster, which is something lots of documentations assume. Then use prometheus federation to send internal metrics to tools-prometheus

In the new k8s cluster, which is RBAC based, how difficult would be to generate the client TLS config required for prometheus to scrape metrics using the k8s API? We would need that for option 1).
The legacy setup had a similar setup, but using bearer tokens for auth.

I imagine that if prometheus is running inside the cluster, it uses a service account, right?

For client TLS for option 1, not too hard. We could honestly do it much the same way as we do for the custom controllers. The scripts would need to be adapted a bit for downloading the cert rather than keeping it in a secret object--also it would need to be renewed periodically (which the custom controllers also need--and need documentation/process around, so I'm glad that came up!)

Depending on the work required for federation, they might be comparable amounts of work.

Ok, I've been following your suggestion and refactored a bit the script you had for the custom admission controllers: https://gerrit.wikimedia.org/r/c/operations/puppet/+/550673

This script can be used to generate a couple of files that can be put into the tools-puppetmaster and then deploy the certs to the prometheus servers we have. If this is something we only do once or twice a year, I don't think it's a big deal.

Mentioned in SAL (#wikimedia-cloud) [2019-11-13T17:20:07Z] <arturo> live-hacking tools-prometheus-01 to test some experimental configs for the new k8s cluster (T237643)

I have a more or less config that may work, but is not ready yet.

  1. using this script I generated a cert for prometheus: https://gerrit.wikimedia.org/r/c/operations/puppet/+/550673 (and scp'ed the certs as required)
  2. I added the RBAC config for prometheus into the new toolsbeta k8s cluster:
# from https://github.com/prometheus/prometheus/blob/master/documentation/examples/rbac-setup.yml
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: prometheus
rules:
- apiGroups: [""]
  resources:
  - nodes
  - nodes/proxy
  - services
  - endpoints
  - pods
  verbs: ["get", "list", "watch"]
- apiGroups:
  - extensions
  resources:
  - ingresses
  verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
  verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
- kind: User
  name: prometheus
  namespace: default
  1. added this prometheus config to tools-prometheus for starters:
#
# arturo's config
#
- job_name: 'new-k8s-nodes'
  scheme: https
  kubernetes_sd_configs:
  - role: node
    api_server: https://k8s.toolsbeta.eqiad1.wikimedia.cloud:6443
    tls_config:
      ca_file: /srv/prometheus/tools/new-k8s.ca
      cert_file: /srv/prometheus/tools/prometheus.crt
      key_file: /srv/prometheus/tools/prometheus.key
      insecure_skip_verify: true
  relabel_configs:
  - action: labelmap
    regex: __meta_kubernetes_node_label_(.+)
  - target_label: __address__
    replacement: kubernetes.default.svc:443
  - source_labels: [__meta_kubernetes_node_name]
    regex: (.+)
    target_label: __metrics_path__
    replacement: /api/v1/nodes/${1}/proxy/metrics
- job_name: 'new-k8s-pods'
  scheme: https
  kubernetes_sd_configs:
  - role: pod
    api_server: https://k8s.toolsbeta.eqiad1.wikimedia.cloud:6443
    tls_config:
      ca_file: /srv/prometheus/tools/new-k8s.ca
      cert_file: /srv/prometheus/tools/prometheus.crt
      key_file: /srv/prometheus/tools/prometheus.key
      insecure_skip_verify: true
  relabel_configs:
  - action: labelmap
    regex: __meta_kubernetes_pod_label_(.+)
  - source_labels: [__meta_kubernetes_namespace]
    action: replace
    target_label: kubernetes_namespace
  - source_labels: [__meta_kubernetes_pod_name]
    action: replace
    target_label: kubernetes_pod_name
- job_name: 'news-k8s-ingresses'
  scheme: https
  kubernetes_sd_configs:
  - role: ingress
    api_server: https://k8s.toolsbeta.eqiad1.wikimedia.cloud:6443
    tls_config:
      ca_file: /srv/prometheus/tools/new-k8s.ca
      cert_file: /srv/prometheus/tools/prometheus.crt
      key_file: /srv/prometheus/tools/prometheus.key
      insecure_skip_verify: true
  metrics_path: /probe
  params:
    module: [http_2xx]
  relabel_configs:
  - source_labels: [__meta_kubernetes_ingress_scheme,__address__,__meta_kubernetes_ingress_path]
    regex: (.+);(.+);(.+)
    replacement: ${1}://${2}${3}
    target_label: __param_target
  - target_label: __address__
    replacement: blackbox-exporter.example.com:9115
  - source_labels: [__param_target]
    target_label: instance
  - action: labelmap
    regex: __meta_kubernetes_ingress_label_(.+)
  - source_labels: [__meta_kubernetes_namespace]
    target_label: kubernetes_namespace
  - source_labels: [__meta_kubernetes_ingress_name]
    target_label: kubernetes_name

Prometheus seems happy and k8s too, however the discovery apparently doesn't work somehow, as prometheus reports there are no metrics fetched in those new jobs... Will keep investigating.

Change 551191 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: prometheus: enable scraping for the new k8s cluster

https://gerrit.wikimedia.org/r/551191

Mentioned in SAL (#wikimedia-cloud) [2019-11-15T14:44:53Z] <arturo> stop live-hacks on tools-prometheus-01 T237643

Mentioned in SAL (#wikimedia-cloud) [2019-11-15T14:46:02Z] <arturo> stop live-hacks on toolsbeta-test-k8s-haproxy-1 T237643

Change 551797 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[labs/private@master] ssl: add dummy private key for toolforge-k8s-prometheus

https://gerrit.wikimedia.org/r/551797

Change 551797 merged by Arturo Borrero Gonzalez:
[labs/private@master] ssl: add dummy private key for toolforge-k8s-prometheus

https://gerrit.wikimedia.org/r/551797

Change 551805 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[labs/private@master] ssl: move toolforge-k8s-prometheus priv key to a proper location

https://gerrit.wikimedia.org/r/551805

Change 551805 merged by Arturo Borrero Gonzalez:
[labs/private@master] ssl: move toolforge-k8s-prometheus priv key to a proper location

https://gerrit.wikimedia.org/r/551805

Change 551191 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: prometheus: enable scraping for the new k8s cluster

https://gerrit.wikimedia.org/r/551191

Mentioned in SAL (#wikimedia-cloud) [2019-11-19T12:46:25Z] <arturo> deploy changes to tools-prometheus to account for the new k8s cluster (T237643)

Change 551816 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: prometheus: fix syntax in the inlined config yaml

https://gerrit.wikimedia.org/r/551816

Change 551816 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: prometheus: fix syntax in the inlined config yaml

https://gerrit.wikimedia.org/r/551816

Change 551817 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: prometheus: fix syntax for label in the new-k8s-nodes job

https://gerrit.wikimedia.org/r/551817

Change 551817 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: prometheus: fix syntax for label in the new-k8s-nodes job

https://gerrit.wikimedia.org/r/551817

Mentioned in SAL (#wikimedia-cloud) [2019-11-19T13:49:24Z] <arturo> re-create nginx-ingress pod due to deployment template refresh (T237643)

I'm working on this grafana dashboard as a way to start using metrics collected by prometheus: https://grafana-labs.wikimedia.org/d/toolforge-kubernetes/toolforge-kubernetes?refresh=1m&orgId=1

I discovered a couple of things to improve in the prometheus side and also in the metrics production side. There are some missing metrics, like memory used by containers, CPU, etc.

aborrero triaged this task as Normal priority.Wed, Nov 20, 10:35 AM