Page MenuHomePhabricator

toolforge: new k8s: figure out metrics / observability
Closed, ResolvedPublic

Description

We didn't plan anything yet for prometheus or the like.

It would be interesting to have metrics on the ingress, at very least.

  • the nginx daemon doing ingress
  • how the custom admission controllers are doing
  • how the front haproxy is doing
  • other traffic inside the cluster
  • api-server, calico and other kube-system pods.

Details

ProjectBranchLines +/-Subject
operations/puppetproduction+13 -24
operations/puppetproduction+1 -1
operations/puppetproduction+5 -5
operations/puppetproduction+46 -41
operations/puppetproduction+1 -1
operations/puppetproduction+6 -2
operations/puppetproduction+8 -21
operations/puppetproduction+36 -0
operations/puppetproduction+163 -0
operations/puppetproduction+36 -0
operations/puppetproduction+2 -6
operations/puppetproduction+0 -6
operations/puppetproduction+216 -0
operations/puppetproduction+12 -0
operations/puppetproduction+1 -1
operations/puppetproduction+14 -4
operations/puppetproduction+49 -4
operations/puppetproduction+7 -1
operations/puppetproduction+17 -0
operations/puppetproduction+1 -4
operations/puppetproduction+3 -1
operations/puppetproduction+219 -2
labs/privatemaster+0 -0
labs/privatemaster+3 -0
operations/puppetproduction+14 -3
Show related patches Customize query in gerrit

Related Objects

StatusSubtypeAssignedTask
ResolvedBstorm
ResolvedBstorm
OpenNone
Resolvedbd808
ResolvedBstorm
Resolvedaborrero
StalledNone
OpenBstorm
ResolvedBstorm
OpenNone
ResolvedBstorm
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 550442 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: new k8s: etcd: enable TLS for metrics endpoint

https://gerrit.wikimedia.org/r/550442

Change 550442 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: new k8s: etcd: enable TLS for metrics endpoint

https://gerrit.wikimedia.org/r/550442

aborrero updated the task description. (Show Details)Nov 12 2019, 4:53 PM

hey @Bstorm. I'm evaluating 2 setups for prometheus in the new k8s cluster:

  1. let prometheus running on tools-prometheus discover and scrape all the metrics in the new cluster by using the new k8s API.
  2. run prometheus inside the new k8s cluster, which is something lots of documentations assume. Then use prometheus federation to send internal metrics to tools-prometheus

In the new k8s cluster, which is RBAC based, how difficult would be to generate the client TLS config required for prometheus to scrape metrics using the k8s API? We would need that for option 1).
The legacy setup had a similar setup, but using bearer tokens for auth.

I imagine that if prometheus is running inside the cluster, it uses a service account, right?

For client TLS for option 1, not too hard. We could honestly do it much the same way as we do for the custom controllers. The scripts would need to be adapted a bit for downloading the cert rather than keeping it in a secret object--also it would need to be renewed periodically (which the custom controllers also need--and need documentation/process around, so I'm glad that came up!)

Depending on the work required for federation, they might be comparable amounts of work.

Ok, I've been following your suggestion and refactored a bit the script you had for the custom admission controllers: https://gerrit.wikimedia.org/r/c/operations/puppet/+/550673

This script can be used to generate a couple of files that can be put into the tools-puppetmaster and then deploy the certs to the prometheus servers we have. If this is something we only do once or twice a year, I don't think it's a big deal.

Mentioned in SAL (#wikimedia-cloud) [2019-11-13T17:20:07Z] <arturo> live-hacking tools-prometheus-01 to test some experimental configs for the new k8s cluster (T237643)

I have a more or less config that may work, but is not ready yet.

  1. using this script I generated a cert for prometheus: https://gerrit.wikimedia.org/r/c/operations/puppet/+/550673 (and scp'ed the certs as required)
  2. I added the RBAC config for prometheus into the new toolsbeta k8s cluster:
# from https://github.com/prometheus/prometheus/blob/master/documentation/examples/rbac-setup.yml
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: prometheus
rules:
- apiGroups: [""]
  resources:
  - nodes
  - nodes/proxy
  - services
  - endpoints
  - pods
  verbs: ["get", "list", "watch"]
- apiGroups:
  - extensions
  resources:
  - ingresses
  verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
  verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
- kind: User
  name: prometheus
  namespace: default
  1. added this prometheus config to tools-prometheus for starters:
#
# arturo's config
#
- job_name: 'new-k8s-nodes'
  scheme: https
  kubernetes_sd_configs:
  - role: node
    api_server: https://k8s.toolsbeta.eqiad1.wikimedia.cloud:6443
    tls_config:
      ca_file: /srv/prometheus/tools/new-k8s.ca
      cert_file: /srv/prometheus/tools/prometheus.crt
      key_file: /srv/prometheus/tools/prometheus.key
      insecure_skip_verify: true
  relabel_configs:
  - action: labelmap
    regex: __meta_kubernetes_node_label_(.+)
  - target_label: __address__
    replacement: kubernetes.default.svc:443
  - source_labels: [__meta_kubernetes_node_name]
    regex: (.+)
    target_label: __metrics_path__
    replacement: /api/v1/nodes/${1}/proxy/metrics
- job_name: 'new-k8s-pods'
  scheme: https
  kubernetes_sd_configs:
  - role: pod
    api_server: https://k8s.toolsbeta.eqiad1.wikimedia.cloud:6443
    tls_config:
      ca_file: /srv/prometheus/tools/new-k8s.ca
      cert_file: /srv/prometheus/tools/prometheus.crt
      key_file: /srv/prometheus/tools/prometheus.key
      insecure_skip_verify: true
  relabel_configs:
  - action: labelmap
    regex: __meta_kubernetes_pod_label_(.+)
  - source_labels: [__meta_kubernetes_namespace]
    action: replace
    target_label: kubernetes_namespace
  - source_labels: [__meta_kubernetes_pod_name]
    action: replace
    target_label: kubernetes_pod_name
- job_name: 'news-k8s-ingresses'
  scheme: https
  kubernetes_sd_configs:
  - role: ingress
    api_server: https://k8s.toolsbeta.eqiad1.wikimedia.cloud:6443
    tls_config:
      ca_file: /srv/prometheus/tools/new-k8s.ca
      cert_file: /srv/prometheus/tools/prometheus.crt
      key_file: /srv/prometheus/tools/prometheus.key
      insecure_skip_verify: true
  metrics_path: /probe
  params:
    module: [http_2xx]
  relabel_configs:
  - source_labels: [__meta_kubernetes_ingress_scheme,__address__,__meta_kubernetes_ingress_path]
    regex: (.+);(.+);(.+)
    replacement: ${1}://${2}${3}
    target_label: __param_target
  - target_label: __address__
    replacement: blackbox-exporter.example.com:9115
  - source_labels: [__param_target]
    target_label: instance
  - action: labelmap
    regex: __meta_kubernetes_ingress_label_(.+)
  - source_labels: [__meta_kubernetes_namespace]
    target_label: kubernetes_namespace
  - source_labels: [__meta_kubernetes_ingress_name]
    target_label: kubernetes_name

Prometheus seems happy and k8s too, however the discovery apparently doesn't work somehow, as prometheus reports there are no metrics fetched in those new jobs... Will keep investigating.

Change 551191 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: prometheus: enable scraping for the new k8s cluster

https://gerrit.wikimedia.org/r/551191

Mentioned in SAL (#wikimedia-cloud) [2019-11-15T14:44:53Z] <arturo> stop live-hacks on tools-prometheus-01 T237643

Mentioned in SAL (#wikimedia-cloud) [2019-11-15T14:46:02Z] <arturo> stop live-hacks on toolsbeta-test-k8s-haproxy-1 T237643

Change 551797 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[labs/private@master] ssl: add dummy private key for toolforge-k8s-prometheus

https://gerrit.wikimedia.org/r/551797

Change 551797 merged by Arturo Borrero Gonzalez:
[labs/private@master] ssl: add dummy private key for toolforge-k8s-prometheus

https://gerrit.wikimedia.org/r/551797

Change 551805 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[labs/private@master] ssl: move toolforge-k8s-prometheus priv key to a proper location

https://gerrit.wikimedia.org/r/551805

Change 551805 merged by Arturo Borrero Gonzalez:
[labs/private@master] ssl: move toolforge-k8s-prometheus priv key to a proper location

https://gerrit.wikimedia.org/r/551805

Change 551191 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: prometheus: enable scraping for the new k8s cluster

https://gerrit.wikimedia.org/r/551191

Mentioned in SAL (#wikimedia-cloud) [2019-11-19T12:46:25Z] <arturo> deploy changes to tools-prometheus to account for the new k8s cluster (T237643)

Change 551816 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: prometheus: fix syntax in the inlined config yaml

https://gerrit.wikimedia.org/r/551816

Change 551816 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: prometheus: fix syntax in the inlined config yaml

https://gerrit.wikimedia.org/r/551816

Change 551817 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: prometheus: fix syntax for label in the new-k8s-nodes job

https://gerrit.wikimedia.org/r/551817

Change 551817 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: prometheus: fix syntax for label in the new-k8s-nodes job

https://gerrit.wikimedia.org/r/551817

Mentioned in SAL (#wikimedia-cloud) [2019-11-19T13:49:24Z] <arturo> re-create nginx-ingress pod due to deployment template refresh (T237643)

I'm working on this grafana dashboard as a way to start using metrics collected by prometheus: https://grafana-labs.wikimedia.org/d/toolforge-kubernetes/toolforge-kubernetes?refresh=1m&orgId=1

I discovered a couple of things to improve in the prometheus side and also in the metrics production side. There are some missing metrics, like memory used by containers, CPU, etc.

aborrero triaged this task as Medium priority.Nov 20 2019, 10:35 AM

Change 552789 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] protmeheus: haproxy: add support for Debian Buster

https://gerrit.wikimedia.org/r/552789

Change 552789 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] protmeheus: haproxy: add support for Debian Buster

https://gerrit.wikimedia.org/r/552789

Change 552794 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: new k8s: haproxy: include prometheus exporter

https://gerrit.wikimedia.org/r/552794

Change 552794 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: new k8s: haproxy: enable prometheus metrics

https://gerrit.wikimedia.org/r/552794

Change 553105 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: proxy: enable nginx prometheus metrics

https://gerrit.wikimedia.org/r/553105

Change 553105 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: proxy: enable nginx prometheus metrics

https://gerrit.wikimedia.org/r/553105

Change 553113 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: prometheus: add job for nginx metrics in the front proxy

https://gerrit.wikimedia.org/r/553113

Change 553113 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: prometheus: add job for nginx metrics in the front proxy

https://gerrit.wikimedia.org/r/553113

Change 553117 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: prometheus: fix port for nginx exporter

https://gerrit.wikimedia.org/r/553117

Change 553117 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: prometheus: fix port for nginx exporter

https://gerrit.wikimedia.org/r/553117

aborrero claimed this task.Nov 26 2019, 7:05 PM
aborrero moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

Created a couple of grafana dashboards:

  • this one is for haproxy in front of the apiserver and nginx-ingress:

https://grafana-labs.wikimedia.org/d/5O3YKfbWz/toolforge-k8s-haproxy

  • this one aggregates metrics for all the ingress path:

https://grafana-labs.wikimedia.org/d/R7BPaEbWk/toolforge-ingress?refresh=1m&orgId=1

aborrero closed this task as Resolved.Nov 27 2019, 5:07 PM

I declare this is mostly done, at least until we start to have real traffic in the service and see where we lack metrics.

Change 556369 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: k8s: metrics: include some hints and comments

https://gerrit.wikimedia.org/r/556369

Change 556369 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: k8s: metrics: include some hints and comments

https://gerrit.wikimedia.org/r/556369

aborrero reopened this task as Open.Dec 19 2019, 10:18 AM

Reopening task. We decided it should be interesting to have more metrics, for example number of ingress objects etc. Will try deploying https://github.com/kubernetes/kube-state-metrics

Change 559506 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: new k8s: add kube-state-metrics.yaml

https://gerrit.wikimedia.org/r/559506

Change 559506 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: new k8s: add kube-state-metrics.yaml

https://gerrit.wikimedia.org/r/559506

there are several open questions about this patch. Will have to do a couple of iterations.

Comments on patch. You actually already solved the biggest security concern. It just needs small tweak.

Change 559506 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: new k8s: add kube-state-metrics.yaml

https://gerrit.wikimedia.org/r/559506

Change 559771 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: new k8s: kube-state-metrics: drop toleration to run on control nodes

https://gerrit.wikimedia.org/r/559771

Change 559771 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: new k8s: kube-state-metrics: drop toleration to run on control nodes

https://gerrit.wikimedia.org/r/559771

Change 559820 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: new k8s: kube-state-metrics: updates to the service endpoint

https://gerrit.wikimedia.org/r/559820

Change 559820 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: new k8s: kube-state-metrics: updates to the service endpoint

https://gerrit.wikimedia.org/r/559820

Change 559830 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: prometheus: add job for kube-state-metrics

https://gerrit.wikimedia.org/r/559830

Change 559830 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: prometheus: add job for kube-state-metrics

https://gerrit.wikimedia.org/r/559830

Change 561654 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: new k8s: deploy cadvisor.yaml

https://gerrit.wikimedia.org/r/561654

Mentioned in SAL (#wikimedia-cloud) [2020-01-03T11:21:49Z] <arturo> upload k8s.gcr.io/cadvisor:v0.30.2 docker image to the docker registry as docker-registry.tools.wmflabs.org/cadvisor:0.30.2 for T237643

Mentioned in SAL (#wikimedia-cloud) [2020-01-03T11:27:01Z] <arturo> [new k8s] cadvisor is running in the metrics namespace now (T237643)

Change 561654 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: new k8s: deploy cadvisor.yaml

https://gerrit.wikimedia.org/r/561654

Change 561831 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: prometheus: collect metrics from cadvisor in the new k8s cluster

https://gerrit.wikimedia.org/r/561831

Change 561831 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: prometheus: collect metrics from cadvisor in the new k8s cluster

https://gerrit.wikimedia.org/r/561831

Change 561839 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: prometheus: fix label config for cadvisor metrics

https://gerrit.wikimedia.org/r/561839

Change 561839 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: prometheus: fix label config for cadvisor metrics

https://gerrit.wikimedia.org/r/561839

Change 561887 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: prometheus: fix regexp for cadvisor discovery

https://gerrit.wikimedia.org/r/561887

Change 561887 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: prometheus: fix regexp for cadvisor discovery

https://gerrit.wikimedia.org/r/561887

Change 561888 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: new k8s: give prometheus permission to read pod/proxy resources

https://gerrit.wikimedia.org/r/561888

Change 561888 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: new k8s: give prometheus permission to read pod/proxy resources

https://gerrit.wikimedia.org/r/561888

Change 562800 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: new k8s: cleanup metrics manifests and files

https://gerrit.wikimedia.org/r/562800

Change 562800 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: new k8s: cleanup metrics manifests and files

https://gerrit.wikimedia.org/r/562800

Change 562802 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: new k8s: fix metrics directory

https://gerrit.wikimedia.org/r/562802

Change 562802 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: new k8s: fix metrics directory

https://gerrit.wikimedia.org/r/562802

Change 562837 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: prometheus: fix regex for cadvisor in the new k8s cluster

https://gerrit.wikimedia.org/r/562837

Change 562838 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: new k8s: prometheus: scrape metrics from each individual ingress pod

https://gerrit.wikimedia.org/r/562838

Change 562837 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: prometheus: fix regex for cadvisor in the new k8s cluster

https://gerrit.wikimedia.org/r/562837

Change 562838 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: new k8s: prometheus: scrape metrics from each individual ingress pod

https://gerrit.wikimedia.org/r/562838

aborrero closed this task as Resolved.Jan 16 2020, 2:34 PM

I think we are good for now. Closing task.