toolforge: new k8s: figure out metrics / observability
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	aborrero
	Nov 7 2019, 2:28 PM

Description

We didn't plan anything yet for prometheus or the like.

It would be interesting to have metrics on the ingress, at very least.

the nginx daemon doing ingress
how the custom admission controllers are doing
how the front haproxy is doing
other traffic inside the cluster
api-server, calico and other kube-system pods.

Details

Subject	Repo	Branch	Lines +/-
toolforge: new k8s: prometheus: scrape metrics from each individual ingress pod	operations/puppet	production	+13 -24
toolforge: prometheus: fix regex for cadvisor in the new k8s cluster	operations/puppet	production	+1 -1
toolforge: new k8s: fix metrics directory	operations/puppet	production	+5 -5
toolforge: new k8s: cleanup metrics manifests and files	operations/puppet	production	+46 -41
toolforge: new k8s: give prometheus permission to read pod/proxy resources	operations/puppet	production	+1 -1
toolforge: prometheus: fix regexp for cadvisor discovery	operations/puppet	production	+6 -2
toolforge: prometheus: fix label config for cadvisor metrics	operations/puppet	production	+8 -21
toolforge: prometheus: collect metrics from cadvisor in the new k8s cluster	operations/puppet	production	+36 -0
toolforge: new k8s: deploy cadvisor.yaml	operations/puppet	production	+163 -0
toolforge: prometheus: add job for kube-state-metrics	operations/puppet	production	+36 -0
toolforge: new k8s: kube-state-metrics: updates to the service endpoint	operations/puppet	production	+2 -6
toolforge: new k8s: kube-state-metrics: drop toleration to run on control nodes	operations/puppet	production	+0 -6
toolforge: new k8s: add kube-state-metrics.yaml	operations/puppet	production	+216 -0
toolforge: k8s: metrics: include some hints and comments	operations/puppet	production	+12 -0
toolforge: prometheus: fix port for nginx exporter	operations/puppet	production	+1 -1
toolforge: prometheus: add job for nginx metrics in the front proxy	operations/puppet	production	+14 -4
toolforge: proxy: enable nginx prometheus metrics	operations/puppet	production	+49 -4
protmeheus: haproxy: add support for Debian Buster	operations/puppet	production	+7 -1
toolforge: new k8s: haproxy: enable prometheus metrics	operations/puppet	production	+17 -0
toolforge: prometheus: fix syntax for label in the new-k8s-nodes job	operations/puppet	production	+1 -4
toolforge: prometheus: fix syntax in the inlined config yaml	operations/puppet	production	+3 -1
toolforge: prometheus: enable scraping for the new k8s cluster	operations/puppet	production	+219 -2
ssl: move toolforge-k8s-prometheus priv key to a proper location	labs/private	master	+0 -0
ssl: add dummy private key for toolforge-k8s-prometheus	labs/private	master	+3 -0
toolforge: new k8s: etcd: enable TLS for metrics endpoint	operations/puppet	production	+14 -3

Related Objects
Search...

Status	Assigned	Task
		Restricted Task
Resolved	• Bstorm	T246122 Upgrade the Toolforge Kubernetes cluster to v1.16
		Restricted Task
Resolved	bd808	T232536 Toolforge Kubernetes internal API down, causing `webservice` and other tooling to fail
Resolved	• Bstorm	T236565 "tools" Cloud VPS project jessie deprecation
Resolved	aborrero	T101651 Set up toolsbeta more fully to help make testing easier
Resolved	• Bstorm	T166949 Homedir/UID info breaks after a while in Tools Kubernetes (can't read replica.my.cnf)
Resolved	• Bstorm	T246059 Add admin account creation to maintain-kubeusers
Resolved	• Bstorm	T154504 Make webservice backend default to kubernetes
Declined	None	T245230 Investigate cpu/ram requests and limits for DaemonSets pods
Resolved	• Bstorm	T214513 Deploy and migrate tools to a Kubernetes v1.15 or newer cluster
Resolved	aborrero	T237643 toolforge: new k8s: figure out metrics / observability
Resolved	aborrero	T237557 new proxy and etcd nodes unreachable by ssh for tools-prometheus
Resolved	aborrero	T238058 toolforge: prometheus-node-exporter not working on tools-proxy-06
Resolved	aborrero	T238096 Toolforge: prometheus: refresh setup
Resolved	aborrero	T245180 Document and test failing over prometheus
Resolved	aborrero	T240402 Deploy or consciously decide not to deploy metrics-server in toolforge kubernetes
Resolved	aborrero	T241853 Move metrics-server and kube-state-metrics into the new metrics namespace

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

aborrero added a subtask: T237557: new proxy and etcd nodes unreachable by ssh for tools-prometheus.Nov 8 2019, 12:09 PM

• Bstorm mentioned this in T215553: Figure out cert management for Toolforge kubernetes and make it clear in documents, etc. for the upgrade.Nov 8 2019, 11:34 PM

aborrero closed subtask T237557: new proxy and etcd nodes unreachable by ssh for tools-prometheus as Resolved.Nov 11 2019, 11:49 AM

Change 550442 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: new k8s: etcd: enable TLS for metrics endpoint

https://gerrit.wikimedia.org/r/550442

gerritbot added a project: Patch-For-Review.Nov 12 2019, 10:51 AM

Change 550442 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: new k8s: etcd: enable TLS for metrics endpoint

https://gerrit.wikimedia.org/r/550442

Maintenance_bot removed a project: Patch-For-Review.Nov 12 2019, 12:10 PM

aborrero closed subtask T238058: toolforge: prometheus-node-exporter not working on tools-proxy-06 as Resolved.Nov 12 2019, 12:54 PM

aborrero updated the task description. (Show Details)Nov 12 2019, 4:53 PM

hey @Bstorm. I'm evaluating 2 setups for prometheus in the new k8s cluster:

let prometheus running on tools-prometheus discover and scrape all the metrics in the new cluster by using the new k8s API.
run prometheus inside the new k8s cluster, which is something lots of documentations assume. Then use prometheus federation to send internal metrics to tools-prometheus

In the new k8s cluster, which is RBAC based, how difficult would be to generate the client TLS config required for prometheus to scrape metrics using the k8s API? We would need that for option 1).
The legacy setup had a similar setup, but using bearer tokens for auth.

I imagine that if prometheus is running inside the cluster, it uses a service account, right?

For client TLS for option 1, not too hard. We could honestly do it much the same way as we do for the custom controllers. The scripts would need to be adapted a bit for downloading the cert rather than keeping it in a secret object--also it would need to be renewed periodically (which the custom controllers also need--and need documentation/process around, so I'm glad that came up!)

Depending on the work required for federation, they might be comparable amounts of work.

Ok, I've been following your suggestion and refactored a bit the script you had for the custom admission controllers: https://gerrit.wikimedia.org/r/c/operations/puppet/+/550673

This script can be used to generate a couple of files that can be put into the tools-puppetmaster and then deploy the certs to the prometheus servers we have. If this is something we only do once or twice a year, I don't think it's a big deal.

Mentioned in SAL (#wikimedia-cloud) [2019-11-13T17:20:07Z] <arturo> live-hacking tools-prometheus-01 to test some experimental configs for the new k8s cluster (T237643)

I have a more or less config that may work, but is not ready yet.

using this script I generated a cert for prometheus: https://gerrit.wikimedia.org/r/c/operations/puppet/+/550673 (and scp'ed the certs as required)
I added the RBAC config for prometheus into the new toolsbeta k8s cluster:

# from https://github.com/prometheus/prometheus/blob/master/documentation/examples/rbac-setup.yml
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: prometheus
rules:
- apiGroups: [""]
  resources:
  - nodes
  - nodes/proxy
  - services
  - endpoints
  - pods
  verbs: ["get", "list", "watch"]
- apiGroups:
  - extensions
  resources:
  - ingresses
  verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
  verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
- kind: User
  name: prometheus
  namespace: default

added this prometheus config to tools-prometheus for starters:

#
# arturo's config
#
- job_name: 'new-k8s-nodes'
  scheme: https
  kubernetes_sd_configs:
  - role: node
    api_server: https://k8s.toolsbeta.eqiad1.wikimedia.cloud:6443
    tls_config:
      ca_file: /srv/prometheus/tools/new-k8s.ca
      cert_file: /srv/prometheus/tools/prometheus.crt
      key_file: /srv/prometheus/tools/prometheus.key
      insecure_skip_verify: true
  relabel_configs:
  - action: labelmap
    regex: __meta_kubernetes_node_label_(.+)
  - target_label: __address__
    replacement: kubernetes.default.svc:443
  - source_labels: [__meta_kubernetes_node_name]
    regex: (.+)
    target_label: __metrics_path__
    replacement: /api/v1/nodes/${1}/proxy/metrics
- job_name: 'new-k8s-pods'
  scheme: https
  kubernetes_sd_configs:
  - role: pod
    api_server: https://k8s.toolsbeta.eqiad1.wikimedia.cloud:6443
    tls_config:
      ca_file: /srv/prometheus/tools/new-k8s.ca
      cert_file: /srv/prometheus/tools/prometheus.crt
      key_file: /srv/prometheus/tools/prometheus.key
      insecure_skip_verify: true
  relabel_configs:
  - action: labelmap
    regex: __meta_kubernetes_pod_label_(.+)
  - source_labels: [__meta_kubernetes_namespace]
    action: replace
    target_label: kubernetes_namespace
  - source_labels: [__meta_kubernetes_pod_name]
    action: replace
    target_label: kubernetes_pod_name
- job_name: 'news-k8s-ingresses'
  scheme: https
  kubernetes_sd_configs:
  - role: ingress
    api_server: https://k8s.toolsbeta.eqiad1.wikimedia.cloud:6443
    tls_config:
      ca_file: /srv/prometheus/tools/new-k8s.ca
      cert_file: /srv/prometheus/tools/prometheus.crt
      key_file: /srv/prometheus/tools/prometheus.key
      insecure_skip_verify: true
  metrics_path: /probe
  params:
    module: [http_2xx]
  relabel_configs:
  - source_labels: [__meta_kubernetes_ingress_scheme,__address__,__meta_kubernetes_ingress_path]
    regex: (.+);(.+);(.+)
    replacement: ${1}://${2}${3}
    target_label: __param_target
  - target_label: __address__
    replacement: blackbox-exporter.example.com:9115
  - source_labels: [__param_target]
    target_label: instance
  - action: labelmap
    regex: __meta_kubernetes_ingress_label_(.+)
  - source_labels: [__meta_kubernetes_namespace]
    target_label: kubernetes_namespace
  - source_labels: [__meta_kubernetes_ingress_name]
    target_label: kubernetes_name

Prometheus seems happy and k8s too, however the discovery apparently doesn't work somehow, as prometheus reports there are no metrics fetched in those new jobs... Will keep investigating.

Change 551191 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: prometheus: enable scraping for the new k8s cluster

https://gerrit.wikimedia.org/r/551191

gerritbot added a project: Patch-For-Review.Nov 15 2019, 2:38 PM

Mentioned in SAL (#wikimedia-cloud) [2019-11-15T14:44:53Z] <arturo> stop live-hacks on tools-prometheus-01 T237643

Mentioned in SAL (#wikimedia-cloud) [2019-11-15T14:46:02Z] <arturo> stop live-hacks on toolsbeta-test-k8s-haproxy-1 T237643

Change 551797 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[labs/private@master] ssl: add dummy private key for toolforge-k8s-prometheus

https://gerrit.wikimedia.org/r/551797

Change 551797 merged by Arturo Borrero Gonzalez:
[labs/private@master] ssl: add dummy private key for toolforge-k8s-prometheus

https://gerrit.wikimedia.org/r/551797

aborrero mentioned this in rLPRI78c0e9bb286c: ssl: add dummy private key for toolforge-k8s-prometheus.Nov 19 2019, 12:27 PM

Change 551805 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[labs/private@master] ssl: move toolforge-k8s-prometheus priv key to a proper location

https://gerrit.wikimedia.org/r/551805

Change 551805 merged by Arturo Borrero Gonzalez:
[labs/private@master] ssl: move toolforge-k8s-prometheus priv key to a proper location

https://gerrit.wikimedia.org/r/551805

aborrero mentioned this in rLPRI0207d0e11fef: ssl: move toolforge-k8s-prometheus priv key to a proper location.Nov 19 2019, 12:34 PM

Change 551191 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: prometheus: enable scraping for the new k8s cluster

https://gerrit.wikimedia.org/r/551191

Mentioned in SAL (#wikimedia-cloud) [2019-11-19T12:46:25Z] <arturo> deploy changes to tools-prometheus to account for the new k8s cluster (T237643)

Change 551816 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: prometheus: fix syntax in the inlined config yaml

https://gerrit.wikimedia.org/r/551816

Change 551816 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: prometheus: fix syntax in the inlined config yaml

https://gerrit.wikimedia.org/r/551816

Change 551817 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: prometheus: fix syntax for label in the new-k8s-nodes job

https://gerrit.wikimedia.org/r/551817

Change 551817 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: prometheus: fix syntax for label in the new-k8s-nodes job

https://gerrit.wikimedia.org/r/551817

Mentioned in SAL (#wikimedia-cloud) [2019-11-19T13:49:24Z] <arturo> re-create nginx-ingress pod due to deployment template refresh (T237643)

aborrero mentioned this in T238654: toolforge: new k8s: issues with routing interfering with DNS in the cluster as well as the webhook controllers.Nov 19 2019, 2:00 PM

I'm working on this grafana dashboard as a way to start using metrics collected by prometheus: https://grafana-labs.wikimedia.org/d/toolforge-kubernetes/toolforge-kubernetes?refresh=1m&orgId=1

I discovered a couple of things to improve in the prometheus side and also in the metrics production side. There are some missing metrics, like memory used by containers, CPU, etc.

aborrero triaged this task as Medium priority.Nov 20 2019, 10:35 AM

Change 552789 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] protmeheus: haproxy: add support for Debian Buster

https://gerrit.wikimedia.org/r/552789

Change 552789 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] protmeheus: haproxy: add support for Debian Buster

https://gerrit.wikimedia.org/r/552789

Change 552794 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: new k8s: haproxy: include prometheus exporter

https://gerrit.wikimedia.org/r/552794

Change 552794 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: new k8s: haproxy: enable prometheus metrics

https://gerrit.wikimedia.org/r/552794

Change 553105 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: proxy: enable nginx prometheus metrics

https://gerrit.wikimedia.org/r/553105

Change 553105 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: proxy: enable nginx prometheus metrics

https://gerrit.wikimedia.org/r/553105

Change 553113 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: prometheus: add job for nginx metrics in the front proxy

https://gerrit.wikimedia.org/r/553113

Change 553113 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: prometheus: add job for nginx metrics in the front proxy

https://gerrit.wikimedia.org/r/553113

Change 553117 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: prometheus: fix port for nginx exporter

https://gerrit.wikimedia.org/r/553117

Change 553117 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: prometheus: fix port for nginx exporter

https://gerrit.wikimedia.org/r/553117

Created a couple of grafana dashboards:

this one is for haproxy in front of the apiserver and nginx-ingress:

https://grafana-labs.wikimedia.org/d/5O3YKfbWz/toolforge-k8s-haproxy

this one aggregates metrics for all the ingress path:

https://grafana-labs.wikimedia.org/d/R7BPaEbWk/toolforge-ingress?refresh=1m&orgId=1

I declare this is mostly done, at least until we start to have real traffic in the service and see where we lack metrics.

aborrero added a subtask: T240402: Deploy or consciously decide not to deploy metrics-server in toolforge kubernetes.Dec 11 2019, 9:22 AM

Change 556369 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: k8s: metrics: include some hints and comments

https://gerrit.wikimedia.org/r/556369

Change 556369 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: k8s: metrics: include some hints and comments

https://gerrit.wikimedia.org/r/556369

• Bstorm closed subtask T240402: Deploy or consciously decide not to deploy metrics-server in toolforge kubernetes as Resolved.Dec 11 2019, 8:03 PM

Reopening task. We decided it should be interesting to have more metrics, for example number of ingress objects etc. Will try deploying https://github.com/kubernetes/kube-state-metrics

Change 559506 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: new k8s: add kube-state-metrics.yaml

https://gerrit.wikimedia.org/r/559506

• Bstorm awarded a token.Dec 19 2019, 2:57 PM

In T237643#5754305, @gerritbot wrote:

Change 559506 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: new k8s: add kube-state-metrics.yaml

https://gerrit.wikimedia.org/r/559506

there are several open questions about this patch. Will have to do a couple of iterations.

Comments on patch. You actually already solved the biggest security concern. It just needs small tweak.

Change 559506 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: new k8s: add kube-state-metrics.yaml

https://gerrit.wikimedia.org/r/559506

Change 559771 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: new k8s: kube-state-metrics: drop toleration to run on control nodes

https://gerrit.wikimedia.org/r/559771

Change 559771 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: new k8s: kube-state-metrics: drop toleration to run on control nodes

https://gerrit.wikimedia.org/r/559771

Change 559820 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: new k8s: kube-state-metrics: updates to the service endpoint

https://gerrit.wikimedia.org/r/559820

Change 559820 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: new k8s: kube-state-metrics: updates to the service endpoint

https://gerrit.wikimedia.org/r/559820

Change 559830 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: prometheus: add job for kube-state-metrics

https://gerrit.wikimedia.org/r/559830

Change 559830 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: prometheus: add job for kube-state-metrics

https://gerrit.wikimedia.org/r/559830

kube-state-metrics is working now!

TODO:

update docs https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Deploying_k8s
use new metrics in grafana dashboards for example, https://grafana-labs-admin.wikimedia.org/d/toolforge-kubernetes/toolforge-kubernetes

Change 561654 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: new k8s: deploy cadvisor.yaml

https://gerrit.wikimedia.org/r/561654

Mentioned in SAL (#wikimedia-cloud) [2020-01-03T11:21:49Z] <arturo> upload k8s.gcr.io/cadvisor:v0.30.2 docker image to the docker registry as docker-registry.tools.wmflabs.org/cadvisor:0.30.2 for T237643

Mentioned in SAL (#wikimedia-cloud) [2020-01-03T11:27:01Z] <arturo> [new k8s] cadvisor is running in the metrics namespace now (T237643)

Change 561654 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: new k8s: deploy cadvisor.yaml

https://gerrit.wikimedia.org/r/561654

Mentioned in SAL (#wikimedia-cloud) [2020-01-03T11:51:02Z] <arturo> [new k8s] deploy cadvisor as in https://gerrit.wikimedia.org/r/c/operations/puppet/+/561654 (T237643)

Change 561831 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: prometheus: collect metrics from cadvisor in the new k8s cluster

https://gerrit.wikimedia.org/r/561831

Change 561831 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: prometheus: collect metrics from cadvisor in the new k8s cluster

https://gerrit.wikimedia.org/r/561831

Change 561839 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: prometheus: fix label config for cadvisor metrics

https://gerrit.wikimedia.org/r/561839

Change 561839 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: prometheus: fix label config for cadvisor metrics

https://gerrit.wikimedia.org/r/561839

Change 561887 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: prometheus: fix regexp for cadvisor discovery

https://gerrit.wikimedia.org/r/561887

Change 561887 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: prometheus: fix regexp for cadvisor discovery

https://gerrit.wikimedia.org/r/561887

Change 561888 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: new k8s: give prometheus permission to read pod/proxy resources

https://gerrit.wikimedia.org/r/561888

Change 561888 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: new k8s: give prometheus permission to read pod/proxy resources

https://gerrit.wikimedia.org/r/561888

Change 562800 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: new k8s: cleanup metrics manifests and files

https://gerrit.wikimedia.org/r/562800

Change 562800 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: new k8s: cleanup metrics manifests and files

https://gerrit.wikimedia.org/r/562800

Change 562802 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: new k8s: fix metrics directory

https://gerrit.wikimedia.org/r/562802

Change 562802 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: new k8s: fix metrics directory

https://gerrit.wikimedia.org/r/562802

aborrero closed subtask T241853: Move metrics-server and kube-state-metrics into the new metrics namespace as Resolved.Jan 8 2020, 12:50 PM

This grafana dashboard per k8s namespace is mostly ready: https://grafana-labs.wikimedia.org/d/toolforge-k8s-namespace-resources/kubernetes-namespace-resources

@bd808 feel free to add a link in https://tools.wmflabs.org/k8s-status

Change 562837 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: prometheus: fix regex for cadvisor in the new k8s cluster

https://gerrit.wikimedia.org/r/562837

Change 562838 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: new k8s: prometheus: scrape metrics from each individual ingress pod

https://gerrit.wikimedia.org/r/562838

Change 562837 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: prometheus: fix regex for cadvisor in the new k8s cluster

https://gerrit.wikimedia.org/r/562837

Change 562838 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: new k8s: prometheus: scrape metrics from each individual ingress pod

https://gerrit.wikimedia.org/r/562838

I think we are good for now. Closing task.

aborrero closed subtask T238096: Toolforge: prometheus: refresh setup as Resolved.Feb 7 2020, 10:56 AM

toolforge: new k8s: figure out metrics / observabilityClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

toolforge: new k8s: figure out metrics / observability
Closed, ResolvedPublic
Actions

Related Objects
Search...