Page MenuHomePhabricator

Improve Grafana scalability
Open, Needs TriagePublic

Description

Currently Grafana runs as a single VM in each DC. If one instance becomes saturated (as we saw today - see T414604) or otherwise impaired, we just lose the ability to query Grafana. This isn't a very sustainable situation and we need to investigate ways to scale grafana horizontally.

As a first step, we'll need to migrate Grafana data from sqlite to MariaDB to ensure that we can share data between instances

We'll need to make an evaluation of risk around where Grafana itself runs - if it's a case of multiple Ganeti VMs (an extension of the existing setup) then we don't really introduce any additional complexity.

However, if we're talking about scaling more reliably in a way that we don't need to manually manage, I think a natural home for the Grafana components is probably Kubernetes. We need to investigate whether the liabilities of running Grafana in Kubernetes are too severe to risk losing access during a severe Kubernetes outage.
(We could in theory run an emergency Ganeti instance that is pooled alongside Kubernetes nodes but that gets quite messy as we'd be manually managing pools unless a mechanism of saying kubesvc + grafana1003.eqiad.wmnet existed in service.yaml - we'd also have to maintain helm and puppet config for grafana, which is kinda gross).

Related Objects

Event Timeline

For visibility, the outage today was a "grafana consumed all the memory" condition. https://grafana-next.wikimedia.org which links to the read-only backup in the standby DC remained available and was used in to diagnose the primary grafana host.