Currently Grafana runs as a single VM in each DC. If one instance becomes saturated (as we saw today - see T414604) or otherwise impaired, we just lose the ability to query Grafana. This isn't a very sustainable situation and we need to investigate ways to scale grafana horizontally.
As a first step, we'll need to migrate Grafana data from sqlite to MariaDB to ensure that we can share data between instances
We'll need to make an evaluation of risk around where Grafana itself runs - if it's a case of multiple Ganeti VMs (an extension of the existing setup) then we don't really introduce any additional complexity.
However, if we're talking about scaling more reliably in a way that we don't need to manually manage, I think a natural home for the Grafana components is probably Kubernetes. We need to investigate whether the liabilities of running Grafana in Kubernetes are too severe to risk losing access during a severe Kubernetes outage.
(We could in theory run an emergency Ganeti instance that is pooled alongside Kubernetes nodes but that gets quite messy as we'd be manually managing pools unless a mechanism of saying kubesvc + grafana1003.eqiad.wmnet existed in service.yaml - we'd also have to maintain helm and puppet config for grafana, which is kinda gross).