Page MenuHomePhabricator

Grafana shows zero EventLogging events for around 44 hours around January 15
Closed, ResolvedPublic

Description

Seen for several schemas, including

Grafana EventLogging-schema ReadingDepth Jan 1-31, 2019 Screenshot from 2019-02-10.png (717×1 px, 93 KB)

However, the data in Hive and Druid for the same schemas seems fine, see e.g. VirtualPageView in Turnilo. So presumably this is an issue with Grafana/Graphite itself.

(Discovered by @Jdlrobson during the web team's chores)

Event Timeline

If I follow the links I can see the hole in ReadingDepth only sometimes, so the first thought that comes into mind is that since this data is backed up by Grafana/Prometheus, it might be that we see the hole when Grafana hits one of the Prometheus masters without the metrics (since Grafana uses a load balancer endpoint, it is not aware of how many hosts are serving metrics).

fgiunchedi subscribed.

I can confirm what @elukey was seeing / saying, namely that the data seems missing only from prometheus instance (hitting d and then r in grafana reloads the dashboard). This is of course suboptimal and will be resolved once we have in place sth like Thanos which is able to merge responses from multiple Prometheus hosts. See also T213918: Investigate distributed and long term storage solutions for Prometheus. On the specific issue at hand I checked SAL and there was no maintenance on prometheus eqiad at the time so definitely something happened to only one of the hosts, leaving the task open for now.

fgiunchedi claimed this task.

It is indeed, thanks @Volans ! Tentatively resolving